Language selection

Search

Patent 2413022 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent Application: (11) CA 2413022
(54) English Title: WHOLE CELL ENGINEERING BY MUTAGENIZING A SUBSTANTIAL PORTION OF A STARTING GENOME, COMBINING MUTATIONS, AND OPTIONALLY REPEATING
(54) French Title: INGENIERIE CELLULAIRE COMPLETE PAR MUTAGENESE D'UNE PARTIE SUBSTANTIELLE D'UN GENOME DE DEPART, PAR COMBINAISON DE MUTATIONS ET EVENTUELLEMENT REPETITION
Status: Deemed Abandoned and Beyond the Period of Reinstatement - Pending Response to Notice of Disregarded Communication
Bibliographic Data
(51) International Patent Classification (IPC):
  • C12N 15/11 (2006.01)
  • A01K 67/00 (2006.01)
  • C12N 15/09 (2006.01)
  • C12N 15/10 (2006.01)
  • C12N 15/82 (2006.01)
  • G01N 33/534 (2006.01)
  • G01N 33/68 (2006.01)
(72) Inventors :
  • SHORT, JAY M. (United States of America)
(73) Owners :
  • DIVERSA CORPORATION
(71) Applicants :
  • DIVERSA CORPORATION (United States of America)
(74) Agent: GOWLING WLG (CANADA) LLP
(74) Associate agent:
(45) Issued:
(86) PCT Filing Date: 2001-06-14
(87) Open to Public Inspection: 2001-12-20
Availability of licence: N/A
Dedicated to the Public: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): Yes
(86) PCT Filing Number: PCT/US2001/019367
(87) International Publication Number: WO 2001096551
(85) National Entry: 2002-12-13

(30) Application Priority Data:
Application No. Country/Territory Date
09/594,459 (United States of America) 2000-06-14
09/677,584 (United States of America) 2000-09-30

Abstracts

English Abstract


An invention comprising cellular transformation, directed evolution, and
screening methods for creating novel transgenic organisms having desirable
properties. Thus in one aspect, this invention relates to a method of
generating a transgenic organism, such as a microbe or a plant, having a
plurality of traits that are differentially activatable. Also, a method of
retooling genes and gene pathways by the introduction of regulatory sequences,
such as promoters, that are operable in an intended host, thus conferring
operability to a novel gene pathway when it is introduced into an intended
host. For example a novel man-made gene pathway, generated based on
microbially-derived progenitor templates, that is operable in a plant cell.
Furthermore, a method of generating novel host organisms having increased
expression of desirable traits, recombinant genes, and gene products.


French Abstract

L'invention porte sur des procédés de transformation cellulaire, d'évolution dirigée et de criblage en vue de créer de nouveaux organismes transgéniques aux propriétés souhaitées. En variante, cette invention porte sur un procédé de génération d'un organisme transgénique tel qu'un microbe ou une plante présentant une pluralité de caractéristiques pouvant être activées de manière différentielle. L'invention porte aussi sur un procédé permettant de restructurer des gènes et des mécanismes d'action génétiques par l'introduction de séquences régulatrices telles que des promoteurs pouvant agir dans un hôte déterminé, ce qui confère une opérabilité à un nouveau mécanisme d'action génétique lorsqu'il est introduit dans un hôte déterminé. Par exemple, un nouveau mécanisme d'action génétique artificiel, généré à partir de gabarits de progéniteurs dérivés de microbes, peut être utilisé dans une cellule végétale. L'invention porte en poutre sur de nouveaux organismes hôtes dont les caractéristiques souhaitées, les gènes de recombinaison et les produits géniques ont une expression accrue.

Claims

Note: Claims are shown in the official language in which they were submitted.


What is claimed is:
1. A method of producing an improved organism having a desirable trait
comprising:
a) obtaining an initial population of organisms, b) generating a set of
mutagenized
organisms, such that when all the genetic mutations in the set of mutagenized
organisms
are taken as a whole, there is represented a set of substantial genetic
mutations, and c)
detecting the presence of said improved organism.
2. The method of claim 1, wherein the set of substantial genetic mutations in
step b)
is comprised of a knocking out of at least 15 different genes.
3. The method of claim 1, wherein the set of substantial genetic mutations in
step b)
is comprised of a knocking out of at least 50 different genes.
4. The method of claim 1, wherein the set of substantial genetic mutations in
step b)
is comprised of a knocking out of at least 100 different genes.
5. The method of claim 1, wherein the set of substantial genetic mutations in
step b)
is comprised of an introduction of at least 15 different genes.
6. The method of claim 1, wherein the set of substantial genetic mutations in
step b)
is comprised of an introduction of at least 50 different genes.
7. The method of claim 1, wherein the set of substantial genetic mutations in
step b)
is comprised of an introduction of at least 100 different genes.
8. The method of claim 1, wherein the set of substantial genetic mutations in
step b)
is comprised of an alteration in the expression of at least 15 different
genes.
9. The method of claim 1, wherein the set of substantial genetic mutations in
step b)
is comprised of an alteration in the expression of at least 50 different
genes.
992

10. The method of claim 1, wherein the set of substantial genetic mutations in
step b)
is comprised of an alteration in the expression of at least 100 different
genes.
11. A method of producing an improved organism having a desirable trait
comprising:
a) obtaining an initial population of organisms, b) generating a set of
mutagenized
organisms each having at least one genetic mutation, such that when all the
genetic
mutations in the set of mutagenized organisms are taken as a whole, there is
represented a
set of substantial genetic mutations c) detecting the manifestation of at
least two genetic
mutations, d) introducing at least two detected genetic mutations into one
organism, and e)
optionally repeating any of steps a), b), c), and d).
12. The method of claim 11, wherein step d) is comprised of a knocking out of
at least
15 different genes in one organism.
13. The method of claim 11, wherein step d) is comprised of a knocking out of
at least
50 different genes in one organism.
14. The method of claim 11, wherein step d) is comprised of a knocking out of
at least
100 different genes in one organism.
15. The method of claim 11, wherein step d) is comprised of an introduction of
at least
15 different genes into one organism.
I6. The method of claim 11, wherein step d) is comprised of an introduction of
at least
50 different genes into one organism.
17. The method of claim 11, wherein step d) is comprised of an introduction of
at least
100 different genes into one organism.
993

18. The method of claim 11, wherein step d) is comprised of an alteration in
the
expression of at least 15 different genes in one organism.
19. The method of claim 11, wherein step d) is comprised of an alteration in
the
expression of at least 50 different genes in one organism.
20. The method of claim 11, wherein step d) is comprised of an alteration in
the
expression of at least 100 different genes in one organism.
21. A method for identifying a gene that alters a trait of an organism,
comprising: a)
obtaining an initial population of organisms, b) generating a set of
mutagenized organisms,
such that when all the genetic mutations in the set of mutagenized organisms
are taken as a
whole, there is represented a set of substantial genetic mutations, and c)
detecting the
presence an organism having said altered trait, and d) determining the
nucleotide sequence
of a gene that has been mutagenized in the organism having the altered trait.
22. A method for producing an organism with an improved trait, comprising: a)
functionally knocking out an enogenous gene in a substantially clonal
population of
organisms; b) transferring a library of altered genes into the substantially
clonal population
of organisms, wherein each altered gene differs from the endogenous gene at
only one
codon; c) detecting a mutagenized organism having an improved trait; and
d)determining
the nucleotide sequence of an gene that has been transferred into the detected
organism.
994

Description

Note: Descriptions are shown in the official language in which they were submitted.


DEMANDE OU BREVET VOLUMINEUX
LA PRESENTE PARTIE DE CETTE DEMANDE OU CE BREVET COMPREND
PLUS D'UN TOME.
CECI EST LE TOME 1 DE 4
~~ TTENANT LES PAGES 1 A 267
NOTE : Pour les tomes additionels, veuillez contacter 1e Bureau canadien des
brevets
JUMBO APPLICATIONS/PATENTS
THIS SECTION OF THE APPLICATION/PATENT CONTAINS MORE THAN ONE
VOLUME
THIS IS VOLUME 1 OF 4
CONTAINING PAGES 1 TO 267
NOTE: For additional volumes, please contact the Canadian Patent Office
NOM DU FICHIER / FILE NAME
NOTE POUR LE TOME / VOLUME NOTE:

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
WHOLE CELL ENGINEERING
BY MUTAGENIZING A SUBSTANTIAL PORTION OF A STARTING GENOME,
COMBINING MUTATIONS,
AND OPTIONALLY REPE~4TINC~
A - FIELD OF THE INVENTION
This invention relates to the field of cellular and whole organism
engineering.
Specifically, this invention relates to a cellular transformation, directed
evolution, and
screening method for creating novel transgenic organisms having desirable
properties.
Thus in one aspect, this invention relates to a method of generating a
transgenic organism,
such as a microbe or a plant, having a plurality of traits that are
differentially activatable.
This invention also relates to the field of protein engineering. Specifically,
this
invention relates to a directed evolution method for preparing a
polynucleotide encoding a
polypeptide. More specifically, this invention relates to a method of using
mutagenesis to
generate a novel polynucleotide encoding a novel polypeptide, which novel
polypeptide is
itself an improved biological molecule &/or contributes to the generation of
another
improved biological molecule. More specifically still, this invention relates
to a method of
performing both non-stochastic polynucleotide chimeri~.ation and non-
stochastic site-
directed point mutagenesis.
Thus, in one aspect, this invention relates to a method of generating a
progeny set
of chimeric polynucleotide(s) by means that are synthetic and non-stochastic,
and where
the design of the progeny polynucleotide(s) is derived by analysis of a
parental set of
polynucleotides &/or of the polypeptides correspondingly encoded by the
parental
polynucleotides. In another aspect this invention relates to a method of
performing site-
directed mutagenesis using means that are exhaustive, systematic, and non-
stochastic.
Furthermore this invention relates to a step of selecting from among a
generated set
of progeny molecules a subset comprised of particularly desirable species,
including by a
process termed end-selection, which subset may then be screened further. This
invention
also relates to the step of screening a set of polynucleotides for the
production of a
polypeptide &/or of another expressed biological molecule having a useful
property.
1

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Novel biological molecules whose manufacture is taught by this invention
include
genes, gene pathways, and any molecules whose expression is affected thereby,
including
directly encoded polypetides &/or any molecules affected by such polypeptides.
Said
novel biological molecules include those that contain a carbohydrate, a lipid,
a nucleic
acid, &/or a protein component, and specific but non-limiting examples of
these include
antibiotics, antibodies, enzymes, and steroidal and non-steroidal hormones.
In a particular non-limiting aspect, the present invention relates to enzymes,
parkicularly to thermostable enzymes, and to their generation by directed
evolution. More
particularly, the present invention relates to thermostable enzymes which are
stable at high
temperatures and which have improved activity at lower temperatures.
B-BACKGROUND
General Overview of the Problem to Be Solved
Brief Summary: It is instantly appreciated that the process of performing a
genetic
manipulation on a organism to achieve a genetic alteration, whether it is on a
unicellular or
on a mufti-cellular organism, can lead to harmful, toxic, noxious, or even
lethal effects on the
manipulated organism. This is particularly true when the genetic manipulation
becomes
sizable. From a technical point of view, this problem is seen as one of the
current obstacles
that hinder the creation of genetically altered organisms having a large
number of transgenic
traits.
On the marketing side, is instantly appreciated that the purchase price of a
genetically
altered organism is often dictated by, or proportional to, the number of
transgenic traits that
have been introduced into the organism. Consequently, a genetically altered
organism
having a large number of stacked transgenic traits can be quite costly to
produce and
purchase and economically in low demand.
On the other hand, the generation of organism having but a single genetically
introduced trait can also lead to the incurrence of undesirable costs,
although for other
reasons. It is thus appreciated that the separate production, marketing, &
storage of
genetically altered organisms each having a single transgenic traits can incur
costs, including
inventory costs, that are undesirable. For example, the storage of such
organisms may
require a separate bin to be used for each trait. Furthermore, the value of an
organisms
having a single particular trait is often intimately tied to the marketability
of that particular
2

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
trait, and when that marketability diminishes, inventories of such organisms
cannot be sold in
other markets.
The instant invention solves these and other problems by providing a method of
producing genetically altered organisms having a large number of stacked
traits that are
differentially activatable. Upon purchasing such a genetically altered
organism (having a
large number of differentially activatable stacked traits), the purchasing
customer has the
option of selecting and paying for particular traits among the total that can
then be activated
differentially. One economic advantage provided by this invention is that the
storage of such
genetically altered organisms is simplified since, for example, one bin could
be used to store
a large number of traits. Moreover, a single organism of this type can satisfy
the demands
for a variety of traits; consequently, such an organism can be sold in a
vaxiety of markets.
To achieve the production of genetically altered organisms having a large
number of
stacked traits that are differentially activatable, this invention provides -
in one specific
aspect - a process comprising the step of monitoring a cell or organism at
holistic level. This
serves as a way of collecting holistic - rather than isolated - information
about a working cell
or organism that is being subjected to a substantial amount of genetic
manipulation. This
invention further provides that this type of holistic monitoring can include
the detection of all
morphological, behavioral, and physical parameters.
Accordingly, the holistic monitoring provided by this invention can include
the
identification &/or quantification of all the genetic material contained in a
working cell or
organism (e.g. all nucleic acids including the entire genome, messenger RNA's,
tRNA's,
rRNA's, and mitochondria) nucleic acids, plasmids, phages, phagemids, viruses,
as well as
all episomal nucleic acids and endosyrnbiont nucleic acids). Furthermore this
invention
provides that this type of holistic monitoring can include all gene products
produced by the
working cell or organisms.
Furthermore, the holistic monitoring provided by this invention can include
the
identification &/or quantification of all molecules that are chemically at
least in part protein
in a working cell or organism. The holistic monitoring provided by this
invention can also
include the identification &/or quantification of all molecules that are
chemically at least in
part carbohydrate in a working cell or organism. The holistic monitoring
provided by this
invention can also include the identification &/or quantification of all
molecules that are
chemically at least in part proteoglycan in a working cell or organism. The
holistic
monitoring provided by this invention can also include the identification &/or
quantification
3

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
of all molecules that are chemically at least in part glycoprotein in a
working cell or
organism. The holistic monitoring provided by this invention can also include
the
identification &/or quantification of all molecules that are chemically at
least in part nucleic
acids in a working cell or organism. The holistic monitoring provided by this
invention can
also include the identification &/or quantification of all molecules that are
chemically at least
in part lipids in a working cell or organism.
In one aspect, this invention provides that the ability to differentially
activate a trait
from among many, such as a enzyme from among many enzymes, depends the
enzymes)
to be activated having a unique activity profile (or activity fingerprint). An
enzyme's
activity profile includes the reactions) it catalyzes and its specificity.
Thus, an enzymes
activity profile includes its:
~ Catalyzed reactions)
~ Reaction type
~ Natural substrates)
~ Substrate spectrum
~ Product spectrum
~ Inhibitors)
~ Cofactor(s)/prostetic groups)
~ Metal compounds/salts that affect it
~ Turnover number
~ Specific activity
~ Km value
~ pH optimum
~ pH range
~ Temperature optimum
~ Temperature range
It is also instantly appreciated that enzymes are differentially affected by
exposure to
varying degrees of processing (e.g. upon extraction &/or purification) and
exposure (e.g.
to suboptimal storage conditions). Accordingly, enzyme differences may surface
after
exposure to:
~ Isolation/Preparation
~ Purification
~ Crystallization
~ Renaturation
4

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
It is instantly appreciated that differences in molecular stability can also
be used
advantageously to differentially activate or inactivate selected enzymes, by
exposing the
enzymes for an appropriate time to variations in:
~ pH
~ Temperature
~ Oxidation
~ Organic solvents)
~ Miscellaneous storage conditions
It is thus appreciated that in order to be able to differentially activate
selected traits
among a plurality of stacked traits, it is desirable to introduce into a
working cell or organism
traits conferred by molecules (e.g. enzymes) having very unique profiles (e.g.
unique enzyme
fingerprints). Furthermore , it is appreciated that in order to obtain the
molecules having a
representation of a wide range of molecular fingerprints, it is advantageous
to harvest
molecules from the widest possible reaches nature's diversity. Thus, it is
beneficial to
harvest molecules not only from cultured mesophilic organisms, but also from
extremophiles
that are largely uncultured.
In another aspect, it is instantly appreciated that harvesting the full
potential of
nature's diversity can include both the step of discovery and the step of
optimizing what is
discovered. For example, the step of discovery allows one to mine biological
molecules that
have commercial utility. It is instantly appreciated that the ability to
harvest the full richness
of biodiversity, i.e. to mine biological molecules from a wide range of
environmental
conditions, is critical to the ability to discover novel molecules adapted to
fixntion under a
wide variety of conditions, including extremes of conditions, such as may be
found in a
commercial application.
However, it is also instantly appreciated that only occassionally are there
criteria for
selection &/or survival in nature that point in the exact direction of
particular commercial
needs. Instead, it is often the case that a naturally occurring molecule will
require a certain
amount of change - from fine tuning to sweeping modification - in order to
fulfill a
particular unmet commercial need. Thus, to meet certain commercial needs
(e.g., a need for
a molecule that is fucntional under a specific set of commercial processing
conditions) it is
sometimes advantageous to experimentally modify a naturally expresed molecule
to achieve
properties beyond what natural evolution has provided &/or is likely to
provide in the near
future.

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The approach, termed directed evolution, of experimentally modifying a
biological
molecule towards a desirable property, can be achieved by mutagenizing one or
more
parental molecular templates and by idendifying any desirable molecules among
the progeny
molecules. Currently available technologies in directed evolution include
methods for
achieving stochastic (i.e. random) mutagenesis and methods for achieving non-
stochastic
(non-random) mutagenesis. However, critical shortfalls in both types of
methods are
identified in the instant disclosure.
In prelude, it is noteworthy that it may be argued philosophically by some
that all
mutagenesis - if considered from an obj ective point of view - is non-
stochastic; and
furthermore that the entire universe is undergoing a process that - if
considered from an
objective point of view - is non-stochastic. Whether this is true is outside
of the scope of the
instant consideration. Accordingly, as used herein, the terms "randomness",
"uncertainty",
and "unpredictability" have subjective meanings, and the knowledge,
particularly the
predictive knowledge, of the designer of an experimental process is a
determinant of whether
the process is stochastic or non-stochastic.
By way of illustration, stochastic or random mutagenesis is exemplified by a
situation
in which a progenitor molecular template is mutated (modified or changed) to
yield a set of
progeny molecules having mutations) that are not predetermined. Thus, in an in
vitro
stochastic mutagenesis reaction, for example, there is not a particular
predetermined product
whose production is intended; rather there is an uncertainty - hence
randomness - regarding
the exact nature of the mutations achieved, and thus also regarding the
products generated.
.In contrast, non-stochastic or non-random mutagenesis is exemplified by a
situation in which
a progenitor molecular template is mutated (modified or changed) to yield a
progeny
molecule having one or more predetermined mutations. It is appreciated that
the presence of
background products in some quantity is a reality in many reactions where
molecular
processing occurs, and the presence of these background products does not
detract from the
non-stochastic nature of a mutagenesis process having a predetermined product.
Thus, as used herein, stochastic mutagenesis is manifested in processes such
as error-
prone PCR and stochastic shuffling, where the mutations) achieved are random
or not
predetermined. In contrast, as used herein, non-stochastic mutagenesis is
manifested in
instantly disclosed processes such as gene site-saturation mutagenesis and
synthetic ligation
reassembly, where the exact chemical structures) of the intended products) are
predetermined.
6

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In brief, existing mutagenesis methods that are non-stochastic have been
serviceable
in generating from one to only a very small number of predetermined mutations
per method
application, and thus produce per method application from one to only a few
progeny
molecules that have predetermined molecular structures. Moreover, the types of
mutations
currently available by the application of these non-stochastic methods are
also limited, and
thus so are the types of progeny mutant molecules.
In contrast, existing methods for mutagenesis that are stochastic in nature
have been
serviceable for generating somewhat larger numbers of mutations per method
application -
though in a random fashion & usually with a large but unavoidable contingency
of
undesirable background products. Thus, these existing stochastic methods can
produce per
method application larger numbers of progeny molecules, but that have
undetermined
molecular structures. The types of mutations that can be achieved by
application of these
current stochastic methods are also limited, and thus so are the types of
progeny mutant
molecules.
It is instantly appreciated that there is a need for the development of non-
stochastic
mutagenesis methods that:
1) Can be used to generate large numbers of progeny molecules that have
predetermined molecular structures;
2) Can be used to readily generate more types of mutations;
3) Can produce a correspondingly larger variety of progeny mutant molecules;
4) Produce decreased unwanted background products;
5) Can be used in a manner that is exhaustive of all possibilities; and
6) Can produce progeny molecules in a systematic & non-repetitive way.
The instant invention satisfies all of these needs.
Directed Evolution Supplements Natural Evolution: Natural evolution has
been a springboard for directed or experimental evolution, serving both as a
reservoir of
methods to be mimicked and of molecular templates to be mutagenized. It is
appreciated
that, despite its intrinsic process-related limitations (in the types of
favored &/or allowed
mutagenesis processes) and in its speed, natural evolution has had the
advantage of having
been in process for millions of years & and throughout a wide diversity of
environments.
Accordingly, natural evolution (molecular mutagenesis and selection in nature)
has resulted
7

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
in the generation of a wealth of biological compounds that have shown
usefulness in certain
commercial applications.
However, it is instantly appreciated that many unmet commercial needs are
discordant with any evolutionary pressure &/or direction that can be found in
nature.
Moreover, it is often the case that when commercially useful mutations would
otherwise be
favored at the molecular level in nature, natural evolution often overrides
the positive
selection of such mutations, e.g. when there is a concurrent detriment to an
organism as a
whole (such as when a favorable mutation is accompanied by a detrimental
mutation).
Additionally, natural evolution is often slow, and favors fidelity in many
types of replication.
Additionally still, natural evolution often favors a path paved mainly by
consecutive
beneficial mutations while tending to avoid a plurality of successive negative
mutations, even
though such negative mutations may prove beneficial when combined, or may lead
- through
a circuitous route - to final state that is beneficial.
Moreover, natural evolution advances through specific steps (e.g. specific
mutagenesis and selection processes), with avoidance of less favored steps.
For example,
many nucleic acids do not reach close enough proximity to each other in a
operative
environment to undergo chimerization or incorporation or other types of
transfers from one
species to another. Thus, e.g., when sexual intercourse between 2 particular
species is
avoided in nature, the chimerization of nucleic acids from these 2 species is
likewise
unlikely, with parasites common to the two species serving as an example of a
very slow
passageway for inter-molecular encounters and exchanges of DNA. For another
example,
the generation of a molecule causing self toxicity or self lethality or sexual
sterility is
avoided in nature. For yet another example, the propagation of a molecule
having no
particular immediate benefit to an organism is prone to vanish in subsequent
generations of
the organism. Furthermore, e.g., there is no selection pressure for improving
the
performance of molecule under conditions other than those to which it is
exposed in its
endogenous environment; e.g. a cytoplasmic molecule is not likely to acquire
functional
features extending beyond what is required of it in the cytoplasm. Furthermore
still, the
propagation of a biological molecule is susceptible to any global detrimental
effects -
whether caused by itself or not - on its ecosystem. These and other
characteristics greatly
limit the types of mutations that can be propagated in nature.
On the other hand, directed (or experimental) evolution - particularly as
provided
herein - can be performed much more rapidly and can be directed in a more
streamlined
8

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
manner at evolving a predetermined molecular property that is commercially
desirable where
nature does not provide one &/or is not likely to provide. Moreover, the
directed evolution
invention provided herein can provide more wide-ranging possibilities in the
types of steps
that can be used in mutagenesis and selection processes. Accordingly, using
templates
harvested from nature, the instant directed evolution invention provides more
wide-ranging
possibilities in the types of progeny molecules that can be generated and in
the speed at
which they can be generated than often nature itself might be expected to in
the same length
of time.
In a particular exemplification, the instantly disclosed directed evolution
methods can
be applied iteratively to produce a lineage of progeny molecules (e.g.
comprising successive
sets of progeny molecules) that would not likely be propagated (i.e.,
generated &/or selected
for) in nature, but that could lead to the generation of a desirable
downstream mutagenesis
product that is not achievable by natural evolution.
Previous Directed Evolution Methods Are Suboptimal:
Mutagenesis has been attempted in the past on many occasions, but by methods
that
are inadequate for the purpose of this invention. For example, previously
described non-
stochastic methods have been serviceable in the generation of only very small
sets of progeny
molecules (comprised often of merely a solitary progeny molecule). By way of
illustration, a
chimeric gene has been made by joining 2 polynucleotide fragments using
compatible sticky
ends generated by restriction enzyme(s), where each fragment is derived from a
separate
progenitor (or parental) molecule. Another example might be the mutagenesis of
a single
codon position (i.e. to achieve a codon substitution, addition, or deletion)
in a parental
polynucleotide to generate a single progeny polynucleotide encoding for a
single site-
mutagenized polypeptide.
Previous non-stochastic approaches have only been serviceable in the
generation of
but one to a few mutations per method application. Thus, these previously
described non-
stochastic methods thus fail to address one of the central goals of this
invention, namely the
exhaustive and non-stochastic chimerization of nucleic acids. Accordingly
previous non-
stochastic methods leave untapped the vast majority of the possible point
mutations,
chimerizations, and combinations thereof, which may lead to the generation of
highly
desirable progeny molecules.
In contrast, stochastic methods have been used to achieve larger numbers of
point
mutations and/or chimerizations than non-stochastic methods; for this reason,
stochastic
9

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
methods have comprised the predominant approach for generating a set of
progeny molecules
that can be subjected to screening, and amongst which a desirable molecular
species might
hopefully be found. However, a major drawback of these approaches is that -
because of
their stochastic nature - there is a randomness to the exact components in
each set of progeny
molecules that is produced. Accordingly, the experimentalist typically has
little or no idea
what exact progeny molecular species are represented in a particular reaction
vessel prior to
their generation. Thus, when a stochastic procedure is repeated (e.g. in a
continuation of a
search for a desirable progeny molecule), the re-generation and re-screening
of previously
discarded undesirable molecular species becomes a labor-intensive obstruction
to progress,
causing a circuitous - if not circular - path to be taken. The drawbacks of
such a highly
suboptirnal path can be addressed by subjecting a stochastically generated set
of progeny
molecules to a labor-incurring process, such as sequencing, in order to
identify their
molecular structures, but even this is an incomplete remedy.
Moreover, current stochastic approaches are highly unsuitable for
comprehensively or
exhaustively generating all the molecular species within a particular grouping
of mutations,
for attributing functionality to specific structural groups in a template
molecule (e.g. a
specific single amino acid position or a sequence comprised of two or more
amino acids
positions), and for categorizing and comparing specific grouping of mutations.
Accordingly,
current stochastic approaches do not inherently enable the systematic
elimination of
unwanted mutagenesis results, and are, in sum, burdened by too many inherently
shortcomings to be optimal for directed evolution.
In a non-limiting aspect, the instant invention addresses these problems by
providing
non-stochastic means for comprehensively and exhaustively generating all
possible point
mutations in a parental template. In another non-limiting aspect, the instant
invention further
provides means for exhaustively generating all possible chimerizations within
a group of
chimerizations. Thus, the aforementioned problems are solved by the instant
invention.
Specific shortfalls in the technological landscape addressed by this invention
include:
1) Site-directed mutagenesis technologies, such as sloppy or low-fidelity PCR,
are
ineffective for systematically achieving at each position (site) along a
polypeptide sequence
the full (saturated) range of possible mutations (i.e. all possible amino acid
substitutions).
2) There is no relatively easy systematic means for rapidly analyzing the
large
amount of information that can be contained in a molecular sequence and in the
potentially

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
colossal number or progeny molecules that could be conceivably obtained by the
directed
evolution of one or more molecular templates.
3) There is no relatively easy systematic means for providing comprehensive
empirical information relating structure to function for molecular positions.
4) There is no easy systematic means for incorporating internal controls, such
as
positive controls, for key steps in certain mutagenesis (e.g. chimerization)
procedures.
5) There is no easy systematic means to select for a specific group of progeny
molecules, such as full-length chimeras, from among smaller partial sequences.
An exceedingly large number of possibilities exist for the purposeful and
random
combination of amino acids within a protein to produce useful hybrid proteins
and their
corresponding biological molecules encoding for these hybrid proteins, i.e.,
DNA, RNA.
Accordingly, there is a need to produce and screen a wide variety of such
hybrid proteins for
a desirable utility, particularly widely varying random proteins.
The complexity of an active sequence of a biological macromolecule (e.g.,
polynucleotides, polypeptides, and molecules that are comprised of both
polynucleotide and
polypeptide sequences) has been called its information content ("IC"), which
has been
defined as the resistance of the active protein to amino acid sequence
variation (calculated
from the minimum number of invariable amino acids (bits) required to describe
a family of
related sequences with the same function). Proteins that are more sensitive to
random
mutagenesis have a high information content.
Molecular biology developments, such as molecular libraries, have allowed the
identification of quite a large number of variable bases, and even provide
ways to select
functional sequences from random libraries. In such libraries, most residues
can be varied
(although typically not all at the same time) depending on compensating
changes in the
context. Thus, while a 100 amino acid protein can contain only 2,000 different
mutations,
20100 sequence combinations are possible.
Information density is the IC per unit length of a sequence. Active sites of
enzymes tend to have a high information density. By contrast, flexible linkers
of
information in enzymes have a low information density.
Current methods in widespread use for creating alternative proteins in a
library
format are error-prone polymerase chain reactions and cassette mutagenesis, in
which the
specific region to be optimized is replaced with a synthetically mutagenized
11

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
oligonucleotide. In both cases, a substantial number of mutant sites are
generated around
certain sites in the original sequence.
Error-prone PCR uses low-fidelity polymerization conditions to introduce a low
level of point mutations randomly over a long sequence. In a mixture of
fragments of
unknown sequence, error-prone PCR can be used to mutagenize the mixture. The
published error-prone PCR protocols suffer from a low processivity of the
polymerase.
Therefore, the protocol is unable to result in the random mutagenesis of an
average-sized
gene. This inability limits the practical application of error-prone PCR. Some
computer
simulations have suggested that point mutagenesis alone may often be too
gradual to allow
the large-scale block changes that are required for continued and dramatic
sequence
evolution. Further, the published error-prone PCR protocols do not allow for
amplification of DNA fragments greater than 0.5 to 1.0 kb, limiting their
practical
application. In addition, repeated cycles of error-prone PCR can lead to an
accumulation
of neutral mutations with undesired results, such as affecting a protein's
immunogenicity
but not its binding affinity.
In oligonucleotide-directed mutagenesis, a short sequence is replaced with a
synthetically mutagenized oligonucleotide. This approach does not generate
combinations
of distant mutations and is thus not combinatorial. The limited library size
relative to the
vast sequence length means that many rounds of selection are unavoidable for
protein
optimization. Mutagenesis with synthetic oligonucleotides requires sequencing
of
individual clones after each selection round followed by grouping them into
families,
arbitrarily choosing a single family, and reducing it to a consensus motif.
Such motif is re-
synthesized and reinserted into a single gene followed by additional
selection. This step
process constitutes a statistical bottleneck, is labor intensive, and is not
practical for many
rounds of mutagenesis.
Error-prone PCR and oligonucleotide-directed mutagenesis are thus useful for
single cycles of sequence fine-tuning, but rapidly become too limiting when
they are
applied for multiple cycles.
Another limitation of error-prone PCR is that the rate of down-mutations grows
with the information content of the sequence. As the information content,
library size, and
mutagenesis rate increase, the balance of down-mutations to up-mutations will
statistically
prevent the selection of further improvements (statistical ceiling).
12

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In cassette mutagenesis, a sequence block of a single template is typically
replaced
by a (partially) randomized sequence. Therefore, the maximum information
content that
can be obtained is statistically limited by the number of random sequences
(i.e., library
size). This eliminates other sequence families which are not currently best,
but which may
have greater long term potential.
Also, mutagenesis with synthetic oligonucleotides requires seduencing of
individual clones after each selection round. Thus, such an approach is
tedious and
impractical for many rounds of mutagenesis.
Thus, error-prone PCR and cassette mutagenesis are best suited, and have been
widely used, for fine-tuning areas of comparatively low information content.
One
apparent exception is the selection of an RNA ligase ribozyme from a random
library
using many rounds of amplification by error-prone PCR and selection.
In nature, the evolution of most organisms occurs by natural selection and
sexual
reproduction. Sexual reproduction ensures mixing and combining of the genes in
the
offspring of the selected individuals. During meiosis, homologous chromosomes
from the
parents line up with one another and cross-over part way along their length,
thus randomly
swapping genetic material. Such swapping or shuffling of the DNA allows
organisms to
evolve more rapidly.
In recombination, because the inserted sequences were of proven utility in a
homologous environment, the inserted sequences are likely to still have
substantial
information content once they are inserted into the new sequence.
Theoretically there are 2,000 different single mutants of a 100 amino acid
protein.
However, a protein of 100 amino acids has 20100 possible sequence
combinations, a
number which is too large to exhaustively explore by conventional methods. It
would be
advantageous to develop a system which would allow generation and screening of
all of
these possible combination mutations.
Some workers in the art have utilized an in vivo site specific recombination
system
to generate hybrids of combine light chain antibody genes with heavy chain
antibody
genes for expression in a phage system. However, their system relies on
specific sites of
recombination and is limited accordingly. Simultaneous mutagenesis of antibody
CDR
regions in single chain antibodies (scFv) by overlapping extension and PCR
have been
reported.
13

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Others have described a method for generating a large population of multiple
hybrids using random in vivo recombination. This method requires the
recombination of
two different libraries of plasmids, each library having a different
selectable marker. The
method is limited to a finite number of recombinations equal to the number of
selectable
markers existing, and produces a concomitant linear increase in the number of
marker
genes linked to the selected sequence(s).
In vivo recombination between two homologous, but truncated, insect-toxin
genes
on a plasmid has been reported as a method of producing a hybrid gene. The in
vivo
recombination of substantially mismatched DNA sequences in a host cell having
defective
mismatch repair enzymes, resulting in hybrid molecule formation has been
reported.
C - SUMMARY OF THE INVENTION
This invention relates generally to the field of cellular and whole organism
engineering. Specifically, this invention relates to a cellular
transformation, directed
evolution, and screening method for creating novel transgenic organisms having
desirable
properties. Thus in one aspect, this invention relates to a method of
generating a
transgenic organism, such as a microbe or a plant, having a plurality of
traits that are
differentially activatable.
In one embodiment, this invention is directed to a method of producing an
improved organism having a desirable trait to by: a) obtaining an initial
population of
organisms, b) generating a set of mutagenized organisms, such that when all
the genetic
mutations in the set of mutagenized organisms are taken as a whole, there is
represented a
set of substantial genetic mutations, and c) detecting the presence of said
improved
organism. This invention provides that any of steps a), b), and c) can be
further repeated
in any particular order and any number of times; accordingly, this invention
specifically
provides methods comprised of any iterative combination of steps a), b), and
c), with a
number of iterations.
In another embodiment, this invention is directed to a method of producing an
improved organism having a desirable trait to by: a) obtaining an initial
population of
organisms, which can be a clonal population or otherwise, b) generating a set
of
mutagenized organisms each having at least one genetic mutation, such that
when all the
14

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
genetic mutations in the set of mutagenized organisms are taken as a whole,
there is
represented a set of substantial genetic mutations c) detecting the
manifestation of at least
two genetic mutations, and d) introducing at least two detected genetic
mutations into one
organism. Additionally, this invention provides that any of steps a), b), c),
and d) can be
further repeated in any particular order and any number of times; accordingly,
this
invention specifically provides methods comprised of any iterative combination
of steps
a), b), c), and d), with a total number of iterations can be from one up to
one million,
including specifically every integer value in between.
In a preferred aspect of embodiments specified herein the step of b)
generating a
second set of mutagenized organisms is comprised of generating a plurality of
organisms,
each of which organisms has a particular transgenic mutation.
As used herein, "generating a set of mutagenized organisms having genetic
mutations" can be achieved by any means known in the art to mutagenized
including any
radiation known to mutagenized, such as ionizing and ultra violet. Further
examples of
serviceable mutagenizing methods include site-saturation mutagenesis,
transposon-based
methods, and homologous recombination.
"Combining" means incorporating a plurality of different genetic mutations in
the
genetic makeup (e.g. the genome) of the same organism; and methods to achieve
this
"combining" step including sexual recombination, homologous recombination, and
transposon-based methods.
As used herein, an "initial population of organisms" means a "working
population of organisms", which refers simply to a population of organisms
with which
one is working, and which is comprised of at least one organism. An "initial
population
of organisms" which can be a clonal population or otherwise.
Accordingly, in step 1) an "initial population of organisms" may be a
population
of multicellular organisms or of unicellular organisms or of both. An "initial
population of
organisms" may be comprised of unicellular organisms or multicellular
organisms or both.
An "initial population of organisms" may be comprised of prokaryotic organisms
or

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
eukaryotic organisms or both. This invention provides that an "initial
population of
organisms" is comprised of at least one organism, and preferred embodiments
include at
least that .
By "organism" is meant any biological form or thing that is capable of self
replication or replication in a host. Examples of "organisms" include the
following kinds
of organisms (which kinds are not necessarily mutually-exclusive): animals,
plants,
insects, cyanobacteria, microorganisms, fungi, bacteria, eukaryotes,
prokaryotes,
mycoplasma, viral organisms (including DNA viruses, RNA viruses), and prions.
Non-limiting particularly preferred examples of kinds of "organisms" also
include
Archaea (archaebacteria) and Bacteria (eubacteria). Non-limiting examples of
Archaea
(archaebacteria) include Crenarchaeota, Euryarchaeota, and Korarchaeota. Non-
limiting
examples Bacteria (eubacteria) include Aquificales, CFB/Green sulfur bacteria
group,
Chlamydiales/Verrucomicrobia group, Chrysiogenes group, Coprothermobacter
group,
Cyanobacteria & chloroplasts, Cytophaga/Flexibacter /Bacteriods group,
Dictyoglomus
group, Fibrobacter/Acidobacteria group, Firmicutes, Flexistipes group,
Fusobacteria,
Green non-sulfur bacteria, Nitrospira group, Planctomycetales, Proteobacteria,
Spirochaetales, Synergistes group, Thermodesulfobacterium group,
Thermotogales,
Thermus/Deinococcus group. As non-limiting examples, particularly preferred
kinds of
organisms include Aquifex, Aspergillus, Bacillus, Clostridium, E. coli,
Lactobacillus,
Mycobacterium, Pseudomonas, Streptomyces, and Thermotoga. As additional non-
limiting examples, particularly preferred organisms include cultivated
organisms such as
CHO, VERO, BHK, HeLa, COS, MDCK, Jurkat, HEK-293, and WI38. Particularly
preferred non-limiting examples of organisms further include host organisms
that are
serviceable for the expression of recombinant molecules. Organisms further
include
primary cultures (e.g. cells from harvested mammalian tissues), immortalized
cells, all
cultivated and culturable cells and multicellular organisms, and all
uncultivated and
uculturable cells and multicellular organisms.
In a preferred embodiment, knowledge of genomic information is useful for
performing the claimed methods; thus, this invention provides the following as
preferred
but non-limiting examples of organisms that are particularly serviceable for
this invention,
16

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
because there is a significant amount of - if not complete - genomic sequence
information
(in terms of primary sequence &/or annotation) for these organisms: Human,
Insect (e.g.
Drosophila melanogaster), Higher plants (e.g. Arabidopsis thaliana), Protozoan
(e.g.
Plasmodium falciparum), Nematode (e.g. Caenorhabditis elegans), Fungi(e.g.
Saccharomyces cerevisiae), Proteobacteria gamma subdivision (e.g. Escherichia
coli K-12
Haemophilus influenzae Rd, Xylella fastidiosa 9a5c, Vibrio cholerae El Tor
N16961,
Pseudomonas aeruginosa PAOI, Buchnera sp. APS), Proteobacteria beta
subdivision (e.g.
Neisseria meningitidis MC58 (serogroup B), Neisseria meraingitidis 22491
(serogroup A)),
Proteobacteria other subdivisions (e.g. Helicobacter pylori 26695,
Helicobacter pylori
J99, Campylobacterjejuni NCTCl 1168, Rickettsia prowazekii), Gram-positive
bacteria
(e.g. Bacillus subtilis, Mycoplasma genitalium, Mycoplasma pneumoniae,
Ureaplasma
urealyticurn, Mycobacterium tuberculosis H37Rv), Chlamydia (e.g. Chlamydia
trachomatisserovar D, Chlarnydia muridarum (Chlarnydia trachomatis MoPn),
Chlamydia
pneumoniae CWL029, Chlamydia pneumoniae AR39, Chlamydia pneumoriiae J138),
Spirochete (e.g. Borrelia burgdorferi B31, Treponema pallidum), Cyanobacteria
(e.g.
Synechocystis sp. PCC6803), Radioresistant bacteria (e.g. Deinococcus
radiodurans Rl),
Hyperthermophilic bacteria (e.g. Aquifex aeolicus VFS, Thermotoga maritima
MSB8), and
Archaea (e.g. Methanococcus jannaschii, Methanobacterium thermoautotrophicum
deltaH, Archaeoglobus fulgidus, Pyrococcus horikoshii OT3, Pyrococcus abyssi,
Aeropyrum pernixxl).
Non-limiting particularly preferred examples of kinds of plant "organisms"
include those
listed in Table 1.
Table 1. Non-limiting examples of plant organisms and sources of transgenic
molecules
(e.g. nucleic acids & nucleic acid products)
1. Alfalfa 39.Pepper
2. Amelanchier laevis 40.Persimmon
3. Apple 41.Petunia
4. Arab. thaliana 42.Pine
5. Arabidopsis 43.Pineapple
6. AspergilIus flavus 44.Pink bollworm
7. Barley 45.Plum
8. Beet 46.Poplar
9. Belladonna 47.Potato
10.Brassica oleracea 48.Pseudomonas
17

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
11.Carrot 49.Pseudomonas putida
12.Chrysanthemum 50.Pseudomonas syringae
13.Cichorium intybus 51.Rapeseed
14.Clavibacter 52.Rhizobium
15.Clavibacter xyli 53.Rhizobium etli
16.Coffee 54.Rhizobium fredii
17.Corn 55.Rhizobium leguminosarum
18.Cotton 56.Rhizobium meliloti
19.Cranberry 57.Rice
20.Creeping bentgrass 58.Rubusidaeus
21.Cryphonectria parasitica 59.Spruce
22.Eggplant 60.Soybean
23.Festuca arundinacea 61.Squash
24.Fusarium graminearum 62.Squash-cucumber
25.Fusarium moniliforme 63.Squash-cucurbita texana
26.Fusarium sporotrichioides 64.Strawberry
27.Gladiolus 65.Sugarcane
28.Grape 66.Sunflower
29.Heterorhabditis bacteriophora67.Sweet potato
30.Kentucky bluegrass 68.Sweetgum
31.Lettuce 69.TMV
32.Melon 70.Tobacco
33.Oat 71.Tomato
34.Onion 72.Walnut
35.Papaya 73.Watermelon
36.Pea 74.Wheat
37.Peanut 75.Xanthomonas
38.Pelargonium 76.Xanthomonas campestris
As used herein, the meaning of "generating a set of mutagenized organisms
having genetic mutations" includes the steps of substituting, deleting, as
well as
introducing a nucleotide sequence into organism; and this invention provides a
nucleotide
sequence that serviceable for this purpose may be a single-stranded or double-
stranded and
the fact that its length may be from one nucleotide up to 10,000,000,000
nucleotides in
length including specifically every integer value in between.
A mutation in an organism includes any alteration in the structure of one or
more
molecules that encode the organism. These molecules include nucleic acid, DNA,
RNA,
prionic molecules, and may be exemplified by a variety of molecules in an
organism such
as a DNA that is genomic, episomal, or nucleic, or by a nucleic acid that is
vectoral (e.g.
viral, cosmid, phage, phagemid).
In one aspect, as used herein, a "set of substantial genetic mutations" is
preferably a disruption (e.g. a functional knock-out) of at least about 15 to
about 150,000
18

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
genomic locations or nucleotide sequences (e.g. genes, promoters, regulatory
sequences,
codons etc.), including specifically every integer value in between. In
another aspect, as
used herein, a "set of substantial genetic mutations" is preferably an
alteration in an
expression level (e.g. decreased or increased expression level) or an
alteration in the
expression pattern (e.g. throughout a period of time) of at least about 15 to
about 150,000
genes, including specifically every integer value in between. Corresponding to
another
aspect, as used herein, a "set of substantial genetic mutations" is preferably
an alteration
in an expression level (e.g. decreased or increased expression level) or an
alteration in the
expression pattern (e.g. throughout a period of time) of at least about 15 to
about 150,000
gene products &/or phenotypes &/or traits, including specifically every
integer value in
between.
In another aspect, as used herein, a "set of substantial genetic mutations"
with
respect to an organism (or type of organism) is preferably a disruption (e.g.
a functional
knock-out) of at least about 1% to about 100% of genomic locations or
nucleotide
sequences (e.g. genes, promoters, regulatory sequences, codons etc.) in the
organism (or
type of organism), including specif cally percentages of every integer value
in between. In
another aspect, as used herein, a "set of substantial genetic mutations" is
preferably an
alteration in an expression level (e.g. decreased or increased expression
level) or an
alteration in the expression pattern (e.g. throughout a period of time) of at
least about 1
to about 100% of genes in an organism (or type of organism), including
specifically
percentages of every integer value in between. Corresponding to another
aspect, as used
herein, a "set of substantial genetic mutations" is preferably an alteration
in an
expression level (e.g. decreased or increased expression level) or an
alteration in the
expression pattern (e.g. throughout a period of time) of at least about 1 % to
about 100%
of the gene products &/or phenotypes &/or traits of an organism (or type of
organism),
including specifically every integer value in between.
In yet another aspect, as used herein, a "set of substantial genetic
mutations" is
preferably an introduction or deletion of at least about 15 to 150,000 genes
promoters or
other nucleotide sequences (where each sequence is from 1 base to 10,000,000
bases),
including specifically every integer value in between. For example, one can
introduce a
library of at least about 1 S to 150,000 nucleotides (genes or promoters)
produced by "site-
19

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
saturation mutagenesis" &/or by "ligation reassembly" (including any specific
aspect
thereof provided herein) into an "initial population of organisms".
It is provided that wherever the manipulation of a plurality of "genes" is
mentioned
herein, gene pathways (e.g. that ultimately lead to the production of small
molecules) are
also included. It is appreciated herein that knocking-out, altering expression
level, and
altering expression pattern can be achieved, by non-limiting exemplification,
by
mutagenizing a nucleotide sequence corresponding gene as well as a
corresponding
promoter that affects the expression of the gene.
As used herein, a "mutagenized organism" includes any organism that has been
altered by a genetic mutation.
A "genetic mutation" can be, by way of non-limiting and non-mutually exclusive
exemplification, and change in the nucleotide sequence (DNA or RNA) with
respect to
genomic, extra-genomic, episomal, mitochondrial, and any nucleotide sequence
associated
with (e.g. contained within or considered part of) an organism..
According to this invention, detecting the manifestation of a "genetic
mutation"
means "detecting the manifestation of a detectable parameter", including but
not
limited to a change in the genomic sequence. Accordingly, this invention
provides that a
step of sequencing (&/or annotating) of and organism's genomic DNA is
necessary for
some methods of this invention, and exemplary but non-limiting aspects of this
sequencing
(&/or annotating) step are provided herein.
A detectable "trait", as used herein, is any detectable parameter associated
with
the organism. Accordingly, such a detectable "parameter" includes, by way of
non-
limiting exemplification, any detectable "nucleotide knock-in", any detectable
"nucleotide
knock-outs", any detectable "phenotype", and any detectable "genotype". By way
of
further illustration, a "trait" includes any substance produced or not
produced by the
organism. Accordingly, a "trait" includes viability or non-viability,
behavior, growth rate,
size, morphology. "Trait" includes increased (or alternatively decreased)
expression of a
gene product or gene pathway product. "Trait" also includes small molecule
production

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
(including vitamins, antibiotics), herbicide resistance, drought resistance,
pest resistance,
production of any recombinant biomolecule (ie.g. vaccines, enzymes, protein
therapeutics,
chiral enzymes). Additional examples of serviceable traits for this invention
are shown in
Table 2.
TABLE 2 - Non-limiting examples of serviceable genes, gene products,
phenotypes, or
traits according to the methods of this invention (e.g. knockouts, knockins,
increased
or decreased expression level, increased or decreased expression pattern)
Table 2 - Part 1. Non-limiting examples of genes or gene products
1. 17 kDa protein 53.Cecropin
2. 3-hydroxy-3-methylglutaryl 54.Cecropin B
CoenzymeA
reductase
3. 4-Coumarate:CoA ligase knockoutS5.Cellulose binding protein
4. 60 kDa protein 56.Chalcone synthase knockout
5. Ac transposable element 57.Chitinase
6. ACC deaminase 58.Chitobiosidase
7. ACC oxidase knockout 59.Chloramphenicol acetyltransferase
8. ACC synthase 60.Cholera toxin B
9. ACC synthase knockout 61.Choline oxidase
10. Acetohydroxyacid synthase 62.Cinnamate 4-hydroxylase
variant
11. Acetolactate synthase 63.Cinnamate 4-hydroxylase knockout
12. Acetyl CoA carboxylase 64.Coat protein
13. ACP acyl-ACP thioesterase 65.Coat protein knockout
14. ACP thioesterase 66.Conglycinin
15. Acyl CoA reductase 67.CryIA
16. Acyl-ACP knockout 68.CryIAb
17. Acyl-ACP desaturase 69.CryIAc
18. Acyl-ACP desaturase knockout70.CryIB
19. Acyl-ACP thioesterase 71.CryIIA
20. ADP glucose pyrophosphorylase72.CryIIIA
21. ADP glucose pyrophosphorylase73.CryVIA
knockout
22. Agglutinin 74.Cyclin dependent kinase
23. Aleurone 1 75.Cyclodextrin glycosyltransferase
24. Alpha hordothinonin 76.Cylindrical inclusion protein
25. Alpha-amylase 77.Cystathionine synthase
26. Alpha-hemoglobin 78.Delta-12 desaturase
27. Aminoglycoside 3'-adenylytransferase79.Delta-12 desaturase knockout
28. Amylase 80.Delta-12 saturase
29. Anionic peroxidase 81.Delta-12 saturase knockout
30. Antibody 82.Delta-IS desaturase
31. Antifungal protein 83.Delta-15 desaturase knockout
32. Antithrombin 84.Delta-9 desaturase
33. Antitrypsin 85.Delta-9 desturase knockout
34. Antiviral protein 86.Deoxyhypusine synthase (DHS)
35. Aspartokinase 87.Deoxyhypusine synthase knockout
36. Attacin E 88.Diacylglycerol acetyl tansferase
37. B1 regulatory gene 89.Dihydrodipicolinate synthase
38. B-1,3-glucanase knockout 90.Dihydrofolate reductase
39. B-1,4-endoglucanase knockout91.Diptheria toxin A
40. Bacteropsin 92.Disease resistance response
gene 49
41. Bamase 93.Double stranded ribonuclease
42. Barstar 94.Ds transposable element
43. Beta-hemoglobin 95.Elongase
21

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
44.B-glucuronidase 96. EPSPS
45.C1 knockout 97. Ethylene forming enzyme
knockout
46.C1 regulatory gene 98. Ethylene receptor protein
47.C2 knockout 99. Ethylene receptor protein
knockout
48.C3 knockout 100.Fatty acid elongase
49.Caffeate O-methylthransferase101.Fluorescent protein
50.Caffeate O-methyltransferase102.G glycoprotein
knockout
51.Caffeoyl CoA O-methyltransferase103.Galactanase
knockout
52.Casein 104.Galanthus nivalis agglutinin
Table 2 - Part 1.(continued) Non-limiting examples of transgenic genes & gene
knockouts
105.Genome-linked protein 157.Omega 3 desaturease knockout
_
~
t06.Glucanase I58.Omega 6 desaturase
107.Glucanase knockout 159.Omega 6 desaturase knockout
108.Glucose oxidase 160.O-methyltransferase
109.Glutamate dehydrogenase 161.Osmotin
i Glutamine binding protein 162.Oxalate oxidase
10.
11 Glutamine synthetase 163.Par locus
I.
112.Glutenin 164.Pathogenesis protein 1 a
113.Glycerol-3-phosphate acetyl165.Pectate lyase
transferase
I Glyphosate exidoreductase 166.Pectin esterase
14.
115.Glyphosate oxidoreductase 167.Pectin esterase knockout
116.Green fluorescent protein 168.Pectin methylesterase
I Helper component 169.Pectin methylesterase knockout
17.
118.Hemicellulase 170.Pentenlypyrophosphate isomerase
119.Hup locus 171.Phosphinothricin
120.Hygromycin phosphotransferase172.Phosphinothricin acetyl
transferase
121.Hyoscamine 6B-hydroxylase 173.Phytochrome A
122.IAA monooxygenase 174.Phytoene synthase
123.Invertase 175.Phleomycin binding protein
124.Invertase knockout 176.Polygalacturonase
125.Isopentenyl transferase 177.Polygalacturonase knockout
126.Ketoacyl-ACP synthase 178.PolygaIacturonase inhibitor
protein
127.Ketoacyl-ACP synthase knockout179.Prf regulatory gene
128.Larval serum protein I Prosystemin
80.
129.Leafy homeotic regulatory 181.Protease
gene
130.Lectin 182.Protein A
131.Lignin peroxidase 183.Protein kinase
132.Luciferase 184.Proteinase inhibitor I
133.Lysine-2 gene 185.PtiS transcription factor
134.Lysophosphatidic acid acetyl186.R regulatory gene
transferase
135.Lysozyme 187.Receptor kinase
136.Maliinlin 188.Recombinase
137.Male sterility protein 189.Reductase
138.Metallothionein 190.Replicase
139.Modifie ethylene receptor 191.Resveratrol synthase
protein
140.Modified ethylene receptor 192.Ribonuclease
protein knockout
141.Monooxygenase I ro 1 c
93.
142.Movement protein 194.Rol hormone gene
143.Movement protein nonfunctional195.S-adenosylmethione decarboxylase
144.N gene for TMV resistance 196.S-adenosylmethione hydrolase
145.N-acetyl glucosidase 197.S-adenosyimethionine transferase
146.Nitrilase 198.Salicylate hydroxylase
147.Nopaline synthase 199.Satellite RNA
148.Notch 200.Seed storage protein
149.NptII 201.Serine-threonine protein
kinase
I50.Nuclear inclusion protein 202.Serum albumin
a
151.Nuclear inclusion protein 203.Shrunken 2
b
152.Nucleocapsid 204.Sorbitol dehydrogenase
153.Nucleoprotein 205.Sorbitol synthase
22

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Table 2 - Part 1.(continued) Non-limiting examples of transgenic genes & gene
knockouts
209.Systemic acquired resistance219._ Trichosanthin
gene 8.2 ~
210.Tetracycline binding protein220.Trifolitoxin
211.Thioesterase (x2) 221.Trypsin inhibitor
212.Thiolase 222.T-URF13 mitochondria)
213.TobRB7 223.UDP glucose glucosyliransferase
214.Transcriptional activator 224.Violaxanthin de-epoxidase
215.Transposon Tn5 225.Violaxanthin de-epoxidase
knockout
216.Trehalase 226.Wheat germ agglutinin
217.Trehalase knockout 227.Xanthosine-N7-methyltransferase
knockout
218.Trichodiene synthase 228.Zein storage protein
Table 2 - Part 2. Non-limiting examples of input traits/phenotypes
1. 2,4-D tolerant 52. Flowering time altered
2. Alemaria resistant 53. Frogeye leaf spot resistant
3. Altered amino acid composition54. Fruit ripening altered
4. Altemaria solani resistant 55. Fruit ripening delayed
5. Ammonium assimilation increased56. Fruit rot resistant
6. AMV resistant 57. Fruit solids increased
7. Aphid resistant 58. Fruit sweetness increased
8. Apple scab resistant 59. Fungal post-harvest resistant
9. Aspergillus resistant 60. Fungal resistant
10. B-1,4-endoglucanase 61. Fungal resistant general
I Bacterial leaf blight resistant62. Fusarium resistant
1.
12. Bacterial speck resistant 63. Glyphosate tolerant
13. BCTV resistant 64. Growth rate altered
14. Blackspot bruise resistant 65. Growth rate reduced
I5. BLRV resistant 66. Heat stable glucanase produced
16. BNYVV Resistant 67. Hordothionin produced
17. Botrytis cinerea resistant 68. Imidazolinone tolerant
I Botrytis resistant 69. Insect resistant general
8.
19. BPMV resistant 70. Kanamycin resistant
20. Bromoxynil tolerant 71. Lepidopteran resistant
21. BYDV resistant 72. Lesser cornstalk borer resistant
22. BYMV resistant 73. LMV resistant
23. Carbohydrate metabolism 74. Loss of systemic resistance
altered
24. Cell wall altered 75. Male sterile
25. Chlorsulfuron tolerant 76. Marssonina resistant
26. Clavibacter resistant 77. MCDV resistant
27. CLRV resistant 78. MCMV resistant
28. CMV resistant 79. MDMV resistant
29. Cold tolerant 80. MDMV-B resistant
30. Coleopteran resistant 81. Mealybug wilt virus resistant
31. Colletotrichum resistant 82. Melamtsora resistant
32. Colorado potato beetle resistant83. Melodgyne resistant
33. Constitutive expression 84. Methotrexate resistant
of glutamine synthetase
34. Corynebacterium sepedonicum85. Mexican Rice Borer resistant
resistant
35. Cottonwood leaf beetle resistant86. Nucleocapsid protein produced
36. Crown gall resistant 87. Oblique banded leafroller
resistant
37. Crown rot resistant 88. PEMV resistant
38. Cucumovirus resistant 89. PeSV resistant
39. Cutting rootability increased90. Phoma resistant
40. Downy mildew resistant 91. Phosphinothricin tolerant
41. Drought tolerant 92. Phratora leaf beetle resistant
23

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
42. Erwinia carotovora resistant93. Phytophthora resistant
43. Ethylene production reduced94. PLRV resistant
44. European Com Borer resistant95. Polyamine metabolism altered
45. Female sterile 96. Potyvirus resistant
46. Fenthion susceptible 97. Powdery mildew resistant
47. Fertility altered 98. PPV resistant
48. Fire blight resistant 99. Pratylenchus vulnus resistant
49. Flower and fruit abscission100.Proteinase inhibitors level
reduced constitutive
50. Flower and fruit set altered101.- PRSV resistant
51. Flowering altered ~ 102.PRV resistant
Table 2 - Part 2.(continued)Non-limiting examples of transgenic input
traits/phenotypes
103.PSbMV resistant 128.Streptomyces scabies resistant
104.Pseudomonas syringae resistant129.Sulfonylurea tolerant
-
I05.PStV resistant 130.Tetracycline binding protein
produced
106.PVX resistant 131.TEV resistant
107.PVY resistant 132.Thelaviopsis resistant
108.RBDV resistant 133.TMV resistant
109.l2hizoctonia resistant 134.Tobamovirus resistant
110.Rhizoctonia solani resistant135.ToMoV resistant
111.Ring rot resistance 136.ToMV resistant
112.Root-knot nematode resistant137.Transposon activator
113.SbMV resistant 138.Transposon inserted
I Sclerotinia resistant 139.TRV resistant
14.
I SCMV resistant 140.TSWV resistant
15.
116.SCYLV resistant 141.TVMV resistant
I Secondary metabolite increased142.TYLCV resistant
17.
I Seed set reduced 143.Tyrosine level increased
I
8.
119.Selectable marker 144.Venturia resistant
120.Senescence altered 145.Verticillium dahliae resistant
121.Septoria resistant 146.Verticillium resistant
122.Shorter stems 147.Visual marker
123.Soft rot fungal resistant 148.WMV2 resistant
124.Soft rot resistant 149.WSMV resistant
125.SqMV resistant 150.Yield increased
126.SrMV resistant 151.ZYMV resistant
127.Storage protein altered
Table 2 - Part 3. Non-limiting examples of output traits/phenotypes
I. ACC oxidase level decreased 36.Oil profile altered
2. Altered lignin biosynthesis 37.Pectin esterase level reduced
3. B-1,4-endoglucanase 38.Pharmaceutical proteins produced
4. Botrytis resistant 39.Phosphinothricin tolerant
5. Carbohydrate metabolism altered40.Phytoene synthase activity
increased
6. Carotenoid content altered 41.Pigment metabolism altered
7. Cell wall altered 42.Polygalacturonase level reduced
8. CMV resistant 43.Processing characteristics
altered
9. Coleopteran resistant 44.Prolonged shelf life
10.Dry matter content increased45.Protein altered
1 Ethylene production reduced 46.Protein quality altered
I.
12.Ethylene synthesis reduced 47.PRSV resistant
13.atty acid metabolism altered48.Root-knot nematode resistant
14.Fire blight resistant 49.Sclerotinia resistant
I5.Flower and fruit abscission 50.Seed composition altered
reduced
16.Flower and fruit set altered51.Seed methionine storage increased
17.Flowering time altered 52.Seed set reduced
18.Fruit firmness increased 53.Seed storage protein
19.Fruit pecrin esterase levels54.Senescence altered (e.g.
decreased Shelf life increased)
20.Fruit ripening altered ~ 55.Shorter stems
24

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
21.Fruit ripening delayed 56.Solids increased
22.Fruit solids increased 57.SqMV resistant
23.Fruit sugar profile altered 58.Starch level increased
24.Fruit sweetness increased 59.Starch metabolism altered
25.Glucuronidase expressing 60.Starch reduced
26.Heat stable glucanase produced61.Sterols increased
27.Heavy metals sequestered 62.Storage protein altered
28.Hordothionin produced 63.Sugar alcohol levels increased
29.Improved fruit quality 64.Tetracycline binding protein
produced
30.Industrial enzyme produced 65.Tyrosine level increased
31.Lepidopteran resistant 66.Verticillium resistant
32.Lysine level increased 67.Visual marker
33.Mealybug wilt virus resistant68.WMV2 resistant
34.Methionine level increased 69.Yield increased
35.Nucleocapsid protein produced70.ZYMV resistant
Table 2 - Part 4. Non-limiting examples of traits/phenotypes with agronomic
properties
I. ACC oxidase level decreased 53.Industrial
enzyme
produced
_
2. Altered amino acid composition54.Lignin
levels
decreased
3. Altered lignin biosynthesis 55.Lipase
expressed
in
seeds
4. Altered maturing 56.Lysine
level
increased
S. Altered plant development 57.Male
sterile
6. Aluminum tolerant 58.Male
sterile
reversible
7. Ammonium assimilation increased59.Methionine
level
increased
8. Anthocyanin produced in seed60.Modified
growth
characteristics
9. B-1,4-endoglucanase 61.Mycotoxin
degradation
10.Calmodulin level altered 62.Nitrogen
metabolism
altered
11.Carbohydrate metabolism altered63.Nucleocapsid
protein
produced
12.Carotenoid content altered 64.Oil
profile
altered
13.Cell wall altered 65.Oil uality altered
q
14.Cold tolerant 66.Oxidative stress tolerant
15.Constitutive expression of 67.Pectin
glutamine synthetase esterase
level
reduced
16.Cutting root ability increased68.Pharmaceutical
proteins
produced
17.Development altered 69.Photosynthesis
enhanced
18.Drought tolerant 70.Phytoene
synthase
activity
increased
19.Dry matter content increased71.Pigment
metabolism
altered
20.Environmental stress reduced72.Polyamine
metabolism
altered
21.Ethylene metabolism altered 73.Polygalacturonase
level
reduced
22.Ethylene production reduced 74.Pratylenchus
wlnus
resistant
23.Ethylene synthesis reduced 75.Processing
characteristics
altered
24.Fatty acid metabolism altered76.Prolonged
shelf
life
25.Female sterile 77.Protein
altered
26.Fenthion susceptible 78.Protein
lysine
level
increased
27.Fertility altered 79.Protein
quality
altered
28.Fiber quality altered 80.Proteinase
inhibitors
level
constitutive
29.Flower and fruit abscission 81.Salt
reduced tolerance
increased
30.Flower and fruit set altered82.Seed
composition
altered
31.Flowering altered 83.Seed
methionine
storage
increased
32.Flower color altered 84.Seed
set
reduced
33.Flowering time altered 85.Selectable
marker
34.Fruit firmness increased 86.Senescence
altered
35.Fruit pectin esterase and 87.Shorter
levels decreased stems
36.Fruit polygalacturonase level88.Solids
decreased increased
37.Fruit ripening altered 89.Starch
level
increased
38.Fruit ripening delayed 90.Starch
metabolism
altered
39.Fruit solids increased 91.Starch
reduced
40.Fruit sugar profile altered 92.Sterols
increased
41.Fruit sweetness increased 93.Storage
protein
altered
42.Glucuronidase expressing 94.Stress
tolerant
43.Growth rate altered 95.Sugar
alcohol
levels
increased

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
44.Growth rate increased 96. Tetracycline binding protein
produced
45.Growth rate reduced 97. Thermostable protein produced
46.Heat stable glucanase produced98. Transposon activator
47.Heat tolerant 99. Transposon inserted
48.Heavy metals sequestered 100.Tyrosine level increased
49.Hordothionin produced 101.Visual marker
50.Improved fruit quality 102.Vivipary increased
51.Increased phosphorus 103.Yield increased
52.Increased stalk strength
Table 2 - Part 5. Non-limiting examples of traits/phenotypes with product
quality properties
I. 2,4-D tolerant 45.Melanin produced in cotton
fibers
2. ACC oxidase level decreased 46.Metabolism altered
y
3. Altered amino acid composition47.Methionine level increased
4. Altered lignin biosynthesis 48.Mycotoxin degradation
5. Anfhocyanin produced in seed49.Mycotoxin production inhibited
6. Antioxidant enzyme increased50.Nicotine levels reduced
7. Auxin metabolism and increased51.Nitrogen metabolism altered
tuber solids
8. B-1,4-endoglucanase 52.Novel protein produced
9. Blackspot bruise resistant 53.Nutritional quality altered
10.Brown spot resistant 54.Oil profile altered
I Bruising reduced 55.Oil quality altered
I.
12.Caffeine levels reduced 56.Pectin esterase Ievet reduced
13.Carbohydrate metabolism altered57.Photosynthesis enhanced
14.Carotenoid content altered 58.Phytoene synthase activity
increased
15.Cell wall altered 59.Pigment metabolism altered
16.Cold tolerant 60.Polyamine metabolism altered
17.Delayed softening 61.Polygalacturonase level reduced
18.Disulfides reduced in endosperm62.Processing characteristics
altered
19.Dry matter content increasedb3.Prolonged shelf life
20.Ear mold resistant 64.Protein altered
21.Ethylene production reduced 65.Protein lysine level increased
22.Ethylene synthesis reduced 66.Protein quality altered
23.Extended flower life 67.Proteinase inhibitors level
constitutive
24.Fatty acid metabolism altered68.Rust resistant
25.Fiber quality altered 69.Seed composition altered
26.Fiber strength altered 70.Seed methionine storage increased
27.Flavor enhancer 71.Seed number increased
28.Flower and fruit abscission 72.Seed quality altered
reduced
29.Fruit firn~ness increased 73.Seed set reduced
30.Fruit invertase level decreased74.Seed weight increased
31.Fruit polygalacturonase level75.Senescence altered
decreased
32.Fruit ripening altered 76.Solids increased
33.Fruit ripening delayed 77.Starch level increased
34.Fruit solids increased 78.Starch metabolism altered
35.Fruit sugar profile altered 79.Starch reduced
36.Fruit sweetness increased 80.Steroidal glycoalkaloids
reduced
37.Glyphosate tolerant 81.Sterols increased
38.Heat stable glucanase produced82.Storage protein altered
39.Improved fruit quality 83.Sugar alcohol levels increased
40.Tncreased phosphorus 84.Thermostable protein produced
41.Increased protein levels 85.Tryptophan level increased
42.Lignin levels decreased 86.Tuber solids increased
43.Lysine level increased 87.Yield increased
44.Male sterile
Table 2 - Part 6. Non-limiting examples of traits/phenotypes with herbicide
tolerance properties
1. 2,4-D tolerant I 1. Sulfonylurea tolerant
2. Chloroacetanilide tolerant 12. Northern corn leaf blight resistant
26

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
3.Fertility altered 13.Herbicide tolerant
4.Protein altered 14.Isoxazole tolerant
5.Lignin levels decreased 15.Chlorsulfuron tolerant
6.Methionine level increased 16.Glyphosate tolerant
7.Bromoxynil tolerant 17.Lepidopteran resistant
8.Metabolism altered 18.Phosphinothricin tolerant
9.Imidazole tolerant 19.Sulfonylurea tolerant
10.Imidazolinone tolerant
Table 2 - Part 7. Non-limiting examples of traits/phenotypes with pest
resistance properties
Legend
me - rsactenat tcesistant NR - Nematode Resistant
FR - Fungal Resistant VR - Viral Resistant
IR - Insent Resistant
1. _ Agrobacterium resistant- 44. Ear mold resistant- FR
BR
2. Alternaria resistant- FR 45. Erwinia carotovora resistant-
BR
3. Alternaria daucii resistant-46. European Com Borer resistant-
FR IR
4. Alternaria solani resistant-47. Eyespot resistant - FR
FR
5. AMV resistant - VR 48. Fall annyworm resistant -
IR
6. Anthracnoseresistant-FR 49. Fire blight resistant-BR
7. Aphid resistant - IR 50. Frogeye leaf spot resistanT-
FR
8. Apple scab resistant-FR S1. Fruit rot resistant-FR
9. Aspergillus resistant- FR 52. Fungal post-harvest resistant
- FR
10.Bacterial leaf blight resistant53. Fungal resistant- FR
- BR
11.Bacterial resistant - BR 54. Fungal resistant general
- FR
12.Bacterial soft rot resistant-55. Fusarium dehlae resistant
BR - FR
13.Bacterial soft rot resistant-56. Fusarium resistant- FR
VR
14.Bacterial speck resistant- 57. Geminivirus resistant- VR
BR
15.BCTV resistant- VR 58. Gray lead spot resistant
- FR
16.Black shank resistant - FR 59. Helminthosporium resistant
- FR
17.BLRV resistant - VR 60. Hordothionin produced - BR
18.BNYW resistant - VR 61. Insect predator resistant
- IR
19.Botrytis cinerea resistant 62. Insect resistant general
- FR - IR
20.Botrytis resistant - FR 63. Late blight resistant - FR
21.BPMV resistant- VR 64. Leaf blight resistant- FR
22.Brown spot resistant- FR 65. Leaf spot resistant- FR
23.Bl'DV resistant- VR 66. Lepidopteran resistant- IR
24.BYNiV resistant - VR 67. Lesser cornstalk borer resistant
- IR
25.CaMVresistant-VR 68. LMVresistant-VR
26.Cercospora resistant - FR 69. Loss of systemic resistance
- VR
27.Clavibacter resistant- BR 70. Marssonina resistant- FR
28.Closteroviursresistant-BR 71. MCDVresistant-VR
29.CLRV resistant- VR 72. MCMV resistant- VR
30.CMVresistant-FR 73. MDMVresistant-VR
31.Coleopteran resistant - IR 74. MDMV-B resistant - VR
32.Colletotrichum resistant- 75. Mealybug wilt virus resistant-
FR VR
33.Colorado potato beetle resistant-76. Melamtsora resistant- FR
IR
34.Corn earworm resistant- IR 77. Melodgyne resistant- NR
35.Corynebacterium sepedonicum 78. Meloidogyne resistant-NR
resistant- BR
36.Cottonwood leaf beetle resistant79. Mexican Rice Borer resistant-
- IR IR
37.Criconnemellaresistant-NR 80. Mycotoxindegradation-FR
38.Crown gal resistant- BR 81. Nepovirus resistant- VR
39.Cucumovirus resistant- VR 82. Northern com leaf blight
resistant- IR
40.Cylindrosporium resistant-FR83. Nucleocapsid protein produced-
VR
41.Disease resistant general 84. Oblique banded leafroller
- FR resistant - IR
42.D011ar spot resistant- FR 85. Oomycete resistant- FR
43.Downy mildew resistant - 86. Pathogenesis related proteins
FR level increased - FR
27

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Table 2 - Part 7. (continued) Non-limiting examples of traits/phenotypes with
pest resistance properties
87. PEMV resistant-VR _ 116. SMV resistant-VR
88. PeSV Resistant- VR 117. Sod web worm resistant-
_ IR
89. Phatora leaf beetle resistant-1 SoR rot fungal resistant-
IR I FR
8.
90. Phoma resistant- FR 119. Soft rot resistant- BR
91. Phytophthora resistant- 120. Southwestern corn borer
FR resistant- IR
92. PLRV resistant-VR 121. SPFMV resistant-VR
93. Potyvirus resistant- VR I22. Sphaeropsffs fruit rot resistant-
FR
94. Powdery mildew resistant- 123. SqMV resistant- VR
FR
95. PPVresistant-VR 124. SrMVresistant-VR
96. Pratylenchus vulnus resistant-NR125. Streptomyces scabies resistant-BR
97. PRSV resistant- VR 126. Sugar cane borer resistant-
IR
98. PRV resistant - VR 127. TEV resistant- VR
99. PSbMVresistant-VR 128.'Thelaviopsisresistant-FR
100. Pseudomonas syringae resistant-BR129. TMV resistant-FR
101. PStV resistant- VR 130. Tobamovirus resistant- VR
102. PVXresistant-VR 131. ToMoVresistant-VR
103. PVY resistant- VR 132. ToMV resistant- VR
104. RBDV resistant-VR 133. TRV resistant-VR
105. Rhizoctoniaresistant-FR 134. TSWVresistant-VR
106. Rhizoctonia solani resistant-135. TVMV resistant- VR
FR
I Ring rot resistance - BR 136. TYLCV resistant - VR
07.
108. Root-knot nematode resistant-137. Venturia resistant - FR
NR
109. Rust resistant-FR 138. Verticillium dahliae resistant-FR
I SbMV resistant- VR 139. Verticillium resistant-FR
10.
111. Sclerotinia resistant - 140. Western corn root worm resistant-
FR IR
I SCMV resistant-VR 141. WMV2 resistant-VR
I2.
113. SCYLV resistant-VR 142. WSMV resistant-VR
114. Septoriaresistant-FR 143. ZYMVresistant-VR
115. Smut resistant- FR
Table 2 - Part 8. Non-limiting examples of miscellaneous traits/phenotypes
with properties
1. Antibiotic produced 31.Mycotoxin production inhibited
2. Antiprotease producing 32.Mycotoxin restored
___ _
3. Capable of growth on defined33.Non-lesion forming mutant
synthetic media
4. Carbohydrate metabolism altered34.Novel protein produced
5. Cell wall altered 35.Oil quality altered
6. Cold tolerant 36.Peroxidase levels increased
7. Coleopteran resistant 37.Pharmaceutical proteins produced
8. Color altered 38.Phosphinothricin tolerant
9. Color sectors in seeds 39.Pigment metabolism altered
10.Colored sectors in leaves 40.Pollen visual marker
I Constitutive expression of 41.Polyamine metablosim altered
1. glutamine synthetase
12.Cre recombinase produced 42.Polymer produced
13.Dalapon tolerant 43.Recombinase produced
14.Development altered 44.Secondary metabolite increased
I5.Disease resistant general 45.Seed color altered
16.Ethylene metabolism altered 46.Seed weight increased
17.Expression optimization 47.Selectable marker
18.Fenthion susceptible 48.Spectromycin resistant
19.Glucuronidase expressing 49.Sterile
20.Glyphosate tolerant 50.Sterols increased
21.Growth rate reduced 51.Sulfonylurea susceptible
22.Heavy metals sequestered 52.Syringomycin deficient
23.Hygromycin tolerant 53.Transposon activator
24.Inducible DNA modification 54.Transposon elements inserted
25.Industrial enzyme produced 55.Transposon inserted
26.Kanamycin resistant 56.Trifolitoxin producing
27.Lipase expressed in seeds 57.Trifolitoxin resistant
28

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In a particular examplification, "producing an organism having a desirable
trait"
includes an organism that is with respect to an organ or a part of an organ
but not
necessarily altered anywhere else.
By "trait" is meant any detectable parameter associated with an organism under
a
set of conditions. Examples of "detectable parameters" include the ability to
produce a
substance, the ability to not produce a substance, an altered pattern of (such
as an
increased or a decreased) ability to produce a substance, viability, non-
viability,
behaviour, growth rate, size, morphology or morphological characteristic,
In another embodiment, this invention is directed to a method of producing an
organism having a desirable trait or a desirable improvement in a trait by: a)
obtaining an
initial population of organisms comprised of at least one starting organism,
b)
mutagenizing the population such that mutations occur throughout a substantial
part of the
genome of at least one initial organism, c) selecting at least one mutagenized
organism
having a desirable trait or a desirable improvement in a trait, and d)
optionally repeating
the method by subjecting one or more mutagenized organisms to a repetition of
the
method. A mutagenized organism having a desirable trait or a desirable
improvement in a
trait can be referred to as an "up-mutant", and the associated mutations)
contained in an
up-mutant organism can be referred to as up-mutation(s).
In one embodiment, step c) is comprised of selecting at least two different
mutagenized organisms, each having a different mutagenized genome, and the
method of
producing an organism having a desirable trait or a desirable improvement in a
trait is
comprised of a) obtaining a starting population of organisms comprised of at
least one
starting organism, b) mutagenizing the population such that mutations occur
throughout a
substantial part of the genome of at least one starting organism, c) selecting
at least two
mutagenized organism having a desirable trait or a desirable improvement in a
trait, d)
creating combinations of the mutations of the two or more mutagenized
organisms, e)
selecting at least one mutagenized organism having a desirable trait or a
desirable
29

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
improvement in a trait, and f) optionally repeating the method by subjecting
one or more
mutagenized organisms to a repetition of the method.
In one embodiment, the method is repeated. Thus, for example, an up-mutant
organism can serve as a starting organism for the above method. Also, for
example, an up
mutant organism having a combination of two or more up-mutations in its genome
can
serve as a starting organism for the above method.
Thus, in one embodiment, this invention is directed to a method of producing
an
organism having a desirable trait or a desirable improvement in a trait by: a)
obtaining a
starting population of organisms comprised of at least one starting organism,
b)
mutagenizing the population such that mutations occur throughout a substantial
part of the
genome of at least one starting organism, c) selecting at least one
mutagenized organism
having a desirable trait or a desirable improvement in a trait, and d)
optionally repeating
the method by subjecting one or more mutagenized organisms to a repetition of
the
method. A mutagenized organism having a desirable trait or a desirable
improvement in a
trait can be referred to as an "up-mutant", and the associated mutations)
contained in an
up-mutant organism can be referred to as up-mutation(s).
Mutagenizing a starting population such that mutations occur throughout a
substantial part of the genome of at least one starting organism refers to
mutagenizing at
least approximately 1% of the genes of a genome, or at least approximately 10%
of the
genes of a genome, or at least approximately 20% of the genes of a genome, or
at least
approximately 30% of the genes of a genome, or at least approximately 40% of
the genes
of a genome, or at least approximately 50% of the genes of a genome, or at
least
approximately 60% of the genes of a genome, or at least approximately 70% of
the genes
of a genome, or at least approximately 80% of the genes of a genome, or at
least
approximately 90% of the genes of a genome, or at least approximately 95% of
the genes
of a genome, or at least approximately 98% of the genes of a genome.
In a particular embodiment, this invention provides a method of producing an
organism having a desirable trait or a desirable improvement in a trait by: a)
obtaining
sequence information of a genome; b) annotating the genomic sequence obtained;
c)

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
mutagenizing a substantial part of the genome the genome; d) selecting at
least one
mutagenized genome having a desirable trait or a desirable improvement in a
trait; and e)
optionally repeating the method by subjecting one or more mutagenized genomes
to a
repetition of the method.
Thus in one aspect, this invention provides a process comprised of
1.) Subjecting a working cell or organism to holistic monitoring (which can
include the
detection and/or measurement of all detectable functions and physical
parameters).
Examples of such parameters include morphology, behavior, growth,
responsiveness to
stimuli (e.g., antibiotics, different environment, etc.). Additional examples
include all
measurable molecules, including molecules that are chemically at least in part
a nucleic
acids, proteins, carbohydrates, proteoglycans, glycoproteins, or lipids. In a
particular
aspect, performing holistic monitoring is comprised of using a microarray-
based method.
In another aspect, performing holistic monitoring is comprised of sequencing a
substantial
portion of the genome, i.e. for example at least approximately 10% of the
genome, or for
example at least approximately 20% of the genome, or for example at least
approximately
30% of the genome, or for example at least approximately 40% of the genome, or
for
example at least approximately SO% of the genome, or for example at least
approximately
60% of the genome, or for example at least approximately 70% of the genome, or
for
example at least approximately 80% of the genome, or for example at least
approximately
90% of the genome, or for example at least approximately 95% of the genome, or
for
example at least approximately 98% of the genome.
2) Introducing into the working cell or organism a plurality of traits
(stacked traits),
including selectively and differentially activatable traits. Serviceable
traits for this
purpose include traits conferred by genes and traits conferred by gene
pathways.
3) Subjecting the working cell or organism to holistic monitoring.
4) Compiling the information obtained from steps 1) and 3), and processing
&lor
analyzing it to better understand the changes introduced into the working cell
or
organisms. Such data processing includes identifying correlations between
and/or among
the measured parameters.
31

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
S) Repeating any number or all of steps 2), 3), and 4).
This invention provides that molecules serviceable for introducing transgenic
traits
into a plant include all known genes and nucleic acids. By way of non-limiting
exemplification, this invention specifically names any number &/or combination
of genes
listed herein or listed in any reference incorporated herein by reference .
Furthermore, by
way of non-limiting exemplification, this invention specifically names any
number &/or
combination of genes & gene pathways listed herein as well as in any reference
incorporated by reference herein. This invention provides that molecules
serviceable as
detectable parameters include molecule, any enzyme, substrate thereof, product
thereof,
and any gene or gene pathway listed herein including in any figure or table
herein as well
as in any reference incorporated by reference herein.
This invention also relates generally to the field of nucleic acid engineering
and
correspondingly encoded recombinant protein engineering. More particularly,
the
invention relates to the directed evolution of nucleic acids and screening of
clones
containing the evolved nucleic acids for resultant activity(ies) of interest,
such nucleic acid
activity(ies) &/or specified protein, particularly enzyme, activity(ies) of
interest.
Mutagenized molecules provided by this invention may have chimeric molecules
and
molecules with point mutations, including biological molecules that contain a
carbohydrate, a
lipid, a nucleic acid, ~lor a protein component, and specific but non-limiting
examples of these
include antibiotics, antibodies, enzymes, and steroidal and non-steroidal
hormones.
This invention relates generally to a method of 1) preparing a progeny
generation of
molecules) (including a molecule that is comprised of a polynucleotide
sequence, a molecule
that is comprised of a polypepdde sequence, and a molecules that is comprised
in part of a
polynucleotide sequence and in part of a polypeptide sequence), that is
mutagenized to achieve
at least one point mutation, addition, deletion, &/or chimerization, from one
or more ancestral
or parental generation template(s); 2) screening the progeny generation
molecules) -
preferably using a high throughput method - for at least one property of
interest (such as an
improvement in an enzyme activity or an increase in stability or a novel
chemotherapeutic
32

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
effect); 3) optionally obtaining &/or cataloguing structural &/or and
functional information
regarding the parental &/or progeny generation molecules; and 4) optionally
repeating any of
steps 1) to 3).
In a preferred embodiment, there is generated (e.g. from a parent
polynucleotide
template) - in what is termed "codon site-saturation mutagenesis" - a progeny
generation
of polynucleotides, each having at least one set of up to three contiguous
point mutations
(i.e. different bases comprising a new codon), such that every codon (or every
family of
degenerate codons encoding the same amino acid) is represented at each codon
position.
Corresponding to - and encoded by - this progeny generation of
polynucleotides, there is
also generated a set of progeny polypeptides, each having at least one single
amino acid
point mutation. In a preferred aspect, there is generated - in what is termed
"amino acid
site-saturation mutagenesis" - one such mutant polypeptide for each of the 19
naturally
encoded polypeptide-forming alpha-amino acid substitutions at each and every
amino acid
position along the polypeptide. This yields - for each and every amino acid
position along
the parental polypeptide - a total of 20 distinct progeny polypeptides
including the original
amino acid, or potentially more than 21 distinct progeny polypeptides if
additional amino
acids are used either instead of or in addition to the 20 naturally encoded
amino acids
Thus, in another aspect, this approach is also serviceable for generating
mutants
containing - in addition to &/or in combination with the 20 naturally encoded
polypeptide-
forrriing alpha-amino acids - other rare &/or not naturally-encoded amino
acids and amino
acid derivatives. In yet another aspect, this approach is also serviceable for
generating
mutants by the use of - in addition to &/or in combination with natural or
unaltered codon
recognition systems of suitable hosts - altered, mutagenized, &lor designer
codon
recognition systems (such as in a host cell with one or more altered tRNA
molecules).
In yet another aspect, this invention relates to recombination and more
specifically to a
method for preparing polynucleotides encoding a polypeptide by a method of in
vivo re-
assortment of polynucleotide sequences containing regions of partial homology,
assembling the
polynucleotides to form at least one polynucleotide and screening the
polynucleotides for the
production of polypeptide(s) having a useful property.
33

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In yet another preferred embodiment, this invention is serviceable for
analyzing and
cataloguing - with respect to any molecular property (e.g. an enzymatic
activity) or
combination of properties allowed by current technology - the effects of any
mutational change
achieved (including particularly saturation mutagenesis). Thus, a
comprehensive method is
provided for determining the effect of changing each amino acid in a parental
polypeptide into
each of at least I9 possible substitutions. This allows each amino acid in a
parental
polypeptide to be characterized and catalogued according to its spectrum of
potential effects on
a measurable property of the polypeptide.
In another aspect, the method of the present invention utilizes the natural
property
of cells to recombine molecules and/or to mediate reductive processes that
reduce the
complexity of sequences and extent of repeated or consecutive sequences
possessing
regions of homology.
It is an object of the present invention to provide a method for generating
hybrid
polynucleotides encoding biologically active hybrid polypeptides with enhanced
activities.
In accomplishing these and other objects, there has been provided, in
accordance with one
aspect of the invention, a method for introducing polynucleotides into a
suitable host cell
and growing the host cell under conditions that produce a hybrid
polynucleotide.
In another aspect of the invention, the invention provides a method for
screening
for biologically active hybrid polypeptides encoded by hybrid polynucleotides.
The
present method allows for the identification of biologically active hybrid
polypeptides
with enhanced biological activities.
Other objects, features and advantages of the present invention will become
apparent from the following detailed description. It should be understood,
however, that
the detailed description and the specific examples, while indicating preferred
embodiments
of the invention, are given by way of illustration only, since various changes
and
modifications within the spirit and scope of the invention will become
apparent to those
skilled in the art from this detailed description.
34

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In yet another aspect, this invention relates to a method of discovering which
phenotype corresponds to a gene by disrupting every gene in the organism.
Accordingly, this invention provides a method for determining a gene that
alters a
characteristic of an organism, comprising: a) obtaining an initial population
of organisms,
b) generating a set of mutagenized organisms, such that when all the genetic
mutations in
the set of mutagenized organisms are taken as a whole, there is represented a
set of
substantial genetic mutations, and c) detecting the presence an organism
having an altered
trait, and d) determining the nucleotide sequence of a gene that has been
mutagenized in
the organism having the altered trait.
In yet another aspect, this invention relates to a method of improving a trait
in an
organism by functionally knocking out a particular gene in the organism, and
then
transferring a libiary of genes, which only vary from the wild-type at one
codon position,
into the organism.
Accordingly, this invention provides a method method for producing an organism
with an improved trait, comprising:
a) functionally knocking out an enogenous gene in a substantially clonal
population of organisms;
b) transfernng the set of altered genes into the clonal population of
organisms,
wherein each altered gene differs from the endogenous gene at only one codon;
and
c) detecting a mutagenized organism having an improved trait; and
d) determining the nucleotide sequence of a gene that has been transferred
into
the detected organism.
D. BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1. Exonuclease Activity Figure 1 shows the activity of the enzyme
exonuclease
III. This is an exemplary enzyme that can be used to shuffle, assemble,
reassemble,
recombine, and/or concatenate polynucleotide building blocks. The asterisk
indicates that
the enzyme acts from the 3' direction towards the 5' direction of the
polynucleotide
substrate.

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Figure 2. Generation of A Nucleic Acid Building Block by PoIymerase-Based
Amplification. Figure 2 illustrates a method of generating a double-stranded
nucleic acid
building block with two overhangs using a polymerase-based amplification
reaction (e.g.,
PCR). As illustrated, a first polymerase-based amplification reaction using a
first set of
primers, F2 and R~, is used to generate a blunt-ended product (labeled
Reaction 1, Product
1), which is essentially identical to Product A. A second polymerase-based
amplification
reaction using a second set of primers, FI and R2, is used to generate a blunt-
ended product
(labeled Reaction 2, Product 2), which is essentially identical to Product B.
These two
products are then mixed and allowed to melt and anneal, generating a
potentially useful
double-stranded nucleic acid building block with two overhangs. In the example
of Fig. 1,
the product with the 3' overhangs (Product C) is selected for by nuclease-
based
degradation of the other 3 products using a 3' acting exonuclease, such as
exonuclease III.
Alternate primers are shown in parenthesis to illustrate serviceable primers
may overlap,
and additionally that serviceable primers may be of different lengths, as
shown.
FIGURE 3. Unique Overhangs And Unique Couplings. Figure 3 illustrates the
point
that the number of unique overhangs of each size (e.g. the total number of
unique
overhangs composed of 1 or 2 or 3, etc. nucleotides) exceeds the number of
unique
couplings that can result from the use of all the unique overhangs of that
size. For
example, there are 4 unique 3' overhangs composed of a single nucleotide, and
4 unique 5'
overhangs composed of a single nucleotide. Yet the total number of unique
couplings that
can be made using all the 8 unique single-nucleotide 3' overhangs and single-
nucleotide S'
overhangs is 4.
FIGURE 4. Unique Overall Assembly Order Achieved by Sequentially Coupling the
Building Blocks
Figure 4 illustrates the fact that in order to assemble a total of "n" nucleic
acid building
blocks, "n-1" couplings are needed. Yet it is sometimes the case that the
number of
unique couplings available for use is fewer that the "n-1" value. Under these,
and other,
circumstances a stringent non-stochastic overall assembly order can still be
achieved by
performing the assembly process in sequential steps. In this example, 2
sequential steps
are used to achieve a designed overall assembly order for five nucleic acid
building
36

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
blocks. In this illustration the designed overall assembly order for the five
nucleic acid
building blocks is: 5'-(#1-#2-#3-#4-#5)-3', where #1 represents building block
number l,
etc.
FIGURE 5. Unique Couplings Available Using a Two-Nucleotide 3' Overhang.
Figure 5 further illustrates the point that the number of unique overhangs of
each size
(here, e.g. the total number of unique overhangs composed of 2 nucleotides)
exceeds the
number of unique couplings that can result from the use of all the unique
overhangs of that
size. For example, there are 16 unique 3' overhangs composed of two
nucleotides, and
another 16 unique S' overhangs composed of two nucleotides, for a total of 32
as shown.
Yet the total number of couplings that are unique and not self binding that
can be made
using all the 32 unique double-nucleotide 3' overhangs and double-nucleotide
5'
overhangs is 12. Some apparently unique couplings have "identical twins"
(marked in the
same shading), which are visually obvious in this illustration. Still other
overhangs
contain nucleotide sequences that can self bind in a palindromic fashion, as
shown and
labeled in this figure; thus they not contribute the high stringency to the
overall assembly
order.
Figure 6. Generation of an Exhaustive Set of Chimeric Combinations by
Synthetic
Ligation Reassembly. Figure 6 showcases the power of this invention in its
ability to
generate exhaustively and systematically all possible combinations of the
nucleic acid
building blocks designed in this example. Particularly large sets (or
libraries) of progeny
chimeric molecules can be generated. Because this method can be performed
exhaustively
and systematically, the method application can be repeated by choosing new
demarcation
points and with correspondingly newly designed nucleic acid building blocks,
bypassing
the burden of re-generating and re-screening previously examined and rejected
molecular
species. It is appreciated that, codon wobble can be used to advantage to
increase the
frequency of a demarcation point. In other words, a particular base can often
be
substituted into a nucleic acid building block without altering the amino acid
encoded by
progenitor codon (that is now altered codon) because of codon degeneracy. As
illustrated,
demarcation points are chosen upon alignment of 8 progenitor templates.
Nucleic acid
building blocks including their overhangs (which are serviceable for the
formation of
ordered couplings) are then designed and synthesized. In this instance, 18
nucleic acid
37

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
building blocks are generated based on the sequence of each of the 8
progenitor templates,
for a total of 144 nucleic acid building blocks (or double-stranded oligos).
Performing the
ligation synthesis procedure will then produce a library of progeny molecules
comprised
of yield of 81g (or over 1.8 x 1016) chimeras.
Figure 7. Synthetic genes from oligos:. According to one embodiment of this
invention, double-stranded nucleic acid building blocks are designed by
aligning a
plurality of progenitor nucleic acid templates. Preferably these templates
contain some
homology and some heterology. The nucleic acids may encode related proteins,
such as
related enzymes, which relationship may be based on function or structure or
both. Figure
7 shows the alignment of three polynucleotide progenitor templates and the
selection of
demarcation points (boxed) shared by all the progenitor molecules. In this
particular
example, the nucleic acid building blocks derived from each of the progenitor
templates
were chosen to be approximately 30 to 50 nucleotides in length.
Figure 8. Nucleic acid building blocks for synthetic ligation gene reassembly.
Figure 8 shows the nucleic acid building blocks from the example in Figure 7.
The
nucleic acid building blocks are shown here in generic cartoon form, with
their compatible
overhangs, including both 5' and 3' overhangs. There are 22 total nucleic acid
building
blocks derived from each of the 3 progenitor templates. Thus, the ligation
synthesis
procedure can produce a library of progeny molecules comprised of yield of 32a
(or over
3.1 x 101°) chimeras.
Figure 9. Addition of Introns by Synthetic Ligation Reassembly. Figure 9 shows
in
generic cartoon form that an intron may be introduced into a chimeric progeny
molecule
by way of a nucleic acid building block. It is appreciated that introns often
have
consensus sequences at both termini in order to render them operational. It is
also
appreciated that, in addition to enabling gene splicing, introns may serve an
additional
purpose by providing sites of homology to other nucleic acids to enable
homologous
recombination. For this purpose, and potentially others, it may be sometimes
desirable to
generate a large nucleic acid building block for introducing an intron. If the
size is overly
large easily genrating by direct chemical synthesis of two single stranded
oligos, such a
specialized nucleic acid building block may also be generated by direct
chemical synthesis
38

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
of more than two single stranded oligos or by using a polymerase-based
amplification
reaction as shown in Figure 2.
Figure 10. Ligation Reassembly Using Fewer Than All The Nucleotides Of An
Overhang. Figure 10 shows that coupling can occur in a manner that does not
make use
of every nucleotide in a participating overhang. The coupling is particularly
lively to
survive (e.g. in a transformed host) if the coupling reinforced by treatment
with a ligase
enzyme to form what may be referred to as a "gap ligation" or a "gapped
ligation". It is
appreciated that, as shown, this type of coupling can contribute to generation
of unwanted
background product(s), but it can also be used advantageously increase the
diversity of the
progeny library generated by the designed ligation reassembly.
Figure 11. Avoidance of unwanted self ligation in palindromic couplings. As
mentioned before and shown in Figure S, certain overhangs are able to undergo
self
coupling to form a palindromic coupling. A coupling is strengthened
substantially if it is
reinforced by treatment with a ligase enzyme. Accordingly, it is appreciated
that the lack
of S' phosphates on these overhangs, as shown, can be used advantageously to
prevent this
type of palindromic self ligation. Accordingly, this invention provides that
nucleic acid
building blocks can be chemically made (or ordered) that lack a 5' phosphate
group (or
alternatively they can be remove - e.g. by treatment with a phosphatase enzyme
such as a
calf intestinal alkaline phosphatase (CIAP) - in order to prevent palindromic
self ligations
in ligation reassembly processes.
Figure 12. Pathway Engineering. It is a goal of this invention to provide ways
of
making new gene pathways using ligation reassembly, optionally with other
directed
evolution methods such as saturation mutagenesis. Figure 12 illustrates a
preferred
approach that may be taken to achieve this goal. It is appreciated that
naturally-occurring
microbial gene pathways are linked more often than naturally-occurring
eukaryotic (e.g.
plant) gene pathways, which are sometime only partially linked. In a
particular
embodiment, this invention provides that regulatory gene sequences (including
promoters)
can be introduced in the form of nucleic acid building blocks into progeny
gene pathways
generated by Iigation reassembly processes. Thus, originally linked microbial
gene
39

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
pathways, as well as originally unlinked genes and gene pathways, can be thus
converted
to acquire operability in plants and other eukaryotes.
Figure 13. Avoidance of unwanted self ligation in palindromic couplings.
Figure 13
illustrates that another goal of this invention, in addition to the generation
of novel gene
pathways, is the subjection of gene pathways - both naturally occurring and
man-made -
to mutagenesis and selection in order to achieve improved progeny molecules
using the
instantly disclosed methods of directed evolution (including saturation
mutagenesis and
synthetic ligation reassembly). In a particular embodiment, as provided by the
instant
invention, both microbial and plant pathways can be improved by directed
evolution, and
as shown, the directed evolution process can be performed both on genes prior
to linking
them into pathways, and on gene pathways themselves.
Figure 14. Conversion of Microbial Pathways to Eukaryotic Pathways. In a
particular embodiment, this invention provides that microbial pathways can be
converted
to pathways operable in plants and other eukaryotic species by the
introduction of
regulatory sequences that function in those species. Preferred regulatory
sequences
include promoters, operators, and activator binding sites. As shown, a
preferred method of
achieving the introduction of such serviceable regulatory sequences is in the
form of
nucleic acid building blocks, particularly through the use of couplings in
ligation
reassembly processes. These couplings in Fig. 14 are marked with the letters
A, B, C, D
and F.
Fig.15. Holistic engineering of differentially activatable stacked traits in
noveltransgenic plants using directed evolution and whole cell monitoring.
Fig.16. Differential Activation of Selected Traits Can Be Achieved by
Adjusting and Controlling the Environment of the Traits.
Fig.17. Harvesting, Processing, Storage.
Fig. l8. Processing.

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Fig.19. Cellular Mutagenesis.
Figure Z0. Differential Activation of Selected Precursor (Inactive) Gene
Products.
Figure 21. Starting population comprised of an organism strain to be subjected
to
improvement or evolution in order to produce a resultant population comprised
of
an improved organism strain that has a desired trait.
Figure 22. Starting population comprised of a genomic sequence to be subjected
to
improvement or evolution in order to produce a resultant population comprised
of
an improved genomic sequence that has a desired trait.
Fig. 23. Strain Improvement.
Fig. 24. Iterative Strain Improvement.
E. DEFINITIONS OF TERMS
In order to facilitate understanding of the examples provided herein, certain
frequently occurring methods and/or terms will be described.
The term "agent" is used herein to denote a chemical compound, a mixture of
chemical compounds, an array of spatially localized compounds (e.g., a VLSIPS
peptide
array, polynucleotide array, and/or combinatorial small molecule array),
biological
macromolecule, a bacteriophage peptide display library, a bacteriophage
antibody (e.g.,
scFv) display library, a polysome peptide display library, or an extract made
form
biological materials such as bacteria, plants, fungi, or animal (particular
mammalian) cells
or tissues. Agents are evaluated for potential activity as anti-neoplastics,
anti-
inflammatories or apoptosis modulators by inclusion in screening assays
described
hereinbelow. Agents are evaluated for potential activity as specific protein
interaction
inhibitors (i.e., an agent which selectively inhibits a binding interaction
between two
predetermined polypeptides but which doe snot substantially interfere with
cell viability)
by inclusion in screening assays described hereinbelow.
41

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
An "ambiguous base requirement" in a restriction site refers to a nucleotide
base
requirement that is not specified to the fullest extent, i.e. that is not a
specific base (such
as, in a non-limiting exemplification, a specific base selected from A, C, G,
and T), but
rather may be any one of at least two or more bases. Commonly accepted
abbreviations
that are used in the art as well as herein to represent ambiguity in bases
include the .
following: R=GorA;Y=CorT;M=AorC;K=GorT;S=GorC;W=AorT;H=
AorCorT;B=GorTorC;V=GorCorA;D=GorAorT;N=AorCorGorT.
The term "amino acid" as used herein refers to any organic compound that
contains an amino group (-NHZ) and a carboxyl group (-COOH); preferably either
as free
groups or alternatively after condensation as part of peptide bonds. The
"twenty
naturally encoded polypeptide-forming alpha-amino acids" are understood in the
art
and refer to: alanine (ala or A), arginine (arg or R), asparagine (asn or I~,
aspartic acid
(asp or D), cysteine (cys or C), gluatamic acid (glu or E), glutamine (gln or
Q), glycine
(gly or G), histidine (his or H), isoleucine (ile or I), leucine (leu or L),
lysine (lys or K),
methionine (met or M), phenylalanine (phe or F), proline (pro or P), serine
(ser or S),
threonine (thr or T), tryptophan (trp or W~, tyrosine (tyr or Y), and valine
(val or V).
The term "amplification" means that the number of copies of a polynucleotide
is
increased.
The term "antibody", as used herein, refers to intact immunoglobulin
molecules,
as well as fragments of immunoglobulin molecules, such as Fab, Fab', (Fab')a,
Fv, and
SCA fragments, that are capable of binding to an epitope of an antigen. These
antibody
fragments, which retain some ability to selectively bind to an antigen (e.g.,
a polypeptide
antigen) of the antibody from which they are derived, can be made using well
known
methods in the art (see, e.g., Harlow and Lane, supra), and are described
further, as
follows.
(1) An Fab fragment consists of a monovalent antigen-binding fragment of an
antibody molecule, and can be produced by digestion of a whole antibody
42

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
molecule with the enzyme papain, to yield a fragment consisting of an intact
light chain and a portion of a heavy chain.
(2) An Fab' fragment of an antibody molecule can be obtained by treating a
whole
antibody molecule with pepsin, followed by reduction, to yield a molecule
consisting of an intact light chain and a portion of a heavy chain. Two Fab'
fragments are obtained per antibody molecule treated in this manner.
(3) An (Fab')2 fragment of an antibody can be obtained by treating a whole
antibody molecule with the enzyme pepsin, without subsequent reduction. A
(Fab')2 fragment is a dimer of two Fab' fragments, held together by two
disulfide bonds.
(4) An Fv fragment is defined as a genetically engineered fragment containing
the
variable region of a light chain and the variable region of a heavy chain
expressed as two chains.
(5) An single chain antibody ("SCA") is a genetically engineered single chain
molecule containing the variable region of a light chain and the variable
region of a heavy chain, linked by a suitable, flexible polypeptide linker.
The term "Applied Molecular Evolution" ("AME") means the application of an
evolutionary design algorithm to a specific, useful goal. While many different
library
formats for AME have been reported for polynucleotides, peptides and proteins
(phage,
lacI and polysomes), none of these formats have provided for recombination by
random
cross-overs to deliberately create a combinatorial library.
A molecule that has a "chimeric property" is a molecule that is: 1) in part
homologous and in part heterologous to a first reference molecule; while 2) at
the same
time being in part homologous and in part heterologous to a second reference
molecule;
without 3) precluding the possibility of being at the same time in part
homologous and in
part heterologous to still one or more additional reference molecules. In a
non-limiting
embodiment, a chimeric molecule may be prepared by assemblying a reassortment
of
43

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
partial molecular sequences. In a non-limiting aspect, a chimeric
polynucleotide molecule
may be prepared by synthesizing the chimeric polynucleotide using plurality of
molecular
templates, such that the resultant chimeric polynucleotide has properties of a
plurality of
templates.
The term "cognate" as used herein refers to a gene sequence that is
evolutionarily
and functionally related between species. For example, but not limitation, in
the human
genome the human CD4 gene is the cognate gene to the mouse 3d4 gene, since the
sequences and structures of these two genes indicate that they are highly
homologous and
both genes encode a protein which functions in signaling T cell activation
through MHC
class II-restricted antigen recognition.
A "comparison window," as used herein, refers to a conceptual segment of at
least
20 contiguous nucleotide positions wherein a polynucleotide sequence may be
compared
to a reference sequence of at least 20 contiguous nucleotides and wherein the
portion of
the polynucleotide sequence in the comparison window may comprise additions or
deletions (i.e., gaps) of 20 percent or less as compared to the reference
sequence (which
does not comprise additions or deletions) for optimal alignment of the two
sequences.
Optimal alignment of sequences for aligning a comparison window may be
conducted by
the local homology algorithm of Smith (Smith and Waterman, AdvAppl Math, 1981;
Smith and Waterman, J Teor Biol, 1981; Smith and Waterman, JMoI Biol, 1981;
Smith et al, JMoI Evol, 1981), by the homology alignment algorithm of
Needleman
(Needleman and Wuncsch, 1970), by the search of similarity method of Pearson
(Pearson and Lipman, 1988), by computerized implementations of these
algorithms (GAP,
BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package Release
7.0,
Genetics Computer Group, 575 Science Dr., Madison, WI), or by inspection, and
the best
alignment (i.e., resulting in the highest percentage of homology over the
comparison
window) generated by the various methods is selected.
As used herein, the term "complementarity-determining region" and "CDR"
refer to the art-recognized term as exemplified by the Rabat and Chothia CDR
definitions
also generally known as supervariable regions or hypervariable loops (Chothia
and Lesk,
1987; Clothia et al, 1989; Kabat et al, 1987; and Tramontano et al, 1990).
Variable
44

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
region domains typically comprise the amino-terminal approximately 105-115
amino acids
of a naturally-occurring immunoglobulin chain (e.g., amino acids 1-110),
although
variable domains somewhat shorter or longer are also suitable for forming
single-chain
antibodies.
"Conservative amino acid substitutions" refer to the interchangeability of
residues having similar side chains. For example, a group of amino acids
having aliphatic
side chains is glycine, alanine, valine, leucine, and isoleucine; a group of
amino acids
having aliphatic-hydroxyl side chains is serine and threonine; a group of
amino acids
having amide-containing side chains is asparagine and glutamine; a group of
amino acids
having aromatic side chains is phenylalanine, tyrosine, and tryptophan; a
group of amino
acids having basic side chains is lysine, arginine, and histidine; and a group
of amino acids
having sulfur-containing side chains is cysteine and methionine. Preferred
conservative
amino acids substitution groups are : valine-leucine-isoleucine, phenylalanine-
tyrosine,
lysine-arginine, alanine-valine, and asparagine-glutamine.
The term "corresponds to" is used herein to mean that a polynucleotide
sequence
is homologous (i.e., is identical, not strictly evolutionarily related) to all
or a portion of a
reference polynucleotide sequence, or that a polypeptide sequence is identical
to a
reference polypeptide sequence. In contradistinction, the term "complementary
to" is used
herein to mean that the complementary sequence is homologous to all or a
portion of a
reference polynucleotide sequence. For illustration, the nucleotide sequence
"TATAC"
corresponds to a reference "TATAC" and is complementary to a reference
sequence
"GTATA."
The term "degrading effective" amount refers to the amount of enzyme which is
required to process at least SO% of the substrate, as compared to substrate
not contacted
with the enzyme. Preferably, at least 80% of the substrate is degraded.
As used herein, the term "defined sequence framework" refers to a set of
defined
sequences that are selected on a non-random basis, generally on the basis of
experimental
data or structural data; for example, a defined sequence framework may
comprise a set of

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
amino acid sequences that are predicted to form a 13-sheet structure or may
comprise a
leucine zipper heptad repeat motif, a zinc-finger domain, among other
variations. A
"defined sequence kernal" is a set of sequences which encompass a limited
scope of
variability. Whereas (1) a completely random 10-mer sequence of the 20
conventional
amino acids can be any of (20)i° sequences, and (2) a pseudorandom 10-
mer sequence of
the 20 conventional amino acids can be any of (20)I° sequences but will
exhibit a bias for
certain residues at certain positions and/or overall, (3) a defined sequence
kernal is a
subset of sequences if each residue position was allowed to be any of the
allowable 20
conventional amino acids (and/or allowable unconventional aminolimino acids).
A
defined sequence kernal generally comprises variant and invariant residue
positions and/or
comprises variant residue positions which can comprise a residue selected from
a defined
subset of amino acid residues), and the like, either segmentally or over the
entire length of
the individual selected library member sequence. Defined sequence kernels can
refer to
either amino acid sequences or polynucleotide sequences. Of illustration and
not
limitation, the sequences (I~lNK)~° and (NNM)1°, wherein N
represents A, T, G, or C; K
represents G or T; and M represents A or C, are defined sequence kernels.
"Digestion" of DNA refers to catalytic cleavage of the DNA with a restriction
enzyme that acts only at certain sequences in the DNA. The various restriction
enzymes
used herein are commercially available and their reaction conditions,
cofactors and other
requirements were used as would be known to the ordinarily skilled artisan.
For analytical
purposes, typically 1 ug of plasmid or DNA fragment is used with about 2 units
of enzyme
in about 20 ~1 of buffer solution. For the purpose of isolating DNA fragments
for plasmid
construction, typically S to SO p.g of DNA are digested with 20 to 250 units
of enzyme in a
larger volume. Appropriate buffers and substrate amounts for particular
restriction
enzymes axe specified by the manufacturer. Incubation times of about I hour at
37°C are
ordinarily used, but may vary in accordance with the supplier's instructions.
After
digestion the reaction is electrophoresed directly on a gel to isolate the
desired fragment.
"Directional ligation" refers to a ligation in which a 5' end and a 3' end of
a
polynuclotide are different enough to specify a preferred ligation
orientation. For
example, an otherwise untreated and undigested PCR product that has two blunt
ends will
typically not have a preferred ligation orientation when ligated into a
cloning vector
46

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
digested to produce blunt ends in its multiple cloning site; thus, directional
ligation will
typically not be displayed under these circumstances. In contrast, directional
ligation will
typically displayed when a digested PCR product having a 5' EcoR I-treated end
and a 3'
Barnes I-is ligated into a cloning vector that has a multiple cloning site
digested with EcoR
I and Barnes I.
The term "DNA shuffling" is used herein to indicate recombination between
substantially homologous but non-identical sequences, in some embodiments DNA
shuffling may involve crossover via non-homologous recombination, such as via
cer/lox
and/or flp/frt systems and the like.
As used in this invention, the term "epitope" refers to an antigenic
determinant on
an antigen, such as a phytase polypeptide, to which the paratope of an
antibody, such as an
phytase-specific antibody, binds. Antigenic determinants usually consist of
chemically
active surface groupings of molecules, such as amino acids or sugar side
chains, and can
have specific three-dimensional structural characteristics, as well as
specific charge
characteristics. As used herein "epitope" refers to that portion of an antigen
or other
macromolecule capable of forming a binding interaction that interacts with the
variable
region binding body of an antibody. Typically, such binding interaction is
manifested as
an intermolecular contact with one or more amino acid residues of a CDR.
The terms "fragment", "derivative" and "analog" when refernng to a reference
polypeptide comprise a polypeptide which retains at least one biological
function or
activity that is at least essentially same as that of the reference
polypeptide. Furthermore,
the terms "fragment", "derivative" or "analog" are exemplified by a "pro-form"
molecule,
such as a low activity proprotein that can be modified by cleavage to produce
a mature
enzyme with significantly higher activity.
A method is provided herein for producing from a template polypeptide a set of
progeny polypeptides in which a "full range of single amino acid
substitutions" is
represented at each amino acid position. As used herein, "full range of single
amino acid
substitutions" is in reference to the naturally encoded 20.naturally encoded
polypeptide-
forming alpha-amino acids, as described herein.
47

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The term "gene" means the segment of DNA involved in producing a polypeptide
chain; it includes regions preceding and following the coding region (leader
and trailer) as
well as intervening sequences (introns) between individual coding segments
(exons).
"Genetic instability", as used herein, refers to the natural tendency of
highly
repetitive sequences to be lost through a process of reductive events
generally involving
sequence simplification through the loss of repeated sequences. Deletions tend
to involve
the loss of one copy of a repeat and everything between the repeats.
The term "heterologous" means that one single-stranded nucleic acid sequence
is
unable to hybridize to another single-stranded nucleic acid sequence or its
complement.
Thus areas of heterology means that areas of polynucleotides or
polynucleotides have
areas or regions within their sequence which are unable to hybridize to
another nucleic
acid or polynucleotide. Such regions or areas are for example areas of
mutations.
The term "homologous" or "homeologous" means that one single-stranded nucleic
acid nucleic acid sequence may hybridize to a complementary single-stranded
nucleic acid
sequence. The degree of hybridization may depend on a number of factors
including the
amount of identity between the sequences and the hybridization conditions such
as
temperature and salt concentrations as discussed later. Preferably the region
of identity is
greater than about 5 bp, more preferably the region of identity is greater
than 10 bp.
An immunoglobulin light or heavy chain variable region consists of a
"framework"
region interrupted by three hypervariable regions, also called CDR's. The
extent of the
framework region and CDR's have been precisely defined; see "Sequences of
Proteins of
Immunological Interest" (Kabat et al, 197). The sequences of the framework
regions of
different light or heavy chains are relatively conserved within a specie. As
used herein, a
"human framework region" is a framework region that is substantially identical
(about
~5 or more, usually 90-95 or more) to the framework region of a naturally
occurring
human immunoglobulin. the framework region of an antibody, that is the
combined
framework regions of the constituent light and heavy chains, serves to
position and align
the CDR's. The CDR's are primarily responsible for binding to an epitope of an
antigen.
48

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The benefits of this invention extend to "commercial applications" (or
commercial processes), which term is used to include applications in
commercial industry
proper (or simply industry) as well as non-commercial commercial applications
(e.g.
biomedical research at a non-profit institution). Relevant applications
include those in
areas of diagnosis, medicine, agriculture, manufacturing, and academia.
The term "identical" or "identity" means that two nucleic acid sequences have
the
same sequence or a complementary sequence. Thus, "areas of identity" means
that regions
or areas of a polynucleotide or the overall polynucleotide are identical or
complementary
to areas of another polynucleotide or the polynucleotide.
The term "isolated" means that the material is removed from its original
environment (e.g., the natural environment if it is naturally occurnng). For
example, a
naturally-occurring polynucleotide or enzyme present in a living animal is not
isolated, but
the same polynucleotide or enzyme, separated from some or all of the
coexisting materials
in the natural system, is isolated. Such polynucleotides could be part of a
vector and/or
such polynucleotides or enzymes could be part of a composition, and still be
isolated in
that such vector or composition is not part of its natural environment.
By "isolated nucleic acid" is meant a nucleic acid, e.g., a DNA or RNA
molecule,
that is not immediately contiguous with the 5' and 3' flanking sequences with
which it
normally is immediately contiguous when present in the naturally occurring
genome of the
organism from which it is derived. The term thus describes, for example, a
nucleic acid
that is incorporated into a vector, such as a plasmid or viral vector; a
nucleic acid that is
incorporated into the genome of a heterologous cell (or the genome of a
homologous cell,
but at a site different from that at which it naturally occurs); and a nucleic
acid that exists
as a separate molecule, e.g., a DNA fragment produced by PCR amplification or
restriction enzyme digestion, or an RNA molecule produced by in vitro
transcription. The
term also describes a recombinant nucleic acid that forms part of a hybrid
gene encoding
additional polypeptide sequences that can be used, for example, in the
production of a
fusion protein.
49

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
As used herein "ligand" refers to a molecule, such as a random peptide or
variable
segment sequence, that is recognized by a particular receptor. As one of skill
in the art
will recognize, a molecule (or macromolecular complex) can be both a receptor
and a
ligand. In general, the binding partner having a smaller molecular weight is
referred to as
the ligand and the binding partner having a greater molecular weight is
referred to as a
receptor.
"Ligation" refers to the process of forming phosphodiester bonds between two
double stranded nucleic acid fragments (Sambrook et al, 1982, p. 146;
Sambrook, 1989).
Unless otherwise provided, ligation may be accomplished using known buffers
and
conditions with 10 units of T4 DNA ligase ("ligase") per 0.5 pg of
approximately
equimolar amounts of the DNA fragments to be ligated.
As used herein, "linker" or "spacer" refers to a molecule or group of
molecules
that connects two molecules, such as a DNA binding protein and a random
peptide, and
serves to place the two molecules in a preferred configuration, e.g., so that
the random
peptide can bind to a receptor with minimal steric hindrance from the DNA
binding
protein.
As used herein, a "molecular property to be evolved" includes reference to
molecules comprised of a polynucleotide sequence, molecules comprised of a
polypeptide
sequence, and molecules comprised in part of a polynucleotide sequence and in
part of a
polypeptide sequence. Particularly relevant - but by no means limiting -
examples of
molecular properties to be evolved include enzymatic activities at specified
conditions,
such as related to temperature; salinity; pressure; pH; and concentration of
glycerol,
DMSO, detergent, &lor any other molecular species with which contact is made
in a
reaction environment. Additional particularly relevant - but by no means
limiting -
examples of molecular properties to be evolved include stabilities - e.g. the
amount of a
residual molecular property that is present after a specified exposure time to
a specified
environment, such as may be encountered during storage.
The term "mutations" includes changes in the sequence of a wild-type or
parental
nucleic acid sequence or changes in the sequence of a peptide. Such mutations
may be

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
point mutations such as transitions or transversions. The mutations may be
deletions,
insertions or duplications. A mutation can also be a "chimerization", which is
exemplified in a progeny molecule that is generated to contain part or all of
a sequence of
one parental molecule as well as part or all of a sequence of at least one
other parental
molecule. This invention provides for both chimeric polynucleotides and
chimeric
polypeptides.
As used herein, the degenerate "N,N,G/T" nucleotide sequence represents 32
possible triplets, where "N" can be A, C, G or T.
The term "naturally-occurring" as used herein as applied to the object refers
to
the fact that an obj ect can be found in nature. For example, a polypeptide or
polynucleotide sequence that is present in an organism (including viruses)
that can be
isolated from a source in nature and which has not been intentionally modified
by man in
the laboratory is naturally occurnng. Generally, the term naturally occurnng
refers to an
object as present in a non-pathological (un-diseased) individual, such as
would be typical
for the species.
As used herein, a "nucleic acid molecule" is comprised of at least one base or
one
base pair, depending on whether it is single-stranded or double-stranded,
respectively.
Furthermore, a nucleic acid molecule may belong exclusively or chimerically to
any group
of nucleotide-containing molecules, as exemplified by, but not limited to, the
following
groups of nucleic acid molecules: RNA, DNA, genomic nucleic acids, non-genomic
nucleic acids, naturally occurring and not naturally occurnng nucleic acids,
and synthetic
nucleic acids. This includes, by way of non-limiting example, nucleic acids
associated
with any organelle, such as the mitochondria, ribosomal RNA, and nucleic acid
molecules
comprised chimerically of one or more components that are not naturally
occurring along
with naturally occurnng components.
Additionally, a "nucleic acid molecule" may contain in part one or more non-
nucleotide-based components as exemplified by, but not limited to, amino acids
and
sugars. Thus, by way of example, but not limitation, a ribozyme that is in
part nucleotide-
based and in part protein-based is considered a "nucleic acid molecule".
51

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In addition, by way of example, but not limitation, a nucleic acid molecule
that is
labeled with a detectable moiety, suchas a radioactive or alternatively a non-
radioactive
label, is likewise considered a "nucleic acid molecule".
The terms "nucleic acid sequence coding for" or a "DNA coding sequence of or
a "nucleotide sequence encoding" a particular enzyme - as well as other
synonymous
terms - refer to a DNA sequence which is transcribed and translated into an
enzyme when
placed under the control of appropriate regulatory sequences. A "promotor
sequence" is a
DNA regulatory region capable of binding RNA polymerise in a cell and
initiating
transcription of a downstream (3' direction) coding sequence. The promoter is
part of the
DNA sequence. This sequence region has a start codon at its 3' terminus. The
promoter
sequence does include the minimum number of bases where elements necessary to
initiate
transcription at levels detectable above background. However, after the RNA
polymerise
binds the sequence and transcription is initiated at the start codon (3'
terminus with a
promoter), transcription proceeds downstream in the 3' direction. Within the
promotor
sequence will be found a transcription initiation site (conveniently defined
by mapping
with nuclease S1) as well as protein binding domains (consensus sequences)
responsible
for the binding of RNA polymerise.
The terms "nucleic acid encoding an enzyme (protein)" or "DNA encoding an
enzyme (protein)" or "polynucleotide encoding an enzyme (protein)" and other
synonymous terms encompasses a polynucleotide which includes only coding
sequence
for the enzyme as well as a polynucleotide which includes additional coding
and/or non-
coding sequence.
In one preferred embodiment, a "specific nucleic acid molecule species" is
defined by its chemical structure, as exemplified by, but not limited to, its
primary
sequence. In another preferred embodiment, a specific "nucleic acid molecule
species" is
defined by a function of the nucleic acid species or by a function of a
product derived from
the nucleic acid species. Thus, by way of non-limiting example, a "specific
nucleic acid
molecule species" may be defined by one or more activities or properties
attributable to it,
including activities or properties attributable its expressed product.
52

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The instant definition of "assembling a working nucleic acid sample into a
nucleic acid library" includes the process of incorporating a nucleic acid
sample into a
vector-based collection, such as by ligation into a vector and transformation
of a host. A
description of relevant vectors, hosts, and other reagents as well as specific
non-limiting
examples thereof are provided hereinafter. The instant definition of
"assembling a
working nucleic acid sample into a nucleic acid library" also includes the
process of
incorporating a nucleic acid sample into a non-vector-based collection, such
as by ligation
to adaptors. Preferably the adaptors can anneal to PCR primers to facilitate
amplification
by PCR.
Accordingly, in a non-limiting embodiment, a "nucleic acid library" is
comprised
of a vector-based collection of one or more nucleic acid molecules. In another
preferred
embodiment a "nucleic acid library" is comprised of a non-vector-based
collection of
nucleic acid molecules. In yet another preferred embodiment a "nucleic acid
library" is
comprised of a combined collection of nucleic acid molecules that is in part
vector-based
and in part non-vector-based. Preferably, the collection of molecules
comprising a library
is searchable and separable according to individual nucleic acid molecule
species.
The present invention provides a "nucleic acid construct" or alternatively a
"nucleotide construct" or alternatively a "DNA construct". The term
"construct" is
used herein to describe a molecule, such as a polynucleotide (e.g., a phytase
polynucleotide) may optionally be chemically bonded to one or more additional
molecular
moieties, such as a vector, or parts of a vector. In a specific - but by no
means limiting -
aspect, a nucleotide construct is exemplified by a DNA expression DNA
expression
constructs suitable for the transformation of a host cell.
An "oligonucleotide" (or synonymously an "oligo") refers to either a single
stranded polydeoxynucleotide or two complementary polydeoxynucleotide strands
which
may be chemically synthesized. Such synthetic oligonucleotides may or may not
have a 5'
phosphate. Those that do not will not ligate to another oligonucleotide
without adding a
phosphate with an ATP in the presence of a kinase. A synthetic oligonucleotide
will ligate
to a fragment that has not been dephosphorylated. To achieve polymerase-based
53

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
amplification (such as with PCR), a "32-fold degenerate oligonucleotide that
is
comprised of, in series, at least a first homologous sequence, a degenerate
N,N,G/T
sequence, and a second homologous sequence" is mentioned. As used in this
contest,
"homologous" is in reference to homology between the oligo and the parental
polynucleotide that is subjected to the polymerase-based amplification.
As used herein, the term "operably linked" refers to a linkage of
polynucleotide
elements in a functional relationship. A nucleic acid is "operably linked"
when it is placed
into a functional relationship with another nucleic acid sequence. For
instance, a promoter
or enhancer is operably linked to a coding sequence if it affects the
transcription of the
coding sequence. Operably linked means that the DNA sequences being linked are
typically contiguous and, where necessary to join two protein coding regions,
contiguous
and in reading frame.
A coding sequence is "operably linked to" another coding sequence when RNA
polymerase will transcribe the two coding sequences into a single mRNA, which
is then
translated into a single polypeptide having amino acids derived from both
coding
sequences. The coding sequences need not be contiguous to one another so long
as the
expressed sequences are ultimately processed to produce the desired protein.
As used herein the term "parental polynucleotide set" is a set comprised of
one or
more distinct polynucleotide species. Usually this term fis used in reference
to a progeny
polynucleotide set which is preferably obtained by mutagenization of the
parental set, in
which case the terms "parental", "starting" and "template" are used
interchangeably.
As used herein the term "physiological conditions" refers to temperature, pH,
ionic strength, viscosity, and like biochemical parameters which are
compatible with a
viable organism, and/or which typically exist intracellularly in a viable
cultured yeast cell
or mammalian cell. For example, the intracellular conditions in a yeast cell
grown under
typical laboratory culture conditions are physiological conditions. Suitable
in vitro
reaction conditions for in vitro transcription cocktails are generally
physiological
conditions. In general, in vitro physiological conditions comprise 50-200 mM
NaCI or
KCI, pH 6.5-8.5, 20-45 C and 0.001-10 mM divalent cation (e.g., Mg++, Ca~;
preferably
54

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
about 150 mM NaCI or ICI, pH 7.2-7.6, 5 mM divalent cation, and often include
0.01-1.0
percent nonspecific protein (e.g., BSA). A non-ionic detergent (Tween, NP-40,
Triton X-
100) can o$en be present, usually at about 0.001 to 2%, typically 0.05-0.2%
(v/v).
Particular aqueous conditions may be selected by the practitioner according to
conventional methods. For general guidance, the following buffered aqueous
conditions
may be applicable: 10-250 mM NaCI, 5-50 mM Tris HCI, pH 5-~, with optional
addition
of divalent cation(s) and/or metal chelators and/or non-ionic detergents
and/or membrane
fractions and/or anti-foam agents and/or scintillants.
Standard convention (5' to 3') is used herein to describe the sequence of
double
standed polynucleotides.
The term "population" as used herein means a collection of components such as
polynucleotides, portions or polynucleotides or proteins. A "mixed population:
means a
collection of components which belong to the same family of nucleic acids or
proteins
(i.e., are related) but which differ in their sequence (i.e., are not
identical) and hence in
their biological activity.
A molecule having a "pro-form" refers to a molecule that undergoes any
combination of one or more covalent and noncovalent chemical modifications
(e.g.
glycosylation, proteolytic cleavage, dimerization or oligomerization,
temperature-induced
or pH-induced conformational change, association with a co-factor, etc.) en
route to attain
a more mature molecular form having a property difference (e.g. an increase in
activity) in
comparison with the reference pro-form molecule. When two or more chemical
modification (e.g. two proteolytic cleavages, or a proteolytic cleavage and a
deglycosylation) can be distinguished en route to the production of a mature
molecule, the
referemce precursor molecule may be termed a "pre-pro-form" molecule.
As used herein, the term "pseudorandom" refers to a set of sequences that have
limited variability, such that, for example, the degree of residue variability
at another
position, but any pseudorandom position is allowed some degree of residue
variation,
however circumscribed.

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
"Quasi-repeated units", as used herein, refers to the repeats to be re-
assorted and
are by definition not identical. Indeed the method is proposed not only for
practically
identical encoding units produced by mutagenesis of the identical starting
sequence, but
also the reassortment of similar or related sequences which may diverge
significantly in
some regions. Nevertheless, if the sequences contain Buff cient homologies to
be
reassorted by this approach, they can be referred to as "quasi-repeated"
units.
As used herein "random peptide library" refers to a set of polynucleotide
sequences that encodes a set of random peptides, and to the set of random
peptides
encoded by those polynucleotide sequences, as well as the fusion proteins
contain those
random peptides.
As used herein, "random peptide sequence" refers to an amino acid sequence
composed of two or more amino acid monomers and constructed by a stochastic or
random process. A random peptide can include framework or scaffolding motifs,
which
may comprise invariant sequences.
As used herein, "receptor" refers to a molecule that has an affinity for a
given
ligand. Receptors can be naturally occurnng or synthetic molecules. Receptors
can be
employed in an unaltered state or as aggregates with other species. Receptors
can be
attached, covalently or non-covalently, to a binding member, either directly
or via a
specific binding substance. Examples of receptors include, but are not limited
to,
antibodies, including monoclonal antibodies and antisera reactive with
specific antigenic
determinants (such as on viruses, cells, or other materials), cell membrane
receptors,
complex carbohydrates and glycoproteins, enzymes, and hormone receptors.
"Recombinant" enzymes refer to enzymes produced by recombinant DNA
techniques, i.e., produced from cells transformed by an exogenous DNA
construct
encoding the desired enzyme. "Synthetic" enzymes are those prepared by
chemical
synthesis.
The term "related polynucleotides" means that regions or areas of the
polynucleotides are identical and regions or areas of the polynucleotides are
heterologous.
56

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
"Reductive reassortment", as used herein, refers to the increase in molecular
diversity that is accrued through deletion (and/or insertion) events that are
mediated by
repeated sequences.
The following terms are used to describe the sequence relationships between
two
or more polynucleotides: "reference sequence," "comparison window," "sequence
identity," "percentage of sequence identity," and "substantial identity."
A "reference sequence" is a defined sequence used as a basis for a sequence
comparison; a reference sequence may be a subset of a larger sequence, for
example, as a
segment of a full-length cDNA or gene sequence given in a sequence listing, or
may
comprise a complete cDNA or gene sequence. Generally, a reference sequence is
at least
20 nucleotides in length, frequently at least 25 nucleotides in length, and
often at least SO
nucleotides in length. Since two polynucleotides may each (1) comprise a
sequence (i.e., a
portion of the complete polynucleotide sequence) that is similar between the
two
polynucleotides and (2) may further comprise a sequence that is divergent
between the two
polynucleotides, sequence comparisons between two (or more) polynucleotides
are
typically performed by comparing sequences of the two polynucleotides over a
"comparison window" to identify and compare local regions of sequence
similarity.
"Repetitive Index (RI)", as used herein, is the average number of copies of
the
quasi-repeated units contained in the cloning vector.
The term "restriction site" refers to a recognition sequence that is necessary
for
the manifestation of the action of a restriction enzyme, and includes a site
of catalytic
cleavage. It is appreciated that a site of cleavage may or may not be
contained within a
portion of a restriction site that comprises a low ambiguity sequence (i.e. a
sequence
containing the principal determinant of the frequency of occurrence of the
restriction site).
Thus, in many cases, relevant restriction sites contain only a low ambiguity
sequence with
an internal cleavage site (e.g. G/AATTC in the EcoR I site) or an immediately
adjacent
cleavage site (e.g. /CCWGG in the EcoR II site). In other cases, relevant
restriction
enzymes [e.g. the Eco57 I site or CTGAAG(16/14)] contain a low ambiguity
sequence
57

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
(e.g. the CTGAAG sequence in the Eco57 I site) with an external cleavage site
(e.g. in the
N~6 portion of the Eco57 I site). When an enzyme (e.g. a restriction enzyme)
is said to
"cleave" a polynucleotide, it is understood to mean that the restriction
enzyme catalyzes or
facilitates a cleavage of a polynucleotide.
In a non-limiting aspect, a "selectable polynucleotide" is comprised of a 5'
terminal region (or end region), an intermediate region (i.e. an internal or
central region),
and a 3' terminal region (or end region). As used in this aspect, a 5'
terminal region is a
region that is located towards a 5' polynucleotide terminus (or a 5'
polynucleotide end);
thus it is either partially or entirely in a 5' half of a polynucleotide.
Likewise, a 3' terminal
region is a region that is located towards a 3' polynucleotide terminus (or a
3'
polynucleotide end); thus it is either partially or entirely in a 3' half of a
polynucleotide.
As used in this non-limiting exemplification, there may be sequence overlap
between any
two regions or even among all three regions.
The term "sequence identity" means that two polynucleotide sequences are
identical (i.e., on a nucleotide-by-nucleotide basis) over the window of
comparison. The
term "percentage of sequence identity" is calculated by comparing two
optimally aligned
sequences over the window of comparison, determining the number of positions
at which
the identical nucleic acid base (e.g., A, T, C, G, U, or I) occurs in both
sequences to yield
the number of matched positions, dividing the number of matched positions by
the total
number of positions in the window of comparison (i.e., the window size), and
multiplying
the result by 100 to yield the percentage of sequence identity. This
"substantial identity",
as used herein, denotes a characteristic of a polynucleotide sequence, wherein
the
polynucleotide comprises a sequence having at least 80 percent sequence
identity,
preferably at least 85 percent identity, often 90 to 95 percent sequence
identity, and most
commonly at least 99 percent sequence identity as compared to a reference
sequence of a
comparison window of at least 25-50 nucleotides, wherein the percentage of
sequence
identity is calculated by comparing the reference sequence to the
polynucleotide sequence
which may include deletions or additions which total 20 percent or less of the
reference
sequence over the window of comparison.
58

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
As known in the art "similarity" between two enzymes is determined by
comparing the amino acid sequence and its conserved amino acid substitutes of
one
enzyme to the sequence of a second enzyme. Similarity may be determined by
procedures
which are well-known in the art, for example, a BLAST program (Basic Local
Alignment
Search Tool at the National Center for Biological Information).
As used herein, the term "single-chain antibody" refers to a polypeptide
comprising a VH domain and a VL domain in polypeptide linkage, generally liked
via a
spacer peptide (e.g., [Gly-Gly-Gly-Gly-Ser]X), and which may comprise
additional amino
acid sequences at the amino- and/or carboxy- termini. For example, a single-
chain
antibody may comprise a tether segment for linking to the encoding
polynucleotide. As an
example, a scFv is a single-chain antibody. Single-chain antibodies are
generally proteins
consisting of one or more polypeptide segments of at least 10 contiguous amino
substantially encoded by genes of the immunoglobulin superfamily (e.g., see
Williams
and Barclay, 1959, pp. 361-365, which is incorporated herein by reference),
most
frequently encoded by a rodent, non-human primate, avian, porcine bovine,
ovine, goat, or
human heavy chain or light chain gene sequence. A functional single-chain
antibody
generally contains a sufficient portion of an immunoglobulin superfamily gene
product so
as to retain the property of binding to a specific target molecule, typically
a receptor or
antigen (epitope).
The members of a pair of molecules (e.g., an antibody-antigen pair or a
nucleic
acid pair) are said to "specifically bind" to each other if they bind to each
other with
greater affinity than to other, non-specific molecules. For example, an
antibody raised
against an antigen to which it binds more efficiently than to a non-specific
protein can be
described as specifically binding to the antigen. (Similarly, a nucleic acid
probe can be
described as specifically binding to a nucleic acid target if it forms a
specific duplex with
the target by base pairing interactions (see -above).)
"Specific hybridization" is defined herein as the formation of hybrids between
a
first polynucleotide and a second polynucleotide (e.g., a polynucleotide
having a distinct
but substantially identical sequence to the first polynucleotide), wherein
substantially
unrelated polynucleotide sequences do not form hybrids in the mixture.
59

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The term "specific polynucleotide" means a polynucleotide having certain end
points and having a certain nucleic acid sequence. Two polynucleotides wherein
one
polynucleotide has the identical sequence as a portion of the second
polynucleotide but
different ends comprises two different specific polynucleotides.
"Stringent hybridization conditions".means hybridization will occur only if
there
is at least 90% identity, preferably at least 95% identity and most preferably
at least 97%
identity between the sequences. See Sambrook et al, 1989, which is hereby
incorporated
by reference in its entirety.
Also included in the invention are polypeptides having sequences that are
"substantially identical" to the sequence of a phytase polypeptide, such as
one of SEQ ID
1. A "substantially identical" amino acid sequence is a sequence that differs
from a
reference sequence only by conservative amino acid substitutions, for example,
substitutions of one amino acid for another of the same class (e.g.,
substitution of one
hydrophobic amino acid, such as isoleucine, valine, leucine, or methionine,
for another, or
substitution of one polar amino acid for another, such as substitution of
arginine for lysine,
glutamic acid for aspartic acid, or glutamine for asparagine).
Additionally a "substantially identical" amino acid sequence is a sequence
that
differs from a reference sequence or by one or more non-conservative
substitutions,
deletions, or insertions, particularly when such a substitution occurs at a
site that is not the
active site the molecule, and provided that the polypeptide essentially
retains its
behavioural properties. For example, one or more amino acids can be deleted
from a
phytase polypeptide, resulting in modification of the structure of the
polypeptide, without
significantly altering its biological activity. For example, amino- or
carboxyl-terminal
amino acids that are not required for phytase biological activity can be
removed. Such
modifications can result in the development of smaller active phytase
polypeptides.
The present invention provides a "substantially pure enzyme". The term
"substantially pure enzyme" is used herein to describe a molecule, such as a
polypeptide
(e.g., a phytase polypeptide, or a fragment thereof) that is substantially
free of other

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
proteins, lipids, carbohydrates, nucleic acids, and other biological materials
with which it
is naturally associated. For example, a substantially pure molecule, such as a
polypeptide,
can be at least 60%, by dry weight, the molecule of interest. The purity of
the
polypeptides can be determined using standard methods including, e.g.,
polyacrylamide
gel electrophoresis (e.g., SDS-PAGE), column chromatography (e.g., high
performance
liquid chromatography (HPLC)), and amino-terminal amino acid sequence
analysis.
As used herein, "substantially pure" means an object species is the
predominant
species present (i.e., on a molar basis it is more abundant than any other
individual
macromolecular species in the composition), and preferably substantially
purified fraction
is a composition wherein the object species comprises at least about 50
percent (on a
molar basis) of all macromolecular species present. Generally, a substantially
pure
composition will comprise more than about 80 to 90 percent of all
macromolecular species
present in the composition. Most preferably, the object species is purified to
essential
homogeneity (contaminant species cannot be detected in the composition by
conventional
detection methods) wherein the composition consists essentially of a single
macromolecular species. Solvent species, small molecules (<500 Daltons), and
elemental
ion species are not considered macromolecular species.
As used herein, the term "variable segment" refers to a portion of a nascent
peptide which comprises a random, pseudorandom, or defined kernel sequence. A
variable
segment" refers to a portion of a nascent peptide which comprises a random
pseudorandom, or defined kernel sequence. A variable segment can comprise both
variant
and invariant residue positions, and the degree of residue variation at a
variant residue
position may be limited: both options are selected at the discretion of the
practitioner.
Typically, variable segments are about 5 to 20 amino acid residues in length
(e.g., 8 to 10),
although variable segments may be longer and may comprise antibody portions or
receptor
proteins, such as an antibody fragment, a nucleic acid binding protein, a
receptor protein,
and the like.
The term "wild-type" means that the polynucleotide does not comprise any
mutations. A "wild type" protein means that the protein will be active at a
level of activity
found in nature and will comprise the amino acid sequence found in nature.
61

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The term "working", as in "working sample", for example, is simply a sample
with which one is working. Likewise, a "working molecule", for example is a
molecule
with which one is working.
62

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
F. DETAILED DESCRIPTION OF THE INVENTION
1. GENOMIC CHARACTERIZATION METHODS
In one aspect, this invention describes a new method to sequence DNA. The
improvements over the existing DNA sequencing technologies are high speed,
high
throughput, no electrophoresis and gel reading artifacts due to the complete
absence
of an electrophoretic step, and no costly reagents involving various
substitutions with
stable isotopes. The invention utilizes the Sanger sequencing strategy and
assembles
the sequence information by analysis of the nested fragments obtained by
basespecific
chain termination via their different molecular masses using mass
spectrometry, as for
example, MALDI or ES mass spectrometry. A father increase in throughtput can
be
obtained by introducing massmodifications in the oligonucleotide primer, chain-
terminating nucleoside triphosphates and/or in the chainelongating nucleoside
triphosphates, as well as using integrated tag sequences which allow
multiplexing by
hybridization of tag specific probes with mass differentiated molecular
weights.
The present invention pertains to a method for sequencing genomes. The
method comprises the steps of obtaining nucleic acid material from a genome.
Then
there is the step of constructing a clone library and one or more probe
libraries from
the nucleic acid material. Next there is the step of comparing the libraries
to form
comparisons. Then there is the step of combining the comparisons to construct
a map
of the clones relative to the genome. Next there is the step of determining
the
sequence of the genome by means of the map.
The present invention also pertains to a system for sequencing a genome. The
system comprises a mechanism for obtaining nucleic acid material from a
genome.
The system also comprises a mechanism for constructing a clone library and one
or
more probe libraries. The constructing mechanism is in communication with the
nucleic acid material from a genome. Additionally, the system comprises a
mechanism for comparing said libraries to form comparisons. The comparing
mechanism is in communication with the said libraries. The system also
comprises a
mechanism for combining the comparisons to construct a map of the clones
relative to
the genome. The said combining mechanism is in communication with the
comparisons. Further, the system comprises a mechanism for determining the
63

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
sequence of the genome by means of said map. The said determining mechanism is
in
communication with said map. The present invention additionally pertains to a
method for producing a gene of a genome.
An efficient method for sequencing Iarge fragments of DNA is described. A
subclone path through the fragment is first identified; the collection of
subclones that
define this path is then sequenced using transposon-mediated direct sequencing
techniques to an extent sufficient to provide the complete sequence of the
fragment.
Improved techniques are provided for DNA sequencing, and particularly for
sequencing of the entire human genome. Different base-specific reactions are
utilized
to use different sets of DNA fragments from a piece of DNA of unknown
sequence.
Each of the different sets of DNA fragments has a common origin and terminates
at a
particular base along the unknown sequence. The molecular weight of the DNA
fragments in each of the different sets is detected by a matrix assisted laser
absorption
mass spectrometer to determinelthe sequence of the different bases in the DNA.
The
methods and apparatus of the present invention provide a relatively simple and
low
cost technique which may be automated to sequence thousands of gene bases per
hour, and eliminates the tedious and time consuming geI electrophoresis
separation
technique conventionally used to determine the masses of DNA fragments.
Processes and kits for simultaneously amplifying and sequencing nucleic acid
molecules, and performing high throughput DNA sequencing are described.
A new contiguous genome sequencing method is described which allows the
contiguous sequencing of a very long DNA without need to be subcloned. It uses
the
basic PCR technique but circumvents the usual need of this technique for the
knowledge two primers for contiguous sequencing, enabling the knowledge of
only
one primer sufficient. The present invention makes it possible to PCR amplify
a DNA
adjacent to a known sequence with which one primer can be made without the
knowledge of the second primer binding site present in the unknown sequence.
The
present invention could thus be used to contiguously sequence a very long DNA
such
as that contained in a YAC clone or a cosmid clone, without the need fox
subcloning
smaller fragments, using the standard PCR technique. It can also be used to
sequence
a whole chromosome or genome without any need to subclone it.
Methods and means are provided for the massively parallel characterization of
complex molecules and of molecular recognition phenomena with parallelism and
64

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
redundancy attained through single molecule examination methods. Applications
include ultra-rapid genome sequencing, affinity characterization, pathogen
characterization and detection means for clinical use and use in the
development and
construction of cybernetic immune systems. Novel methods for single molecule
examination and manipulation are provided, including scanned beam light
microscopic means and methods, and detection means availing of optoelectronic
array
devices. Various apparatus for rate control, including stepping control for
various
reactions are combined with molecular recognition, signal amplification and
single
molecule examination methods. Inclusion of internal control in samples,
algorithm-
based dynamically responsive manipulation controls, and sample redundancy, are
availed to provide an arbitrarily high degree of accuracy in final data.

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.I SEQUENCING
The present invention relates to sequencing of DNA and is in the field of
determining the nucleotide sequence of large segments of DNA. More
specifically,
the invention provides an improved method to obtain the complete nucleotide
sequence of genomic DNA provided in fragments of over 30 kb.
The present invention pertains to a process for determining the DNA sequence
of the genome of an organism. And more particularly, the invention relates to
the
sequencing of the entire human genome.
More specifically, the present invention is related to constructing clone maps
of organisms, and then using these maps to direct the sequencing effort. The
invention
also pertains to systems that can effectively use this sequence and map
information.
The invention relates to the massively parallel single molecule examination of
associations or reactions between large numbers of first complex molecules,
which
may be diverse, and second single or plural probing molecules, which may or
may not
be diverse, with applications to biology, biotechnology, pharmacology,
immunology,
the novel field of cybernetic immunology, molecular evolution, cybernetic
molecular
evolution, genomics, comparative genomics, enzymology, clinical enzymology,
pathology, medical research, and clinical medicine.
The present invention has applications in the area of polynucleotide sequence
determination, including DNA sequencing.
66

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2 SEQUENCING METHODS
1.2.1 Importance of DNA sequencing:
Current knowledge regarding gene structure, the control of gene activity and
the function of cells on a molecular Level alI arose based on the
determination of the
base sequence of millions of DNA molecules. DNA sequencing is still critically
important in research and for genetic therapies and diagnostics, (e.g., to
verify
recombinant clones and mutations).
DNA, a polymer of deoxyribonucleotides, is found in all living cells and some
viruses. DNA is the Garner of genetic information, which is passed from one
generation to the next by homologous replication of the DNA molecule.
Information
for the synthesis of all proteins is encoded in the sequence of bases in the
DNA. DNA
sequence information represents the information required for gene organization
and
regulation of most life forms. Accordingly, the development of reliable
methodology
for sequencing DNA has contributed significantly to an understanding of gene
structure and function.
Since the genetic information is represented by the sequence of the four DNA
building blocks deoxyadenosine- (dpA), deoxyguanosine- (dpG), deoxycytidine-
(dpC) and deoxythyraidine-5'-phosphate (dpT), DNA sequencing is one of the
most
fundamental technologies in molecular biology and the life sciences in
general.. The
ease and the rate by which DNA sequences can be obtained greatly affects
related
technologies such as development and production of new therapeutic agents and
new
and useful varieties of plants and microorganisms via recombinant DNA
technology.
In particular, unraveling the DNA sequence helps in understanding human
pathological conditions including genetic disorders, cancer and AIDS. In some
cases,
very subtle differences such as a one nucleotide deletion, addition or
substitution can
create serious, in some cases even fatal., consequences. Recently, DNA
sequencing
has become the core technology of the Human Genome Sequencing Project (e.g.,
J.E.
Bishop and M. Waldholz, 1991, Genome: The Story of the Most Astonishing
Scientific Adventure of Our Time - The Attempt to Map All the Genes in the
Human
Body, Simon & Schuster, New York). Knowledge of the complete human genome
DNA sequence will certainly help to understand, to diagnose, to prevent and to
treat
human diseases. To be able to tackle successfully the determination of the
approximately 3 billion base pairs of the human genome in a reasonable time
frame
67

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
and in an economical way, rapid, reliable, sensitive and inexpensive methods
need to
be developed, which also offer the possibility of automation. The present
invention
provides such a technology. The need for highly rapid, accurate, and
inexpensive
sequencing technology is nowhere more apparent than in a demanding sequencing
project.such as the human genome project.
The present invention relates to the field of nucleic acid analysis,
detection,
and sequencing. More specifically, in one embodiment the invention provides
improved techniques for synthesizing arrays of nucleic acids, hybridizing
nucleic
acids, detecting mismatches in a double-stranded nucleic acid composed of a
single-
stranded probe and a target nucleic acid, and determining the sequence of DNA
or
RNA or other polymers.
A human being has 23 pairs of chromosomes consisting of a total of about
100,000 genes. The human genome consists of those genes. A single gene which
is
defective may cause an inheritable disease, such as Huntington's disease, Tay-
Sachs
disease or cystic fibrosis. The human chromosomes consist of large organic
linear
molecules of double-strand DNA (deoxyribonucleic acid) with a total length of
about
3.3 billion "base pairs". The base pairs are the chemicals that encode
information
along DNA. A typical gene may have about 30,000 base pairs. By correlating the
inheritance of a "marker" (a distinctive segment of DNA) with the inheritance
of a
disease, one can find a mutant (abnormal) gene to within one or two million
base
pairs. This opens the way to clone the DNA segment, test is activity, follow
its
inheritance, and diagnose carriers and future disease victims.
The mapping of the human genome is to accurately determine the location and
composition of each of the 3.3 billion bases. The complexity and large scale
of such a
mapping has placed it, in terms of cost, effort and scientific potential of
such projects,
as one of the largest and most important projects of the 1990's and beyond.
Recent reviews of today's methods together with future directions and trends
are given by Barrell (The FASEB Journal 1, 40-45 (1991)), and Trainor (Anal.
Chem.
62, 418-26 (1990)).
68

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.2.2 Previously developed methods:
The problem of DNA sequence analysis is that of determining the order of the
four bases on the DNA strands. DNA sequencing is a technique by which the four
DNA nucleotides (characters) in a linear DNA sequence is ordered by chemical
and
biochemical means. Generally, strategies for determining the nucleotide
sequence of
DNA involve the generation of a DNA substrate i.e., DNA fragments suitable for
sequencing a region of the DNA, enzymatic or chemical reactions, and analysis
of
DNA fragments that have been separated according to their lengths to yield
sequence
information. More specifically, to sequence a given region of DNA, labeled DNA
fragments are typically generated in four separate reactions. In each of the
four
reactions, the DNA fragments typically have one fixed end and one end that
terminates sequentially at each of the four nucleotide bases, respectively.
The
products of each reaction are fractionated by gel electropheresis on adjacent
lanes of a
polyacrylamide gel. As all of the nucleotides are represented among the four
lanes,
the sequence of a given region of DNA can be determined from the four
"ladders" of
DNA fragments. The present status of techniques for determining such sequences
is
described in some detail in an article by Lloyd M. Smith published in the
American
Biotechnology Laboratory, Volume 7, Number 5, May 1989, pp 10-17. Since the
early 1970's, two methods have been developed for the determination of DNA
sequence: (1) the enzymatic chain-termination sequencing method, which relies
on
the template directed incorporation of nucleotides which themselves do not
supply the
necessary chemical functionalities required for subsequent enzymatic
polymerization
of a daughter strand polynucleotide, developed by Sanger and colleagues (F.
Sanger,
S. Nicklen, and A. R. Coulson, "DNA sequencing with chain- terminating
inhibitors."
Proc. Nati. Acad. Sci, USA, 74:5463-5467 (1977)), which is most commonly used
for
sequence determination; and (2) the base-specific chemical degradation
(modification
and cleavage) method, developed by Maxam and Gilbert (A. M. Maxam, and W.
Gilbert, "A new method of sequencing DNA." Proceedings of the National Academy
of Sciences, USA, 74:560-564 (1977)), which similarly yields polynucleotide
molecules terminated at sites containing a specific base according to the
chemical
treatment applied to the sample. Both of these techniques are based on similar
principals, and employ gel electrophoresis to separate DNA fragments of
different
lengths with high resolution. On these gels it is thus possible to separate a
DNA
69

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
fragment 600 bases in length from one 601 bases in length. No distinct method
preferable to these has yet been validated. Both methods require a large
number of
complex manipulations, such as isolation of homogeneous DNA fragments,
elaborate
and tedious preparation of samples, preparation of a separating gel,
application of
samples to the gel, electrophoresing the samples on the gel, working up of the
finjshed
gel, and analysis of the results of the procedure.
1. 2.2.1 Chemical/Maxam and Gilbert method for sequencing:
In the chemical method, the DNA strand is isotropically labeled on one end,
broken down into smaller fragments at sequence locations ending with a
particular
nucleotide (A, T, C, or G) by chemical means, and the fragments ordered based
on
this information. Base specific modifications result in a base specific
cleavage of the
radioactive or fluorescently labeled DNA fragment. After the DNA substrate is
end
labeled, it is subjected to chemical reactions designed to cleave the DNA at
positions
adjacent to a given base or bases. The labeled DNA fragments will, therefore,
have a
common labeled terminus while the unlabeled termini will be defined by the
positions
of chemical cleavage. This results in the generation of DNA fragments (four
sets of
nested fragments) which can be separated according to length by polyacrylamide
geI
electrophoresis (PAGE) and identified. Alternatively, unlabeled DNA fragments
can
be separated after complete restriction digestion and partial chemical
cleavage of the
DNA, and hybridized with probes homologous to a region near the region of the
DNA
to be sequenced. See, Church et al., Proc. Natl. Acad. Sci., 81:1991 (1984).
After
autoradiography, the sequence can be read directly since each band (fragment)
in the
gel originates from a base specific cleavage event. Thus, the fragment lengths
in the
four "ladders" directly translate into a specific position in the DNA
sequence.
1. 2.2.2 EnzymaticlSanger method for sequencing:
In the enzymatic method, the four base specific sets of DNA fragments are
formed by starting with a primer/template system elongating the primer into
the
unknown DNA sequence area and thereby copying the template and synthesizing
complementary strands using a DNA polymerase in the presence of chain-
terminating
reagents. The chain-terminating event is achieved by incorporating into the
four
separate reaction mixtures in addition to the four normal deoxynucleoside
triphosphates, dATP, dGTP, dTTP and dCTP, only one of the chain-terminating
dideoxynucleoside triphosphates, ddATP, ddGTP, ddTTP or ddCTP, respectively,
in a

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
limiting small concentration. The incorporation of a ddNTP lacking the 3'
hydroxyl
function into the growing DNA strand by the enzyme DNA polymerise leads to
chain
termination through preventing the formation of a 3'-5'-phosphodiester bond by
DNA
polymerise. Due to the random incorporation of the ddNTPs, each reaction leads
to a
population of base specific terminated fragments of different lengths, which
all
together represent the sequenced DNA-molecule. The four sets of resulting
fragments
produce, after electrophoresis, four base specific ladders from which the DNA
sequence can be determined.
In the enzymatic method, the following basic steps are involved:
(a) annealing an oligonucleotide primer to a suitable single or denatured
double stranded DNA template; (ii) extending the primer with DNA polymerise in
four separate reactions, each containing one - I abeled dNTP or ddNTP
(alternatively
a labeled primer can be used), a mixture of unlabeled dNTPs, and one chain-
terminating dideoxynucleoside- 5'-triphosphate (ddNTP); (iii) resolving the
four sets
of reaction products, which include a distribution of DNA fragments having
primer-
defined 5' termini and differing dideoxynucleotides at the 3' termini,on a
high
resolution polyacrylamide-urea gel; and (iv) producing an auto radiographic
image of
the gel that can be examined to infer the DNA sequence. Alternatively,
fluorescently
labeled primers or nucleotides can be used to identify the reaction products.
Known
dideoxy sequencing methods utilize a DNA polymerise such as the Klenow
fragment
of E. cola DNA polymerise, a DNA polymerise from Thermus aquaticus, reverse
transcriptase, a modified T7 DNA polymerise, or the Taq polymerise.
1. 2.2.3 Similarities, differences and other details of the two methods:
The two sequencing methods differ in the techniques employed to produce the
DNA fragments, but are otherwise similar. In the Maxim-Gilbert method, four
different base-specific reactions are performed on portions of the DNA
molecules to
be sequenced, to produce four sets of radiolabeled DNA fragments. These four
fragment sets are each loaded in adjacent lanes of a polyacrylamide slab gel,
and are
separated by electrophoresis. Autoradiographic imaging of the pattern of the
radiolabeled DNA bands in the gel reveals the relative size, corresponding to
band
mobilities, of the fragments in each lane, and the DNA sequence is deduced
from this
pattern.
71

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
While numerous modifications and improvements to the strategies referred to
above have been developed, most sequencing techniques require the presence of
a
known primer binding site for every 300 to S00 nucleotides to be sequenced
either,
for example, for initiation of DNA synthesis or for hybridization to different
length
DNA fragments having a common end. However, as such approaches utilize a
"ladder" of DNA fragments containing the primer binding site (or its
complement),
the amount of sequence information that can be obtained is limited by the
present
inability to resolve DNA fragments greater than 500 nucleotides in length on
sequencing gels.
Both of these methods yield a population of molecules comprising a nested set
which together may be analyzed to determine the base sequence of the sample.
At
least one of these two techniques is employed in essentially every laboratory
concerned with molecular biology, and together they have been employed to
sequence
more than 26 million bases of DNA. Currently a skilled biologist can produce
about
30,000 bases of finished DNA sequence per year under ideal conditions.
These methods and several variations thereupon, as well as their severe
limitations with respect to the economy and rapidity of accumulation of
sequence
data, are well known to those in the relevant arts. Various lower resolution
techniques,
generally falling within the category termed genome mapping, have been
developed
to circumvent these limitations for applications where more "broad spectrum"
examination of genetic material is required but less detailed information
about
sequence will suffice.
1. 2.2.4 Cloning/Subcloning steps:
On the upfront end, the DNA to be sequenced has to be fragmented into
sequencable pieces of currently not more than S00 to 1000 nucleotides.
Starting from
a genome, this is a multi-step process involving cloning and subcloning steps
using
different and appropriate cloning vectors such as YAC, cosmids, plasmids and
M13
vectors (Sambrook et al., Molecular Cloning: A Laboratory Manual. Cold Spring
Harbor Laboratory Press, 1989). Finally, for Sanger sequencing, the fragments
of
about 500 to 1000 base pairs are integrated into a specific restriction site
of the
replicative form I (RF I) of a derivative of the M13 bacteriophage (Vieria and
Messing, Gene 19, 259(1982)) and then the double-stranded form is transformed
to
the single-stranded circular form to serve as a template for the Sanger
sequencing
72

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
process having a binding site for a universal primer obtained by chemical DNA
synthesis (Sinha, Biernat, McManus and Koster, Nucleic Acids Res. 12, 4539-57
(1984); U.S. Patent No. 4725677 upstream of the restriction site into which
the
unknown DNA fragment has been inserted. Under specific conditions, unknown DNA
sequences integrated into supercoiled double-stranded plasmid DNA can be
sequenced directly by the Sanger method (Chen and Seeburg, DNA 4, 165-170
(1985)) and Lim et al., Gene Anal., Techn. 5, 32-39 (1988), and, with the
Polymerase
Chain Reaction (PCR) (PCR Protocols- A Guide to Methods and Applications.
Innis
et al., editors, Academic Press, San Diego (1990)) cloning or subcloning steps
could
be omitted by directly sequencing off chromosomal DNA by first amplifying the
DNA segment by PCR and then applying the Sanger sequencing method (Innis et
al.,
Proc. Nad. Acad. Sci. USA 85, 9436-9440 (1988)). In this case, however, the
DNA
sequence in the interested region most be known at least to the extent to bind
a
sequencing primer.
1. 2.2.5 Methodology described by Guo and Wu
Methodology described by Guo and Wu, Nucleic Acids Res., 10:2065 (1982);
and Meth. Enz., 100:60 (1983), which is not dependent upon primer binding
sites, is
highly desirable for sequencing DNA greater than S00 nucleotides. This method
involves partially digesting linear double stranded DNA with E. coli
exonuclease III
to produce DNA fragments with 3' ends shortened to varying lengths, performing
the
dideoxy primer extension reactions of Sanger, supra, with the shortened 3'
ends as
primers for DNA synthesis, and digesting the DNA with a selected restriction
enzyme
that cleaves near one end of the molecule adjacent to, but not within, the
labeled
region of DNA. By digestion of the DNA with a selected resfiriction enzyme,
the
labeled DNA strands from one end of the molecule are made small enough to be
resolved on a sequencing gel. Each successive deletion in length, therefore,
brings
"new" regions of the target DNA into sequencing range.
However, certain disadvantages inherent in the methodology of Guo and Wu,
supra, limit its usefulness for the large scale sequencing of DNA. For
example, this
approach depends upon the selection of appropriate restriction enzymes which
cleave
at restriction sites in close proximity to particular E. coli exonuclease III
endpoints,
but not within the labeled DNA as this would result in two or more
superimposed
sequence ladders. The selection of appropriate restriction enzymes generally
requires,
73

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
therefore, the restriction mapping of DNA fragments to identify sites in close
proximity to the numerous exonuclease III endpoints. However, the
determination of
restriction maps tends to be both time consuming and labor intensive.
Specifically,
restriction mapping to the resolution needed for DNA sequencing involves the
digestion of each region of DNA with combinations of 20 or more enzymes to
uncover the relative position of restriction sites. This may require over 100
enzymatic
reactions followed by numerous electrophoretic separations. Further,
significant
amounts of DNA are consumed in the mapping process and interpretation of the
data
generally requires a substantial amount of time.
74

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.3 3'-hydroxy-protected and labeled nucleotides:
A modified nucleotide compound possessing two properties particularly useful
for purposes of the present invention has been described by N. Williams and
P.S.
Coleman). This compound is 3'-O-(4-benzoyl)benzoyl adenosine 5'-triphosphate.
This
nucleotide bears a 3' protecting group linked via an ester function which
should be
susceptible to hydrolysis by appropriate chemical treatments. The protecting
moiety is
suitable for photoactivation, and this property was utilized by those
investigators to
probe the stl-ucture of mitochondria) F~-ATPase, indicating that this analog
will
interact properly with at least some enzymes. Under appropriate circumstances,
the
protecting moiety may also serve as a label.
Very recently, B Canard and R.S. Sarfati have described similar nucleotides,
here comprising all four nucleobases, with chemically removable 3'-hydroxyl
protecting groups. Said protecting groups comprise various fluorescent dye
moieties.
These investigators have shown that these compounds may be added to
appropriately
primed polynucleotides by polymerases according to Watson-Crick base-pairing
rules, and serve to terminate chain elongation in a manner which may be
reversed by
removal of said protecting groups by appropriate chemical treatments,
admitting
resumption of polymerization. These workers propose that such compounds may
form
the basis of a novel sequencing methodology availing steppinq control by means
of
said removable protecting groups and detection of labels following their
release from
the nascent strand by appropriate chemical treatment. Such a method, while a
potential advance over electrophoretic resolution methods, does not avail of
great
parallelism because only one molecule or an identical population of molecules
may be
sequenced at once (within a single vessel) by such a method, due to the
release of the
labeling moiety prior to detection, according to this proposed scheme.
Further, this
limitation requires that any attempt to avail of parallelism entail elaborate
parallel
fluidics. Low or no parallelism entails that stepping rate will be critical to
the
throughput attained with such a sequencing scheme. The results published by
these
authors suggests that the rate of chemical removal of 3'-hydroxy protecting
groups
(less than 90% removal after 10 minutes of treatment with O.1M NaOI~ will be
unacceptably low for such an inherently serial sequencing scheme.
Additional references regarding such compounds and in most instances their
properties as substrates for various enzymes including polymerases have been
found

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
in the biological literature: Churchich, J.E.; 1995. Eur. J. Biochem.,
231:736. Metzket,
M.L.; Gibbs, R.A.; et al.; 1994. Nucleic Acids Research, 22:4259.
Beabealashvilli,
R.S.; Kulchanova, M.K.; et al.; 1986. Biochimica et Biophysica Acta, 868:136.
Chidgeavadze, Z.G.; Kukhanova, M.K.; et al.; 1986. Biochimica et Biophysica
Acta,
868:145. Hiratsulca, T; 1983. Biochimica et Biophysica Acta, 742:496. Jeng,
S.J.;
Guillory, R.J.; 1975. J. Supramolecular Structure, 3:448.
76

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.4 Related Base Addition Sequencing Schemes:
Various other investigators have also independently devised polynucleotide
sequencing methodologies which depend on the addition of a polymerization
terminating labeled nucleotide to a primed or elongated daughter strand on a
polynucleotide sample with template dependent polynucleotide polymerases.
Most,
but not all, of these methods (referred to herein as previously disclosed base-
addition
sequencing schemes) avail nucleotide triphosphate monomers with some base-
specific
label which may be removed by some deprotection treatment. It must be
emphasized
that all of these other previously disclosed base-addition sequencing schemes
examine
not single molecules individually but rather large homogeneous populations of
substantially identical molecules, wherein the observed signal used to
identify label
type originates from the totality of such a population of molecules rather
than an
individual molecule. It must be further emphasized that conventional usage
does not
generally reveal this distinction: phrases such a "a molecule" or "a sample
molecule"
refer not to an individual molecule considered separately or in isolation from
other
molecules including separately from other molecules of identical composition
and
structure, but to populations comprising millions or more molecules of
identical
structure. A careful reading of these prior disclosures reveals that these
investigators
are not working with samples consisting of single molecules but rather with
samples
comprising a plurality of identical molecules. In particular, even where these
investigators do not (as is consistent with conventional usage) explicitly
note this
point, they take measures which would apply only to samples of pluralities of
identical molecules, and do not take measures associated with working with
single
molecules.
77

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.2.5 Labeling
1. 2.5.1 Sequencing from PAGE using radioisotopes:
In order to be able to read the sequence from PAGE, detectable labels have to
be used in either the primer (very often at the 5'-end) or in one of the
deoxynucleoside
triphosphates, dNTP. Using radioisotopes such as Sap, 33P, or 3sS is still the
most
frequently used technique. After PAGE, the gels are exposed to X-ray films and
silver
grain exposure is analyzed. The use of radioisotopic labeling creates several
problems. Most labels useful for autoradiographic detection of sequencing
fragements
have relatively short half lives which can limit the useful time of the
labels. The
emission high energy beta radiation, particularly from 3aP, can lead to
breakdown of
the products via radiolysis so that the sample should be used very quickly
after
labeling. In addition, high energy radiation can also cause a deterioration of
band
sharpness by scattering. Some of these problems can be reduced by using the
less
energetic isotopes such as 33P or 3sS (see, e.g., Ornstein et al.,
Biotechniques 2, 476
(1985)). Here, however, longer exposure times have to be tolerated. Above all,
the use
of radioisotopes poses significant health risks to the experimentalist and, in
heavy
sequencing projects, decontamination and handling the radioactive waste are
other
severe problems and burdens.
1. 2.5.2 Integration of non-radioactive labeling techniques into partly
automated
DNA sequencing:
In response to the above mentioned problems related to the use of radioactive
labels, non-radioactive labeling techniques have been explored and, in recent
years,
integrated into partly automated DNA sequencing procedures. All these
improvements utilize the Sanger sequencing strategy. The fluorescent label can
be
tagged to the primer (Smith et al., Nature M, 674-679 (1986) and EPO Patent
No. 873
00998.9; Du Pont De Nemours EPO Application No. 03 59225; Ansorge et al., L
Biochem. Biophys. Method 13, 325-32 (1986)) or to the chain-terminating
dideoxynucloside triphosphates (Prober et al. Science M, 336-41 (1987);
Applied
Biosystems, PCT Application WO 91/05060). Based on either labeling the primer
or
the ddNTP, systems have been developed by Applied Biosystems (Smith et al.,
Science 235, G89 (1987); U.S. Patent Nos. 570973 and 689013), Du Pont De
Nemours (Prober et al., Science 238, 336-341 (1987); U.S. Patents Nos. 881372
and
57566), Pharmacia-LKB (Ansorge et al. Nucleic Acids Res. 15-, 4593-4602 (1987)
78

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
and EMBL Patent Application DE P3724442 and P3805808.1) and Hitachi (JP 1-
90844 and DE 4011991 Al). A somewhat similar approach was developed by
Brumbaugh et al. (Proc. Natl. Sci. USA 85, 5610-14 (1988) and U.S. Patent No.
4,729,947). An improved method for the Du Pont system using two
electrophoretic
lanes with two different specific labels per lane is described (PCT
Application
W092/02635). A different approach uses fluorescently labeled avidin and biotin
labeled primers. Here, the sequencing ladders ending with biotin are reacted
during
electrophoresis with the labeled avidin which results in the detection of the
individual
sequencing bands (Brumbaugh et al., U.S. Patent No. 594676).
More recently even more sensitive non-radioactive labeling techniques for
DNA using chemiluminescence triggerable and amplifyable by enzymes have been
developed (Beck, O'Keefe, Coull and Koster, Nucleic Acids Res. 7, 5115- 5123
(1989) .L7 and Beck and Koster, Anal. Chem. 62 2258-2270 (1990)). These
labeling
methods were combined with multiplex DNA sequencing (Church et al., Science
240,
185-188 (1988) to provide for a strategy aimed at high throughput DNA
sequencing
(Koster et al., Nucleic Acids Res. Symposium Ser. No. 24, 318-321 (1991),
University of Utah, PCT Application No. WO 90/I5883); this strategy still
suffers
from the disadvantage of being very laborious and difficult to automate.
1. 2.5.2.1 Fluorescent labeling ing in methods for automated DNA sequencing
Of particular interest in DNA sequencing are methods of automated
sequencing, in which fluorescent labels are employed to label the size
separated
fragments or primer extension products of the enzymatic method. Currently,
three
different methods are used for automated DNA sequencing. In the first method,
the
DNA fragments are labeled with one fluorophore and then run in adjacent
sequencing
lanes, one lane for each base. See Ansorge et al., Nucleic Acids Res.
(1987)15:4593-
4602. In the second methods, the DNA fragments are labeled with
oligonucleotide
primers tagged with four fluorophores and all of the fragments are run in one
lane.
See Smith et al., Nature (1986) 321:674- 679. In the third method, each of the
different chain terminating dideoxynucleotides is labeled with a different
fluorophore
and all of the fragments are run in one lane. See Prober et al., Science
(1987)
238:336-341. The first method has the potential problems of lane-to-lane
variations as
well as a low throughput. The second and third methods require that the four
dyes be
well excited by one laser source, and that they have distinctly different
emission
79

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
spectra. Otherwise, multiple lasers have to be used, increasing the complexity
and the
cost of the detection instrument.
With the development of Energy Transfer primers which offer strong
fluorescent signals upon excitation at a common wavelength, the second method
produces robust sequencing data in currently commercial available sequencers.
However, even with the use of Energy Transfer primers, the second method is
not
entirely satisfactory. In the second method, all of the false terminated or
false stop
fragments are detected resulting in high backgrounds. Furthermore, with the
second
method it is difficult to obtain accurate sequences for DNA templates with
long
repetitive sequences. See Robbins et al., Biotechniques (1996) 20: 862-868.
The third method has the advantage of only detecting DNA fragments
incorporated with a terminator. Therefore, backgrounds caused by the detection
of
false stops are not detected. However, the fluorescence signals offered by the
dye-
labeled terminators are not very bright and it is still tedious to completely
clear up the
excess of dye-terminators even with AmpliTaq DNA Polymerase (FS enzyme).
Furthermore, non-sequencing fragments are detected, which contributes to
background signal. Applied Biosystems Model 373 A DNA Sequencing System User
Bulletin, November 17,P3,August 1990.
Thus, there is a need for the development of improved methodology which is
capable of providing for highly accurate sequencing data, even for long
repetitive
sequences. Such methodology would ideally include a means for isolating the
DNA
sequencing fragments from the remaining components of the sequencing reaction
mixtures such as salts, enzymes, excess primers, template and the like, as
well as false
stopped sequencing fragments and non-sequencing fragments resulting from
contaminated RNA and nicked DNA templates.

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.6 Simplifying DNA sequencing using solid supports:
In an attempt to simplify DNA sequencing, solid supports have been
introduced. In most cases published so far, the template strand for sequencing
(with or
without PCR amplification) is immobilized on a solid support most frequently
utilizing the strong biotin-avidin/streptavidin interaction (Orion-Yhtyma Oy,
U.S.
Patent No. 277643; M. Uhlen et al. Nucleic Acids Res. 16, 3025-38 (1988); Cemu
Bioteknik, PCT Application No. WO 89/09282 and Medical Research Council, GB,
PCT Application No. WO 92/03575). The primer extension products synthesized on
the immobilized template strand are purified of enzymes, other sequencing
reagents
and by-products by a washing step and then released under denaturing
conditions by
loosing the hydrogen bonds between the Watson-Crick base pairs and subjected
to
PAGE separation. In a different approach, the primer extension products (not
the
template) from a DNA sequencing reaction are bound to a solid support via
biotin/avidin (Du Pont De Nemours, PCT Application WO 91/11533). In contrast
to
the above mentioned methods, here, the interaction between biotin and avidin
is
overcome by employing denaturing conditions (formamide/EDTA) to release the
primer extension products of the sequencing reaction from the solid support
for PAGE
separation. As solid supports, beads, (e. g., magnetic beads (Dynabeads) and
Sepharose beads), filters, capillaries, plastic dipsticks (e.g., polystyrene
strips) and
microtiter wells are being proposed.
8I

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.2.7 Electrophoresis
1. 2.7.1 Drawbacks and limitations of polyacrylamide gel electrophoresis
(PAGE):
All methods discussed so far have one central step in common:
polyacrylarnide gel electrophoresis (PAGE). In many instances, this represents
a
major drawback and limitation for each of these methods. Preparing a
homogeneous
gel by polymerization, loading of the samples, the electrophoresis itself,
detection of
the sequence pattern (e.g., by autoradiography), removing the gel and cleaning
the
glass plates to prepare another gel are very laborious and time-consuming
procedures.
Moreover, the whole process is error-prone, difficult to automate, and, in
order
to improve reproducibility and reliability, highly trained and skilled
personnel are
required.
In the case of radioactive labeling, autoradiography itself can consume from
hours to days. In the case of fluorescent labeling, at least the detection of
the
sequencing bands is being performed automatically when using the laser-
scanning
devices integrated into commercial available DNA sequencers. One problem
related
to the fluorescent labeling is the influence of the four different base-
specific
fluorescent tags on the mobility of the fragments during electrophoresis and a
possible
overlap in the spectral bandwidth of the four specific dyes reducing the
discriminating
power between neighboring bands, hence, increasing the probability of sequence
ambiguities. Artifacts are also produced by base- specific interactions with
the
polyacrylamide gel matrix (Frank and Koster, Nucleic Acids Res. -6, 2069
(1979))
and by the formation of secondary structures which result in "band
compressions" and
hence do not allow one to read the sequence. This problem has, in part, been
overcome by using 7-deazadeoxyguanosine triphosphates (Barr et al.,
Biotechniques
4, 428 (1986)). However, the reasons for some artifacts and conspicuous bands
are
still under investigation and need fiu-ther improvement of the gel
electrophoretic
procedure.
1. 2.7.2 Capillary zone electrophoresis (CZE):
A recent innovation in electrophoresis is capillary zone electrophoresis (CZE)
(Jorgenson et al., J. Chromatography 352, 337 (1986); Gesteland et al.,
Nucleic Acids
Res. 18, 141 S-1419 (1990)) which, compared to slab gel electrophoresis
(PAGE),
significantly increases the resolution of the separation, reduces the time for
an
82

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
electrophoretic run and allows the analysis of very small samples. Here,
however,
other problems arise due to the miniaturization of the whole system such as
wall
effects and the necessity of highly sensitive on-line detection methods.
Compared to
PAGE, another drawback is created by the fact that CZE is only a "one-lane"
process,
whereas in PAGE samples in multiple lanes can be electrophoresed
simultaneously.
1. 2.7.3 DNA sequencing without the electrophoretic step:
Analysis methods have heretofore relied on electrophoretic separation and
resolution of the products of Sanger or Maxam and Gilbert reactions according
to the
length of said products. Analysis thus suffers all of the limitations
associated with
electrophoresis including limited separation range (i.e. limited dynamic
range, where
separative resolution is related exponentially to fractional differences in
molecular
length), limitations on parallelism, time requirements, etc., despite much
effort in
improving these reparative methodologies. With presently available equipment
and
trained personnel, sequencing the human genone would require about 100 years
of
total effort if no other sequencing projects were done. While very useful, the
present
sequencing methods are extremely tedious and expensive, yet require the
services of
highly skilled scientists. Moreover, these methods utilize hazardous chemicals
and
radioactive isotopes, which have inhibited their consideration and further
development. Large scale sequencing projects, as that of the human genome,
thus
appear to be impractical using these well-established techniques.
In addition to being slow, the present DNA sequencing techniques involve a
large number of cumbersome handling steps which are difficult to automate.
Recent
improvements include replacing the radioactive labels with fluorescent tags.
These
developments have improved the speed of the process and have removed some of
the
tedious manual steps, although present technology continues to employ the
relatively
slow gel electrophoresis technique for separating the DNA fragments.
Due to the severe limitations and problems related to having PAGE as an
integral and central part in the standard DNA sequencing protocol, several
methods
have been proposed to do DNA sequencing without an electrophoretic step. One
approach calls for hybridization or fragmentation sequencing (Bains,
Biotechnology
10, 757-58 (1992) and Mirzabekov et al., FEBS Letters 256, 118-122 (1989))
utilizing
the specific hybridization of known short oligonucleotides (e.g.,
octadeoxynucleotides
which gives 65,536 different sequences) to a complementary DNA sequence.
Positive
83

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
hybridization reveals a short stretch of the unknown sequence. Repeating this
process
by performing hybridizations with all possible octadeoxynucleotides should
theoretically determine the sequence. In a completely different approach,
rapid
sequencing of DNA is done by unilaterally degrading one single, immobilized
DNA
fragment by an exonuclease in a moving flow stream and detecting the cleaved
nucleotides by their specific fluorescent tag via laser excitation (Jett et
al., J.
Biomolecular Structure & D3mamics 7, 3 O1-3 09, (1989), United States
Department
of Energy, PCT Application No. WO 89/03432). In another system proposed by
Hyman Anal. Biochem. 174, 423-436 (1988)), the pyrophosphate generated when
the
correct nucleotide is attached to the growing chain on a primer-template
system is
used to determine the DNA sequence. The enzymes used and the DNA are held in
place by solid phases (DEAF-Sepharose and Sepharose) either by ionic
interactions or
by covalent attachment. In a continuous flow- through system, the amount of
pyrophosphate is determined via bioluminescence (luciferase). A synthesis
approach
to DNA sequencing is also used by Tsien et al. (PCT Application No. WO
91/06678).
Here, the incoming dNTP's are protected at the T-end by various blocking
groups
such as acetyl or phosphate groups and are removed before the next elongation
step,
which makes this process very slow compared to standard sequencing methods.
The template DNA is immobilized on a polymer support. To detect
incorporation, a fluorescent or radioactive label is additionally incorporated
into the
modified dNTP's.
1. 2.7.4 Apparatus to automate DNA sequencing without electrophoretic
step(mass spectrometry):
PCT Application No. WO 91/06678 also describes an apparatus designed to
automate the sequencing process.
Mass Spectrometry is a well known analytical technique which can provide
fast and accurate molecular weight information on relatively complex mixtures
of
organic molecules. Mass spectrometry has historically had neither the
sensitivity nor
resolution to be useful for analyzing mixtures at high mass. A series of
articles in
1988 by Hillenkamp and Karas do suggest that large organic molecules of about
10,
000 to 100,000 Daltons may be analyzed in a time of flight mass spectrometer,
although resolution at lower molecular weights is not as sharp as conventional
84

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
magnetic field mass spectrometry. Moreover, the Hillenkamp and Karas technique
is
very time-consuming, and requires complex and costly instrumentation.
Mass spectrometry, in general., provides a means of "weighing" individual
molecules by ionizing the molecules in vacuo and making them "fly" by
volatilization.
Under the influence of combinations of electric and magnetic fields, the ions
follow trajectories depending on their individual mass (m) and charge (z). In
the range
of molecules with low molecular weight, mass spectrometry has long been part
of the
routine physical-organic repertoire for analysis and characterization of
organic
molecules by the determination of the mass of the parent molecular ion. In
addition,
by arranging collisions of this parent molecular ion with other particles
(e.g., argon
atoms), the molecular ion is fragmented forming secondary ions by the so-
called
collision induced dissociation (CID). The fragmentation pattern/pathway very
often
allows the derivation of detailed structural information. Many applications of
mass
spectrometric methods in the known in the art, particularly in biosciences,
and can be
found summarized in Methods in Enzymology, Vol. 193: "Mass Spectrometry" Q.A.
McCloskey, editor), 1990, Academic Press, New York.
Due to the apparent analytical advantages of mass spectrometry in providing
high detection sensitivity, accuracy of mass measurements, detailed structural
information by CID in conjunction with an MS/MS configuration and speed, as
well
as on-line data transfer to a computer, there has been considerable interest
in the use .
of mass spectrometry for the structural analysis of nucleic acids. Recent
reviews
summarizing this field include K. H. Schram, "Mass Spectrometry of Nucleic
Acid
Components, Biomedical Applications of Mass Spectrometry" 34, 203-287 (1990);
and P.F. Crain, "Mass Spectrometric Techniques in Nucleic Acid Research," Mass
Spectrometry Reviews 9, 505-554 (1990). The biggest hurdle to applying mass
spectrometry to nucleic acids is the difficulty of volatilizing these very
polar
biopolymers.

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.8 Mass Spectrometry
1. 2.8.1 Limitation in applying mass spectrometry due to the difficulty of
volatilizing nucleic acids:
Therefore, "sequencing" has been limited to low molecular weight synthetic
oligonucleotides by determining the mass of the parent molecular ion and
through
this, confirming the already known sequence, or alternatively, confirming the
known
sequence through the generation of secondary ions (fragment ions) via CID in
an
MS/MS configuration utilizing, in particular, for the ionization and
volatilization, the
method of fast atomic bombardment (FAB mass spectrometry) or plasma desorption
(PD mass spectrometry). As an example, the application of FAB to the analysis
of
protected dimeric blocks for chemical synthesis of oligodeoxynucleotides has
been
described (Koster et al., Bioedical Environmental Mass SpectrometricE 14, 111-
116
(1987)).
1. 2.8.2 Two more ionization/desorption techniques (ES and MALDI):
Two more recent ionization/desorption techniques are electrospray/ionspray
(ES) and matrix-assisted laser desorption/ionization (MALDI). ES mass
spectrometry
has been introduced by Fenn et al. J. Phys. Chem. 18, 4451-59 (1984); PCT
Application No. WO 90/14148) and current applications are summarized in recent
review articles (R.D. Smith et al., Anal. Chem. 62, 882-89 (1990) and B.
Ardrey,
Electrospray Mass Spectrometry, Spectroscopy Europe 4, 10-18 (1992)). The
molecular weights of the tetradecanucleotide d(CATGCCATGGCATG) (Covey et al.
"The Determination of Protein, Oligonucleotide and Peptide Molecular Weights
by
Ionspray Mass Spectrometry," Rapid Communications in Mass SpectrometJ3~, 2,
249- 256 (1988)), of the 21-mer d(AAATTGTGCACATCCTGCAGC) and without
giving details of that of a tRNA with 76 nucleotides Methods in Enzymolop-L 1.
23,
"Mass Spectrometry" (McCloskey, editor), p. 425, 1990, Academic Press, New
York)
have been published. As a mass analyzer, a quadrupole is most frequently used.
The
determination of molecular weights in ferntomole amounts of sample is very
accurate
due to the presence of multiple ion peaks which all could be used for the mass
calculation.
MALDI mass spectrometry, in contrast, can be particularly attractive when a
time-of flight (TOF) configuration is used as a mass analyzer. The MALDI-TOF
mass
spectrometry has been introduced by Hillenkamp et al. ("Matrix Assisted UV-
Laser
86

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Desorption/Ionization: A New Approach to Mass Spectrometry of Large
Biomolecules, Biological Mass Spectrometry (Burlingame and McCloskey,
editors),
Elsevier Science Publishers, Amsterdam, pp. 49-60, 1990.) Since, in most
cases, no
multiple molecular ion peaks are produced with this technique, the mass
spectra, in
principle, look simpler compared to ES mass spectrometry. Although DNA
molecules
up to a molecular weight of 410,000 daltons could be desorbed and volatilized
(Williams et al., "Volatilization of High Molecular Weight DNA by Pulsed Laser
Ablation of Frozen Aqueous Solutions," Science, 246, 1585-87 (1989)), this
technique has so far only been used to determine the molecular weights of
relatively
small oligonucleotides of known sequence, e.g., oligothymidylic acids up to 18
nucleotides (Huth-Fehre et al., "Matrix- Assisted Laser Desorption Mass
Spectrometry of Oligodeoxythymidylic Acids," Rapid Communications in Mass
Spectrometry, 6, 209-13 (1992)) and a double-stranded DNA of 28 base pairs
(Williams et al., "Time-of Flight Mass Spectrometry of Nucleic Acids by Laser
Ablation and Ionization from a Frozen Aqueous Matrix," Rapid Communications in
Mass Spectrometry, 4, 348-351 (1990)). In one publication (Ruth- Fehre et al.,
1992 ,
supra), it was shown that a mixture of all the oligothymidylic acids from n=12
to
n=18 nucleotides could be resolved.
1. 2.8.3 Producing fragments, separating by electrophoresis and using matrix
method to sequence
In U.S. Patent No. 5,064,754, RNA transcripts extended by DNA both of
which are complementary to the DNA to be sequenced are prepared by
incorporating
NTP's, dNTP's and, as terminating nucleotides, ddNTP's which are substituted
at the
5'- position of the sugar moiety with one or a combination of the isotopes
12C,13C,
14C, ~H, ZH, 3H,160, 1~0 and I80. The polynucleotides obtained are degraded to
3'-
nucleotides, cleaved at the N-glycosidic linkage and the isotopically labeled
5'-
functionality removed by periodate oxidation and the resulting formaldehyde
species
determined by mass spectrometry. A specific combination of isotopes serves to
discriminate base-specifically between internal nucleotides originating from
the
incorporation of NTPs and dNTP's and terminal nucleotides caused by linking
ddNTP's to the end of the polynucleotide chain. A series of RNA/DNA fragments
is
produced, and in one embodiment, separated by electrophoresis, and, with the
aid of
the so-called matrix method of analysis, the sequence is deduced.
87

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.8.4 Mass spectrometry using atoms which normally do not occur in DNA
In Japanese Patent No. 59-131909, an instrument is described which detects
nucleic acid fragments separated either by electrophoresis, liquid
chromatography or
high speed gel filtration. Mass spectrometric detection is achieved by
incorporating
into the nucleic acids atoms which normally do not occur in DNA such as S, Br,
I or
Ag, Au, Pt, Os, Hg. The method, however, is not applied to sequencing of DNA
using
the Sanger method. In particular, it does not propose a base-specif c
correlation of
such elements to an individual ddNTP.
1. 2.8.5 Sequencing with the Sanger method by using four stable isotopes to
label
the ddNTP's
PCT Application No. WO 89/12694 (Brennan et al., Proc. SPIE-Int. Soc. Opt.
Eng. 1206, (New Technol. Cytom. Mot. Biol.), pp. 60-77 (1990); and Brennan,
U.S.
Patent No. 5,003,059) employs the Sanger methodology for DNA sequencing by
using a combination of either the four stable isotopes 325, 335 345 365 or
35C1; 3~C1,
~9Br, 8lBr to specifically label the chain-terminating ddNTP's. The sulfur
isotopes can
be located either in the base or at the alpha-position of the triphosphate
moiety
whereas the halogen isotopes are located either at the base or at the 3'-
position of the
sugar ring.
The sequencing reaction mixtures are separated by an electrophoretic
technique such as CZE, transferred to a combustion unit in which the sulfur
isotopes
of the incorporated ddNTP's are transformed at about 900°C in an oxygen
atmosphere. The 502 generate with masses of 64, 65, 66 or 68 is determined on-
line
by mass spectrometry using, e.g., mass analyzer, a quadrupole with a single
ion-
multiplier to detect the ion current.
1. 2.8.6 Using resonance ionization spectroscopy in conjunction with a
magnetic
sector mass analyzer
A similar approach is proposed in U.S. Patent No. 5,002,868 (Jacobson a al.,
Proc. SPIE-Int. Soc. Opt. Eng. 1435, 9pt. Methods Ultrasensitive Detect. Anal.
Tech.
26-35 (1991)) using Sanger sequencing with four ddNTP's specifically
substituted at
the alpha-position of the triphosphate moiety with one of the four stable
sulfur
isotopes as described above and subsequent separation of the four sets of
nested
sequences by tube gel electrophoresis. The only difference is the use of
resonance
ionization spectroscopy (RIS) in conjunction with a magnetic sector mass
analyzer as
88

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
disclosed in U.S. Patent No. 4,442,354 to detect the sulfur isotopes
corresponding to
th specific nucleotide terminators, and by this, allowing the assignment of
the DNA
sequence.
1. 2.8.7 Using tube gel electrophoresis, a nebulizer and a mass analyzer to
sequence
EPO Patent Applications No. 0360676 Al and 0360677 Al also describe
Sanger sequencing using stable isotope substitutions in the ddNTP's such as D,
13C,
ISN, m0, 1g0, 325, 335 345' 365' i9F~ ssCl~ 3~C1, ~9Br, $IBr and 12~I or
function groups
such as CF3 or Si(CH3)3 at the base, the sugar or the alpha position of the
triphosphate
moiety according to chemical functionality. The Sanger sequencing reaction
mixtures
are separated by tube gel electrophoresis. The effluent is converted into an
aerosol by
the electrospray/thermospray nebulizer method and then atomized and ionize by
a hot
plasma (7000 to 8000°K) and analyzed by a simple mass analyzer. An
instrument is
proposed which enables one to automate the analysis of the Sanger sequencing
reaction mixture consisting of tube electrophoresis, a nebulizer and a mass
analyzer.
The application of mass spectrometry to perform DNA sequencing by the
hybridization/fragment method (see above) has been recently suggested (Bains,
"DNA
Sequencing by Mass Spectrometry: Outline of a Potential Future Application,
Chimicaoiggi 2, 13-I6 (1991)).
89

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.2.9 Probes
1. 2.9.1 Using large arrays of nucleic acid probes on a substrate
Alternative techniques have been proposed for sequencing a nucleic acid. PCT
patent Publication No. 92110588, incorporated herein by reference for all
purposes,
describes one improved technique in which the sequence of a labeled, target
nucleic
acid is determined by hybridization to an array of nucleic acid probes on a
substrate.
Each probe is located at a positionally distinguishable location on the
substrate. When
the labeled target is exposed to the substrate, it binds at locations that
contain
complementary nucleotide sequences. Through knowledge of the sequence of the
probes at the binding locations, one can determine the nucleotide sequence of
the
target nucleic acid. The technique is particularly efficient when very large
arrays of
nuleic acid probes are utilized.
Such arrays can be formed according to the techniques described in U.S.
Patent No. 5,143,854 issued to Pirrung et al. See also U.S. application Serial
No.
07/805,727, both incorporated herein by reference for all purposes.
1. 2.9.2 Employing sequencing by hybridization when the probes are shorter
than the target
When the nucleic acid probes are of a length shorter than the target, one can
employ a reconstruction technique to determine the sequence of the larger
target
based on affinity data from the shorter probes. See U.S. Patent No. 5,202,231
to
Drmanac-et al., and PCT patent Publication No. 89/10977 to Southern. One
technique
for overcoming this difficulty has been termed sequencing by hybridization or
SBH.
For example, assume that a 12-mer target DNA 5'-AGCCTAGCTGAA is mixed with
an array of all octanucleotide probes. If the target binds only to those
probes having
an exactly complementary nucleotide sequence, only five of the 65,536 octamer
probes (3'-TCGGATCG, CGGATCGA, GGATCGAC, GATCGACT, and
ATCGACTT) will hybridize to the target. Alignment of the overlapping sequences
from the hybridizing probes reconstructs the complement of the original 12-mer
target:
TCGGATCG
CGGATCGA
GGATCGAC

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
GATCGACT
ATCGACTT
TCGGATCGACTT
While meeting with much optimism, prior techniques have also met with
certain limitations. For example, practitioners have 45 encountered
substantial
difficulty in analyzing probe arrays hybridized to a target nucleic acid due
to the
hybridization of partially mismatched sequences, among other difficulties. The
present invention provides significant advances in sequencing with such
arrays.
91

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.10 DNA Amplification
DNA can be amplified by a variety of procedures including cloning
(Sambrook et at., Molecular Cloning : A Laboratory Manual., Cold Spring Harbor
Laboratory Press, 1989), polymerase chain reaction (PCR) (C.R. Newton and A.
Graham, PCF, BIOS Publishers, 1994), ligase chain reaction (LCR) (F. Barany
Proc.
Natl. Acad Sci USA 88, 189-93 (1991), strand displacement amplification (SDA)
(G.
Terrance Walker et al., Nucleic Acids Res. 22, 2670-77 (1994)) and variations
such as
RT-PCR, allele-specific amplification (ASA) etc.
- The polymerase chain reaction (Mullis, K. et al., Methods Enzymol., 155:335-
350 1987) permits the selective in vitro amplification of a particular DNA
region by
mimicking the phenomena of in vivo DNA replication. Required reaction
components
are single stranded DNA, primers (oligonucleotide sequences complementary to
the 5'
and 3' ends of a defined sequence of the DNA template),
deoxynucleotidetriphosphates and a DNA polymerase enzyme. Typically, the
single
stranded DNA is generated by heat denaturation of provided double strand DNA.
The
reaction buffers contain magnesium ions and co-solvents for optimum enzyme
stability and activity.
The amplification results from a repetition of such cycles in the following
manner: The two different primers, which bind selectively each to one of the
complementary strands, are extended in the first cycle of amplification. Each
newly
synthesized DNA then contains a binding site for the other primer. Therefore
each
new DNA strand becomes a template for any further cycle of amplification
enlarging
the template pool from cycle to cycle. Repeated cycles theoretically lead to
exponential synthesis of a DNA-fragment with a length defined by the S'
termini of
the primer.
The PCR amplification procedure has been used to sequence the DNA being
amplified (e.g. "Introduction to the AmpliTaq Cycle Sequencing Kit Protocol",
a
booklet from Perkin Elmer Cetus Corporation). The DNA could be first amplified
and
then it could be sequenced using the two conventional DNA sequencing
techniques.
Modified methods for sequencing PCR-amplified DNA have also been developed
(e.g. Bevan et al., "Sequencing of PCR-Amplified DNA" PCR Meth. App. 4:222
( 1992)).
92

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.11 Additional Sequencing Methods
1. 2.11.1 Sanger sequencing using the degradation of phosphorothioate-
containing DNA fragments
A recent modification of the Sanger sequencing strategy involves the
degradation of phosphorothioate-containing DNA fragments obtained by using
alpha-
thio dNTP instead of the normally used ddNTPs during the primer extension
reaction
mediated by DNA polymerase (Labeit et al., MA 5, 173-177 (1986); Amersham,
PCT- Application GB86/00349; Eckstein et al., Nucleic Acids Res. l~, 9947
(1988)).
Here, the four sets of base-specific sequencing ladders are obtained by
limited
digestion with exonuclease III or snake venom phosphodiesterase, subsequent
separation on PAGE and visualization by radioisotopic labeling of either the
primer or
one of the dNTPs. In a further modification, the base-specific cleavage is
achieved by
alkylating the sulphur atom in the modified phosphodiester bond followed by a
heat
treatment (Max- Planck- Geselischaft, DE 3930312 Al). Both methods can be
combined with the amplification of the DNA via the Polymerase Chain Reaction
(PCR).
1. 2.11.2 Sanger sequencing using modified polymerization reation (at high
temperature)
Initial PCR experiments used thermolabile DNA polymerase. However,
thermolabile DNA polymerase must be continually added to the reaction mixture
after
each denaturation cycle. Major advances in PCR practice were the development
of a
polymerase, which is stable at the near-boiling temperature (Saiki, R. et al.,
Science
239:487-491 1998) and the development of automated thermal cyclers.
The discovery of thermostable polymerases also allowed modification of the
Sanger sequencing reaction with significant advantages. The polymerization
reaction
could be carned out at high temperature with the use of thermostable DNA
polymerase in a cyclic manner (cycle sequencing). The conditions of the cycles
are
similar to those of the PCR technique and comprise denaturation, annealing,
and
extension steps. Depending on the length of the primers only one annealing
step at the
beginning of the reaction may be sufficient. Carrying out a sequencing
reaction at
high temperature in a cyclic manner provides the advantage that each DNA
strand can
serve as template in every new cycle of extension which reduces the amount of
DNA
93

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
necessary for sequencing, thereby providing access to minimal volumes of DNA,
as
well as resulting in improved specificity of primer hybridization at higher
temperature
and the reduction of secondary structures of the template strand.
1. 2.11.3 Semi-exponential cycle. sequencing using a second reverse primer in
the
sequencing reaction
However, amplification of the terminated fragments is linear in conventional
cycle sequencing approaches. A recently developed method, called semi-
exponential
cycle sequencing shortens the time required and increases the extent of
amplification
obtained from conventional cycle sequencing by using a second reverse primer
in the
sequencing reaction. However, the reverse primer only generates additional
template
strands if it avoids being terminated prior to reaching the sequencing primer
binding
site. Needless to say, terminated fragments generated by the reverse primer
can not
serve as a sufficient template. Therefore, in practice, amplification by the
semi-
exponential approach is not entirely exponential. (Sarkat, G. and Bolander
Mark E.,
Semi Exponential Cycle Sequencing Nucleic Acids Research, 1995, Vol. 23, No.
7, p.
1269-1270).
1. 2.11.4 Need to facilitate highthroughput sequencing
In addition to the foregoing limitations inherent in current sequencing
techniques, the generation of DNA substrate molecules for each 300 to 500
nucleotides to be sequenced is presently required. Assuming no overlapping
sequence
between substrate molecules, the sequencing of both strands of an entire
mammalian
genome would, therefore, require the generation of at least 20 million DNA
substrate
molecules.
As pointed out above, current nucleic acid sequencing methods require
relatively large amounts (typically about 1 g) of highly purified DNA
template. Often,
however, only a small amount of template DNA is available. Although
amplifications
may be performed, amplification procedures are typically time consuming, can
be
limited in the amount of amplified template produced and the amplified DNA
must be
purified prior to sequencing. A streamlined process for amplifying and
sequencing
DNA is needed, particularly to facilitate highthroughput nucleic acid
sequencing.
94

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.12 Strategies for obtaining the initial sequence
Methods currently used to sequence large segments of DNA do not lend
themselves to large-scale determination of genomic sequences. In general., the
initial
determination of a genomic clone sequence results in ambiguities and
discrepancies
that are resolved by assembling and editing the raw sequencing data into a
consensus
sequence. There are also, generally, holes in the sequence that need to be
filled in in
order to create a finished sequence. There are two general strategies for
obtaining the
initial sequence: shotgun sequencing and transposon-mediated directed
sequencing.
1. 2.12.1 Shotgun sequencing
In the currently existing methods for sequencing very long DNA of millions of
nucleotides, the DNA is fragmented into smaller, overlapping fragments, and
sub-
cloned to produce numerous clones containing overlapping DNA sequences. These
clones are sequenced randomly and the sequences assembled by "overlap sequence-
matching" to produce the contiguous sequence. In this shot-gun sequencing
method,
approx. ten times more sequencing than the length of the DNA being sequenced
is
required to assemble the contiguous sequence. Shotgun sequencing is reasonably
appropriate for generating the initial sequences of the genomic clone. In this
method,
the clone is digested with a multiplicity of restriction enzymes and the
individual
fragments are sequenced. When sufficient sequence is obtained to putatively
cover the
length of the genomic clone (1 x total sequence length) statistically 65% of
the
genomic clone sequence will have successfully been determined. The shotgun
strategy relies on assembly algorithms to piece together a final sequence by
determining relationships between a selected set of random templates. Although
this
assembly process is semiautomated, it remains labor-intensive, especially in
complex
regions that contain highly related tandem repeats. In addition, since the
selection of
subclones is not random, gaps of unknown distance are included between islands
of
known sequence. Linking up the islands requires either sequencing additional
subclones or ordering custom oligonucleotides to generate sequence into the
gaps.
The weaknesses of shotgun sequencing performed on substantial lengths of
nucleotide
sequence are thus 1) the difficulties involved in sequence assembly and 2) the
need
for hole-filling.
A non-ordered approach to sequencing, e.g., shotgun sequencing, would
require the generation of 100 to 200 million DNA templates. Although there has
been

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
effort directed to automating the steps presently involved in DNA substrate
generation, e.g., restriction mapping, preparation of subfragments for
subcloning,
identification of subclones, growing bacterial cultures, and purifying nucleic
acids, it
is unlikely that human intervention can be substantially eliminated from the
process.
Current approaches, therefore, are less than optimal for the large scale
sequencing of
DNA, particularly sequencing the human genome.
Although the problems enumerated above are not intended to be exhaustive,
the limitations inherent in methods presently available for sequencing DNA are
readily apparent. Accordingly, there exists a need for an improved method of
sequencing DNA that circumvents the need for primer binding sites as well as
the
need to determine restriction maps. Additionally, there exists a need for an
improved
method which extends the amount of sequence information obtainable from a DNA
substrate, thus substantially reducing the number of DNA substrate molecules
required to sequence a given region of DNA. The present invention meets these
needs.
1. 2.12.2 Transposon-mediated directed sequencing
On the other hand, the transposon-mediated sequencing method described by
Strathmann, M. et al. Proc Natl Acad Sci USA (1991) 88:1247- 1250, provides an
orderly approach to generating subclones for sequencing. The method uses a
.gamma..delta. bacterial transposable element bracketed by sequencing primers.
The
primer-flanked transposon permits the introduction of evenly spaced priming
sites across a fragment with an unknown DNA sequence. The number of template
sequences required to obtain the complete sequence information can be
calculated
from the length of the fragment. In the "directed" sequencing method, the
linear order
of the DNA clones has to be first determined by "physical mapping" of the
clones. As
the transposon insertions are random, the positions of the insertions are
mapped, for
example, using the polymerase chain reaction (PCR) using primers that amplify
the
intervening sequence between the transposon insertion site and the vector
sequences
at each end of the inserted fragment to be sequenced. The lengths of the
amplified
products thus define a map position for the transposon. Sequencing can be
conducted
based on the sequencing primers flanking the transposon, and since the
position of the
transposon has been mapped prior to sequencing, a fully automated assembly
process
96

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
is possible. There are no gaps since an ordered set of sequencing templates
which
cover the DNA fragment is produced.
1. 2.12.3 Drawbacks of these two strategies, "primer-walking" method
However, transposon sequencing can only be used on fragments containing 2-
kb; preferably 3-4 kb. Thus, to use the transposon method on larger fragments,
smaller subclones of the original fragment must be generated and organized
into an
ordered overlapping set. The shotgun strategy is not completely appropriate
for this
purpose. Neither is an alternative strategy termed dog-tagging. Dog-tagging is
a
"walking" process, a contiguous DNA sequencing method called the "primer-
walking" method using the Sanger's DNA polymerase enzymatic sequencing
procedure, that scans through a 30-hit subclone library for sequences that are
near the
end of the last walking step. It is labor-intensive and does not always
succeed. In this
method, the DNA copying has to occur always from the template DNA during DNA
sequencing. In contrast, in the PCR procedure, the target DNA amplified in the
first
rounds from the original input template DNA will function as the template DNA
in
subsequent cycles of amplification. After a certain cycles of amplification,
the DNA
sequencing reaction will be started by adding the sequencing "cocktail". Thus
in the
PCR reaction, only one copy of template DNA is theoretically sufficient to
amplify
into millions of copies, and therefore a very little genomic (or template) DNA
is
sufficient for sequencing. The advantage of DNA amplification that exists in
PCR is
lacking in the conventional Sanger procedure. Thus, this primer-walking method
will
require a larger amount of template DNA compared to the PCR sequencing method.
Also, because the long DNA has a tendency to re-anneal back to duplex DNA, the
sequencing gel pattern may not be as clean as in a PCR procedure, when a very
long
DNA is being sequenced. This may limit the length of the DNA, that could be
contiguously sequenced without breaking the DNA, using the primer- walking
procedure. The PCR method also enables the reduction of non-specific binding
of the
primers to the template DNA because the enzymes used in these protocols
function at
high-temperatures, and thus allow "stringent" reaction conditions to be used
to
improve sequencing.
The present method of contiguous DNA sequencing using the basic PCR
technique has thus many advantages over the primer walking method. Also, so
far no
method exists for contiguously sequencing a very long DNA using PCR technique.
97

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The present invention thus offers a unique and very advantageous procedure for
contiguous DNA sequencing.
1. 2.12.4 Amplification and equencing a long genomic DNA without subcloning
into smaller fragments
In one embodiment, the present invention provides a method for contiguous
sequencing of very long DNA using a modification of the standard PCR technique
without the need for breaking down and subcloning the long DNA.
The PCR technique enables the amplification of DNA which lies between two
regions of known,sequence (K. B. Mullis et al., U.S. Pat. Nos. 4,683,202;
7/1987;
435/91; and 4,683,195, 7/1987; 435/6). Oligonucleotides complementary to these
known sequences at both ends serve as "primers" in the PCR procedure. Double
stranded target DNA is first melted to separate the DNA strands, and then
oligonucleotide (oligo) primers complementary to the ends of the segment which
is
desired to be amplified are annealed to the template DNA. The oligos serve as
primers
for the synthesis of new complementary DNA strands, using a DNA polymerase
enzyme and a process known as primer extension. The orientation of the primers
with
respect to one another is such that the 5' to 3' extension product from each
primer
contains, when extended far enough, the sequence which is complementary to the
other oligo. Thus, each newly synthesized DNA strand becomes a template for
synthesis of another DNA strand beginning with the other oligo as primer.
Repeated
cycles of melting, annealing of oligo primers, and primer extension lead to a
(near)
doubling, with each cycle, of DNA strands containing the sequence of the
template
beginning with the sequence of one oligo and ending with the sequence of the
other
oligo.
The key requirement for this exponential increase of template DNA is the two
oligo primers complementary to the ends of the sequence desired to be
amplified, and
oriented such that their 3' extension products proceed toward each other. If
the
sequence at both ends of the segment to be amplified is not known,
complementary
oligos cannot be made and standard PCR cannot be performed. The object of the
present invention is to overcome the need for sequence information at both
ends of the
segment to be amplified, i.e. to provide a method which allows PCR to be
performed
when sequence is known for only a single region, and to provide a method for
the
98

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
contiguous sequencing of a very long DNA without the need for subcloning of
the
DNA.
Amplifying and sequencing using the PCR procedure requires that the
sequences at the ends of the DNA (the two primer sequences) be known in
advance.
Thus, this procedure is limited in utility, and cannot be extended to
contiguously
sequence a long DNA strand. If the knowledge of only one primer is sufficient
without anything known about the other primer, it would be greatly
advantageous for
sequencing very long DNA molecules using the PCR procedure. It would then be
possible to use such a method for contiguously sequencing a long genomic DNA
without the need for subcloning it into smaller fragments, and knowing only
the very
first, beginning primer in the whole long DNA.
1. 2.12.5 Large-scale sequencing throught the generation of a subclone path
In another embodiment, the present invention provides a large-scale
sequencing method which combines efficient method to generate a subclone path
through the large original fragment, such as a genomic clone, wherein the
subclones
are accessible to transposon sequencing, in combination with sequencing these
subclones using the transposon method.
99

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.13 Constructing ordered clone maps of DNA sequences
A primary goal of the human genome project is to determine the entire DNA
sequence for the genomes of human, model, and other useful organisms. A
related
goal is to construct ordered clone maps of DNA sequences at 100 kilobase (kb)
resolution for these organisms (D. R. Cox, E. D. Green, E. S. Lander, D.
Cohen, and
R. M. Myers, "Assessing mapping progress in the Human Genome Project,"
Science,
vol. 265, no. 5181, pp. 2031- 2, 1994), incorporated by reference. Integrated
maps
that localize clones together with polymorphic genetic markers (J. Weber and
P. May,
"Abundant class of human DNA polymorphisms which can be typed using the
polymerase chain reaction," Am. J. Hum. Genet., vol. 44, pp. 388-396, 1989),
incorporated by reference, are particularly useful for positionally cloning
human
disease genes (F. Collins, "Positional cloning: lets not call it reverse
anymore," Nature
Genet., vol. l, no. 1, pp. 3-6, 1992), incorporated by reference. The greatest
need,
however, is for sequence-ready maps. Also useful are maps of expressed
sequences.
Mapping techniques include restriction enzyme analysis of genetic material.,
and the
hybridization and detection of specific oligonucleotides which test for the
presence or
absence of particular alleles or loci, and may further be used to gain spatial
information about the occurrence of their targets when appropriate analytic
techniques
are subsequently applied. Note that such characterizations presently are
methodologically and operationally distinct from other processes comprehended
within the biotechnological and related arts. Human DNA sequences now exist as
genomic libraries in a variety of small- and large-insert capacity cloning
vectors, with
yeast artificial chromosomes (YACs) (D. T. Burke, G. F. Carle, and M. V.
Olson,
"Cloning of large exogenous DNA into yeast by means of artificial
chromosomes,"
Science, vol. 236, pp. 806-812, 1987), incorporated by reference, used
extensively in
mapping large regions. Efficient strategies for performing the requisite
experimentation are critical for sequencing and mapping chromosomes or entire
genomes.
1. 2.13.1 Sequence-tagged site
The starting point for an effective sequencing method is a complete ordered
clone map of a genome. Current strategies for ordering clones build contiguous
sequences (contigs) using short-range comparison data. Sequence-tagged site
(STS)
(M. Olson, L. Hood, C. Cantor, and D. Botstein, "A common language for
physical
100

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
mapping of the human genome," Science, vol. 245, pp. 1434-35, 1989),
incorporated
by reference, comparisons with clones are used in STS-content mapping (SCM)
(E.
D. Green and P. Green, "Sequence-tagged site (STS) content mapping of human
chromosomes: theoretical considerations and early experiences," PCR Methods
and
Applications, vol. 1, pp. 77-90, 1991), incorporated by reference. For
chromosomal or
genome-wide SCM, very large YACs (megaYACs) are required for the currently
available STS densities (R. Arratia, E. S. Larder, S. Tavare, and M. S.
Waterman,
"Genomic mapping by anchoring random clones: a mathematical analysis,"
Genomics, vol. 11, pp. 806-827, 1991; W. J. Ewers, C. J. Bell, P. J. Donnelly,
P.
Dunn, E. Matallana, and J. R. Esker, "Genome mapping with anchored clones:
theoretical aspects," Genomics, vol. 11, pp. 799-805, 1991), incorporated by
reference; these large YACs are often chimeric or contain gaps. Restriction
fragment
fingerprint mapping has been done with hybridization (C. Bellanne-Chantelot,
B.
Lacroix, P. Ougen, A. Billault, S. Beaufils, S. Bertrand, S. Georges, F.
Gliberr, I.
Gros, G. Lucotte, L. Susini, J.-J. Codani, P. Gesnouin, S. Pook, G. Vaysseix,
J. Lu-
Kuo, T. Ried, D. Ward, I. Chumakov, D. Le Paslier, E. Barillot, and D. Cohen,
"Mapping the whole genome by fingerprinting yeast artificial chromosomes,"
Cell,
vol. 70, pp. 1059-1068, 1992; R. L. Stallings, D. C. Torney, C. E: Hildebrand,
J. L.
Longmire, L. L. Deaven, J. H. Jett, N. A. Doggert, and R. K. Moyzis, "Physical
mapping of human chromosomes by repetitive sequence hybridization," Proc.
Natl.
Acad. Sci. USA, vol. 87, pp. 6218-6222, 1990), incorporated by reference, or
without
hybridization (A. Coulson, J. Sulston, S. Brenner, and J. Karn, "Toward a
physical
map of the genome of the nematode Caenorhaboditis elegans," Proc. Natl. Acad.
Sci.
USA, vol. 83, pp. 7821-7825, 1986), incorporated by reference. With
hybridization
fingerprinting, path analysis of YAC fingerprints is not always reliable when
constructing contigs. Hybridizing an internal clone sequence (e.g., end-clone
sequence, Alu- PCR probes) against a library to determine neighboring
sequences
builds unpositioned YAC contigs (M. T. Ross and V. P. J. Stanton, "Screening
large-
insert libraries by hybridization," in Current Protocols in Human Genetics,
vol. 1, N.
J. Dracopoli, J. L. Haines, B. R. Korf, C. C. Morton, C. E. Seidman, J. G.
Seidman, D.
T. Moir, and D. Smith, ed. New York: John Wiley and Sons, 1995, pp. 5.6.1-
5.6.34),
incorporated by reference, although walking techniques are generally reserved
for
closing gaps.
101

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.13.2 Gridding library onto nylon filters, and hybridizing with probes to
reduce cost, increase throughput
The number of experiments needed for these short-range clone mapping
approaches increases with the number of clones in the library. While
considerable
efficiency is gained by using multiplexed experiments with pooled reagents (G.
A.
Evans and K. A. Lewis, "Physical mapping of complex genomes by cosmid
multiplex
analysis," Proc. Natl. Acad. Sci. USA, vol. 86, no. 13, pp. 5030-4, 1989; E.
D. Green
and M. V. Olson, "Systematic screening of yeast artificial-chromosome
libraries by
use of the polymerase chain reaction," Proc. Natl. Acad. Sci. USA, vol. 87,
no. 3, pp.
1213-7, 1990), incorporated by reference, the experimental requirements are at
least
proportional to the number of clones. A useful goal is to significantly reduce
cost and
increase throughput by achieving a number of required experiments largely
independent of library size. One step toward this independence has been
achieved by
gridding an entire library onto nylon filters, and then hybridizing these
filters with a
set of probes (H. Lehrach, A. Drmanac, J. Hoheisel, Z. Larin, G. Lennon, A. P.
Monaco, D. Nizetic, G. Zehetner, and A. Poustka, "Hybridization fingerprinting
in
genome mapping and sequencing, " in Genetic and Physical Mapping I: Genome
Analysis, K. E. Davies and S. M. Tilghman, ed. Cold Spring Harbor, N.Y.: Cold
Spring Harbor Laboratory, 1990, pp. 39-81; A. P. Monaco, V. M. S. Lam, G.
Zehetner, G. G. Lennon, C. Douglas, D. Nizetic, P. N. Goodfellow, and H.
Lehrach,
"Mapping irradiation hybrids to cosmid and yeast artificial chromosome
libraries by
direct hybridization of Alu-PCR products," Nucleic Acids Res., vol. 19, no.
12, pp.
3315-3318, 1991), incorporated by reference. For example, contigs of small
genomic
regions have been constructed by oligonucleotide fingerprinting of gridded
cosmid
filters (A. G. Craig, D. Nizetic, J. D. Hoheisel, G. Zehetner, and H. Lehrach,
"Ordering of cosmid clones covering the herpes simplex virus type I," Nucleic
Acids
Res., vol. 18, no. 9, pp. 2653-60, 1990; A. J. Cuticchia, J. Arnold, and W. E.
Timberlake, "ODS: ordering DNA sequences, a physical mapping algorithm based
on
simulated annealing," CABIOS, vol. 9, no. 2, pp. 215- 219, 1992), incorporated
by
reference.
102

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.13.3 Radiation hybrid mapping
To efficiently span larger genomic regions, radiation hybrid (RH) mapping (D.
R. Cox, M. Burmeister, E. R. Price, S. Kim, and R. M. Myers, "Radiation hybrid
mapping: a somatic cell genetic method for constructing high-resolution maps
of
mammalian chromosomes," Science, vol. 250, pp. 245-250, 1990), incorporated by
reference, has been used to localize small DNA sequences (though not clones)
into
high-resolution bins. Relatively few PCR experiments with one 96-well plate
library
of RHs generally suffice for mapping STSs or genes to unique bins having 250
kb to 1
Mb average resolution. The very large multiple fragments in each RH clone
efficiently cover much of a chromosome (or genome). Assaying a sequence for
intersection against a set of RHs provides long- range relational information
for
localization much akin to somatic cell hybrid (SCH) mapping (M. C. Weiss and
H.
Green, "Human-mouse hybrid cell lines containing partial complements of human
chromosomes and functioning human genes," Proc. Natl. Acad. Sci. USA, vol. 58,
pp.
1104-1111, 1976), incorporated by reference. However, RH mapping offers much
greater resolution than SCH or fluorescent in situ hybridization (FISH)
mapping.
1. 2.13.4 Combining RH mapping with filter hybridization techniques
For highly optimized experimentation, it would be desirable to combine high-
resolution long-range RH mapping with low-cost high-throughput filter
hybridization
techniques to map clones. One can serially probe a gridded clone library with
a set of
RHs (H. Lehrach, A. Drmanac, J. Hoheisel, Z. Larin, G. Lennon, A. P. Monaco,
D.
Nizetic, G. Zehetner, and A. Poustka, "Hybridization fingerprinting in genome
mapping and sequencing," in Genetic and Physical Mapping I: Genome Analysis,
K.
E. Davies and S. M. Tilghman, ed. Cold Spring Harbor, New York: Cold Spring
Harbor Laboratory, 1990, pp. 39-81), in principle requiring a number of
experiments
that is independent of the clone library size and logarithmically related to
the desired
map resolution. However, complex hybridization probes such as RHs (or their
Alu-
PCR products) generate data containing considerable noise. This inherent
uncertainty,
together with the large clone insert size (which complicates conventional RH
analysis), has thus far precluded high-resolution mapping of clones using RHs
(J.
Kumlien, T. Labella, G. Zehetner, R. Vatcheva, D. Nizetic, and H. Lehrach,
"Efficient
identification and regional positioning of YAC and cosmid clones to human
103

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
chromosome 21 by radiation fusion hybrids," Mammalian Genome, vol. 5, no. 6,
pp.
365-71, 1994), incorporated by reference.
1. 2.13.5 Inner product mapping
Inner product mapping (IPM) is a hybridization-based method for achieving
high-throughput, high-resolution RH mapping of clones (M. W. Perlin and A.
Chakravarti, "Efficient construction of high-resolution physical maps from
yeast
artificial chromosomes using radiation hybrids: inner product mapping,"
Genomics,
vol. 18, pp. 283-289, 1993), incorporated by reference, that overcomes this
barrier.
Experimental data have established that IPM is a highly rapid, inexpensive,
accurate,
and precise large-scale long-range mapping method, particularly when
preexisting RH
maps are available, and that IPM can replace or complement more conventional
short-
range mapping methods.
1. 2.13.6 Obtaining improved mapping results
Improved mapping results can be obtained incrementally by gradually
enlarging the data tables, a process which provides useful feedback to both
experimentation and analysis. With additional RHs, the signal-to-noise
characteristics
of the clone profiles improve. This incremental process, and the relatively
few RHs
required for accurate mapping, follows the logarithmic number of the probes
needed
for IPM. For best mapping results, as many STS-typed RHs as feasible are used:
with
currently available high-throughput, robotically-assisted hybridization
methods, the
localization benefits of performing many filter hybridizations outweigh the
relatively
low experimentation costs. The incremental construction also highlights IPM's
indirect inference of map location: STS-content mapping directly compares
clones
with STSs, and can not map small-insert clones against STSs which are
insufficiently
dense .
1. 2.13.7 Building accurate maps and partitioning data noise
IPM builds accurate maps from low-confidence data. IPM's partitioning of the
experiments into two data tables of (A) clones vs. RHs and (B) RHs vs. STSs
also
partitions the data noise. Table B is formed from relatively noiseless PCR-
based
comparisons of STSs against RH DNA, and can thus accurately order and position
the
STS bins using combinatorial mapping procedures (M. Boehnke, "Radiation hybrid
mapping by minimization of the number of obligate chromosome breaks," Genetic
Analysis Workshop 7: Issues in Gene Mapping and the Detection of Major Genes.
104

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Cytogenet Cell Genet, vol. 59, pp. 96-98, 1992; M. Boehnke, K. Large, and D.
R.
Cox, "Statistical methods for multipoint radiation hybrid mapping," Am. J.
Hum.
Genet., vol. 49, pp. 1174-1188, 1991), incorporated by reference. Table A is
formed
from inherently unreliable and inconsistently replicated hybridizations of
complex RH
probes against gridded filters. Inner product mapping uses the table B data
matrix to
ameliorate these data errors and robustly translate a clones's noisy RH
signature
vector (a row of table A) into a chromosomal profile, whose peak bins the
clone.
1. 2.13.8 Mapping YAC's using IPM
IPM is a proven approach for mapping YACs (C. W. Richard III, D. J.
Duggan, K. Davis, J. E. Farr, M. J. Higgins, S. Qin, L. Zhang, T. B. Shows, M.
R.
James, and M. W. Perlin, "Rapid construction of physical maps using inner
product
mapping: YAC coverage of chromosome 11," in Fourth International Conference on
Human Chromosome 1 l, Sep. 22-24, Oxford, England, 1994), incorporated by
reference, and is a candidate method for mapping PACs (P. A. Ioannou, C. T.
Amemiya, J. Games, P. M. Kroisel, H. Shizuya, C. Chen, M. A. Batzer, and P. J.
de
Jong, "A new bacterophage P1-derived vector for the propagation of large human
DNA fragments," Nature Genet., vol. 6, no. l, pp. 84-89, 1994), incorporated
by
reference, cosmids, expressed sequences (M. D. Adams, J. M. Kelley, J. D.
Gocayne,
M. Dubnick, M. H. Polymeropoulos, H. Xiao, C. R. Merril, A. Wu, B. Olde, R. F.
Moreno, A. R. Kerlavage, W. R. McCombie, and J. C. Venter, "Complementary DNA
sequencing: Expressed sequence tags and human genome project," Science, vol.
252,
pp. 1651-1656, 1991), incorporated by reference, and other physical reagents
(J. D.
McPherson, C. Wagner- McPherson, M. Perlin, and J. J. Wasmuth, "A physical map
of human chromosome 5 (Abstract)," Amer. J. Hum. Genet., vol. 55, no. 3
Supplement, pp. A265, 1994), incorporated by reference. Hybridization
efficiency for
table A can be improved by using long and IRE-bubble PCR (D. J. Munroe, M.
Haas,
E. Bric, T. Whirton, H. Aburatani, K. Hunter, D. Ward, and D. E. Housman, "IRE-
bubble PCR: a rapid method for efficient and representative amplification of
human
genomic DNA sequences from complex sources," Genomics, vol. 19, no. 3, pp. 506-
14, 1994), incorporated by reference, to reduce false negative errors,
providing
controls and redundant DNA spotting for internal calibration, and directly
acquiring
signals (e.g., via a phosphorimager, Molecular Dynamics, Sunnyvale, Calif.) to
facilitate automated scoring. Current robotic technologies enable the high-
throughput
105

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
construction of gridded filters (A. Copeland and G. Lennon, "Rapid arrayed
filter
production using the 'ORCA' robot," Nature, vol. 369, no. 6479, pp. 421-422,
1994);
incorporated by reference; single use of these filters would reduce the time
and error
related to stripping and reprobing. Robots similarly provide high-throughput
PCR
comparisons for constructing table B. Alternatively, existing RH mapping data
can be
rapidly extended (at low cost) into inner product maps of libraries (U.
Francke, E.
Chang, K. Comeau, E.-M. Geigl, J. Giacalone, X. Li, J. Luna, A. Moon, S.
Welch,
and P. Wilgenbus, "A radiation hybrid map of human chromosome 18," Cytogenet.
Cell Genet., vol. 66, pp. 196-213, 1994), incorporated by reference.
1. 2.13.9 Whole genome RH libraries
Whole human genome RH (WG-RH) libraries of 0.5 and 1.0 Mb resolution
have been constructed (D. R. Cox, K. O'Connor, S. Hebert, M. Harris, R. Lee,
B.
Stewart, G. DiSibio, M. Boehnke, K. Large, R. Goold, and R. M. Myers,
"Construction and analysis of a panel of 'whole genome' radiation hybrids
(Abstract)," Amer. J. Hum. Genet., vol. 55, no. 3 Supplement, pp. A23, 1994;
M. A.
Walter, D. J. Spillerr, P. Thomas, J. Weissenbach, and P. N. Goodfellow, "A
method
for constructing radiation hybrid maps of whole genomes," Nature Genet., vol.
7, no.
1, pp. 22-28, 1994), incorporated by reference, and have been characterized
for the
STSs used in the genome-wide CEPH megaYAC STS-content map (T. Hudson, S.
Foote, S. Gerety, J. Ma, S.-h. Xu, X. Hu, J. Bae, J. Silva, J. Valle, S.
Maitra, A.
Colbert, L. Horton, M. Anderson, M. P. Reeve, M. Daly, A. Kaufinan, C.
Rosenberg,
L. Stein, N. Goodman, J. Orlin, D. C. Page, and E. S. Larder, "Towards an STS-
content map of the human genome (Abstract)," Amer. J. Hum. Genet., vol. 55,
no. 3
Supplement, pp. A23, 1994), incorporated by reference. The availability of
this WG-
RH table B resource suggests that constructing table A by performing
hybridizations
between species specific (e.g., Alu-PCR) products of these RHs and gridded
clones or
expressed sequences, and then combining tables A and B to build a genome-wide
inner product map, is a fast, accurate, and inexpensive approach to whole
genome
physical mapping. IPM has localized the components of chimeric YACs as
distinct
multiple peaks. IPM is therefore useful in verifying and extending current
megaYAC
mapping projects, and in multiplexed experimental designs that pool sequences
from
well-separated bins.
106

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1. 2.13.10 Using short-range data to determine the orders and distances of
clone
subsets in proximate bins
IPM provides long-range mapping information for DNA sequences relative to
RH bins through DNA hybridization. This binning information can be
complemented
with short-range mapping data, such, as oligonucleotide fingerprint
hybridizations (H.
Lehrach, A. Drmanac, J. Hoheisel, Z. Larin, G. Lennon, A. P. Monaco, D.
Nizetic, G.
Zehetner, and A. Poustka, "Hybridization fingerprinting in genome mapping and
sequencing," in Genetic and Physical Mapping I: Genome Analysis, I~. E. Davies
and
S. M. Tilghman, ed. Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory,
1990,
pp. 39-81), incorporated by reference, and (R. Drmanac, Z. Strezoska, I.
Labat, S.
Drmanac, and R. Crkvenjakov, "Reliable hybridization of oligonucleotides as
short as
six nucleotides, " DNA Cell Biol., vol. 9, no. 7, pp. 527-534, 1990),
incorporated by
reference. Combining the data from these two high-throughput hybridization
studies
enables a two-pass BIN-SORT (A. V. Aho, J. E. Hopcroft, and J. D. Ullman, Data
Structures and Algorithms. Reading, Mass.: Addison-Wesley, 1983), incorporated
by
reference, strategy to high-resolution mapping: first use IPM to bin the
clones, and
then use short-range data to determine the orders and distances of clone
subsets in
proximate bins. This strategy can rapidly construct minimum-length paths of
sequence-ready clones that tile the genome. Crucially, such IPM- derived
contigs
overcome the short-range limitations of all other known mapping methods, and
enable
the coordinated sequencing of the human genome, which is a well-recognized
goal (F.
Collins and D. Galas, "A new five-year plan for the U.S. Human Genome
Project,"
Science, vol. 262, pp. 43-46, 1993), incorporated by reference. Such
combination
approaches can be highly effective for other purposes, such as using short-
range
proximity data to sharpen long-range inner product map results. IPM's
experimental
efficiencies enable effective determination of genome-wide DNA sequences, and
the
construction of high-resolution integrated genome maps for human, model
organism,
and agricultural species.
In one embodiment, this invention pertains to determining the sequence of the
genome of an organism or species through the use of a novel, unobvious, and
highly
effective clone mapping strategy. Such sequence information can be used for
finding
genes of known utility, determining structure/function properties of genes and
their
products, elucidating metabolic networks, understanding the growth and
development
107

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
of humans and other organisms, and making comparisons of genetic information
between species. From these studies, diagnostic tests and pharmacological
agents can
be developed of great utility for preventing and treating human and other
disease.
Disclosures of this type, yielded in a search, are:
Patent Number: Inventor Issued* US 5,302,509 Cheeseman, Peter C. 12 April
1994 WO 93/2134 Rosenthal., A; et al. 28 October 1993 DE 41 41 178 Al Ansorge,
Wilhelm 16 June 1993 Wo 93/01583 Gibbs, Richard A.; et al. 18 March 1993 Wo
91/06678 Tsien, Roger Y.; et al. 16 May 1991 WO 90113666 Garland, Peter B.; et
al.
15 November 1990 Included in some of these above disclosures are descriptions
of
nucleotide triphosphates comprising removable fluorescent 3' protecting
groups.
108

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3 ALTERNATIVE SEQUENCING METHODS
The present invention provides an improved method of determining the
nucleotide base sequence of DNA. In one embodiment, the method of the
invention
involves the preparation of a DNA substrate comprising at a set of molecules,
each
having a template strand and a primer strand, wherein the 3' ends of the
primer strands
of the molecules terminate at about the same nucleotide position on the
template
strands of the molecules within each set. Preferably, the template and primer
strands
of the molecules are of unequal lengths wherein the 3' ends of the primer
strands of
the molecules terminate at about the same nucleotide position on the template
strands
of the molecules within each set. DNA synthesis is induced to obtain labeled
reaction
products comprising newly sythesized DNA complementary to the template strands
using the 3' ends of the primer strands to prime DNA synthesis, labeled
nucleoside
triphosphates, at least one modified nucleoside triphosphate, and preferably,
a suitable
chain terminator, wherein the modified nucleoside triphosphate is selected to
substantially protect newly synthesized DNA from cleavage. Thereafter, the
labeled
reaction products are cleaved at one or more selected sines to obtain labeled
DNA
fragments wherein newly synthesized DNA is substantially protected from
cleavage
by the incorporation of the modified nucleotide. The labeled DNA fragments
obtained
in the preceding step are separated and their nucleotide base sequence is
identified by
suitable means. The advantages of the present invention over prior art methods
will
become apparent after consideration of the accompanying drawings and the
following
detailed description of the invention.
109

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3.1 One-step process for generating from a DNA template
According to one process of the invention, a combined amplification and
termination reaction is performed using at least two different polymerise
enzymes,
each having a different affinity for the chain terminating nucleotide, so that
polymerization by an enzyme with relatively low affinity for the chain
terminating
nucleotide leads to exponential amplification whereas an enzyme with
relatively high
affinity for the chain terminating nucleotide terminates the polymerization
and yields
sequencing products.
In another aspect, the invention features kits for directly amplifying nucleic
acid templates and generating base specifically terminated fragments. In one
embodiment, the kit can comprise an appropriate amount of: i) a complete set
of
chain- elongating nucleotides; ii) at least one chain-terminating nucleotide;
(iii) a first
DNA polymerise, which has a relatively low affinity towards the chain
terminating
nucleotide., and (iv) a second DNA polymerise, which has a relatively high
affinity
towards the chain terminating nucleotide. The kit can also optionally include
an
appropriate primer or primers, appropriate buffers as well as instructions for
use.
The instant invention allows DNA amplification and termination to be
performed in one reaction vessel. Due to the use of two polymerises with
different
affinities for dideoxy nucleotide triphosphates, exponential amplification of
the target
sequence can be accomplished in combination with a termination reaction
nucleotide.
In addition, the process obviates the purification procedures, which are
required when
amplification is performed separately from base terminated fragment
generation.
Further, the instant process requires less time to accomplish than separate
amplification and base specific termination reactions.
When combined with a detection means, the process can be used to detect and/
or quantitate a particular nucleic acid sequence where only small amounts of
template
are available and fast and accurate sequence data acquisition is desirable.
For
example, when combined with a detection means, the process is useful for
sequencing
unknown genes or other nucleic acid sequences and for diagnosing or monitoring
certain diseases or conditions, such as genetic diseases, chromosomal
abnormalities,
genetic predispositions to certain diseases (e.g. cancer, obesity,
artherosclerosis) and
pathogenic (e.g. bacterial., viral., fungal., protistal) infections. Further,
when double
stranded DNA molecules are used as the starting material., the instant process
110

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
provides an opportunity to simultaneously sequence both strands, thereby
providing
greater certainty of the sequence data obtained or acquiring sequence
information
from both ends of a longer template.
111

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3.2 Base-specific Reactions Used on DNA fragments from a piece of an
unknown sequence
In accordance with the present invention, there is also provided a method and
apparatus for determining the sequence of the bases in DNA by measuring the ,
molecular mass of each of the DNA fragments in mixtures prepared by either the
Maxam-Gilbert or Sanger-Coulson techniques. The fragments are preferably
prepared
as in these standard techniques, although the fragments need not be tagged
with
radioactive tracers. These standard procedures produce from each section of
DNA to
be sequenced four separate collections of DNA fragments, each set containing
fragments terminating at only one or two of the four bases. In the Maxam-
Gilbert
method, the four separated collections contain fragments terminating at G,
both G and
A, both C and T, or C positions, respectively. Each of these collections is
sequentially
loaded into an ultraviolet laser desorption mass spectrometer, and the mass
spectrum
of each collection is recorded and stored in the memory of a computer. These
spectra
are recorded under conditions such that essentially no fragmentation occurs in
the
mass spectrometer, so that the mass of each ion measured corresponds to the
molecular weight of one of the DNA fragments in the collection, plus a proton
in the
positive ion spectrum, and minus a proton in the negative ion spectrum.
Spectra
obtained from the four spectra are compared using a computer algorithm, and
the
location of each of the four bases in the sequence is unambiguously
determined.
It is also possible, in principle, to obtain the DNA sequence from a single
mass
spectrum obtained from a more complex single mixture containing all possible
fragments, but both the resolution and mass accuracy required are much higher
than in
the preferred method described above. As a result the accuracy of the DNA
sequence
obtained from the single spectrum method will generally be inferior, and the
gain in
raw sequence speed will be counterbalanced by the need for more repetitions to
assure
accuracy of the sequence.
The DNA fragments to be analyzed are dissolved in a liquid solvent
containing a matrix material. Each sample is radiated with a UV laser beam at
a
wavelength of between 260 nm to 560 nm, and pulses of from 1 to 20 ns
pulsewidth.
It is an objective of the present invention to provide a method and apparatus
for the rapid and accurate sequencing of human genome and other DNA material.
112

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
It is a further objective of the present invention to provide an instrument
and
method which are relatively simple to operate, relatively low in cost, and
which may
be automated to sequence thousands of gene bases per hour.
It is a further objective of the present invention to obtain much faster and
more
accurate DNA sequence data by eliminating the gel electrophoresis separation
technique used in conventional DNA sequencing methods to determine the masses
of
the DNA fragments in a mixture.
113

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3.3 Sequencing Through Exposure To Immobilized Probes Of Shorter
Length
According to one embodiment of the invention, a target oligonucleotide is
exposed to a large number of immobilized probes of shorter length. The probes
are
collectively referred to as an "array." In the method, one identifies whether
a target
nucleic acid is complementary to a probe in the array by identifying first a
core probe
having high affinity to the target, and then evaluating the binding
characteristics of all
probes with a single base mismatch as compared to the core probe. If the
single base
mismatch probes exhibit a characteristic binding or affinity pattern, then the
core
probe is exactly complementary to at least a portion of the target nucleic
acid.
The method can be extended to sequence a target nucleic acid larger than any
probe in the array by evaluating the binding affinity of probes that can be
termed
"left" and "right" extensions of the core probe. The correct left and right
extensions of
the core are those that exhibit the strongest binding affinity andlor a
specific
hybridization pattern of single base mismatch probes.
The binding affinity characteristics of single base mismatch probes follow a
characteristic pattern in which probe/target complexes with mismatches on the
3' or 5'
termini are more stable than probe/target complexes with internal mismatches.
The
process is then repeated to determine additional Left and right extensions of
the core
probe to provide the sequence of a nucleic acid target.
In some embodiments, such as in diagnostics, a target is expected to have a
particular sequence. To determine if the target has the expected sequence, an
array of
probes is synthesized that includes a complementary probe and all or some
subset of
all single base mismatch probes. Through analysis of the hybridization pattern
of the
target to such probes, it can be determined if the target has the expected
sequence and,
if not, the sequence of the target may optionally be determined.
Kits for analysis of nucleic acid targets are also provided by virtue of the
present invention. According to one embodiment, a kit includes an array of
nucleic
acid probes. The probes may include a perfect complement to a target nucleic
acid.
The probes also include probes that are single base substitutions of the
perfect
complement probe. The kit may include one or more of the A, C, T, G, and/or U
substitutions of the perfect complement. Such kits will have a variety of
uses,
114

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
including analysis of targets for a particular genetic sequence, such as in
analysis for
genetic diseases.
A further understanding of the nature and advantages of the inventions herein
may be realized by reference to the remaining portions of the specification
and the
attached drawings.
115

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3.4 Sequencing Contiguously, Without The Need For Fragmenting And Sub-
Cloning The DNA
The present invention also enables the amplification of a DNA adjacent to a
known sequence using the PCR, without the knowledge of the sequence for a
second
primer.
The present primary invention also provides a new method for sequencing a
contiguously very long DNA sequence using the PCR technique, thereby enabling
contiguous genomic sequencing. It,will avoid the need for mapping or sub-
cloning of
shorter DNA fragments from haploid genomes such as the bacterial genomes. This
method can be used on very large DNA inserts into vectors such as the YAC.
Thus,
diploid genomes can be sequenced without any further need to sub-clone from
the
YAC clones. The cloned inserts can be of any length, of several million
nucleotides.
Alternatively, wherever purified chromosomes are available, this method can be
directly applied to sequence the whole chromosome without any need to fragment
the
chromosome or obtain YAC clones from the chromosome. This method can also be
used on whole unpurified genomes with appropriate modifications to account for
the
allelic variations of the two alleles present on the two chromosomes. In
essence, using
the method of the present invention, one can generate contiguous genomic
sequence
information in a manner not possible with any other known protocol using PCR.
The extended invention that enables the sequencing of an unknown region of
very long DNA (e.g. genomic DNA) of totally unknown sequence would also find
many applications in biology and medicine. For instance, it can be used to
physically
"map" a chromosome or genome. It would, for example, enable the production of
an
inventory of many about 500 nucleotide long sequences and the exact primer
associated with each of them. This method would also enable the cloning of the
amplified DNA sequences from arbitrary regions from a genomic DNA without the
need for breaking down the DNA. Using appropriately longer partly fixed
primers (as
the second primers), very long DNA pieces (several kilobases long) could be
amplified and cloned by using this method.
1.3.4.1 PCR Technique with 1 Primer
In one embodiment, the present invention enables the amplification of a DNA
stretch using the PCR procedure with the knowledge of only one primer. Using
this
basic method, the present invention describes a procedure by which a very long
DNA
116

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
of the order of millions of nucleotides can be sequenced contiguously, without
the
need for fragmenting and sub-cloning the DNA. In this method, the general PCR
technique is used, but the knowledge of only one primer is sufficient, and the
knowledge of the other primer is derived from the statistics of the
distributions of
oligonucleotide sequences of specified lengths.
Present DNA sequencing methods using the separation of DNA fragments on
a gel has a limitation of resolving the products of length up to about 1000
nucleotides.
Thus, in a single step, the sequence of a DNA fragment up to a length of only
about
1000 nucleotides can be obtained by the two conventional DNA sequencing
methods.
A DNA sequence of a few nucleotides up to many thousand nucleotides can be
amplified by the PCR procedure. Thus the PCR procedure can be combined with
the
DNA sequencing procedure successfully.
A primer is usually of length twelve nucleotides and longer. Let the sequence
of one primer is known in a long DNA sequence from which the DNA sequence is
to
be worked out. From this primer sequence, a specific sequence of four
nucleotides
occurs statistically at an average distance of 256 nucleotides. It has been
worked out
by Senapathy that a particular sequence of four characters would occur
anywhere
from zero distance up to about 1500 characters with a 99.9% probability (P.
Senapathy, "Distribution and repetition of sequence elements in eukaryotic
DNA:
New insights by computer aided statistical analysis," Molecular Genetics (Life
Sciences Advances), 7:53-65 (1988)). The mean distance for such an occurrence
is
256 characters and the median is 180 characters. Similarly, a 5 nucleotide
long
specific sequence will occur at a mean distance of 1024 characters, with
99.99% of
them occurnng within 6000 characters from the first primer. The median
distance for
the occurrence of a 5-nucleotide specific sequence is 730 nucleotides.
Similarly, a
particular 6 nucleotide long sequence will occur at a mean distance of 4096
nucleotides and a median distance of 2800 nucleotides. A primer of known
length,
say length 14 can be prepared with a known sequence of 6 characters and the
rest of
the sequence being random in sequence. It means that any of the four
nucleotides can
occur at the "random" sequence locations. With a fixed 5, 6 or 7 nucleotide
sequence
within the second primer, a primer of length 12-18 can be prepared with high
specificity of binding.
1.3.4.2 Non-Random Primer (Partly Fixed Primer)
117

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Such a partially non-random primer (hereafter called the partly fixed primer,
or partly non-random primer, meaning that part of its sequence is fixed) can
"anneal"
to only the sequence at which the fixed sequence exists. That is, from the
first primer,
the partly fixed primer will bind at an average distance of 1024 characters
(for a fixed
five nucleotide characters). This primer will bind specifically only at the
location of
the occurrence of the particular five nucleotide sequence with respect to the
first
primer. The average distance between the first primer and the second non-
random
primer is ideal for DNA amplification and DNA sequencing. In this situation,
the first
primer is labeled. Thus, although there would be many locations in the long
DNA
molecule at which the non-random primer can bind, it would not affect the DNA
sequencing because it is dependent only upon the labeled primer.
1.3.4.3 Partly Fixed 2°d Primer
Although the partly fixed second primer has a random sequence component in
it, a sub-population of the primer molecules will have the exact sequence that
would
bind with the exact target sequence. The proportion of the molecules with
exact
sequence that would bind with the exact target sequence will vary depending on
the
number of random characters in the partly fixed second primer. For example, in
a
second primer 11 nucleotides long with 6 characters fixed and 5 characters
random,
one in 1000 molecules will have the exact sequence complementary to the target
sequence on the template. By increasing the concentration of the partly fixed
second
primer appropriately, a comfortable level of PCR amplification required for
sequencing can be achieved. When primer concentration is increased, it
requires an
increase in the concentration of Magnesium, which is required for the function
of the
polymerase enzyme. The excess primers (and "primer- dimers" formed due to
excess
of primers) can be removed after amplification reaction by a gel-purification
step.
Any non-specific binding by any population of the second primers to non-
target sequences could be avoided by adjusting (increasing) the temperature of
re-
annealing appropriately during DNA amplification. It is well known that the
change
of even one nucleotide due to point-mutation in some cancer genes can be
detected by
DNA-hybridization. This technique is routinely used for diagnosing particular
cancer
genes (e.g. John Lyons, "Analysis of ras gene point mutations by PCR and
oligonucleotide hybridization," in PCR Protocols: A guide to methods and
applications, edited by Michael A Innis et al., (1990), Academic Press, New
York).
118

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
This is done by adjusting the "re-annealing" or "melting- temperature", and
fine-
tuning the reaction conditions. Thus the binding of non-specific sequences
even with
just one nucleotide difference compared to the target binding-site in the
template
sequence can be avoided.
It should also be noted that non-specific binding sites for the partly fixed
second primers could be expected to occur statistically on a long genomic DNA
at
many places other than the target site which is close to the first primer.
Amplification
of non-specific DNA between these primer binding sites that could occur on
opposite
strands of the template DNA could happen. However, this would not affect the
objective of the present invention of specific DNA sequencing of the target
sequence.
Because only the first primer is labeled radioactivity or fluorescently, only
the
reaction products of the target DNA will be visualized on the sequencing gel
pattern.
The presence of such non-specific amplification products in the reaction
mixture will
also not affect the DNA sequencing reaction.
Amplification of DNA will occur not only between the first primer and the
partly fixed second primer that occurs closest downstream from the first
primer, but
also between the first primer and one or two subsequently occurring second
primers,
depending upon the distance at which they occur. However, these amplification
products will all start from the first primer and will proceed up to these
second
primers. Since the DNA sequencing products are visualized by labeling the
first
primer, and since the DNA synthesis during the sequencing reaction proceeds
from
the first primer, the presence of two or three amplification products that
start from the
first primer will not affect the DNA sequencing products and their
visualization on
gels. At the most, the intensity of the bands that are subsets of different
amplification
products will vary slightly on the gel, but not affect the gel pattern. In
fact, it is
expected that this phenomenon will enable the sequencing of a longer DNA
strand
where the closest downstream primer is too close to the first primer--thereby
avoiding
the need for sequencing from the first primer again using another partly fixed
second
primer.
The minimum length of primer for highly specific amplification between
primers on a template DNA is usually considered to be about 15 nucleotides.
However, in the present invention, this length can be reduced by increasing
the G/C
content of the fixed sequence to 12-14 nucleotides.
119

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In essence, the basic procedure of the present invention is fully viable and
feasible, and any non-specificity can be avoided by fine-tuning the reaction
conditions
such as adjusting the annealing temperature and reaction temperature during
amplification, and/or adjusting the length and G/G content of the primers,
which are
routinely done in the standard PCR amplification protocol.
1.3.4.4 Sequence DNA of 2"d Primer
The primary advantage of the present invention is to provide an extremely
specific second primer that would bind precisely to a sequence at an
appropriate
distance from the first primer resulting in the ability to sequence a DNA
without the
prior knowledge of the second primer. From the newly worked out DNA sequence,
a
primer sequence can be made complementary to a sequence located close to the
downstream end. This can be used as the first primer in the next DNA
amplification-
sequencing reaction, and the unknown sequence downstream from it can be
obtained
by again using the same partly fixed primer that was used in the first round
of
sequencing as the second primer. Thus, knowing only one short sequence in a
contiguously long DNA molecule, the entire sequence can be worked out using
the
present invention.
When the length of the fixed sequence in the partly fixed second primer is
increased in the present invention, the distance from the first primer at
which the
second primer will bind on the template will also be correspondingly
increased. For a
6 nucleotide fixed sequence, the median length of DNA amplified will be 2800
nucleotides (mean 4096 nucleotides), and for a 7 nucleotide fixed sequence,
the
median length of amplified DNA will be ~l 1,000 nucleotides (mean= 16,000
nucleotides). However, even if the length of amplified DNA is several thousand
nucleotides, still this DNA can be used in DNA sequencing procedures.
Furthermore,
the present invention can be used to amplify a DNA of length which is limited
only by
the inherent ability of PCR amplification. A technique known as "long PCR" is
used
to amplify long DNA sequences (Kainz et at., "In vitro amplification of DNA
Fragments > 10 kb," Anal Biochem., 202:46 (1992); Ponce & Micol, "PCR
amplification of long DNA fragments" Nucleic Acids Research, 20:623 (1992)).
Existing genome sequencing methods employ the breaking down of a very
long genomic DNA into many small fragments, sub-cloning them, sequencing them,
and then assembling the sequence of the long DNA. Typically, a genomic DNA is
120

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
broken down and cloned into overlapping fragments of approx. one million
nucleotides in "YAC" (Yeast Artificial Chromosome) clones, each YAC clone is
again fragmented and sub-cloned into overlapping fragments of 25,000
nucleotides
in "cosmid" clones, and each cosmid clone in turn sub-cloned into overlapping
fragments of 1000 nucleotides in "M13 phage" or "plasmid" clones. These are
sequenced randomly to assemble the larger sequences in the hierarchy. The
present
invention circumvents the need for breaking down and sub-cloning steps, making
it
greatly advantageous for contiguously sequencing long genomic DNA.
1.3.4.5 The 2°d Partly Fixed Primer Enabling Sequencing
Extending the above invention, another invention is presented here. This
extended invention would enable the sequencing of 500 nucleotide long sequence
somewhere within a given long DNA with no prior information of any sequence at
all
within the long DNA. The probability that any specific primer of length 10
nucleotides would occur somewhere in a DNA of about one million nucleotides is
approximately 1. The probability that any primer of length 15 nucleotides
occur
somewhere in a genome of about one billion nucleotides is approximately 1.
Thus,
use of any exact primer of about 15 nucleotide sequence on a genomic DNA in
the
present invention as the first primer, and the use of the second partly fixed
primer will
enable the sequencing of the DNA sequence bracketed by the two primers
somewhere
in the genome. °Thus, this procedure can be used to obtain an exact
sequence of about
500 characters somewhere from a genome without the prior knowledge of any of
its
sequence at all. Thus, by using many different primers with arbitrary but
exact
sequences, one can obtain many 500-nucleotide sequences at random locations
within a genome. Using these sequences as the starting points for contiguous
genome
sequencing in the present invention, the whole genomic sequence can be closed
and
completed. Thus an advantage of the present invention is that without any
prior
knowledge of any sequence in a genome, the whole sequence of a genome can be
obtained.
It must be noted that every 15-nucleotide arbitrary primer may not always
have a complementary sequence in a genome (of one billion nucleotides long).
However, most often it would be present and would be useful in performing the
above-mentioned sequencing. In some cases, there may be more than one
occurrence
of the primer sequence in the genome, and so may not be useful in obtaining
the
121

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
sequence. However, the frequency of successful single-hits can be extremely
high
(~90%) and can be further refined by using an appropriate length of the
arbitrary
primer. For genomes (or long DNAs) that are shorter than a billion
nucleotides,
shorter exact sequences in the first primers (say 10 characters) could be
used, and the
rest could be random or "degenerate" nucleotides. While this primer will still
bind at
the sequence complementary to the exact sequence, the longer primer will aid
in
avoiding non-specific DNA amplification. The length of the first primer can
thus be
increased using degenerate nucleotides at the ends to a desired extent,
without
affecting any specificity. Once a sequence is known in an unknown genomic DNA,
then the present method can be performed to extend a contiguous sequence in
both
directions of the DNA from this starting point.
The present invention can also be useful to amplify the DNA between the first
primer and the partly fixed second primer, with an aim to using this amplified
DNA
for purposes other than DNA sequencing, such as cloning. Although there would
be
sufficient quantity of the target specific amplified DNA in the reaction
products, the
reaction products will, however, contain the population of non-specific DNA
amplif ed between the non-specifically occurring second primer binding sites
on
opposite strands. However, by introducing a purification step from this
reaction
mixture, such as using an immobilized column containing only the first primer,
the
amplified target DNA can be purified and used for any other purposes.
122

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3.5 Sequencing large fragments of DNA (end-sequencing-based method of
subclone pathway generation through the fragment with efficient transposon-
based sequencing of the identified subclones)
The invention also provides a systematic and efficient way to sequence large
fragments of DNA, in particular genomic DNA. It combines an end-sequencing-
based
method of subclone pathway generation through the fragment with efficient
transposon-based sequencing of the identified subclones.
Thus, in one aspect, the invention is directed to a method to sequence a
fragment of DNA, said fragment typically having a length of more than about 30
kb.
The method comprises the following steps.
First, the fragment is provided in a host cloning vector capable of
accommodating it. The size of the fragment that can be sequenced will depend
on the
nature of the host cloning vector. Cloning vectors are available that can
accommodate
large fragments of DNA; even the approximately 30-40 kb fragments that are
suitable
for insertion into cosmids are of sufficient length that the method of the
invention is
usefully applicable to them.
A composition comprising said vector containing the inserted fragment is then
randomly sheared, such as by sonication, to obtain subfragments of
approximately 3
kb. The length of the subfragments is appropriate to the transposon-mediated
directed
sequencing method that will ultimately be applied. The 3 kb length is an
approximation; it is intended only as an order of magnitude. Generally
speaking,
subfragments of 2-5 kb are susceptible to this approach.
The subfragments are then inserted into host cloning vectors to obtain a
library
of subclones. These host cloning vectors are ideally of minimal size,
containing only a
selectable marker, an origin of replication, and appropriate insertion sites
for the
subfragments. The desirability of minimizing the available plasmid DNA in the
performance of transposon-mediated sequencing is described by Strathmann, et
al.
(supra}.
Sufficient subclones that contain subfragments derived from the original
fragment are then recovered to provide lx coverage of the fragment when the
end of
each subfragment is sequenced. A stretch of about 400-450 bases can be
sequenced
with assurance using available automated sequencing techniques. Thus, the
sequencing can be conducted using the sequencing primers based on the vector
123

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
sequences adjacent the inserts to proceed into the insert to approximately
this
distance. For a 1 x coverage of the original fragment, the number of subclones
required can be calculated by dividing the length of the original fragment by
the
intended sequencing distance--i.e., by approximately 400- 450.
There should also be sufficient subclones in the library so that when the
complete sequence of each is determined, the coverage of the original fragment
will
be about 7-8 x. This provides, as described below, a high probability that
every
nucleotide present in the fragment will be present in the library. This number
can, of
course, be determined by multiplying the length of the fragment by 7 or 8 and
dividing by the length of the subfragments generated.
It is preferable to assure that all of the subclones in the library contain
pieces
of the original fragment. This can be done by recovering only those subclones
that
hybridize to the fragments.
A sufficient portion of one of the ends of each recovered subclone containing
fragment-derived DNA is then sequenced and this sequence information is placed
into
a searchable database. The database is searched for subclones that contain
subfragments with nucleotide sequences matching those that characterize the
host
vector that accommodated the original fragment. To the extent that these
subfragments also contain sequence from the original fragment, that sequence
must be
at one or the other end of the original fragment. This illustrates why the
efficiency of
the method is improved by introducing a prescreening step which eliminates any
subclones which do not contain portions of the original fragment. If the
prescreening
has been done, these subclones contain oligonucleotide sequence from either
end of
the original fragment. The identified subclones are recovered.
1.3.5.1 "Second End" Sequence
A partial sequence of each of the identified subclones is determined from the
opposite end of the subfragment insert from that originally placed in the
database.
This provides "second end" sequence information concerning sequence further
removed from the end of the original fragment. This information is then used
to
search the database in order to identify subclones containing nucleotide
sequence that
matches this second end sequence. Such subclones are likely to represent
regions of
the original fragment that are farther removed from the ends and provide
further
progress in constructing a path across the fragment. These subclones are
recovered as
124

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
well, and sequenced from the end opposite to that which was sequenced to
provide the
information for the database and this new information, in turn, used to search
the
database for a matching sequence. The steps of second end sequencing,
searching the
database with the resulting sequence information, and recovery of subclones
which
contain a match are repeated sequentially until subclones have been identified
that
represent the complete original fragment. The resulting collection of
subclones
consists of an ordered minimum set that collectively represent the original
fragment.
The appropriate sequence of such subclones to span the original fragment from
end to
end is also known.
It remains only to obtain sufficient portions of the complete nucleotide
sequence of each subclone from the subclone collection using transposon-
mediated
sequencing to provide the complete sequence of the original fragment.
In another aspect, the invention is directed to kits suitable for conducting
the
method of the invention.
125

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3.6 Improvements in high speed, high throughput, no required eIec.:-
ophoresis
(and, thus, no geI reading artifacts due to the complete absence of an
electrophoretic step)
The invention also describes a new method to sequence DNA. The
improvements over the existing DNA sequencing technologies include high speed,
high throughput, no required electrophoresis (and, thus, no gel reading
artifacts due to
the complete absence of an electrophoretic, step), and no costly reagents
involving
various substitutions with stable isotopes. The invention utilizes the Sanger
sequencing strategy and assembles the sequence information by analysis of the
nested
fragments obtained by base-specific chain termination via their different
molecular
masses using mass spectrometry, for example, MALDI or ES mass spectrometry. A
further increase in throughput can be obtained by introducing mass
modifications in
the oligonucleotide primer, the chain-terminating nucleoside triphosphates
andlor the
chain- elongating nucleoside triphosphates, as well as using integrated tag
sequences
which allow multiplexing by hybridization of tag specific probes with mass
differentiated molecular weights.
126

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3.7 A method and a system for sequencing a genome
The present invention pertains to a method for sequencing genomes. The
method comprises the steps of obtaining nucleic acid material from a genome.
Then
there is the step of constructing a clone library and one or more probe
libraries from
the nucleic acid material. Next there is the step of comparing the libraries
to form
comparisons. Then there is the step of combining the comparisons to construct
a map
of the clones relative to the genome. Next there is the step of determining
the
sequence of the genome by means of the map.
The present invention pertains to a system for sequencing a genome. The
system comprises a mechanism for obtaining nucleic acid material from a
genome.
The system also comprises a mechanism for constructing a clone library and one
or
more probe libraries. The constructing mechanism is in communication with the
nucleic acid material from a genome. Additionally, the system comprises a
mechanism for comparing said libraries to form comparisons. The comparing
mechanism is in communication with the said libraries. The system also
comprises a
mechanism for combining the comparisons to construct a map of the clones
relative to
the genome. The said combining mechanism is in communication with the
comparisons. Further, the system comprises a mechanism for determining the
sequence of the genome by means of said map. The said determining mechanism is
in
communication with said map.
1.3.7.1 A method for producing a gene of a genome
The present invention additionally pertains to a method for producing a gene
of a genome. The method comprises the steps of obtaining nucleic acid material
from
a genome. Then there is the step of constructing libraries from the nucleic
acid
material. Next there is the step of comparing the libraries to form
comparisons. Then
there is the step of combining the comparisons to construct a map of the
clones
relative to the genome. Next there is the step of localizing a gene on the
map. Then
there is the step of cloning the gene from the map.
127

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3.8 Methods and means for the massively parallel characterization of complex
molecules and of molecular recognition phenomena with parallelism and
redundancy attained through single molecule examination methods
In another embodiment, the present invention approaches the vastness of
biological complexity through massive parallelism, which may conveniently be
attained through various single molecule examination (SME) methods variously
referred to heretofore as single molecule detection (SMD), single molecule
visualization (SMV) and single molecule spectroscopy (SMS) techniques.
Used within appropriate procedures, single molecule examination methods can
enable molecular parallelism.
Molecular parallelism may be applied to the examination of the composition
of complex molecules (including co-polymers of natural or of synthetic origin)
or to
determinations of interactions between large numbers of molecules. The former
case
may be applied to genome-scale sequencing methods. The latter case may be
applied
to rapid determination of molecular complementarity, with applications in
(biological
or non-biological) affinity characterization, immulogical study, clinical
pathology,
molecular evolution (e.g. in vitro evolution), and the construction of a
cybernetic
immune system as well as prostheses based thereupon. In both cases, molecular
recognition phenomena are observed with molecular parallelism.
Note that within said affinity characterization applications, both kinetics of
both binding association and dissociation, and binding equilibria, may be
examined.
Kinetics may be examined by observing the rates of occupation of appropriate
sites or
diverse populations thereof by some homogenous or heterogeneous sample, and
the
rates of vacancy formation from occupied sites. Equilibria constants may be
determined by observing the proportion (number of occupied sites divided by
number
of total sites) of sites occupied under equilibrium conditions, with greater
quantitative
confidence yielded by, for example, examining more binding sites.
Sequencing of polynucleotide molecules may be effected by the (preferably
end-wise) immobilization of a library of such molecules to a surface at a
density
convenient for detection, which will vary according to the detection
methodology
availed. Several methods capable of effecting such immobilization will be
obvious to
those skilled in the arts of recombinant DNA technology and molecular biology,
128

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
among others. Priming, which may be random or non-random, is effected by any
of a
variety of methods, most of which are obvious to those skilled in the relevant
arts.
Genome sequencing applications availing of enzymatic polymerization's and
corresponding embodiments of the present invention, rely upon control over
polymerization rate and nucleotide incorporation specificity, consistent with
the well-
known Watson-Crick base pairing rules which may be enforced (upon single
nucleotides in a processive manner, as conditions permit) by the use of DNA
polymerases or analogs thereof, in combination with repeatable single molecule
detection applied to a large population of diverse molecules. A sequencing
cycle
comprises the steps of: (l.) polymerizing one or less nucleotides, which carry
some
removable or neutralizable molecular label and may optionally be reversibly 3'
protected (or otherwise protected in anv manner which modulates polymerization
rate
onto each sample molecule at the primer or at subsequent extensions thereof
and in
opposition to (and pairing with) a single, unique, base of the template
polynucleotide
strand; (2.) optionally washing away any unreacted labeled nucleotides; (3.)
detecting,
by either direct or indirect methods, said labeled nucleotides incorporated
into said
sample molecules, in a manner which repeatably associates information obtained
about the type of label observed with the unique identity of the template
molecule
under observation, which may be uniquely distinguished by a variety of methods
(which include: a mappable location of immobilization of the sample template
molecule on a substrate surface; a mappable location of immobilization of the
sample
template molecule within some matrix volume element; microscopic labeling with
some readily identifiable, e.g. combinatorially or permutationally diverse and
readily
examined particle or molecule or group of molecules and detection of the thus
marked
identity of individual free molecules in solution; and, scanning of a liquid
sample

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
may serve to modulate monomer addition rate to the strand being copied from
the
template molecule) from the nucleotide added during the present cycle, if
these are
distinct from any cleavably linked labeling moieties; (6.) optionally checking
that the
removal or neutralization of said label in step (4) was successful for any
particular
molecule of the sample, by repeating a similar detection procedure. Said
sequencing
cycle comprising an appropriate subset of steps 1-6 may be repeated as many
times as
convenient, but must be repeated a sufficient number of times to obtain
sequence
information of sufficient complexity from each individual molecule to permit
unambiguous alignment of all such sequence information determined for all of
the
molecules of the sample. This minimum number of cycles will be approximately
related to the complexity C of the sample to be treatated as part of the same
macroscopic reaction (i.e. a macroscopic sample preparation subjected to
unitary
macroscopic manipulations) by the formula C<4° where n is the number of
cycles.
Beyond this minimum, there are tradeoffs between the number of cycles to be
performed and the number of molecules to be examined, and the confidence for
sequence data obtained.
Note that unused reagents and enzymes may be recovered from washes and
recycled.
1.3.8.1 Advantages of Parallelism
In contrast to the previously disclosed base-addition sequencing schemes, the
sequence determination applications of the present invention enjoys
substantial
advantages deriving from sample manipulation in the single-molecule-regime.
Working instead in the distinct single-molecule-regime rather than with
populations
of identical molecules provides substantial advantages of parallelism,
facility of use
and implementatiol, (including automated implementation,) and operability.
Among
these are unanticipated advantages: (1) because a single molecule is
necessarily
monodisperse, failure of a molecule to undergo addition in a cycle does not
cause a
loss of sample monodispersion (i.e. lead to uneven sample molecules dispersity
or
polydispersion); such addition failure is unproblematic when single molecules
are
examined individually because it is readily detected and accounted for in data
analysis; in contrast, samples comprising multiple identical molecules may
thus take
on non-identical lengths, complicating data collection and analysis; (2)
samples
comprising a plurality of individually distinct single molecules (species) may
be
130

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
handled unitarily without requiring any handling measures to keep distinct
molecules
apart, providing a large reduction in manipulations required on a per-species
basis and
not requiring the use of many separate, parallel fluid handling steps or
means; (3)
inadvertent multiple base additions are more readily detected and their extent
is more
readily quantified because these changes in quantity are large compared to the
signal
expected from the incorporation of a single base (i.e. single label) into a
single
molecular species; (4) deprotection or delabeling failures may also be readily
detected
and noted for the correct single molecule, such that addition failure, the
presence of a
label, or overlabeling in the subsequent cycle may be correctly interpreted
(according
to the unlabeling and single stepping methods used in a particular
embodiment.)
These advantages are expected to be important in the competitiveness of these
present
methods over conventional polynucleotide sequencing methods.
Various techniques are included to address any non-idealities encountered
which may arise because of deviations from conventional polymerization or
detection
methods. These generally take the form of different types of redundancy, which
may
be employed to either prevent or resolve any such errors. Prominent among
these
redundancies is oversampling, i.e. the examination of some multiple (j) of the
number
(m) of sample molecules suggested by combinatoric computations to be minimally
sufficient for full alignment of data from a sample of a given complexity.
Such oversampling redundancy will increase the confidence interval for
accuracy of collected data and reduce the likelihood of artifacts arising from
sequence
duplications which may occur in any given sample.
1.3.8.2 Oversampling Redundancy
Oversampling redundancy may be availed to increase data confidence by
providing the opportunity to score and match multiple occurrences of the same
sequence segment and thus detect and eliminate erroneous sequence segment
information by virtue of its less frequent occurrence. Erroneous sequence
segment
information may arise, for instance, by nucleotide incorporation errors which
are an
inevitable feature of polymerization with polymerases having a characteristic
fidelity,
i.e. displaying a characteristic nucleotide misincorporation rate, Such
methods will be
particularly useful where polynucleotide polymerases fidelity would otherwise
be
unacceptably low. It should be noted that an error rate of one percent or more
has
been deemed conventionally acceptable for genome informatic purposes.
131

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.3.8.3 Controls/Data
Further, known molecules having sequences that are highly unrelated to the
sample may be included as internal controls to monitor the efficiency and
accuracy of
a particular sequence collection process; such internal control sequences will
present
negligibly small overhead because molecular parallelism may easily accommodate
any such comparatively small increase in sample complexity, even though it
might be
considered large with respect to pre-existing methods.
After raw data have been collected for each molecule, these are all mutually
compared by some appropriate matching algorithm and aligned so as to
reconstruct
the full sequence of the sample. The computational complexity of completing
such an
alignment may be estimated as the mufti- phase comparison and sorting of
(j)(m)
strings each of length n.
Alternatively, data alignment may be performed in tandem or parallel with
later cycles and may be monitored by appropriate computational algorithms for
data
quality and confidence of sequence information, and cycling may continue till
desired
criteria are satisfied. Computer, microprocessor, electronic or other
automated control
of instrumentation, including fluidics and robotics for the manipulation of
samples,
and the automated effectuation of the various methods of the present
invention, all
according to parameterized algorithms, may be accomplished by means obvious
from
the present disclosure to those skilled in the relevant arts (e.g. fluidics,
robotics,
electronics, microelectronics, computer science and engineering, and
mechanical
engineering). Concurrent data alignment and monitoring will permit
modifications of
the sequencing cycle described above, such as dynamic adjustment of
polymerization
reaction conditions and durations, label removal or neutralization procedure
parameters, polymerization deprotection conditions, and any other desired
parameter,
so as to permit optimization of procedures and results.
With appropriately flexible design, automated systems and instruments such as
those described above for genome applications may readily be adapted, with
appropriate changes in samples and labeling methods and reagents, to
cybernetic
molecular evolution, cybernetic immune system, broad spectrum pathogen
characterization and other applications of the present invention.
1.3.8.4 Double/Single Stranded Polynucleotide Sequencing Method
132

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
According to the embodiment availed, double or single stranded
polynucleotides may be examined. Where single stranded polynucleotide
molecules
are preferred, second strands may be removed by performing said immobilization
so
as to only involve only one strand in covalent linkage with said surface and
then
performing a denaturation of the sample with washing. Priming means required
by
any particular enzyme must then be provided, usually by hybridization of a
complementary oligo- or polynucleotide to the sample template molecules,
though
other means are possible. Other methods which will be obvious to those skilled
in the
arts of recombinant DNA technology may also be employed to yield immobilized
or
otherwise uniquely~identifiable single stranded polynucleotide samples.
Where double stranded molecules are preferred, said second strands may be
treated with an appropriate exonuclease under appropriate conditions and for
an
appropriate lengths of time to provide a good distribution of lengths of said
second
strands such that the termini of the undegraded portions of said second
strands
provide convenient priming for enzymatic nucleotide polymerization (i.e. DNA
directed DNA synthesis or DNA replication, DNA directed RNA synthesis or
transcription, RNA directed DNA synthesis or reverse transcription, or RNA
directed
RNA synthesis or RNA replication).
Note that the polynucleotide sequencing methods of the present invention
represent the converse of conventional enzymatic and chemical sequencing
methods
in that those conventional methods rely upon the production of multiple
homogeneous
sub-populations of DNA molecules which together comprise a nested set, and the
detection of each of such sub-population (with deviant chain terminator
misincorporation molecules arising with significantly lower frequency and thus
constituting a poorly detected population), while the present invention relies
on
alignment of information from a highly inhomogeneous population molecules and
repeatable detection of single molecules. Further note that by previous
methods, each
species yields information about only one base at one position within the
sample
sequence, while with the methods of the present invention, each individual
sample
template molecule may yield information about the identity of several bases.
Note
also that under conventional methods, some effort has been expended in
increasing
the number of bases yielding information per sample, i.e. lengthening the
linear
sequence information obtained from any one segment of a sample, which is
133

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
substantially frustrated by the inherent limitations of electrophoretic
separation and
particularly gel electrophoresis, while the present invention readily
accomplishes the
information yielded per unitary manipulation through increases in the facility
and
practicable extent of parallelism.
There are several levels of parallelism and pipelining possible with the
methods of the present invention. An arbitrarily large number of molecules may
be
subjected to any given manipulation at once if they are part of the same
unitary
sample. Detection will have constraints entailed by the particular
instrumentation and
method used, but many degrees of freedom exist with regard to means of
providing
parallelism in detection instrumentation (e.g. multiple microscopy instruments
or
appropriately arranged objective lenses and controlled light paths for light
microscopic based detection, multiple optoelectronic device arrays [e.g. CCDs
or
SLMs] for the respective types of detection; multiple probes [i.e. in arrays
with
parallel detection provided] for scanning probe microscopic detection methods
with
various degrees of freedom with respect to eachother during scanning, etc.)
Means for
pipelining the steps of the methods disclosed herein will be readily apparent
when one
considers that dedicated instrumentation or robotics may perform each relevant
step,
and that the ensemble of such instrumentation may readily be integrated to
form a
coordinated system, for example matching throughput at different stages by
adjusting
the parallelism of appropriate stages. Thus economy, throughput and data
accuracy
are tradeoffs, but may individually vastly exceed any such measures attainable
with
conventional methods.
134

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4 EXEMPLARY GENOMIC CHARACTERIZATION METHODS
1.4.1 Employing Mass Spectrometry To Analyze The Sanger Sequencing
Reaction Mixtures
In one embodiment, this invention describes an improved method of
sequencing DNA. In particular, this invention employs mass spectrometry to
analyze
the Sanger sequencing ieaction mixtures.
In Sanger sequencing, four families of chain-terminated fragments are
obtained. The mass difference per nucleotide addition is 289.19 for dpC,
313.21 for
dpA, 329.21 for dpG and 304.2 for dpT, respectively.
1.4.1.1 Mass Modified
In one embodiment, through the separate determination of the molecular
weights of the four base-specifically terminated fragment families, the DNA
sequence
can be assigned via superposition (e.g., interpolation) of the molecular
weight peaks
of the four individual experiments. In another embodiment, the molecular
weights of
the four specifically terminated fragment families can be determined
simultaneously
by MS, either by mixing the products of all four reactions run in at least two
separate
reaction vessels (i.e., all run separately, or two together, or three
together) or by
running one reaction having all four chain-terminating nucleotides (e.g., a
reaction
mixture comprising dTTP, ddTTP, dATP, ddATP, dCTP, ddCTP, dGTP, ddGTP) in
one reaction vessel. By simultaneously analyzing all four base-specifically
terminated
reaction products, the molecular weight values have been, in effect,
interpolated.
Comparison of the mass difference measured between fragments with the known
masses of each chain-terminating nucleotide allows the assignment of sequence
to be
carried out. In some instances, it may be desirable to mass modify, as
discussed
below, the chain-terminating nucleotides so as to expand the difference in
molecular
weight between each nucleotide. It will be apparent to those skilled in the
art when
mass-modification of the chain- terminating nucleotides is desirable and can
depend,
for instance, on the resolving ability of the particular spectrometer
employed. By way
of example, it may be desirable to produce four chain- 12 3 1 terminating
nucleotides, ddTTP, ddCTP , ddATP and ddGTP where ddCTP ddATP 2 and ddGTP
3 have each been mass-modified so as to have molecular weights resolvable from
one
another by the particular spectrometer being used.
135

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The terms chain-elongating nucleotides and chain-terminating nucleotides are
well known in the art. For DNA, chain-elongating nucleotides include 2'-
deoxyribonucleotides and chain-terminating nucleotides include 2', 3'-
dideoxyribonucleotides. For RNA, chain-elongating nucleotides include
ribonucelotides and chain-terminating nucleotides include 3'-
deoxyribonucleotides.
The term nucleotide is also well known in the art. For the purposes of this
invention,
nucleotides include nucleoside mono-, di-, and triphosphates. Nucleotides also
include
modified nucleotides such as phosphorothioate nucleotides.
Since mass spectrometry is a serial method, in contrast to currently used slab
gel electrophoresis which allows several samples to be processed in parallel,
in
another embodiment of this invention, a further improvement can be achieved by
multiplex mass spectrometric DNA sequencing to allow simultaneous sequencing
of
more than one DNA or RNA fragment. As described in more detail below, the
range
of about 300 mass units between one nucleotide addition can be utilized by
employing
either mass modified nucleic acid sequencing primers or chain-elongating
and/or
terminating nucleoside triphosphates so as to shift the molecular weight of
the base-
specifically terminated fragments of a particular DNA or RNA species being
sequenced in a predetermined manner. For the first time, several sequencing
reactions
can be mass spectrometrically analyzed in parallel. In yet another embodiment
of this
invention, multiplex mass spectrometric DNA sequencing can be performed by
mass
modifying the fragment families through specific oligonucleotides (tag probes)
which
hybridize to specific tag sequences within each of the fragment families. In
another
embodiment, the tag probe can be covalently attached to the individual and
specific
tag sequence prior to mass spectrometry.
1.4.1.2 Mass Spectrometer Formats Used (MALDI, ES, ICR, Fourier
Transform)
Preferred mass spectrometer formats for use in the invention are matrix
assisted laser desorption ionization (MALDI), electrospray (ES), ion cyclotron
resonance (ICR) and Fourier Transform. For ES, the samples, dissolved in water
or in
a volatile buffer, are injected either continuously or discontinuously into an
atmospheric pressure ionization interface (API) and then mass analyzed by a
quadrupole. The generation of multiple ion peaks which can be obtained using
ES
mass spectrometry can increase the accuracy of the mass determination. Even
more
136

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
detailed information on the specific structure can be obtained using an MS/N4S
quadrupole configuration In MALDI mass spectrometry, various mass analyzers
can
be used, e.g., magnetic sector/magnetic deflection instruments in single or
triple
quadrupole mode (MS/MS), Fourier transform and time-of flight (TOF)
configurations as is known in the art of mass spectrometry. For the
desorption/ionization process, numerous matrix/laser combinations can be used.
Ion-
trap and reflectron configurations can also be employed.
In one embodiment of the invention, the molecular weight values of at least
two base-specifically terminated fragments are determined concurrently using
mass
spectrometry. The molecular weight values of preferably at least five and more
preferably at least ten base-specifically terminated fragments are determined
by mass
spectrometry. Also included in the invention are determinations of the
molecular
weight values of at least 20 base-specifically terminated fragments and at
least 30
base- specifically terminated fragments. Further, the nested base-specifically
terminated fragments in a specific set can be purified of all reactants and by-
products
but are not separated from one another. The entire set of nested base-
specifically
terminated fragments is analyzed concurrently and the molecular weight values
are
determined. At least two base-specifically terminated fragments are analyzed
concurrently by mass spectrometry when the fragments are contained in the same
sample.
1.4.1.3 Process of Mass Spectrometric DNA Sequencing
In general, the overall mass spectrometric DNA sequencing process will start
with a library of small genomic fragments obtained after first randomly or
specifically
cutting the genomic DNA into large pieces which then, in several subcloning
steps,
are reduced in size and inserted into vectors like derivatives of M 13 or pUC
(e.g.,
Ml3mpl8 or M13mp19). In a different approach, the fragments inserted in
vectors,
such as M 13, are obtained via subcloning starting with a cDNA library. In yet
another approach, the DNA fragments to be sequenced are generated by the
polymerase chain reaction (e.g., Higuchi et al., "A General Method of in vitro
Preparation and Mutagenesis of DNA Fragments: Study of Protein and DNA
Interactions," Nucleic Acids Res., 16, 7351-67 (1988)). As is known in the
art, Sanger
sequencing can start from one nucleic acid primer (UP) binding to the plus-
strand or
from another nucleic acid primer binding to the opposite minus-strand. Thus,
either
137

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
the complementary sequence of both strands of a given unknown DNA sequence can
be obtained (providing for reduction of ambiguity in the sequence
determination) or
the length of the sequence information obtainable from one clone can be
extended by
generating sequence information from both ends of the unknown vector- inserted
DNA fragment.
The nucleic acid primer carries, preferentially at the 5'-end, a linking
functionality, L, which can include a spacer of sufficient length and which
can
interact with a suitable functionality, L', on a solid support to form a
reversible
linkage such as a photocleavable bond. Since each of the four Sanger
sequencing
families starts with a nucleic acid primer this fragment family can be bound
to the
solid support by reacting with functional groups, L', on the surface of a
solid support
and then intensively washed to remove all buffer salts, triphosphates,
enzymes,
reaction by- products, etc. Furthermore, for mass spectrometric analysis, it
can be of
importance at this stage to exchange the canon at the phosphate backbone of
the DNA
fragments in order to eliminate peak broadening due to a heterogeneity in the
rations
bound per nucleotideunit. Since the L-L' linkage is only of a temporary nature
with
the purpose to capture the nested Sanger DNA or RNA fragments to properly
condition them for mass spectrometric analysis, there are different
chemistries which
can serve this purpose. In addition to the examples given in which the nested
fragments are coupled covalently to the solid support, washed, and cleaved off
the
support for mass spectrometric analysis, the temporary linkage can be such
that it is
cleaved under the conditions of mass spectrometry, i.e., a photocleavable bond
such
as a charge transfer complex or a stable organic radical. Furthermore, the
linkage can
be formed with L'being a quaternary ammonium group. In this case, preferably,
the
surface of the solid support carries negative charges which repel the
negatively
charged nucleic acid backbone and thus facilitates desorption. Desorption will
take
place either by the heat created by the laser pulse and/or, depending on L; by
specific
absorption of laser energy which is in resonance with the L' chromophore. The
functionalities, L and L,' can also form a charge transfer complex and thereby
form
the temporary L-L' linleage. Various examples for appropriate functionalities
with
either acceptor or donator properties are depicted without limitation herein.
Since in
many cases the "charge-transfer band" can be determined by LTV/vis
spectrometry
(see e.g. Organic Charge Transfer Complexes by R. Foster, Academic Press,
1969),
138

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
the laser energy can be tuned to the corresponding energy of the charge-
transfer
wavelength and, thus, a specific desorption off the solid support can be
initiated.
Those skilled in the art will recognize that several combinations can serve
this
purpose and that the donor functionality can be either on the solid support or
coupled
to the nested Sanger DNA/RNA fragments or vice versa.
In yet another approach, the temporary linkage L-L' can be generated by
homolytically forming relatively stable radicals. As described herein, a
combination
of the approaches using charge-transfer complexes and stable organic radicals
is
shown. Here, the nested Sanger DNAlRNA fragments are captured via the
formation
of a charge transfer complex. Under the influence of the laser pulse,
desorption (as
discussed above) as well as ionization will take place at the radical
position. In other
examples described herein, under the influence of the laser pulse, the L-L'
linkage will
be cleaved and the nested Sanger DNA/RNA fragments desorbed and subsequently
ionized at the radical position formed. Those skilled in the art will
recognize that
other organic radicals can be selected and that, in relation to the
dissociation energies
needed to homolytically cleave the bond between them, a corresponding laser
wavelength can be selected (see e.g. Reactive Molecules by C. Wentrup, John
Wiley
& Sons, 1984). In yet another approach, the nested Sanger DNA/RNA fragments
are
captured via Watson-Crick base pairing to a solid support- bound
oligonucIeotide
complementary to either the sequence of the nucleic acid primer or the tag
oligonucleotide sequence. The duplex formed will be cleaved under the
influence of
the laser pulse and desorption can be initiated. The solid support- bound base
sequence can be presented through natural oligoribo- or
oligodeoxyribonucleotide as
well as analogs (e.g. thio-modified phosphodiester or phosphotriester
backbone) or
employing oligonucleotide mimetics such as PNA analogs (see e.g. Nielsen et
al.,
Science, 254, 1497 (1991)) which render the base sequence less susceptible to
enzymatic degradation and hence increases overall stability of the solid
support-bound
capture base sequence. With appropriate bonds, L-L', a cleavage can be
obtained
directly with a laser tuned to the energy necessary for bond cleavage. Thus,
the
immobilized nested Sanger fragments can be directly ablated during mass
spectrometric analysis.
1.4.1.3.1 Conditioning
139

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Prior to mass spectrometric analysis, it may be useful to "condition" nucleic
acid molecules, for example to decrease the laser energy required for
volatization, to
minimize fragmentation or to otherwise increase the sensitivity of mass
spectrometeric detection. For example, nucleic acids can be "conditioned" by
adding
positive or negative charges, i.e. charge tags (CTs). CTs increase the mass
spectrometer detection sensitivity by increasing the degree of ionization
during the
mass spectrometric (e.g.MALDI) process. A CT can be linked either to the
external 3'
or 5' position or internally e.g. at the 2' position or at the base, e.g. at C-
5 in uracil, C-
methyl group of thymine, C-5 at cytosine, at C' or C$ guanine, adenine and
hypoxanthine or at the phosphate ester moiety. Charge tags, CTs, can function
molecules with permanent (i.e. pH-independent) ionization, such as:
Me
Me - N -- CHZCHZ - O --
lVle
or molecules which generate a positive charge upon MALD 1 and which are
stabilized by delocalization of the positive charge by mesomeric effects in
unsaturated
and/or aromatic systems such as:
R
OLIGOS X
~R-~--' R'
wherein, R, R', R' = H,OA1 (wherein A1= e.g.
lower alley], methyl, ethyl, propyl), N02,CN, C02H,
C02 active ester, or halogen; and X = -0-, -NH-, -S-,
C=O, OCO either in the para or meta position.
140

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
For example, the positive charge of a trityl cation is produced during MALDI
by the removal of a moiety such as: -OR, where R = a lower alkyl, or an anion
such as
C104, SbF6-, BF4- and the like.
141

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In an alternative scheme, the trityl group is used to anchor the
oligonucleotide
to a solid support via the tertiary carbon and this bond is cleaved during
mass
spectrometry (e.g. MALDI), leaving a positive charge on the desorbing and high
vacuum flying oligonucleotide.
X-- OLIGOS
- CH2-O -C - ~,
I MAZ,DI
- CHz-O+ ~ R R"
OLIGOS
-~~-x-
R' -
One of skill in the art can readily appreciate several variations to the
schemes
described above. In addition to employing the charge tag array alone, one of
skill in
the art can employ a charge tag array in conjunction with another conditioning
means.
Particularly preferred means to be used in conjunction with the CT include
treating
the phosphodiester bond with trialkylsilyl halides or the
phosphomonothiodiester
bond with alkyliodides to render the polyanionic backbone neutral.
1.4.1.3.1.1 Modification of pbosphodiester Backbone of Nucleic Acid Molecule
Another example of conditioning is modification of the phosphodiester
backbone of the nucleic acid molecule (e.g. cation exchange), which can be
useful for
eliminating peak broadening due to a heterogeneity in the cations bound per
nucleotide unit. In addition, a nucleic acid molecule can be contacted with an
alkylating agent such as alkyliodide, iodoacetamide, l3-iodoethanol, or 2,3 -
epoxy- 1 -
propanol, the monothio phosphodiester bonds of a nucleic acid molecule can be
transformed into a phosphotriester bond. Likewise, phosphodiester bonds may be
transformed to uncharged derivatives employing trialkylsilyl chlorides.
Further
conditioning involves incorporating nucleotides which reduce sensitivity for
depurination (fragmentation during MS) such as N7- or N9-deazapurine
nucleotides,
or RNA building blocks or using oligonucleotide triesters or incorporating
phosphorothioate functions which are alkylated or employing oligonucleotide
mimetics such as PNA.
142

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Modification of the phosphodiester backbone can be accomplished by, for
example, using alpha-thin modified nucleotides for chain elongation and
termination.
With alkylating agents such as akyliodides, iodoacetarnide,13- iodoethanol,
2,3-
epoxy-1- propanol, the monothio phosphodiester bonds of the nested Sanger
fragments are transformed into phosphotriester bonds. Multiplexing by mass
modification in this case is obtained by mass-modifying the nucleic acid
primer (UP)
or the nucleoside triphosphates at the sugar or the base moiety. To those
skilled in the
art, other modifications of the nested Sanger fragments can be envisioned. In
one
embodiment of the invention, the linking chemistry allows one to cleave off
the so-
purified nested DNA enzymatically, chemically or physically. By way of
example, the
L- L' chemistry can be of a type of disulfide bond (chemically cleavable, for
example,
by mcrcaptoethanol or dithioerythrol), a biotin/streptavidin system, a
heterobifunctional derivative of a trityl ether group (K6ster et al., "A
Versatile Acid-
Labile Linker for Modification of Synthetic Biomolecules," Tetrahedron Letters
31,
7095 (1990)) which can be cleaved under mildly acidic conditions, a levulinyl
group
cleavable under almost neutral conditions with a hydrazinium/acetate buffer,
an
arginine- arginine or lysine-lysine bond cleavable by an endopeptidase enzyme
like
trypsin or a pyrophosphate bond cleavable by a pyrophosphatase, a
photocleavable
bond which can be, for example, physically cleaved and the like. Optionally,
another
cation exchange can be performed prior to mass spectrometric analysis. In the
instance that an enzyme-cleavable bond is utilized to immobilize the nested
fragments, the enzyme used to cleave the bond can serve as an internal mass
standard
during MS analysis.
1.4.1.3.2 Purification Process
The purification process and/or ion exchange process can be carried out by a
number of other methods instead of, or in conjunction with, immobilization on
a solid
support. For example, the base-specifically terrainated products can be
separated from
the reactants by dialysis, filtration (including ultrafiltration), and
chromatography.
Likewise, these techniques can be used to exchange the cation of the
phosphate backbone with a counter-ion which reduces peak broadening.
The base-specifically terminated fragment families can be generated by
standard Sanger sequencing using the Large Klenow fragment of E. coli DNA
polymerase I, by Sequenase, Taq DNA polymerase and other DNA polymerases
143

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
suitable for this purpose, thus generating nested DNA fragments for the mass
spectrometric analysis. Tt is, however, part of this invention that base-
specifically
terminated RNA transcripts of the DNA fragments to be sequenced can also be
utilized for mass spectrometric sequence determination. In this case, various
RNA
polyrnerases such as the SP6 or the T7 RNA polymerise can be used on
appropriate
vectors containing, for example, the SP6 or the T7 promoters (e.g. Axelrod et
al.,
"Transcription from Bacteriophage T7 and SP6 RNA Polymerise Promoters in the
Presence of 3' Deoxyribonucleoside 5' triphosphate Chain Terminators,"
Biochemistry
24, 5716-23 (1985)). In this case, the unknown DNA sequence fragments are
inserted
downstream from such promoters. Transcription can also be initiated by a
nucleic acid
primer (Pitulle et al., "Initiator Oligonucleotides for the Combination of
Chemical and
Enzymatic RNA Synthesis, " Gene 112, 101- 105 (1992)) which carries, as one
embodiment of this invention, appropriate linking functionalities, L, which
allow the
immobilization of the nested RNA fragments, as outlined above, prior to mass
spectrometric analysis for purification and/or appropriate modification and/or
conditioning.
1.4.1.3.3 Immobilization Process
For this immobilization process of the DNA/RNA sequencing products for
mass spectrometric analysis, various solid supports can be used, e.g., beads
(silica gel,
controlled pore glass, magnetic beads, SephadexlSepharose beads, cellulose
beads,
etc.), capillaries, glass fiber filters, glass surfaces, metal surfaces or
plastic material.
Examples of useful plastic materials include membranes in Blter or microtiter
plate
formats, the latter allowing the automation of the purification process by
employing
microtiter plates which, as one embodiment of the invention, carry a permeable
membrane in the bottom of the well functionalized with L'. Membranes can be
based
on polyethylene, polypropylene, polyamide, polyvinylidenedifluoride and the
like.
Examples of suitable metal surfaces include steel, gold, silver, aluminum, and
copper.
After purification, cation exchange, and/or modification of the phosphodiester
backbone of the L-L' bound nested Singer fragments, they can be cleaved ofFthe
solid support chemically, enzymatically or physically. Also, the L-L'bound
fragments
can be cleaved from the support when they are subjected to mass spectrometric
analysis by using appropriately chosen L-L linkages and corresponding laser
energies/intensities as described above and herein.
144

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.1.4 Data Analysis (ES, MALDI)
The highly purified, four base-specifically terminated DNA or RNA fragment
families are then analyzed with regard to their fragment lengths via
determination of
their respective molecular weights by MALDI or ES mass spectrometry.
For ES, the samples, dissolved in water or in a volatile buffer, are injected
either continuously or discontinuously into an atmospheric pressure ionization
interface (API) and then mass analyzed by a quadrupole. With the aid of a
computer
program, the molecular weight peaks are searched for the known molecular
weight of
the nucleic acid primer (UP) and determined which of the four chain
terminating
nucleotides has been added to the UP. This represents the first nucleotide of
the
unknown sequence. Then, the second, the third, the n 'h extension product can
be
identified in a similar manner and, by this, the nucleotide sequence is
assigned. The
generation of multiple ion peaks which can be obtained using ES mass
spectrometry
can increase the accuracy of the mass determination.
1.4.1.5 Process for Multiplex Mass Spectrometric DNA Sequencing Employing
Mass Modiefied Reagents
As illustrative embodiments of this invention, three different basic processes
for multiplex mass spectrometric DNA sequencing employing the described mass-
modified reagents are described below:
A) Multiplexing by the use of mass-modified nucleic acid primers (LJP) for
Sanger DNA or RNA sequencing,
B) Multiplexing by the use of mass-modified nucleoside triphosphates as
chain elongators and/or chain terminators for Sanger DNA or RNA sequencing,
and
C) Multiplexing by the use of tag probes which specifically hybridize to tag
sequences which are integrated into part of the four Sanger DNA/RNA base-
specifically terminated fragment families. Mass modification here can be
achieved as
described hereing, or alternately, by designing different oligonucleotide
sequences
having the same or different length with unmodified nucleotides which, in a
predetermined way, generate appropriately differentiated molecular weights.
The process of multiplexing by mass-modified nucleic acid primers (UP) is
illustrated by way of example herein for mass analyzing four different DNA
clones
simultaneously. The first reaction mixture is obtained by standard Sanger DNA
145

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
sequencing having unknown DNA fragment 1 (clone 1 ) integrated in an
appropriate
vector (e.g., M13mp18), employing an unmodified nucleic acid primer UP
°, and a
standard mixture of the four unmodified deoxynucleoside triphosphates, dNTP
° and
with 1110th of one of the four dideoxynucleoside triphosphates, ddNTP . A
second
reaction mixture for DNA fragment 2 (clone 2) is obtained by employing a mass-
modified nucleic acid primer UP ' and, as before, the four unmodified
nucleoside
triphosphates, 0 dNTP , containing in each separate Sanger reaction
I/10°' of the
chain- terminating unmodified dideoxynucleoside triphosphates ddNTP . In the
other
two experiments, the four Sanger reactions have the following compositions:
DNA
fragment 3 (clone 3 ), UPZ, dNTP° , ddNTP° and DNA fragment 4
(clone 4), UP3 ,
dNTP° , ddNTP° . For mass spectrometric DNA sequencing, all base-
specifically
terminated reactions of the four clones are pooled and mass analyzed. The
various
mass peaks belonging to the four dideoxy-terminated (e.g., ddT-terminated)
fragment
families are assigned to specifically elongated and ddT-terminated fragments
by
searching (such as by a computer program) for the known molecular ion peaks of
UP°, UP' , UP2 and UP3 extended by either one of the four
dideoxynucleoside
triphosphates, UP° ddN° , UP' ddN° , UPa ddN° and
UP3 - ddN °. In this way, the first
nucleotides of the four unknown DNA sequences of clone 1 to 4 are determined.
The
process is repeated, having memorized the molecular masses of the four
specific first
extension products, until the four sequences are assigned. Unambiguous
mass/sequence assignments are possible even in the worst case scenario in
which the
four mass-modified nucleic acid primers are extended by the same dideoxynucleo
side
triphosphate, the extension products then being, for example, UP° ddT,
UP' -ddT, UP
2 -ddT and UP 3 -ddT, which differ by the known mass increment differentiating
the
four nucleic acid primers. In another embodiment of this invention, an
analogous
technique is employed using different vectors containing, for example, the SP6
and/or
T7 promoter sequences, and performing transcription with the nucleic acid
primers
UP o, UP', UP 2 and UP 3 and either an RNA polymerase (e.g., SP6 or T7 RNA
polymerase) with chain-elongating and terminating unmodified nucleoside
triphosphates NTP ° and 3'-dNTP °. Here, the DNA sequence is
being determined by
Sanger RNA sequencing.
Illustrated herein is the process of multiplexing by mass-modified chain-
elongating or/and terminating nucleoside triphosphates in which three
different DNA
146

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
fragments (3 clones) are mass analyzed simultaneously. The first DNA Sanger
sequencing reaction (DNA fragment l, clone 1) is the standard mixture
employing
unmodified nucleic acid primer UP° , dNTP° and in each of the
four reactions one of
the four ddNTP° . The second (DNA fragment 2, clone 2) and the third
(DNA
fragment 3, clone 3) have the following contents: UP° , dNTP° ,
ddNTPI and UP° ,
dNTP° , ddNTP2 , respectively. In a variation of this process, an
amplification of the
mass increment in mass-modifying the extended DNA fragments can be achieved by
either using an equally mass-modified deoxynucleoside triphosphate (i.e.,
dNTP1 ,
dNTP2 ) for chain elongation alone or in conjunction with the homologous
equally
mass-modified dideoxynucleoside triphosphate. For the three clones depicted
above,
the contents of the reaction mixtures can be as follows: either UP °
/dNTP ° /ddNTP °
UP ° /dNTP 1 lddNTP ° and UP ° /dNTP 2 /ddNTP ° or
UP ° /dNTP ° lddNTP° , UP °
/dNTP I /ddNTP 1 and UP ° /dNTP 2 /ddNTP 2 . As described above, DNA
sequencing
can be performed by Sanger RNA sequencing employing unmodified nucleic acid
primers, UP , and an appropriate mixture of chain-elongating and terminating
nucleoside triphosphates. The mass-modification can be again either in the
chain-
terminating nucleoside triphosphate alone or in conjunction with mass-modified
chain-elongating nucleoside triphosphates. Multiplexing is achieved by pooling
the
three base-specifically terminated sequencing reactions (e.g., the ddTTP
terminated
products) and simultaneously analyzing the pooled products by mass
spectrometry.
Again, the first extension products of the known nucleic acid primer sequence
are
assigned, e.g., via a computer program. Mass/sequence assignments are possible
even
in the worst case in which the nucleic acid primer is extended/terminated by
the same
nucleotide, e.g., ddT, in all three clones. The following configurations thus
obtained
can be well differentiated by their different mass modifications: UP°
ddT°, UP° ddTl,
UP ° ddTa.
In yet another embodiment of this invention, DNA sequencing by multiplex
mass spectrometry can be achieved by cloning the DNA fragments to be sequenced
in
"plex-vectors" containing vector specific "tag sequences" as described (Koster
et al.,
"Oligonucleotide Synthesis and Multiplex DNA Sequencing Using Chemiluminescent
Detection," Nucleic Acids Re. Symposium Ser. No. 24, 318-321 (1991)); then
pooling
clones from different plex-vectors for DNA preparation and the four separate
Sanger
sequencing reactions using standard dNTP° /ddNTP° and nucleic
acid primer UP°
147

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
purifying the four multiplex fragment families via linking to a solid support
through
the linking group, L, at the 5'-end of UP°; washing out all by-
products, and cleaving
the purified multiplex DNA fragments off the support or using the L-L' bound
nested
Sanger fragments as such for mass spectrometric analysis as described above;
performing de- multiplexing by one-by-one hybridization of specific "tag
probes";
and subsequently analyzing by mass spectrometry. As a reference point, the
four base-
specifically terminated multiplex DNA fragment families are run by the mass
spectrometer and all ddT° , ddA ddC and ddG° terminated
molecular ion peaks are
respectively detected and memorized. Assignment of, for example, ddT °
terminated
DNA fragments to a specific fragment family is accomplished by another mass
spectrometric analysis after hybridization of the specific tag probe (TP) to
the
corresponding tag sequence contained in the sequence of this specific fragment
family.
Only those molecular ion peaks which are capable of hybridizing to the
specific tag probe are shifted to a higher molecular mass by the same known
mass
increment (e.g. of the tag probe). These shifted ion peaks, by virtue of all
hybridizing
to a specific tag probe, belong to the same fragment family. For a given
fragment
family, this is repeated for the remaining chain terminated fragment families
with the
same tag probe to assign the complete DNA sequence. This process is repeated i-
I
times corresponding to i clones multiplexed (the i-th clone is identified by
default).
The differentiation of the tag probes for the different multiplexed clones can
be obtained just by the DNA sequence and its ability to Watson-Crick base pair
to the
tag sequence. It is well known in the art how to calculate stringency
conditions to
provide for specific hybridization of a given tag probe with a given tag
sequence (see,
for example, Molecular Cloning: A laboratory manual Zed, ed. by Sambrook,
Fritsch
and Maniatis (Cold Spring Harbor Laboratory Press: NY, 1989, Chapter 11 ).
Furthermore, differentiation can be obtained by designing the tag sequence for
each
plea-vector to have a sufficient mass difference so as to be unique just by
changing
the length or base composition or by mass-modifications. In order to keep the
duplex
between the tag sequence and the tag probe intact during mass spectrometric
analysis,
it is another embodiment of the invention to provide for a covalent attachment
mediated by, for example, photoreactive groups such as psoralen and
ellipticine and
by other methods known to those skilled in the art (see, for example, Helene
et al.,
148

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Nature 344, 358 (1990) and Thuong et at. "Oligonucleotides Attached to
Intercalators,
Photoreactive and Cleavage Agents" in F. Eckstein, Oligonucleotides and
Analogues
A Practical Approach, IRL Press, Oxford 1991, 283-306).
The DNA sequence is unraveled again by searching for the lowest molecular
weight molecular ion peak corresponding to the known UP ° -tag
sequence/tag probe
molecular weight plus the first extension product, e.g., ddT ° , then
the second, the
third, etc.
In a combination of the latter approach with the previously described
multiplexing processes, a fiu-ther increase in multiplexing can be achieved by
using, in
addition to the tag probe/tag sequence interaction, mass-modified nucleic acid
primers
andlor mass-modified deoxynucleoside, dNTP °-', andlor
dideoxynucleoside
triphosphates, ddNTP °-'. Those skilled in the art will realize that
the tag sequence/tag
probe multiplexing approach is not limited to Sanger DNA sequencing generating
nested DNA fragments with DNA polymerases. The DNA sequence can also be
determined by transcribing the unknown DNA sequence from appropriate promoter-
containing vectors (see above) with various RNA polymerases and mixtures of
NTP
o-r 3~ dNTP °-I , thus generating nested RNA fragments.
In yet another embodiment of this invention, the mass-modifying functionality
can be introduced by a two or multiple step process. In this case, the nucleic
acid
primer, the chain-elongating or terminating nucleoside triphosphates and/or
the tag
probes are, in a first step, modified by a precursor functionality such as
azido, - N3, or
modified with a functional group in which the R in XR is H thus providing
temporary
functions, e.g., but not limited to -OH, NHa, -NHR, -SH, -NCS, -OCO(CH2)rCOOH
(r =1-20), -NHCO(CH2)rCOOH (r =1-20), -OSOZOH, -OCO(CH2)r' (r =1-20), -OP(O-
Alkyl)N(Alkyl)2. These less bulky functionalities result in better substrate
properties
for the enzymatic DNA or RNA synthesis reactions of the DNA sequencing
process.
The appropriate mass-modifying functionality is then introduced after the
generation
of the nested base-specifically terminated DNA or RNA fragments prior to mass
spectrometry. Several examples of compounds which can serve as mass-modifying
functionalities are depicted herein without limiting the scope of this
invention.
1.4.1.6 Kits for Sequencing Nucleic Acid by Mass Spectrometry
Another aspect of this invention concerns kits for sequencing nucleic acids by
mass spectrometry which include combinations of the above-described sequencing
149

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
reactants. For instance, in one embodiment, the kit comprises reactants for
multiplex
mass spectrometric sequencing of several different species of nucleic acid.
The kit can
include a solid support having a linking functionality (L 1 ) for
immobilization of the
base- specifically terminated products; at least one nucleic acid primer
having a
linking group (L) for reversibly and temporarily linking the primer and solid
support
through, for example, a photocleavable bond; a set of chain-elongating
nucleotides (e.
g., dATP, dCTP, dGTP and dTTP, or ATP, CTP, GTP and UTP); a set of chain-
terminating nucleotides (such as 2',3'-dideoxynucleotides for DNA synthesis or
3'
deoxynucleotides for RNA synthesis); and an appropriate polymerase for
synthesizing
complementary nucleotides. Primers and/or terminating nucleotides can be mass-
modified so that the base-specifically terminated fragments generated from one
of the
species of nucleic acids to be sequenced can be distinguished by mass
spectrometry
from all of the others. Alternative to the use of mass-modified synthesis
reactants, a
set of tag probes (as described above) can be included in the kit. The kit can
also
include appropriate buffers as well as instructions for performing multiplex
mass
spectrometry to concurrently sequence multiple species of nucleic acids.
In another embodiment, a nucleic acid sequencing kit can comprise a solid
support as described above, a primer for initiating synthesis of complementary
nucleic
acid fragments, a set of chain-elongating nucleotides and an appropriate
polymerase.
The mass-modified chain-terminating nucleotides are selected so that the
addition of
one of the chain terminators to a growing complementary nucleic acid can be
distinguished by mass spectrometry.
150

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.2 A Method And System For Determining The Sequence Of Genomes
1.4.2.1 A Process For Directly Amplifying And Base Specifically Terminating A
Nucleic Acid Molecule For Sequencing
In general., the invention features a process for directly amplifying and base
specifically terminating a nucleic acid molecule. According to the process of
the
invention, a combined amplification and termination reaction is performed on a
nucleic acid template using: i) a complete set of chain-elongating
nucleotides; ii) at
least one chain-terminating nucleotide; and (iii) a first DNA polymerase,
which has a
relatively low affinity towards the chain terminating nucleotide; and (iv) a
second
DNA polymerase, which has a relatively high affinity towards the chain
terminating
nucleotide, so that polymerization by the enzyme with relatively low affinity
for the
chain terminating nucleotide leads to amplification of the template, whereas
the
enzyme with relatively high affinity for the chain terminating nucleotide
terminates
the polymerization and yields sequencing products.
The combined amplification and sequencing can be based on any
amplification procedure that employs an enzyme with polynucleotide synthetic
ability
(e.g. polymerase). One preferred process, based on the polymerase chain
reaction
(PCR), is comprised of the following three thermal steps: 1) denaturing a
double
stranded (ds) DNA molecule at an appropriate temperature and for an
appropriate
period of time to obtain the two single stranded (ss) DNA molecules (the
template:
sense and antisense strand); 2) contacting the template with at least one
primer that
hybridizes to at least one ss DNA template at an appropriate temperature and
for an
appropriate period of time to obtain a primer containing ss DNA template; 3)
contacting the primer containing template at an appropriate temperature and
for an
appropriate period of time with: (i) a complete set of chain elongating
nucleotides, (ii)
at least one chain terminating nucleotide, (iii) a first DNA polyrnerase,
which has a
relatively low affinity towards the chain terminating nucleotide; and (iv) a
second
DNA polyrnerase, which has a relatively high affinity towards the chain
terminating
nucleotide.
Steps 1)- 3) can be sequentially performed for an appropriate number of times
(cycles) to obtain the desired amount of amplified sequencing ladders. The
quantity of
the base specifically terminated fragment desired dictates how many cycles are
performed. Although an increased number of cycles results in an increased
level of
151

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
amplification, it may also detract from the sensitivity of a subsequent
detection. It is
therefore generally undesirable to perform more than about 50 cycles, and is
more
preferable to perform less than about 40 cycles (e.g. about 20-30 cycles). In
a
preferred embodiment, the first denaturation step is performed at a
temperature in the
range of about 85°C to about 100°C (most preferably about
92°C to about 96°C) far
about 20 seconds (s) to about 2 minutes (most preferably about 30s- 1 minute).
The
second hybridization step is preferably performed at a temperature, which is
in the
range of about 40°C to about 80°C (most preferably about
45°C to about 72°C) for
about 20s to about 2 minutes (most preferably about 30s-1 minute). The third,
primer
extension step is preferably performed at about 65°C to about
80°C (most preferably
about 70°C to about 74°C) for about 30 s to about 3 minutes
(most preferably about 1
to about 2 minutes).
In order to obtain sequence information on both the sense and antisense
strands of a DNA molecule simultaneously, each of the single stranded sense
and
antisense templates generated from the denaturing step can be contacted with
appropriate primers in step 2), so that amplified and chain terminated nucleic
acid
molecules generated in step 3), are complementary to both strands.
Another preferred process for simulataneously amplifying and chain
terminating a nucleic acid sequence is based on strand displacement
amplification
(SDA) (G. Terrance Walker et al., Nucleic Acids Res. 22, 2670-77 (1994);
European
Patent Publication Number 0 684 315 entitled Strand Displacement Amplification
Using Thermophilic Enzymes). In essence, this process involves the following
three
steps, which altogether comprise a cycle: 1) denaturing a double stranded (ds)
DNA
molecule containing the sequence to be amplified at an appropriate temperature
and
for an appropriate period of time to obtain the two single stranded (ss) DNA
molecules (the template: sense and antisense strand); 2) contacting the
template with
at least one primer (P), that contains a recognition/cleavage site for a
restriction
endonuclease (RE) and that hybridizes to at least one ss DNA template at an
appropriate temperature and for an appropriate period of time to obtain a
primer
containing ss DNA template; 3) contacting the primer containing template at an
appropriate temperature and for an appropriate period of time with: (i) a
complete set
of chain elongating nucleotides; (ii) at least one chain terminating
nucleotide, (iii) a
first DNA polymerase, which has a relatively low affinity towards the chain
152

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
terminating nucleotide; (iv) a second DNA polymerise, which has a relatively
high
affinity towards the chain terminating nucleotide; and (v) an RE that nicks
the primer
recognition/cleavage site.
Steps 1 ) - 3) can be sequentially performed for an appropriate number of
times
(cycles) to obtain the desired amount of amplified sequencing ladders. As with
the
PCR based process, the quantity of the base specifically terminated fragment
desired
dictates how many cycles are performed. Preferably, less than 50 cycles, more
preferably less than about 40 cycles and most preferably about 20 to 30 cycles
are
performed.
The amplified sequencing ladders obtained as described above, can be
separated and detected and/or quantitated using well established methods, such
as
polyacrylamide gel electrophoresis (PAGE), or capillary zone electrophoresis
(CZE)
(Jorgenson et al., J. Chromatography 352, 337 (1986); Gesteland et al.,
Nucleic Acids
Res. L8, 1415-1419 (1990)); or direct blotting electrophoresis (DBE) (Beck and
Pohl,
EMBO J, vol. 3: Pp. 2905-2909 (1984)) in conjunction with, for example,
colorimetry, fluorimetry, chemiluminescence and radioactivity.
Dye-terminator chemistry can be employed in the combined amplification and
sequencing reaction to enable the simultaneous generation of forward and
reverse
sequence ladders, which can be separated based on the streptavidin-biotin
system
when one biotinylated primer is provided.
Depicted herein is a scheme for the combined amplification and sequencing
using two polymerises and dye-labeled chain terminating nucleotide (ddNTP) for
detection and two reverse oriented primers. A means of separation for the
simultaneously generated forward and reverse sequence ladders is shown. Step A
represents the exponential amplification of a target sequence by the
polymerise with a
low affinity for ddNTPs. One of the sequence specific oligonucleotide primers
is
biotinylated. Step B represents the generation of a sequence ladder either
from the
original template or the simultaneously generated amplification product
carried out by
the polymerise with a high affinity for ddNTPs. After completion of the
reaction, the
products are incubated with a streptavidin coated solid support (Step C).
Biotinylated
forward sequencing products and reverse products hybridized to the forward
template
are immobilized. In order to obtain readable sequence information, the forward
and
reverse sequence ladders are separated in Step D. The immobilized strands are
153

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
washed and separated by denaturation with ammonium hydroxide at room
temperature. The non-biotinylated reverse sequencing products are removed from
the
beads with ammonium hydroxide supernatant during this procedure. The
biotinylated
forward sequencing products remain immobilized to the beads and are re-
solubilized
with ammonium hydroxide at 60°C. After ethanol precipitation, both
sequencing
species can be resuspended in loading dye and run on an automated sequencer,
for
example.
When mass spectrometry is used in conjunction with the direct amplification
and chain termination processes, the sequencing ladders can be directly
detected
without first being separated using several mass spectrometer formats.
Amenable formats for use in the invention include ionization techniques such
as matrix-assisted laser desorption (MALDI), continuous or pulsed electrospray
(ESI)
and related methods (e.g. Ionspray or Thermospray), and massive cluster impact
(MSI); these ion sources can be matched with a detection format, such as
linear or
reflectron time-of flight (TOF), single or multiple quadrupole, single or
multiple
magnetic sector, Fourier Transform ion cyclotron resonance (FTICR), ion trap,
or
combinations of these to give a hybrid detector (e.g. ion trap-TOF). For
ionization,
numerous matrix/wavelength combinations (MALDI) or solvent combinations (ESI)
can be employed.
The above-described process can be performed using virtually any nucleic
acid molecule as the source of the DNA template. For example, the nucleic acid
molecule can be: a) single stranded or double stranded; b) linear or
covalently closed
circular in supercoiled or relaxed form; or c) RNA if combined with ieverse
transcription to generate a cDNA. For example, reverse transcription can be
performed using a suitable reverse transcriptase (e.g. Moloney marine leukemia
virus
reverse transcriptase) using standard techniques (e.g. Kawasaki (1990) in PCR
Protocols: A Guide to Methods and Applications, Innis et al., eds., Academic
Press,
Berkeley, CA pp21- 27).
Sources of nucleic acid templates can include: a) plasmids (naturally
occurring
or recombinant); b) RNA- or DNA- viruses and bacteriophages (naturally
occurring
or recombinant); c) chromosomal or episomal replicating DNA (e. g. from
tissue, a
blood sample, or a biopsy); d) a nucleic acid fragment (e.g. derived by
exonuclease,
unspecific endonuclease or restriction endonuclease digestion or by physical
154

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
disruption (e.g. sonication or nebulization)); and e) RNA or RNA transcripts
like
mRNAs.
The nucleic acid to be amplified and sequenced can be obtained from virtually
any biological sample. As used herein, the term "biological sample" refers to
any
material obtained from any living source (e.g. human, animal., plant,
bacteria, fungi,
protist, virus). Examples of appropriate biological samples for use in the
instant
invention include: solid materials (e.g tissue, cell pellets, biopsies) and
biological
fluids (e.g. urine, blood, saliva, amniotic fluid, mouth wash, spinal fluid).
The nucleic
acid to be amplified and sequenced can be provided by unpuri~ed whole cells,
bacteria or virus.
Alternatively, the nucleic acid can first be purified from a sample using
standard techniques, such as: a) cesium chloride gradient centrifugation; b)
alkaline
lysis with or without RNAse treatment; c) ion exchange chromatography; d)
phenol/chloroform extraction; e) isolation by hybridization to bound
oligonucleotides;
f) gel electrophoresis and elution; alcohol precipitation and h) combinations
of the
above.
As used herein, the phrases "chain-elongating nucleotides" and "chain-
terminating nucleotides" are used in accordance with their art recognized
meaning.
For example, for DNA, chain-elongating nucleotides include 2'-
deoxyribonucleotides
(e.g. dATP, dCTP, dGTP and dTTP) and chain-terminating nucleotides include 2',
3'-
dideoxyribonucleotides, (e.g. ddATP, ddCTP, ddGTP, ddTTP). For RNA, chain-
elongating nucleotides include ribonucleotides (e.g., ATP, CTP, GTP and UTP)
and
chain-terminating nucleotides include 3'-deoxyribonucleotides (e.g. 3'dA,
3'dC, 3'dG
and 3'dU). A complete set of chain elongating nuclectides refers to dATP,
dCTP,
dGTP and dTTP. The term "nucleotide" is also well known in the art. For the
purposes of this invention, nucleotides include nucleoside mono-, di-, and
triphosphates. Nucleotides also include modified nucleotides, such as
phosphorothioate nucleotides and deazapurine nucleotides. A complete set of
chain-
elongating nucleotides refers to four different nucleotides that can hybridize
to each of
the four different bases comprising the DNA template.
If the amplified sequencing ladders are to be detected by mass spectrometric
analysis, it may be useful to "condition" nucleic acid molecules, for example
to
decrease the laser energy required for volatization and/or to minimize
fragmentation.
155

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Conditioning is preferably performed while the sequencing ladders are
immobilized.
An example of conditioning is modification of the phosphodiester backbone of
the
nucleic acid molecule (e.g. cation "change), which can be useful for
eliminating peals
broadening due to a heterogeneity in the cations bound per nucleotide unit.
Contacting a nucleic acid molecule, which contains an -thio-nucleoside-
triphosphate during polymerization with an alkylating agent such as
akyliodide,
iodoacetamide, - iodoethanol, or 2,3-epoxy-1-propanol, the monothio
phosphodiester
bonds of a nucleic acid molecule can be transformed into a phosphotriester
bond.
Further conditioning involves incorporating nucleotides which reduce
sensitivity for
depurination (fragmentation during MS), e.g. a purine analog such as N7- or N9-
deazapurine nucleotides, and partial RNA containing oligodeoxynucleotide to be
able
to remove the unmodified primer from the amplified and modified sequencing
ladders
by RNAse or alkaline treatment. In DNA sequencing using fluorescent detection
and
gel electrophoretic separation, the N7 deazapurine nucleotides reduce the
formation of
secondary structure resulting in band compression from which no sequencing
information can be generated.
1.4.2.2 The Use of Two Polymerise Enzymes Each Having Different Affinities
for the Chain Terminating Nucleotides
Critical to the novel process of the invention is the use of appropriate
amounts
of two different polymerise enzymes, each having a different affinity for the
particular chain terminating nucleotide, so that polymerization by the enzyme
with
relatively low affinity for the chain terminating nucleotide leads to
amplification
whereas the enzyme with relatively high affinity for the chain terminating
nucleotide
terminates the polymerization and yields sequencing products. Preferably about
0.5 to
about 3 units of polymerise is used in the combined amplification and chain
termination reaction. Most preferably about I to 2 units is used. Particularly
preferred
polymerises for use in conjunction with PCR or other thermal amplification
process
are thermostable polymerises, such as Taq DNA polymerise (Boehringer
Mannheim), AmpliTiq FS DNA polymerise (Perkin-Elmer), Deep Vent (exo-), Vent,
Vent (exo-) and Deep Vent DNA polymerises (New England Biolabs), Thermo
Sequenase (Amersham) or exo(- ) Pseudococcusfuriosus (Pfu) DNA polymerise
(Stratagene, Heidelberg Germany). AmpliTaq, Ultmin, 9 degree Nm, Tth, Hot Tub,
156

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
and Pyrococcusfuriosus. In addition, preferably the polymerase does not have
5'-3'
exonuclease activity.
The process of the invention can be carried out using AmpliTaq FS DNA
polymerase (Perkin-Elmer), which has a relatively high affinity and Taq DNA
polymerase, which has a relatively low affinity for chain terminating
nucleotides.
Other appropriate polymerase pairs for use in the instant invention can be
determined
by one of skill in the art. (See e.g. S. Tabor and C.C. Richardson (1995)
Proc. Nat.
Acad. Sci. (LJSA), vol. 92: Pp. 6339-6343.) in addition to polymerases, which
have a
relatively high and a relatively low affinity to the chain terminating
nucleotide, a third
polymerase, which has proofreading capacity (e.g. Pyrococcus woesei (Pwo)) DNA
polymerase may also be added to the amplification mixture to enhance the
fidelity of
amplification.
Oligonucleotide primers, for use in the invention, can be designed based on
knowledge of the 5' and/or 3' regions of the nucleotide sequence to be
amplified and
sequenced,' e.g., insert flanking regions of cloning and sequencing vectors
(such as
M13, pLTC, phagemid, costaid). Optionally, at least one primer used in the
chain
extension and termination reaction can be linked to a solid support to
facilitate
purification of amplified product from primers and other reactants, thereby
increasing
yield or to separate the Sanger ladders from the sense and antisense template
strand
where simultaneous amplification-sequencing of both a sense and antisense
strand of
the template DNA has been performed.
Examples of appropriate solid supports include beads (silica gel, controlled
pore glass, magnetic beads, Sephadex/Sepharose beads, cellulose beads, etc.),
capillaries, flat supports such as glass fiber filters, glass surfaces, metal
surfaces
(steel, gold, silver, aluuiinum, and copper), plastic materials or membranes
(polyethylene, polypropylene, polyamide, polyvinylidenedifluoride) or beads in
pits
of flat surfaces such as wafers (e.g. silicon wafers), with or without filter
plates.
1.4.2.3 Immobilization Based on Hybridization
Immobilization can be accomplished, for example, based on hybridization
between a capture nucleic acid sequence, which has already been immobilized to
the
support and a complementary nucleic acid sequence, which is also contained
within
the nucleic acid molecule containing the nucleic acid sequence to be detected.
So that
hybridization between the complementary nucleic acid molecules is not hindered
by
157

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
the support, the capture nucleic acid can include a spacer region of at least
about five
nucleotides in length between the solid support and the capture nucleic acid
sequence.
The duplex formed will be cleaved under the influence of the laser pulse and
desorption can be initiated. The solid support-bound base sequence can be
presented
through natural oligoribo- or oligodeoxyribo- nucleotide as well as analogs
(e.g. thio-
modified phosphodiester or phosphotriester backbone) or employing
oligonucleotide
mimetics such as PNA analogs (see e.g. Nielsen et al., Science, 254, 1497
(1991))
which render the base sequence less susceptible to enzymatic degradation and
hence
increases overall stability of the solid support-bound capture base sequence.
1.4.2.4 Linkage
Alternatively, a target detection site can be directly linked to a solid
support
via a reversible or irreversible bond between an appropriate functionality
(L') on the
target nucleic acid molecule and an appropriate functionality (L) on the
capture
molecule. A reversible linkage can be such that it is cleaved under the
conditions of
mass spectrometry (i.e., a photocleavable bond such as a trityl ether bond or
a charge
transfer complex or a labile bond being formed between relatively stable
organic
radicals). Furthermore, the linkage can be formed with L' being a quaternary
ammonium group, in which case, preferably, the surface of the solid support
carries
negative charges which repel the negatively charged nucleic acid backbone and
thus
facilitate the desorption required for analysis by a mass spectrometer.
Desorption can
occur either by the heat created by the laser pulse and/or, depending on L,'
by specific
absorption of laser energy which is in resonance with the L' chromophore.
By way of example, the L-L' chemistry can be of a Type of disulfide bond
(chemically cleavable, for example, by mercaptoethanol or dithioerythrol), a
biotin/streptavidin system, a heterobifunctional derivative of a trityl ether
group
(Koster et al., "A Versatile Acid-Labile Linker for Modification of Synthetic
Biomolecules," Tetrahedron Letters 31, 7095 (1990)) which can be cleaved under
mildly acidic conditions as well as under conditions of mass spectromehy, a
levulinyl
group cleavable under almost neutral conditions with a hydrazinium/acetate
buffer, an
arginine-arginine or lysine-lysine bond cleavable by an endopeptidase enzyme
like
trypsin or a pyrophosphate bond cleavable by a pyrophosphatase or a
ribonucleotide
in between a deoxynucleotide sequence cleavable by an RNAse or alkali.
158

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The functionalities, L and L,' can also form a charge transfer complex and
thereby form the temporary L-L' linkage. Since in many cases the "charge-
transfer
band" can be determined by UV/vis spectrometry (see e.g. Organic Charge
Transfer
Complexes by R. Foster, Academic Press, 1969), the laser energy can be tuned
to the
corresponding energy of the charge-transfer wavelength and, thus, a specific
desorption off the solid support can be initiated. Those skilled in the art
will recognize
that several combinations can serve this purpose and that the donor
functionality can
be either on the solid support or coupled to the nucleic acid molecule to be
detected or
vice versa.
In yet another approach, a reversible L-L' linkage can be generated by
homolytically forming relatively stable radicals. Under the influence of the
laser
pulse, desorption (as discussed above) as well as ionization will take place
at the
radical position. Those skilled in the art will recognize that other organic
radicals can
be selected and that, in relation to the dissociation energies needed to
homolytically
cleave the bond between them, a corresponding laser wavelength can be selected
(see
e.g. Reactive Molecules by C. Wentrup, John Wiley & Sons, 194). An anchoring
function L' can also be incorporated into a target capturing sequence by using
appropriate primers during an amplification procedure, such as PCR, LCR or
transcription amplification.
For certain applications, it may be useful to simultaneously amplify and chain
terminate more than one (mutated) Ioci on a particular captured nucleic acid
fragment
(on one spot of an array) or it may be useful to perform parallel processing
by using
oligonucleotide or oligonucleotide mimetic arrays on various solid supports.
"Multiplexing" can be achieved either by the sequence itself (composition or
length) or by the introduction of mass-modifying functionalities into the
primer
oligonucleotide. Such multiplexing is particularly useful in conjunction with
mass
spectrometric DNA sequencing or mobility modified gel based fluorescence
sequencing.
1.4.2.5 Mass or Mobility Modification
Without limiting the scope of the invention, the mass or mobility modification
can be introduced by using oligo/polyethylene glycol derivatives. The .
oligo/polyethylene glycols can also be monoalkylated by a lower alkyl such as
methyl, ethyl, propyl, isopropyl, t-butyl and the like. Other chemistries can
be used in
159

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
the mass-modified compounds, as for example, those described recently in
Oligonucleotides and Analogues- A Practical Approach, F. Eckstein, editor IRL
Press,
Oxford, 1991.
In yet another embodiment, various mass or mobility modifying
functionalities, other than oligo/polyethylene glycols, can be selected and
attached via
appropriate linking chemistries. A simple modification can be achieved by
using
different alkyl, aryl or aralkyl moieties such as methyl, ethyl, propyl,
isopropyl, t-
butyl, hexyl, phenyl, substituted phenyl or benzyl. Yet another modification
can be
obtained by attaching homo- or heteropeptides to the nucleic acid molecule
(e.g.,
primer) or nucleoside triphosphates. Simple oligoamides also can be used.
Numerous
other possibilities, in addition to those mentioned above, can be performed by
one
skilled in the art.
Different mass or mobility modified primers allow for multiplex sequencing
via simultaneous detection of primer-modified Sanger sequencing ladders.
Mass or mobility modifications can be incorporated during the amplification
process through nucleoside triphosphates or modified primers.
1.4.2.6 Kits for Amplified Base Specifically Terminated Fragments
Another aspect of this invention concerns kits for directly generating from a
nucleic acid template, amplified base specifically terminated fragments. Such
kits
include combinations of the above-described reactants. For instance, in one
embodiment, the kit can comprise: i) a set of chain-elongating nucleotides;
ii) a set of
chain-terminating nucleotides; and (iii) a first DNA polymerase, which has a
relatively low affinity towards the chain terminating nucleotide; and (iv) a
second
DNA polymerase, which has a relatively high affinity towards the chain
terminating
nucleotide. The kit can also include appropriate solid supports for
capture/purification
and buffers as well as instructions for use.
For use with certain detection means, such as polyacrylamide gel
electrophoresis (PAGE), detectable labels must be used in either the primer
(typically
at the 5'-end) or in one of the chain extending nucleotides, or chain
terminating
nucleotides.
Using radioisotopes such as Sap, 33P~ or 31 S is still the most frequently
used
technique.
160

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
After PAGE, the gels are exposed to X-ray films and silver grain exposure is
analyzed.
161

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.3 Hybridization
Oligonucleotide arrays can be used in a wide variety of applications,
including
hybridization studies. In a hybridization study, the array can be exposed to a
receptor
(R) of interest. The receptor can be labelled with an appropriate label (*),
such as
fluorescein. The locations on the substrate where the receptor has bound are
determined and, through knowledge of the sequence of the oligonucleotide probe
at
that location one can then determine, if the receptor is an oligonucleotide,
the
sequence of the receptor.
Sequencing by hybridization (SBH) is most efficiently practiced by attaching
many probes to a surface to form an array in which the identity of the probe
at each
site is known. A labeled target DNA or RNA is then hybridized to the array,
and the
hybridization pattern is examined to determine the identity of all
complementary
probes in the array. Contrary to the teachings of the prior art, which teaches
that
mismatched probe/target complexes are not of interest, the present invention
provides
an analytical method in which the hybridization signal of mismatched
probe/target
complexes identifies or confirms the identity of the perfectly matched
probe/target
complexes on the array.
Arrays of oligonucleotides are efficiently generated for the hybridization
studies using light-directed synthesis techniques.
1.4.3.1 Light Directed Synthesis
As discussed below, an array of alI tetranucleotides was produced in sixteen
cycles, which required only 4 hours to complete. Because combinatorial
strategies are
used, the number of different compounds on the array increases exponentially
during
synthesis, while the number of chemical coupling cycles increases only
linearly. For
example, expanding the synthesis to the complete set of 48 (65,536)
octanucleotides
adds only 4 hours (or less) to the synthesis due to the 16 additional cycles
required.
Furthermore, combinatorial synthesis strategies can be implemented to generate
arrays of any desired probe composition. For example, because the entire set
of
dodecamers (41a) can be produced in 48 photolysis and coupling cycles or less
(b°
compounds requires no more than b x n cycles), any subset of the dodecamers
(including any subset of shorter oligonucleotides) can be constructed in 48 or
fewer
chemical coupling steps. The number of compounds in an array is limited only
by the
density of synthesis sites and the overall array size. The present invention
has been
162

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
practiced with arrays with probes synthesized in square sites 25 microns on a
side. At
this resolution, the entire set of 65,536 octanucleotides can be placed in an
array
measuring only 0.64 cm2. The set of 1,048,576 dodecanucleotides requires only
a 2.56
cm2 array at this individual probe site size.
The success of genome sequencing projects depends on efficient DNA
sequencing technologies. Current methods are highly reliant on complex
procedures
and require substantial manual effort. SBH offers the potential for automating
many
of the manual efforts in current practice. Light-directed sythesis offers an
efficient
means for large scale production of miniaturized arrays not only for SBH but
for
many other applications as well.
Although oligonucleotide arrays can be used for primary sequencing
applications, many diagnostic methods involve the analysis of only a few
nucleotide
positions in a target nucleic acid sequence. Because single base changes cause
multiple changes in the hybridization pattern of the target on a probe array,
the
oligonucleotide arrays and methods of the present invention enable one to
check the
accuracy of previously elucidated DNA sequences, or to scan for changes or
mutations in certain specific sequences within a target nucleic acid. The
latter as is
important, for example, for genetic disease, quality control, and forensic
analysis.
With an octanucleotide probe set, a single base change in a target nucleic
acid can be
detected by the loss of eight perfect hybrids, and the generation of eight new
perfect
hybrids. The single base change can also be detected through altered mismatch
probe/target complex formation on the array. Perhaps even more surprisingly,
such
single base changes in a complex nucleic acid dramatically alter the overall
hybridization pattern of the target to the array. According to the present
invention
such changes in the overall hybridization pattern are used to actually
simplify the
analysis.
The high information content of light-directed oligonucleotide arrays greatly
benefits genetic diagnostic testing. Sequence comparisons of hundreds to
thousands of
different mutations can be assayed simultaneously instead of in a one-at-a-
time
format.
1.4.3.2 Arrays Constructed to Contain Genetic Markers
Arrays can also be constructed to contain genetic markers for the rapid
identification of a wide variety of pathogenic organisms, and to study the
sequence
163

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
specificity of RNA/RNA, RNA/DNA, protein/RNA or protein/DNA, interactions.
One can use non Watson- Crick oligonucleotides and novel synthetic nucleoside
analogs for antisense, triple helix, or other applications. Suitably protected
RNA
monomers can be employed for RNA synthesis, and a wide variety of synthetic
and
non-naturally occurring nucleic acid analogues can be used, depending upon the
motivations of the practitioner. See, e.g., PCT patent Publication Nos.
91/19813,
92105285, and 92114843, incorporated herein by reference. In addition, the
oligonucleotide arrays can be used to deduce thermodynamic and kinetic rules
governing the formation and stability of oligonucleotide complexes.
1.4.3.2.1 Hybridization of Targets to Surface Oligonucleotides
The support bound octanucleotide probes discussed above were hybridized to
a target of 5'GCGTAGGC-fluorescein in the hybridization chamber by incubation
for 15 minutes at 15°C.
The array surface was then interrogated with an epifluorescence microscope
(488 nm argon ion excitation). The fluorescence intensity pattern matches the
800
X 1280 pm stripe used to direct the synthesis of the probe. Furthermore, the
signal
intensities are high (four times over the background of the glass substrate),
demonstrating specific binding of the target to the probe.
The behavior of the target-probe complex was investigated by increasing the
temperature of the hybridization solution. After a minute equilibration at
each
temperature, the substrate was scanned for signal. The duplex melted in the
temperature range expected for the sequence under study (Tm~28°C
obtained from
the rule Tm [2°(A+T)+4°(G+C)]). The probes in the array were
stable to
temperature denaturation of the target-probe complex as demonstrated by
rehybridization of target DNA.
1.4.3.2.2 Sequence Specificity of Target Hybridization
To demonstrate the sequence specificity of target hybridization, two different
probes were synthesized in 800 x 1280 ~m stripes. The probe S-3'-CGCATCCG
was synthesized in stripes l, 3 and S. The probe S-3'-CGCTTCCG was
synthesized in stripes 2, 4 and 6. The results of hybridizing a 5'-GCGTAGGC-
fluorescein target to the substrate at 15°C are depicted herein.
Although the probes
differ by only one internal base, the target hybridizes specifically to its
164

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
complementary sequence 0500 counts above background in stripes 1, 3 and 5)
with little or no detectable signal in positions 2, 4 and 6 (~10 counts).
1.4.3.2.3 Combinatorial Synthesis of, and Hybridization of a Nucleic Acid
Target to, a Probe Matrix
In a light-directed synthesis, the location and composition of products
depends
on the pattern of illumination and the order of chemical coupling reagents
(see
Fodor et al., Science (1991) 251:767-773, for a complete description).
Consider
the synthesis of 256 tetranucleotides. Mask 1 activates one fourth of the
substrate
surface for coupling with the first of four nucleosides in the first round of
synthesis. In cycle 2, mask 2 activates a different quarter of the substrate
for
coupling with the second nucleoside. The process is continued to build four
regions of mononucleotides. The masks of round 2 are perpendicular to those of
round l, and each cycle of round 2 generates four new dinucleotides. The
process
continues through round 2 to form sixteen dinucleotides. The masks of round 3
further subdivide the synthesis regions so that each coupling cycle generates
16
trimers. The subdivision of the substrate is continued through round 4 to form
the
tetranucleotides. The synthesis of this probe matrix can be compactly
represented
in polynomial notation as (A+C+G+T)4. Expansion of this polynomial yields the
256 tetranucleotides.
The application of an array of 256 probes synthesized by light-directed
combinatorial synthesis to generate a probe matrix is illustrated herein. The
polynomial for this synthesis is given by: 3'-CG(A+G+C+T)4CG. All possible
tetranucleotides were synthesized flanked by CG at the 3'- and 5'-ends.
Hybridization of target 5'-GCGGCGGC-fluorescein to this array at 15°C
correctly
yielded the S-3'-CGCCGCCG complementary probe as the most intense position
(2,698 counts). Significant intensity was also observed for the following
mismatches: S-3'-CGCAGCCG (554 counts), S-3'-CGCCGACG (317 counts), S-
3'-CGCCGTCG (272 counts), S-3'-CGACGCCG (242 counts), S-3'-CGTCGCCG
(203 counts), S-3'-CGCCCCCG (180 counts), S-3'-CGCTGCCG (163 counts), S-
3'-CGCCACCG (125 counts), and S-3'-CGCCTCCG (78 counts).
1.4.3.3 Mismatch Analysis
165

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.3.3.1 Arrays Used to Determine the Gene Sequence of Oligos of Length "n"
Using Array of Probes of Shorter Length "k"
The arrays discussed herein can be utilized in the present method to determine
the nucleic acid sequence of an oligonucleotide of length n using an array of
probes of
shorter length k. In a simple example, the target has a sequence 5'-XXYXY-3',
where
X and Y are complementary nucleic acids such as A and T or C and G. For
discussion
purposes, the example is simplified by using only two bases and very short
sequences,
but the technique can easily be extended to larger nucleic acids with, for
example, all
4 RNA or DNA bases.
The sequence of the target is, generally, not known ab initio. One can
determine the sequence of the target using the present method with an array of
shorter
probes. In this example, an array of all possible X and Y 4-mers is
synthesized and
then used to determine the sequence of a 5-mer target.
Initially, a "core" probe is identified. The core probe is exactly
complementary
to a sequence in the target using the mismatch analysis method of the present
invention. The core probe is identified using one or both of the following
criteria:
1. The core probe exhibits stronger binding affinity to the target than other
probes, typically the strongest binding affinity of any probe in the array
(that has not
been identified as a core probe in a previous cycle of analysis).
2. Probes that are mismatched with the target, as compared to the core probe
sequence, exhibit a characteristic pattern, discussed in greater detail below,
in which
probes that mismatch at the 3'- and S'-end of the probe bind more strongly to
the
target than probes that mismatch at interior positions.
In this particular example, selection criteria #1 identifies a core 4-mer
probe
with the strongest binding affinity to the target that has the sequence 3'-
YYXY. The
probe 3'-YYXY (corresponding to the 5'-XXYX position of the target) is,
therefore,
chosen as the "core" probe.
Selection criteria #2 is utilized as a "check" to ensure the core probe is
exactly
complementary to the target nucleic acid.
The second selection criteria evaluates hybridization data (such as the
fluorescence intensity of a labeled target hybridized to an array of probes on
a
166

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
substrate, although other techniques are well known to those of skill in the
art) of
probes that have single base mismatches as compared to the core probe. In this
particular case, the core probe has been selected as S-3'-YYXY. The single
base
mismatched probes of this core probe are: S-3'-XYXY, S-3'-YXXY, S-3'-YYYY, and
S-3'-YYXX. The binding affinity characteristics of these single base
mismatches are
utilized to ensure that a "correct" core has been selected, or to select the
core probe
from among a set of probes exhibiting similar binding affinities.
1.4.3.3.2 Binding Affinity vs. Mismatch Position
An illustrative, hypothetical plot of expected binding affintity versus
mismatch
position is provided herein. The binding affinity values (typically
fluorescence
intensity of labeled target hybridized to probe, although many other factors
relating to
affinity may be utilized) are all normalized to the binding affinity of S-3'-
YYXY to
the target, which is plotted as a value of 1. Because only two nucleotides are
involved
in this example, the value plotted for a probe mismatched at position 1 (the
nucleotide
at the 3'-end of the probe) is the normalized binding affinity of S-3'-XYXY.
The value
plotted for mismatch at position 2 is the normalized affinity of S-3'-YXXY.
The value
plotted for mismatch at position 3 is the normalized affinity of S-3'-YYYY,
and the
value plotted for mismatch position 4 is the normalized affinity of S-3'-YYXX.
As
noted above, "affinity" may be measured in a number of ways including, for
example,
the number of photon counts from fluorescence markers on the target.
The affinity of all three mismatches is lower than the core in this
illustration.
Moreover, the affinity plot shows that a mismatch at the 3'-end of the probe
has less
impact than a mismatch at the 51-end of the probe in this particular case,
although this
may not always be the case. Further, mismatches at the end of the probe result
in less
disturbance than mismatches at the center of the probe. These features, which
result in
a "smile" shaped graph when plotted, will be found in most plots of single
base
mismatch after selection of a "correct" core probe, or after accounting for a
mismatched probe that is a core probe with respect to another portion of the
target
sequence. This information will be utilized in either selecting the core probe
initially
or in checking to ensure that an exactly matched core probe has been selected.
Of course, in certain situations, as noted in in the section above,
identification
of a core is all that is required such as in, for example, forensic or genetic
studies, and
the like.
167

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In sequencing studies, this process is then repeated for left and/or right
extensions of the core probe. In one example, only right extensions of the
core probe
are possible. The possible 4-mer extension probes of the core probe are 3'-
YXYY and
31-YXYX. Again, the same selection criteria are utilized. Between 31-YXYY and
3'-
YXYX, it would normally be found that 31-YXYX would have the strongest binding
affinity, and this probe is selected as the correct probe extension. This
selection may
be confirmed by again plotting the normalized binding affinity of probes with
single
base mismatches as compared to the core probe.
When a hypothetical plot is illustrated, again, the characteristic "smile"
pattern
is observed, indicating that the "correct" extension has been selected, i.e.,
3'-YXYX.
From this information, one would correctly conclude that the sequence of the
target is
51-XXYXY.
168

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.4 A Method for Sequencing Genomes
In one embodiment, a method is described for sequencing genomes that is
comprised of the steps:
(1) Obtaining a clone library to be sequenced and mapped;
(2) Preparing DNA from individual clones in the clone library for comparison
experiments;
(3) Obtaining a long-range probe library relative to the clone library;
(4) Preparing DNA from members of the long-range probe library for comparison
experiments;
(5) Comparing DNA from the clone library with DNA from the long-range probe
library;
(6) Producing a clone library characterized by long-range probes;
(7) Obtaining a bin probe library suitable for positioning the DNA sequences
of
long-range probes relative to the genome;
(8) Comparing DNA from the bin probe library with DNA from the long-range
probe library;
(9) Producing a long-range probe library whose DNA sequences have been
characterized by binning information relative to the genome;
(10) Combining the clone vs. long-range probe characterization from step 6,
together with the long-range probe vs. genome binning characterization from
step 9;
( 11 ) Producing a binning of the clone library;
(12) Obtaining a short-range probe library relative to the clone library;
(13) Comparing DNA from the clone library with DNA from the short-range
probe library;
(14) Producing a clone library characterized by short-range probes;
(15) Combining the long-range binning of the clone library, together with the
short-range probing of the clone library from;
(16) Producing a contig of the clone library which bins and orders clones
relative
to the genome;
(17) Forming a tiling path of clones that span genome regions;
(18) Determining the sequence of said clones, and of the entire genome.
1.4.4.1 Obtaining a clone library to be sequenced and mapped.
169

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The clones may be comprised of large-sized clones that have genomic inserts
greater than 250 kb (e.g., YACs), medium-sized clones that have genomic
inserts
greater than 50 kb, but less than 250 kb (e.g., PACs, BACs, Pls, or YACs), or
small-
sized clones that have genomic inserts less than SO kb (e.g., cosmids,
plasmids, phage,
phagemids, or cDNAs). In the preferred embodiment, the clone library has at
least
two-fold redundancy relative to the genome. The technology for constructing
these
clones is well described (F. M. Ausubel, R. Brent, R. E. Kingston, D. D.
Moore, J. G.
Seidman, J. A. Smith, and K. Struhl, ed., Current Protocols in Molecular
Biology.
New York, N.Y.: John Wiley and Sons, 1995; N. J. Dracopoli, J. L. Haines, B.
R.
Korf, C. C. Morton, C. E. Seidman, J. G. Seidman, D. T. Moir, and D. Smith,
ed.,
Current Protocols in Human Generics. New York: John Wiley and Sons, 1995; J.
Sambrook, E. F. Fritsch, and T. Maniatis, Molecular Cloning, Second Edition.
Plainview, N.Y.: Cold Spring Harbor Press, 1989), incorporated by reference.
Chromosome-specific cosmid clones are available from Los Alamos National
Laboratories (Los Alamos, N.Mex.), genome-wide PAC clones from Pieter de Jong
(Roswell Park, Buffalo, N.Y.), and the Genethon YAC libraries from the
national
genome center GESTECs, including the Whitehead Institute (Cambridge, Mass.).
Libraries are also provided by commercial vendors, including cDNA libraries
(ATCC,
Rockville, Md.), P 1 libraries (DuPont/Merck Pharmaceuticals, Glenolden, Pa.),
BAC
libraries (Research Genetics, Huntsville, Ala.), and cDNAs and other genome-
wide
resources (BIOS Labs, New Haven, Conn.).
1.4.4.1.1 Preparing DNA from individual clones in the clone library for
comparison experiments.
In the preferred embodiment, DNA from the clones is prepared for DNA
hybridization experiments. For DNA derived from bacterial clones (cosmids,
PACs,
etc.), two straightforward protocols are: (a) growing up colonies for each
clone, and
then lysing the bacterial cells to expose the cloned insert DNA, or (b)
specifically
extracting the DNA material from the clone using DNA prep such as an ion
exchange
column (Qiagen, Chatsworth, Calif.). When using vectors with more complex
genomes (e.g., yeast cells), a species-specific DNA prep (e.g., Alu-PCR or IRE-
bubble PCR) is preferred. This DNA from each clone is then gridded onto nylon
membranes such as Hybond N+ (Amersham, Arlington Heights, Ill.) to prepare for
170

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
subsequent DNA hybridization experiments (Hybond N+ product protocol, ver. 2),
incorporated by reference.
1.4.4.1.2 Obtaining a long-range probe library relative to the clone library.
The preferred long-range multiplexed probe is the radiation hybrid (RH) (D.
R. Cox, M. Burmeister, E. R. Price, S. Kim, and R. M. Myers, "Radiation hybrid
mapping: a somatic cell genetic method for constructing high-resolution maps
of
mammalian chromosomes," Science, vol. 250, pp. 245-250, 1990; S. J. Goss and
H.
Harris, "New method for mapping genes in human chromosomes," Nature, vol. 255,
pp. 680-684, 1975; S. J. Goss and H. Hams, "Gene transfer by means of cell
fusion:
statistical mapping of the human X-chromosome by analysis of radiation-induced
gene segregation," J. Cell. Sci., vol. 25, pp. 17-37, 1977), incorporated by
reference.
Chromosome-specific RH libraries have been constructed for other human
chromosomes (M. R. James, C. W. Richard III, J.-J. Schott, C. Yousry, K.
Clark, J.
Bell, J. Hazan, C. Dubay, A. Vignal., M. Agrapart, T. Imai, Y. Nakamura, M.
Polymeropoulos, J. Weissenbach, D. R. Cox, and G. M. Lathrop, "A radiation
hybrid
map of 506 STS markers spanning human chromosome 11," Nature Genetics, vol. 8,
no. l, pp. 70-76, 1994; S. H. Shaw, J. E. W. Farr, B. A. Thief, T. C. Matise,
J.
Weissenbach, A. Chakravarti, and C. W. Richard, "A radiation hybrid map of 95
STSs spanning human chromosome 13q," Genomics, vol. 27, no. 3, pp. 502-510,
1995; U. Francke, E. Chang, K. Comeau, E.-M. Geigl, J. Giacalone, X. Li, J.
Luna, A.
Moon, S. Welch, and P. wilgenbus, "A radiation hybrid map of human chromosome
18," Cytogenet. Cell Genet., vol. 66, pp. 196-213, 1994), incorporated by
reference.
Whole-genome RHs (WG-RHs) for humans and other mammalian genomes have also
been developed (M. A. Walter, D. J. Spillett, P. Thomas, J. Weissenbach, and
P. N.
Goodfellow, "A method for constructing radiation hybrid maps of whole
genomes,"
Nature Genet., vol. 7, no. 1, pp. 22-28, 1994), incorporated by reference,
including the
high-energy Stanford set (David Cox, Stanford, Calif.) and the low-energy
Genethon
set; the DNAs from both WG-RH sets are available (Research Genetics,
Huntsville,
Ala.).
There are alternative embodiments that can construct long-range multiplexed
probes. One alternative embodiment is the use of rare cutter restriction
enzymes (e.g.,
Notl partial digests) to develop large DNA sequences from genomes. These
fragments can be purified using pulsed-field gel electrophoresis (D. C.
Schwartz and
171

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
C. R. Cantor, "Separation of yeast chromosome-sized DNAs by pulsed field
gradient
gel electrophoresis," Cell, vol. 37, pp. 67-75, 1984), incorporated by
reference, and
then selectively pooled. A second alternative embodiment is the use of a
second clone
library that has a larger average insert size than the first clone library in
step 1.
Subsets of these larger insert clones can be pooled together to form a long-
range
probe library (relative to the first clone library). A third alternative
embodiment which
is particularly useful in animal models is the use of genetically inbred
strains. With an
Fl backcross between strains A and B, the meiotic events produce an
interleaving of
large chromosomal fragments of strains A and B. A subtractive hybridization
can
selectively remove the DNA from strain B, leaving behind just the large
chromosomal
regions of strain A for each backcross individual. This procedure constructs a
long-
range probe library (relative to the strain A clone library). The subtractive
hybridization can be performed by first digesting the backcross individual
genome
with restriction enzymes, and then using whole genome DNA from strain B bound
to
solid support to selectively remove the strain B DNA.
1.4.4.1.3 Preparing DNA from members of the long-range probe library for
comparison experiments.
The long-range probe DNA often resides in a complex background genome. In
the RH embodiment, the background is marine genome, while in the pooled YAC
embodiment, the background is the yeast genome. Therefore, the DNA
preparations
for these long-range probe embodiments preferrably use a species-specific DNA
extraction and amplification. The particular assay often depends on the clone
library
used.
When the clonal inserts reside in a complex background genome, such as
YACs, inter-Alu hybridization is the preferred approach in step 5. In this
case, Alu-
PCR preparation of the long-range probes (M. T. Ross and V. P. J. Stanton,
"Screening large-insert libraries by hybridization," in Current Protocols in
Human
Genetics, vol. 1, N. J. Dracopoli, J. L. Haines, B. R. Korf, C. C. Morton, C.
E.
Seidman, J. G. Seidman, D. T. Moir, and D. Smith, ed. New York: John Wiley and
Sons, 1995, pp. 5.6.1-5.6.34), incorporated by reference, is the preferred
embodiment.
An alternative embodiment when background hybridization noise may be greater
is
IRE-bubble PCR (D. J. Munroe, M. Haas, E. Bric, T. Whirton, H. Aburatani, K.
Hunter, D. Ward, and D. E. Housman, "IRE-bubble PCR: a rapid method for
efficient
172

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
and representative amplification of human genomic DNA sequences from complex
sources," Genomics, vol. 19, no. 3, pp. 506-14, 1994), incorporated by
reference.
When the clonal inserts are sufficiently large to contain inter-Alu regions,
and
the vector genome is not complex (e.g., bacterial), then IRE-bubble PCR is the
preferred embodiment. This situation applies to many clone libraries,
including
cosmids, PACs, BACs, and P 1 s.
When the clonal inserts are too small to contain inter-Alu subsequences
detectable by hybridization (such as cDNAs), an assay that provides for more
uniform
DNA expression from the long-range probes may be needed. The most preferred
embodiment is then to use a multiplicity of restriction enzyme digests, each
followed
by long PCR between Alu repeats, and to then pool the PCR products to
construct a
probe. A second approach is a variation on direct selection (M. Lovett, J.
Kere, and L.
M. Hinton, "Direct selection: a method for the isolation of cDNAs encoded by
large
genomic regions," Proc. Natl. Acad. Sci. U.S.A., vol. 88, pp. 9628-9632,
1991),
incorporated by reference. In this approach, Lovett's cDNAs are replaced by a
full
restriction digest with a frequent-cutter of the long-range probe DNA, and
Lovett's
genomic contig is replaced with repetitive DNA (e.g., Alu or Cot-1) that
selects for
the same genome as the species-specific long-range probe. The result is a PCR
amplification (via the end priming sites) of the long-range probe that is
species
specific (via the Alu selection).
The species-specific DNA is then amplified and labeled for use as a
hybridization probe. In the preferred embodiment, this amplification and
labeling is
performed using a labeled dNTP with the random primer method (A. P. Feinberg
and
B. Vogelstein, "A technique for radiolabeling DNA restriction endonuclease
fragments to high specific activity," Analyt. Biochem., vol. 132, pp. 6-13,
1983; N.J.
Dracopoli, J. L. Haines, B. R. Korf, C. C. Morton, C. E. Seidman, J. G.
Seidman, D.
T. Moir, and D. Smith, ed., Current Protocols in Human Genetics. New York:
John
Wiley and Sons, 1995), incorporated by reference. In one embodiment, 3zP- dNTP
is
incorporated into a random primer PCR amplification, possibly using a kit such
as the
DECprime II DNA labeling kit (Ambion, Austin, Tex.). Other isotopes such as
355 or
33P can be used. In alternative embodiments, nonisotopic labeling is performed
(L. J.
Kricka, ed., Nonisotopic Probing, Blotting, and Sequencing, Second Edition.
San
Diego, Calif.: Academic Press, 1995), incorporated by reference.
173

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.4.1.4 Comparing DNA from the clone library with DNA from the long-range
probe library.
The labeled long-range probe DNA is hybridized against the gridded clone
library (A. P. Monaco, V. M. S. Lam, G. Zehetner, G. G. Lennon, C. Douglas, D.
Nizetic, P. N. Goodfellow, and H. Lehrach, "Mapping irradiation hybrids to
cosmid
and yeast artificial chromosome libraries by direct hybridization of Alu-PCR
products," Nucleic Acids Res., vol. 19, no. 12, pp. 3315-3318, 1991 ),
incorporated by
reference. In an alterative embodiment, the roles of the long-range probe
library and
the clone library are reversed, with the long-range probe immobilized on the
membrane and the label on the clone.
The hybridization comparison is done by preannealing the probe with 25 ng of
Cot-1 DNA (Gibco-BRL, Grand Island, N.Y.) for 2 hours at 37°C. before
adding to
the prehybridization mix. The nylon filters containing the spotted clone DNA
is then
prehybridized overnight per manufacturer's instructions (Amersham, Arlingon
Heights, Ill.), except for the addition of sheared, denatured human placental
DNA at a
final concentration of 50 ng/ml. Filters are hybridized overnight at
68°C., washed
three times with final wash of 0.1 SSPE/0.1% SDS at 72° C., before
exposing to
autoradiographic film for 1 to 8 days. The exposed film image is then
electronically
scanned into a computer with memory. A phosphorimager (Molecular Dynamics,
Sunnyvale, Calif.) or other electronic device can be used for imaging without
the use
of film.
For every RH hybridization probing, each of the clone positions on the
autoradiographs of the gridded filters are scored on a numerical scale, such
as 1-5,
with 1 negative, 2 equivocal., 3 weakly positive, 4 positive, and 5 strongly
positive.
When duplicate typings are available, the maximum of the two scores is used,
since
there is a very high false-negative rate in the hybridization data. This data
entry can be
facilitated by use of an interactive computer program that presents the
electronic
image of the filter on a computer display, or by automated computer
interpretation of
the scanned image.
1.4.4.2 Producing a clone library characterized by long-range probes.
The hybridization experiments construct a table of scores that compare the
DNA from clones against DNA from long-range probes for detectable sequence
similarity, and thus presumed genomic colocalization. The scores are resealed
so that
174

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
the new scaling is approximately linear (C. C. Clogg and E. S. Shihadeh,
Statistical
Models for Ordinal Variables. Thousand Oaks, Calif.: Sage Press, 1994),
incorporated
by reference. That is, a unit increase in the scaling indicates a unit
increase in the
confidence one holds that the clone actually hybridized with the long-range
probe. An
equivocal event is scored as a 0, since it was equally likely to be negative
or positive.
A negative event is scored as -l, since there is high confidence that no
observable
hybridization has occurred; both positive and strongly positive events are
scored as l,
since there is certainty that a hybridization event has occurred. A weakly
positive
event can be scored at 0.67 when a single typing is available, since there is
considerably more confidence that it is positive than negative, and is
considered
equivocal when duplicate typings were available. For any scale used, the data
is
scored in a manner determined by the laboratory investigator and data analyst.
This
resealed clone vs. probe comparison table A is stored in the memory of a
computational device.
With perfectly clean comparison data (i.e., very low false negative and false
positive rates), this table A might suffice for ordering the clones using
conventional
RH mapping methods. However, the high-throughput hybridization experiments
incur
a large noise cost. Therefore, some correction data is required to accurately
map the
clones. This correction stage is performed in the following steps.
1.4.4.2.1 Obtaining a bin probe library suitable for positioning the DNA
sequences of long-range probes relative to the genome.
In the preferred embodiment, the bin probe library is comprised of sequence-
tagged sites (STSs). For positional cloning applications, many of the STSs are
preferrably made polymozphic. The genetic or physical markers to be used for
each
STS are obtained as PCR primer sequences pairs and PCR reaction conditions
from
available Internet databases (Genbank, Bethseda, Md.; GDB, Baltimore, Md.;
EMBL,
Cambridge, UK; Genethon, Ervy, France; Stanford Genome Center, Stanford,
Calif.;
Whitehead Institute Genome Center, Cambridge, MA; G. Gyapay, J. Morissette, A.
Vignal., C. Dib, C. Fizames, P. Millasseau, S. Mare, G. Bernardi, M. Lathrop,
and J.
Weissenbach, "'The 1993-94 Genethon Human Genetic Linkage Map," Nature
Genetics, vol. 7, no. 2, pp. 246-339, 1994; Hilliard, Davison, Doolittle, and
Roderick,
Jackson laboratory mouse genome database, Bar Harbor, Me.; MapPairs, Research
Genetics, Huntsville, Ala.), incorporated by reference. Alternatively, STSs
can be
175

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
constructed using existing techniques (Sambrook, J., Fritsch, E. F., and
Manjarls, T.
1989. Molecular Cloning, second edition. Plainview, N.Y.: Cold Spring Harbor
Press;
N. J. Dracopoli, J. L. Haines, B. R. Korf, C. C. Morton, C. E. Seidman, J. G.
Seidman,
D. T. Moir, and D. Smith, ed., Current Protocols in Human Genetics. New York:
John
Wiley and Sons, 1995), incorporated by reference.
In a first alternative embodiment, the locations of the long-range probe
fragments are localized on the genome by fluorescence in situ hybridization
(FISH)
studies. In these FISH studies, the nuclear DNA of the genome serves as the
bin
probe. In a second alternative embodiment, the binning is effected by
comparison
with previously positioned DNA probes, including mapped clone libraries, ESTs,
or
PCR primers.
1.4.4.2.2 Comparing DNA from the bin probe library with DNA from the long-
range probe library.
In the preferred embodiment, PCR amplifications are carried out between the
STSs in the bin probe library and the RH (or other) DNAs in the long-range
probe
library. Subsequent detection for presence or absence of PCR products (+/-
scores) is
carried out either by gel electrophoresis or by internal oligonucleotide
hybridizations.
The orders of the STSs relative to the genome are then determined using
computational or statistical methods (M. Boehnke, "Radiation hybrid mapping by
minimization of the number of obligate chromosome breaks," Genetic Analysis
Workshop 7: Issues in Gene Mapping and the Detection of Major Genes. Cytogenet
Cell Genet, vol. 59, pp. 96- 98, 1992; M. Boehnke, K. Large, and D. R. Cox,
"Statistical methods for multipoint radiation hybrid mapping," Am. J. Hum.
Genet., .
vol. 49, pp. I 174-1188, 1991; A. Chakravarti and J. E. Reefer, "A theory for
radiation
hybrid (Goss-Harris) mapping: application to proximal 21 q markers," Generic
Analysis Workshop 7: Issues in Gene Mapping and the Detection of Major Genes.
Cytogenet Cell Genet, vol. 59, pp. 99-101, 1992), incorporated by reference.
Physical
distances are then computed using maximum likelihood estimation.
In the first alternative FISH embodiment of step 7, DNA from the long-range
probes (e.g., specifies-specific PCR products) are fluorescently labeled, and
then
hybridized back onto the genome. The fragment positions on the genome of the
probes are then visualized using fluorescent microscopic imaging. Linear
fractional
length measurements on the metaphase spreads of chromosomes are then performed
176

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
to determine the bin positions of the fragments. In the second alternative
embodiment
of step 7, DNA from the previously positioned bin probes is hybridized to DNA
from
the long-range probes.
Detailed protocols for these methods have been described (F. M. Ausubel, R.
Brent, R. E. Kingston, D. D. Moore, J. G. Seidman, J. A. Smith, and K. Struhl,
ed.,
Current Protocols in Molecular Biology. New York, N.Y.: John Wiley and Sons,
1995; N. J. Dracopoli, J. L. Haines, B. R. Korf, C. C. Morton, C. E. Seidman,
J. G.
Seidman, D. T. Moir, and D. Smith, ed., Current Protocols in Human Genetics.
New
York: John Wiley and Sons, 1995), incorporated by reference.
1.4.4.2.3. Producing a long-range probe library whose DNA sequences have been
characterized by binning information relative to the genome.
The procedures produce a data table which compares the DNA content of the
long-range probes to bins on the genome. In the preferred embodiment, this is
a table
B of long-range probes (the rows of B) vs. ordered STSs (the columns of B).
The
pairwise distance information between the ordered STSs is also recorded. In
alternative embodiments, the table can be arranged similarly.
Knowledge of the genomic positions of the RH fragments enables the desired
correction of noisy RH hybridization data, as described next.
1.4.4.3 Producing a binning of the clone library.
The procedures of step 10 produce a table which bins each clone relative to
the
genome. In the preferred embodiment, this is a table C of clones (the rows of
C) vs.
ordered bins (the columns of C). Each entry in the table describes the
confidence that
the clone is located in the bin.
Note that this result C is a binning of clones, not a contig. To form the
desired
set of mapped overlapping clones, a short-range probing is preferrably
performed.
This probing and contig formation is performed in the following steps.
1.4.4.3.1 Obtaining a short-range probe library relative to the clone library.
Since current clone mapping technology is based on short-range probing, there
is a large number of workable approaches. The preferred embodiment uses
hybridization assays based on oligonucleotide probes. The design of such
experiments
has been described (A. J. Cuticchia, J. Arnold, and W. E. Timberlake, "PCAP:
probe
choice and analysis package, a set of programs to aid in choosing synthetic
oligomers
177

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
for contig mapping," CABIOS, vol. 9, no. 2, pp. 201-203, 1992; Y.-X. Fu, E. W.
Timberlake, and J. Arnold, "On the design of genome mapping experiments using
short synthetic oligonucletides," Biometrics, vol. 48, pp. 337-359, 1992; H.
Lehrach,
A. Drmanac, J. Hoheisel, Z. Larin, G. Lennon, A. P. Monaco, D. Nizetic, G.
Zehetner,
and A. Poustka, "Hybridization fingerprinting in genome mapping and
sequencing,"
in Genetic and Physical Mapping I: Genome Analysis, K. E. Davies and S. M.
Tilghman, ed. Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory, 1990,
pp.
39- 81; A. Poustka, T. Pohl, D. P. Barlow, G. Zehetner, A. Craig, F. Michiels,
E.
Erlich, A.-M. Frischauf, and H. Lehrach, "Molecular approaches to mammalian
genetics," in Cold Spring Harbor Symp. Quant. Biol., vol. 51. 1986, pp. 131-
139),
incorporated by reference.
An efficient design produces 25 to 200 small (preferrably 5 bp-15 bp)
oligonucleotides which each hybridize, on average, to 5%-95% of the clones.
The
oligonucleotide sequences are generally designed to preferentially detect
sequences
that are related to the genes in the genome, rather than to repetitive
elements in the
genome or to the cloning vector. This selective bias can be achieved either by
experimental probings, or by examination of the sequences to be compared. Once
designed, these oligonucleotides are preferrably ordered from a DNA synthesis
service (Research Genetics, Huntsville, Ala.). Alternatively, they can be
synthesized
on a DNA synthesizer (Applied Biosystems, Foster City, Calif.).
Alternative hybridization embodiments include using clones (or their PCR
products) to probe clone libraries, using pools of clones as hybridization
probes, and
using Southern blotting of digested clones with repetitive element
hybridization
probes. Enzymatic methods include gel electrophoresis of restriction
endonuclease
digests of clones, PCR-based STS comparisons, and hybrid methods such as Alu
fingerprinting. Other short-range probes can be formed by selective or random
retention of fragments produced by genome cutting.
For experimental efficiency, many of these short-range probes work in a
multiplexed way, and probe one or more genome regions simultaneously. These
probes include oligonucleotides, pooled clones, and repetitive-element
fingerprint
probes.
L4.4.3.2 Comparing DNA from the clone library with DNA from the short-
range probe library.
178

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
This is done by comparison experiments using standard protocols. In the
preferred embodiment, DNA from the clones in the clone library is spotted onto
nylon
membranes. This DNA is comprised of lysed colonies, DNA preps, or species-
specific PCR products. The membranes are then prepared for hybridization. Each
oligonucleotide short-range probe is then labeled, preferrably with 32P using
a kinase.
The labeled probe is then hybridized to the membranes, followed by rinsing,
stringent
washing, and autoradiography. The filters may be stripped for subsequent
reuse. The
autoradiograph spots are then scored on a binary or more continuous (e.g., 0-
255)
scale.
Specific oligonucleotide hybridization protocols for particular clone
libraries
and oligonucleotides have been described (A. G. Craig, D. Nizetic, J. D.
Hoheisel, G.
Zehetner, and H. Lehrach, "Ordering of cosmid clones covering the herpes
simplex
virus type I," Nucleic Acids Res., vol. 18, no. 9, 2653-60, 1990; R. Drmanac,
Z.
Strezoska, I. Labat, S. Drmanac, and R. Crkvenjakov, "Reliable hybridization
of
oligonucleotides as short as six nucleotides," DNA Cell Biol., vol. 9, no. 7,
pp. 527-
534, 1990; J. D. Hoheisel, G. G. Lennon, G. Zehetner, and J. Lehrach, "Use of
high
coverage reference libraries of Drosophila melanogaster for relational
analysis," J.
Mol. Biol., vol. 220, pp. 903- 914, 1991; F. Michiels, A. G. Craig, G.
Zehetner, G. P.
Smith, and H. Lehrach, "Molecular approaches to genome analysis: a strategy
for the
construction of ordered overlapping clone libraries," CABIOS, vol. 3, pp. 203-
210,
1987; D. Nizetic, R. Drmanac, and J. Lehrach, "An improved bacterial colony
lysis
procedure enables direct DNA hybridization using short (10, 11 bases)
oligonucleotides to cosmids," Nucleic Acids Res., vol. 19, pp. 182, 1991),
incorporated by reference.
For alternative short-range probes, the comparison protocols are described
(see
cited references above).
1.4.4.3.3 Producing a clone library characterized by short-range probes.
The comparison experiments of the previous step construct a table D of scores
that compare the DNA from clones against DNA from short-range probes. These
provide measures of genomic colocalization and distance.
In this step, or in the following step 15, contigs can be formed from the
short-
range characterization data of the clones. In the preferred embodiment, each
clone's
score signature relative to the oligonucleotides is compared against other
clones' score
179

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
signatures. Pairs of clones having similar score signatures are inferred to be
close, and
their distances can be estimated. The preferred ordering method is simulated
annealing (W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling,
Numerical Recipes in C: The Art of Scientific Computing. Cambridge: Cambridge
University Press, 1988), incorporated by reference. Effective contiging
algorithms
have been described (A. J. Cuticchia, J. Arnold, and W. E. Timberlake, "ODS:
ordering DNA sequences, a physical mapping algorithm based on simulated
annealing," CABIOS, vol. 9, no. 2, pp. 215-219, 1992; A. J. Cuticchia, J.
Arnold, and
W. E. Timberlake, "The Use of Simulated Annealing in Chromosome Reconstruction
Experiments Based on Binary Scoring," Genetics, vol. 132, pp. 591-601, 1992;
A.
Milosavljevic, Z. Strezoska, M. Zeremski, D. Grujic, T. Paunesku, and R.
Crkvenjakov, "Clone clustering by hybridization," Genomics, vol. 27, no. 1,
pp. 83-
89, 1995), incorporated by reference.
For alternative short-range probes, the contiging analysis procedures use
analogous comparison data and search procedures, and have been described (D.
O.
Nelson and T. P. Speed, "Statistical issues in constructing high resolution
physical
maps," Statistical Science, vol. 9,~no. 3, pp. 334-354, 1994; E. Branscomb, T.
Slezak,
R. Pae, D. Galas, and al., "Optimizing restriction-fragment fingerprinting
methods for
ordering large genomic libraries," Genomics, vol. 8, pp. 351-366, 1990; S. G.
Fisher,
E. Cayanis, J. J. Russo, I. Sunjevaric, B. Boukhgalter, P. Zhang, M.-T. Yu, R.
Rothstein, D. Warburton, I. S. Edelman, and A. Efstratiadis, "Assembly of
ordered
contigs of cosmids selected with YACs of human chromosome 13," Genomics, vol.
21, pp. 525-537, 1994; R. Mort, A. Grigoriev, E. Maier, J. Hoheisel, and H.
Lehrach,
"Algorithms and software tools for ordering clone libraries: application to
the
mapping of the genome of Schizosaccharomyces pombe," Nucleic Acids Research,
vol. 21, no. 8, pp. 1965-1974, 1993), incorporated by reference.
1.4.4.3.4 Forming a tiling path of clones that span genome regions.
From an accurate clone map of a genome, a (not necessarily unique) subset of
clones that cover the genome can be identified. This identification is done by
starting
from a leftmost clone by moving rightward from a selected clone A, selecting a
neighbor B which overlaps A, and then iteratively continuing from B. A
constraint
180

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
can be placed on this process to find tiling paths having small or minimal
length,
where length is defined as the sum of the insert sizes of the component
clones.
In the preferred embodiment, (minimal) tiling paths have immediately utility
for finding genes. This is because the inner product map integrates genetic
markers
(polymorphic STSs) together with the clones that fully cover the genome region
containing the gene of interest. This considerably reduces the search effort
for cloning
the gene. Even greater utility for positional/candidate cloning (F. S.
Collins,
"Positional cloning moves from perditional to traditional.," Nature Genet.,
vol. 9, no.
4, pp. 347-350, 1995), incorporated by reference, is present when a map of
ESTs,
expressed cDNAs, or exons is also integrated into the map.
1.4.4.3.5 Determining the sequence of said clones, and of the entire genome.
In the preferred embodiment, each mapped clone is selected in turn from a
minimum tiling path. This clone is then subcloned into M13 sequencing vectors.
For
each M13 subclone, nested deletions are constructed for use in DNA sequencing.
For
each deletion clone, a DNA sequencing template is prepared. This template is
then
sequenced by the dideoxy method, preferrably using an automated DNA sequencer,
such as an A. L. F. (Pharmacia Biotech, Piscataway, N.J.) or an ABI/373 or
ABI1377
(Applied Biosystems, Foster City, Calif.) , and 100-500 by of sequence
determined. In
addition to this "shotgun" phase, in which an initial read is taken from each
subclone
using a universal primer, a "walking" phase takes additional reads from
selected
subclones by use of custom primers. Complete protocols for these and related
sequencing steps have been described (F. M. Ausubel, R. Brent, R. E. Kingston,
D. D.
Moore, J. G. Seidman, J. A. Smith, and K. Struhl, ed., Current Protocols in
Molecular
Biology. New York, N.Y.: John Wiley and Sons, 1995; N. J. Dracopoli, J. L.
Hairies,
B. R. Korf, C. C. Morton, C. E. Seidman, J. G. Seidman, D. T. Moir, and D.
Smith,
ed., Current Protocols in Human Genetics. New York: John Wiley and Sons,
1995).
The sequences of the nested deletion clones are assembled into the complete
sequence of the subclone by matching overlaps. The subclone sequences are then
assembled into the sequence of the mapped clone. The sequences of the mapped
clones are assembled into the complete sequence of the genome by matching
overlaps. Computer programs are available for these tasks (Rodger Staden
programs,
Cambridge, UK; DNAStar, Madison, Wis.). Following sequence assembly, current
analysis practice includes similarity and homology searches relative to
sequence
181

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
databases (Genbank, Bethesda, Md.; EMBL, Cambridge, UK; Phil Green's
GENEFINDER, Seattle, Wash.) to identify genes and repetitive elements, infer
function, and determine the sequence's relation to other parts of the genome
and cell.
1.4.4.4.6 Application of Strategies
Such strategies have been successfully applied to sequencing the genomes of
several bacteria (Human Genome Sciences, Gaithersburg, Md.), including E. coli
(G.
Plunkerr and al., "Analysis of the Escherichia coli genome. III. DNA sequence
of the
region from 87.2 to 89.2 minutes," Nucl. Acids Res., vol. 21, pp. 3391-3398,
1993),
incorporated by reference, and higher organisms, including yeast (S. G. Oliver
and al.,
"The complete sequence of yeast chromosome III," Nature, vol. 357, pp. 38-46,
1992); incorporated by reference, human (A. Martin-Gallardo and al.,
"Automated
DNA sequencing and analysis of 106 kilobases from human chromosome 19q13.3,"
Nature Genet., vol. 1, pp. 34-39, 1992), incorporated by reference, mouse (R.
K.
Wilson and al., "Nucleotide sequence analysis of 95 kb near the 3' end the
marine T-
cell receptor alpha/delta chain locus: strategy and methodology," Genomics,
vol. 13,
pp. 1198-1208, 1992), incorporated by reference, and C. elegans (R. Wilson and
al.,
"2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans,"
Nature, vol. 368, pp. 32-38, 1994; J. Sulston, Z. Du, K. Thomas, R. Wilson, L.
Hillier,
R. Staden, N. Halloran, P. Green, J. Thierry-Mieg, L. Qiu, S. Dear, A.
Coulson, M.
Craxton, M. Durbin, M. Berks, M. Metzstein, T. Hawkins, R. Ainscough, and R.
Waterston, "The C. elegans genome sequencing project: a beginning," Nature,
vol.
356, pp. 37-41, 1992), incorporated by reference. The automated sequencing of
large
genome regions from mapped cosmid (or other) clones is now routine in several
centers (Sanger Center, Cambridge, UK; Washington University, St. Louis, Mo.),
with very low error at an average cost of $0.50 or less per base. Specific
strategies and
protocols for these efforts have been detailed (H. G. Griffin and A. M.
Griffin, ed.,
DNA Sequencing: Laboratory Protocols. New Jersey: Humana, 1992), incorporated
by reference.
The current best mode for sequencing is gel electrophoresis on polyacrylamide
gels, possibly using fluorescence detection. Newer technologies for DNA size
separation axe being developed that are applicable to DNA sequencing,
including
ultrathin gel slabs (A. J. Kostichka, M. L. Marchbanks, R. L. Bromley Jr., H.
Drossman, and L. M. Smith, "High speed automated DNA sequencing in ultrathin
182

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
slab gels," Bio/Technology, vol. 10, pp. 78-81, 1992), incorporated by
reference,
capillary arrays (R. A. Mathies and X. C. Huang, "Capillary array
electrophoresis: an
approach to high-speed, high-throughput DNA sequencing," Nature, vol. 359, pp.
167-169, 1992), incorporated by reference, and mass spectrometry (K. J. Wu, A.
Stedding, and C. H. Becker, "Matrix-assisted laser desorption time-of flight
mass
spectrometry of oligonucleotides using 3-hydroxypicolinic acid as an
ultraviolet-
sensitive matrix," Rapid Commun. Mass Spectrom., vol. 7, pp. 142-146, 1993),
incorporated by reference. DNA sequencing without the use of gel
electrophoresis has
also been done using sequencing by hybridization methodologies (R. Drmanac, S.
Drmanac, Z. Strezoska, T. Paunesku, I. Labat, M. Zeremski, J. Snoddy, W. K.
Funkhouser, B. Koop, and L. Hood, "DNA sequence determination by
hybridization:
a strategy for efficient large-scale sequencing," Science, vol. 260, pp. 1649-
1652,
1993; E. M. Southern, U. Maskos, and J. K. Elder, "Analyzing and comparing
nucleic
acid sequences by hybridization to arrays of oligonucletides: evaluation using
experimental models," Genomics, vol. 13, pp. 1008-10017, 1991; S. P. A. Fodor,
J. L.
Read, M. C. Pirrung, L. Stryer, A. T. Lu, and D. Solas, "Light-directed
spatially
addressable parallel chemical synthesis," Science, vol. 251, pp. 767-773,
1991),
incorporated by reference. Another approach is base addition sequencing
strategy
(BASS), which uses synchronized DNA polymer construction to determine the
sequence of unknown DNA templates (P. C. Cheeseman, "Method for sequencing
polynucleotides," U.S. Pat. No. 5,302,509; filed Feb. 27, 1991, published Apr.
12,
1994; A. Rosenthal., K. Close, and S. Brenner, "DNA sequencing method," Patent
#PCT WO 93/21340; filed Apr. 22, 1992, published Oct. 28, 1993; R. Y. Tsien,
P.
Ross, M. Fahenstock, and A. J. Johnston, "DNA sequencing," Patent #PCT WO
91/06678; filed Oct. 26, 1990, published May 16, 1991), incorporated by
reference.
183

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.5 Insertion of a Genomic Fragment into an Appropriate Host Vector
In another embodiment, the process begins with a fragment of DNA, such as
a genomic fragment, which is inserted into an appropriate host vector capable
of
accommodating it. For example, a BAC vector can accommodate approximately 140
kb of DNA; a cosmid vector can accommodate approximately 40 kb. A composition
comprised of these insert-containing vectors is randomly sheared using
standard
methods, such as sonication, to obtain fragments suitable for transposon-based
sequencing--i.e., about 2-5 kb, preferably 3-4 kb, on the average.
The resulting subfragments are ligated into cloning vectors to create a first
library of subclones representing the original fragment. Because the subclones
in this
library will be used as target plasmids for transposon-mediated sequencing,
the size of
the cloning vector should be minimized; preferably it should contain only a
selectable
marker, an origin of replication, and an insertion site. A suitable host
plasmid is
pOT2; the subfragments obtained by shearing the original composition are end-
repaired, ligated to suitable restriction site containing adapters, and
inserted into the
host vector. Suitable adapters for the pOT2 vector contain BstXI sites.
The resulting cloning vectors with their inserts are then transfected into
bacteria, typically E. coli, for clonal growth. This first library should
contain a 15-20-
fold representation of the original fragment of DNA. For example, if the
original
fragment is approximately 40 kb, and the subclones contain inserts of
approximately 4
kb, 200 such clones would be required for a 20-fold representation of the
original
fragment.
1.4.5.1 Hybridization Screening
As pointed out above, this first library will contain subclones which do not
contain DNA derived from the original fragment to be sequenced. In order to
eliminate these subclones, a preliminary hybridization screen is conducted.
The
required number of subclones is prepared for hybridization screening, for
example, by
plating in 96-well plates and transferring to filters. The filters are then
probed with the
original fragment insert to weed out any colonies which do not contain DNA
which
represents portions of the original fragment. This checks the quality of the
library and
eliminates subclones that contain only host cloning vector for the original
fragment or
contaminating bacterial DNA.
1.4.5.2 2"a Library Formation by Subclones that Contain Inserts
184

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The subclones confirmed to contain inserts derived from the fragment to be
sequenced form a second library. The number of subclones in this library
should be
sufficient to contain a 7-8x times. representation of the fragment. Each
subclone is
individually sequenced from one end of the insert. This is straightforward,
since the
sequence information in the cloning vector provides sufficient information to
design
appropriate primers. Typically, about 400-450 nucleotides into the insert is
read. In
addition to the requirement for 7-8x times. coverage of the fragment when the
complete insert sequences of the subclones are obtained, there must be
sufficient
sequence information available from this end sequencing to represent a lx
times.
coverage of the fragment. Thus, if the original fragment contained 40 kb and
400
nucleotides into the insert is read, 100 clones would be required. The
resulting
sequence information is organized into a computer-readable form for searching.
A
DNA sequence comparison algorithm can be used for subsequent comparisons, such
as the NCBI program BLASTN.
The criteria used to determine the number of subclones used to establish the
database in the method described above are that low sequencing redundancy must
be
maintained and a complete path must be available within the set of subclones
chosen
to provide complete coverage of the original fragment. In addition, the number
must
be chosen so that there is a high probability of finding the next subclone
when
searching with the newly sequenced end sequence.
A method similar to that employed by Chen, E. et al. Genomics (1993)
17:651-666, is used. Larder and Waterman (cite) conclude that the maximum
number
of sequence islands occurs at C=(1-.theta.)-1, where C is the sequence
coverage and
theta is the ratio of the number of bases required to detect the true overlap
to the
sequence read length. As theta approaches zero, sequence coverage of 1 will
produce
the maximum number of sequence islands. In order to achieve the highest
efficiency
database, enough end-sequence data should be generated to obtain about lx
times.
coverage.
In addition, the subclone coverage--i.e., the redundancy based on the complete
sequence contained in the number of subclones chosen--is important. A subclone
coverage factor of 7x-8x times provides a 99.9% probability that each
nucleotide in
the fragment will actually reside in the library. This requires only about 100
subclones
averaging 3 kb in size for a 40 kb fragment.
185

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Sequence information from the host vector for the original fragment is~used as
the first query and reveals which subclones in the library are hybrid
vector/fragment
insert subclones. These will identify the two ends of the original fragment.
One
subclone representing each end, preferably that containing the least amount of
vector
sequence, is selected for further sequencing. The insert of the identified
subclone will
be sequenced from the opposite end from that previously sequenced-- i.e.,
opposite
the end containing the vector sequence. The new sequence information (which is
now
derived from the fragment) is used as the next query. This identifies
additional
subclones which contain additional nucleotide sequence farther in from the end
of the
original fragment. The next identified subclone is then also sequenced from
the
opposite end of the insert from that used to place it in the database and the
new
sequence information used as the next query. The process is continued
sequentially
until a subclone path through the fragment is obtained. The subclone path will
represent the collection of subclones which completely define the fragment
from
which they originated, and their correct relative positions are known.
At any point in this process, if there are no responses to the query,
additional
sequence can be obtained from the subclones already identified and this
sequence
used as the query.
Once the subclone path is determined, it remains only to complete the
sequencing of the subclones involved in the path. According to the method of
the
invention, this is accomplished using the transposon- mediated method of
Strathmann
incorporated by reference hereinabove. Use of this method to complete the
sequence
information for the fragment has been designated "minimal assembled path"
(MA.P)
sequencing. The name is apt because the information provided by the subclone
path
can be used to determine the minimal sequencing path through the identified
subclones. For example, if two subclones overlap over 1 kb, transposon
insertions can
be selected so that the overlap region is sequenced only once. Thus, although
theoretically each of the subclones obtained to define the path can be
completely
sequenced using the transposon-mediated method, only sufficient portions of
these
subclones need be sequenced to obtain he complete sequence of the original
fragment.
186

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.6 Methods of Determining A Nucleic Acid Sequence through Enzymatic
Sequencing
In another embodiment, improved methods of determining a nucleic acid
sequence through enzymatic sequencing are provided. In the subject methods,
primers
are used in combination with capturable chain terminators to produce primer
extension products capable of being captured on a solid phase, where the
primer
extension products may be labeled, e. g. by employing labeled primers to
generate the
primer extension products. Following generation of the primer extension
products, the
primer extension products are isolated through capture on a solid phase. The
isolated
primer extension products are then released from the solid phase, size
separated and
detected to yield sequencing data from which the nucleic acid sequence is
determined.
Methods of determining the sequence of a nucleic acid, e.g. DNA, by
enzymatic sequencing are well known in the art and described in Sambrook et
al.,
Molecular Cloning: A Laboratory Manual (Cold Spring Harbor Laboratory Press,
1989) and Griffin and Griffin, "DNA Sequencings, Recent Innovations and Future
Trends," Applied Biochemistry and Biotechnology (1993) 38: 147-159, the
disclosures of which are herein incorporated by reference. The Sanger method
is
shown schematically herein. Generally, in enzymatic sequencing methods, which
are
also referred to as Sanger dideoxy or chain termination methods, differently
sized
oligonucleotide fragments representing termination at each of the bases of the
template DNA are enzymatically produced and then size separated yielding
sequencing data from which the sequence of the nucleic acid is determined. The
results of such size separations are shown herein. The first step in such
methods is to
produce a family of differently sized oligonucleotides for each of the
different bases
in the nucleic acid to be sequenced, e.g. for a strand of DNA comprising all
four bases
(A, G, C, and T) four families of differently sized oligonucleotides are
produced, one
for each base. To produce the family of differently sized oligonucleotides,
each base
in the sequenced nucleic acid, i.e. template nucleic acid, is combined with an
oligonucleotide primer, a polymerase, nucleotides and a dideoxynucleotide
corresponding to one of the bases in the template nucleic acid. Each of the
families of
oligonucleotides are then size separated, e.g. by electrophoresis, and
detected to
obtain sequencing data, e.g. a separation pattern or electropherogram, from
which the
nucleic acid sequence is determined.
187

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Before further describing the subject methods in greater detail, the critical
chain terminator reagents employed in the subject methods will be discussed.
Critical
to the subject methods is the use of capturable chain terminators to produce
the
families of different sized oligonucleotide fragments (hereinafter referred to
as primer
extension products) comprising a capture moiety at the 3' terminus. The primer
sequences employed to generate the primer extension products will be
sufficiently
long to hybridize the nucleic acid comprising the target or template nucleic
acid under
chain extension conditions, where the length of the primer will generally
range from 6
to 40, usually 15 to 30 nucleotides in length. The primer will generally be a
synthetic
oligonucleotide, analogue or mimetic thereof, e.g. a peptide nucleic acid.
Although
the primer may hybridize directly to the 3' terminus of the target nucleic
acid where a
sufficient portion of this terminus of the target nucleic acid is known,
conveniently a
universal primer may be employed which anneals to a known vector sequence
flanking the target sequence. Universal primers which are known in the art and
commercially available include pUC/M 13, g t 10, gtl 1 and the like.
1.4.6.1 Primers Comprise a Detectable Label
In one preffered embodiment of the subject invention, the primers employed in
the subject invention will comprise a detectable label. A variety of labels
are known
in the art and suitable for use in the subject invention, including
radioisotopic,
chemiluminescent and fluorescent labels. As the subject methods are
particularly
suited for use with methods employing automated detection of primer extension
products, fluorescent labels are preferred. Fluorescently labeled primers
employed in
the subject methods will generally comprise at least one fluorescent moiety
stably
attached to one of the bases of the oligonucleotide.
The primers employed in the subject invention may be labeled with a variety
of different fluorescent moieties, where the fluorescer or fluorophore should
have a
high molar absorbance, where the molar absorbance will generally be at least
104crri
1M-1, usually at least 104 cm 1M-1 and preferably at least 105 cm IM-I, and a
high
fluorescence quantum yield, where the fluorescence quantum yield will
generally be
at least about 0.1, usually at least about 0.2 and preferably at least about
0.5.
For primers labeled with a single fluorescer, the wavelength of light absorbed
by the fluorescer will generally range from about 300 to 900 nm, usually from
about
400 to 800 nm, where the absorbance maximum will typically occur at a
wavelength
188

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
ranging from about 500 to 800 nm. Specific fluorescers of interest for use in
singly
labeled primers include: fluorescein, rhodamine, BODIPY, cyanine dyes and the
like,
and are farther described in Smith et al., Nature (1986) 321: 647-679, the
disclosure
of which is herein incorporated by reference.
Of particular interest for use in the subject methods are energy transfer
labeled
fluorescent primers, in which the primer comprises both a donor and acceptor
fluorescer component in energy transfer relationship. Energy transfer labeled
primers
are described in PCT/LTS95/01205 and PCT/L1S96113134, as well as in Ju et al.,
Nature Medicine (1996)2:246-249, the disclosures of which are herein
incorporated
by reference.
In an alternative embodiment of the subject invention, instead of using
labeled
primers labeled deoxynucleotides are employed, such as fluorescently labeled
dUTP,
which are incorporated into the primer extension product resulting in a
labeled primer
extension product.
The dideoxynucleotides employed as capturable chain terminators in the
subject methods will comprise a functionality capable of binding to a
functionality
present on a solid phase. The bond arising from reaction of the two
functionalities
should be sufficiently strong so as to be stable under washing conditions and
yet be
readily disruptable by specific chemical or physical means. Generally, the
chain
terminator dideoxynucleotide will comprise a member of a specific binding pair
which is capable of specifically binding to the other member of the specific
binding
pair present on the solid phase. Specific binding pairs of interest include
ligands and
receptors, such as antibodies and antigens, biotin and strept/avidin, sulfide
and gold
(Cheng & Brajter-Toth, Anal.Chem. (1996)68:4180-4185, and the like, where
either
the ligand or the receptor, but usually the ligand, member of the pair will be
present
on the chain terminator. Of particular interest for use as chain terminators
are
biotinylated dideoxymicleotides, where such dideoxymicleotides are known in
the art
and available commercially, e. g. biotin- I I -ddATP, biotin- I I -ddGTP,
biotin- I I -
ddCTP and biotin- 11 -ddTTP, and the like.
1.4.6.2 Subject Methods
Turning now to the subject methods, the nucleic acids which are capable of
being sequenced by the subject methods are generally deoxyribonucleic acids
that
have been cloned in appropriate vector, where a variety of vectors are known
in the
189

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
art and commercially available, and include M I3mp 18, pGEM, pSport and the
like.
The first step in the subject method is to prepare a reaction mixture for each
of the
four different bases of the sequence to be sequenced or target DNA. Each of
the
reaction mixtures comprises an enzymatically generated family of primer
extension
products, usually labeled primer extension products, terminating in the same
base. In
other words, in practicing the subject method, one will first generate an "A
", G," "C,"
and "T," family of differently sized primer extension products using the
target DNA
as template. To generate the four families of differently sized primer
extension
products, template DNA, a DNA polymerise, primer (which may be labeled), the
four
different deoxynucleotides, and capturable dideoxynucleotides are combined in
a
primer extension reaction mixture. The components are reacted under conditions
sufficient to produce primer extension products which are differently sized
due to the
random incorporation of the capturable dideoxynucleotide and subsequent chain
termination. Thus, to generate the "A" family of differently sized primer
extension
products, the above listed reagents will be combined into a reaction mixture,
where
the dideoxynucleotide is ddATP modified to comprise a capiurable moiety, e.g.
biotinylated ddATP, such as biotin- 11 -ddATP. The remaining "G", C," and "T"
families of differently sized primer extension products will be generated in
an
analogous manner using the appropriate dideoxynucleotide.
Where labeled primers are employed to generate each of the families of primer
extension products, the labeled primers may be the same or different.
Preferably, the
labeled primer employed will be different for production of each of the four
families
of primer extension products, where the labels will be capable of being
excited at
substantially the same wavelength and yet will provide a distinguishable
signal. The
use of labels with distinguishable signals affords the opportunity of
separating the
differently sized primer extension products when such products are together in
the
same separation medium. This results in superior sequencing data and therefore
more
accurate sequence determination. For example, one can prepare the "A" family
of
primer extension products with a first fluorescent label capable of excitation
at a
wavelength from about 470 to 480 nm which fluoresces at 525 nm. The label used
in
production of "G," "C," and "T" families will be excitable at the same
wavelength as
that used in the "A" family, but will emit at 555 nm, 580 nm, and 605 nm
respectively. Accordingly, the primer extension labels are designed so that
all four of
190

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
the labels absorb at substantially the same wavelength but emit at different
wavelengths, where the wavelengths of the emitted light differ in detectable
and
differentiatable amounts, e.g. differ by at least 15 nm. The next step in the
subject
method is isolation of the primer extension products. The primer extension
products
are isolated by first capturing the primer extension products on a solid phase
through
the capture moiety at the 3' terminus of the primer extension product and then
separating the solid phase from the remaining components of the reaction
mixture.
Capture of the primer extension products occurs by contacting the reaction
mixture comprising the family of primer extension products with a solid phase.
The
solid phase has a member of a specific binding pair on its surface. The other
member
of the specific binding pair is bonded to the primer extension products, as
described
above. Contact will occur under conditions sufficient to provide for stable
binding of
the specific binding pair members. A variety of different solid-phases are
suitable for
use in the subject methods, such phases being known in the art and
commercially
available. Specific solid phases of interest include polystyrene pegs, sheets,
beads,
magnetic beads, gold surface and the like. The surfaces of such solid phases
have
been modified to comprise the specific binding pair member, e.g. for
biotinylated
primer extension products, streptavidin coated magnetic bead may be employed
as the
solid phase.
Following capture of the primer extension reaction products on the solid
phase, the solid phase is then separated from the remaining components of the
reaction mixture, such as template DNA, excess primer, excess deoxy- and
dideoxymicleotides, polymerase, salts, extension products which do not have
the '
capture moiety, and the like. Separation can be accomplished using any
convenient
methodology. The methodology will typically comprise washing the solid phase,
where further steps can include centrifugation, and the like. The particular
method
employed to separate the solid-phase is not critical to the subject invention,
as long as
the method employed does not disrupt the bond linking the primer extension
reaction
product from the solid-phase.
The primer extension products are then released from the solid phase. The
products may be released using any convenient means, including both chemical
and
physical means, depending on the nature of the bond between the specific
binding pair
members. For example, where the bond is a biotin-streptavidin bond, the bond
may be
191

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
disrupted by contacting the solid phase with a chemical disruption agent, such
as
formamide, and the like, which disrupts the biotin-streptavidin bond and
thereby
releases the primer extension product from the solid phase. The released
primer
extension products are then separated from the solid phase using any
convenient
means, including elution, centrifugation and the like.
The next step in the subject method is to size separate the primer extension
products. Size separation of the primer extension products will generally be
accomplished through electrophoresis, in which the primer extension products
are
moved through a separation medium under the influence of an electric field
applied to
the medium, as is known in the art. Alternatively, for sequencing with Mass
Spectrometry (MS) where unlabeled primer extension products are detected, the
sequencing fragments are separated by the time of the flight chamber and
detected by
the mass of the fragments. See Roskey et al., Proc. Natl. Acad. Sci. USA
(1996) 93:
4724-4729. The subject methodology is especially important for obtaining
accurate
sequencing data with MS, because the subject methodology offers a means to
load
only the primer extension products terminated with the capturable chain
terminators,
eliminating all other masses"thereby producing accurate results.
In methods in which the fragments are size separated, the size separated
primer extension products are then detected, where detection of the size
separated
products yields sequencing data from which the sequence of the target or
template
DNA is determined. For example, where the families of fragments are separated
in a
traditional slab gel in four separate lanes, one corresponding to each base of
the target
DNA, sequencing data in the form of a separation pattern is obtained. From the
separation pattern, the target DNA sequence is then determined, e.g. by
reading up the
gel. Alternatively, where automated detectors are employed and all of the
reaction
products are separated in the same electrophoretic medium, the sequencing data
may
take the form of an electropherogram, as is known in the art, from which the
DNA
sequence is determined.
Where labeled primers are employed, the nature of the labeled primers will, in
part, determine whether the families of labeled primer extension products may
be
separated in the same electrophoretic medium, e.g. in a single lane of slab
gel or in the
same capillary, or in different electophoretic media, e.g. in different lanes
of a slab gel
or in different capillaries. Where the same labeled primer generating the same
192

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
detectable single is employed to generate the primer extension products in
each of the
different families, the families of primer extension products will be
electrophoretically separated in different electrophoretic media, so that the
families of
primers extension products corresponding to each base in the nucleic acid can
be
distinguished.
Where different labeled primers are used for generating each family of primer
extension products, the families of products may be grouped together and
electrophoretically separated in the same electrophoretic medium. In this
preferred
method, the families of primer extension products may be combined or pooled
together at any convenient point following the primer extension product
generation
step. Thus, the primer extension products can be pooled either prior to
contact with
the solid phase, while bound to the solid phase or after separation from the
solid phase
but prior to electrophoretic separation.
Kits for practicing the subject sequencing methods are also provided. At a
minimum such kits will comprise capturable chain terminators, e.g.
biotinylated-
ddATP; -ddTTP; - ddCTP and -ddGTP. For embodiments in which the primer
extension products are labeled, the kits will further comprise a means fox
generating
labeled primer extension products, such as labeled deoxynucleotides, or
preferably
labeled primers, where the labeled primers are preferably Energy Transfer
labeled
primers which absorb at the same wavelength and provide distinguishable
fluorescent
signals. Conveniently, the kits may further comprise one or more additional
reagents
useful in enzymatic sequencing, such as vector, polymerase, deoxynucleotides,
buffers, and the like. The kits may further comprise a plurality of
containers, wherein
each contain may comprise one or more of the necessary reagents, such as
labeled
primer, unlabled primer or degenerate primer, dNTPs, dNTPs containing a
fraction of
fluorescent dNTPs, capturable ddNTP, polymerase and the like. The kits may
also
further comprise solid phase comprising a moiety capable of binding with the
capturable ddNTP, such as streptavidin coated magnetic beads and the like.
193

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.7 Production of the DNA Fragments
In another embodiment, the DNA fragments are preferably prepared
according to either the enzymatic or chemical degradation sequencing
techniques
previously described, but the fragments are not tagged with radioactive
tracers. These
standard procedures produce, from each section of DNA to be sequenced, four
separate collections of DNA fragments, each set containing fragments
terminating at
only one of the four bases. These four samples, suitably identif ed, are
provided as a
few microliters of liquid solution.
1.4.7.1 Sample Preparation and Introduction
To obtain intact molecular ions from large molecules, such as DNA fragments,
by UV laser desorption mass spectrometry, the samples should be dispersed in a
solid
matrix that strongly absorbs light at the laser wavelength. Suitable matrices
for this
purpose include cinnamic acid derivatives such as (4-hydroxy, 3-methoxy)
cinnamic
acid (ferulic acid), (3,4-dihydroxy) cinnamic acid (caffeic acid) and (3,5-
dimethoxy,
4- hydroxy) cinnamic acid (sinapinic acid). These materials may be dissolved
in a
suitable solvent such as 3:2 mixture of 0.1 % aqueous trifluoroacetic acid and
acetonitrile at concentrations which are near saturation at room temperature.
One technique for introducing samples into the vacuum of the mass
spectrometer is to deposit each sample and matrix as a liquid solution at
specific spots
on a disk or other media having a planar surface. To prepare a sample for
deposit,
approximately 1 microliter of the sample solution is mixed with 5-10
microliters of
the matrix solution. An aliquot of this mixed solution for each DNA sample is
placed
on the disk at a specific location or spot, and the volatile solvents are
removed by
room temperature evaporation. When the solution containing the samples and
thousand-fold or more excess of matrix is dried on the disk, the result should
be a
solid solution of samples each in the matrix at a specific site on the disk.
Each molecule of the sample should be fully encased in matrix molecules and
isolated from other sample molecules. Aggregation of sample molecules should
not
occur. The matrix need not be volatile, but it must be rapidly vaporized
following
absorption of photons. This can occur as the result of photochemical
conversion to
more volatile substances. In addition, the matrix must transfer ionization to
the
sample. To form protonated positive molecular ions from the sample, the proton
amity of the matrix must be less than that of the basic sites on the molecule,
and to
194

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
form deprotonated negative ions, the gas phase acidity of the matrix must be
less than
that of acidic sites on the sample molecule. Although it is necessary for the
matrix to
strongly absorb photons at the laser wavelength, it is preferable that the
sample does
not absorb laser photons to avoid radiation damage and fragmentation of the
sample.
Therefore, matrices which have absorption bands at longer wavelengths are
preferred,
such as at 355 nm, since DNA fragment molecules do not absorb at the longer
wavelengths.
Depicted herein is a suitable automated DNA sample preparation and loading
technique. In this approach, a commercially available autosampler is used to
add
matrix solution from container to the separated DNA samples. A large number of
DNA fragment samples, for example 120 samples, may be loaded into a sample
tray.
The matrix solution may be added automatically to each sample using procedures
available on such an autosampler, and the samples may then be spotted
sequentially as
sample spots on an appropriate surface, such as the planar surface of the disk
rotated
by stepper motor. Sample spot identification is entered into the data storage
and
computing system which controls both the autosampler and the mass
spectrometer.
The location of each spot relative to a reference mark is thus recorded in the
computer. Sample preparation and loading onto the solid surface is done off
line from
the mass spectrometer, and multiple stations may be employed for each mass
spectrometer if the time required for sample preparation is longer than the
measurement time.
Once the samples in suitable matrix are deposited on the disk, the disk may be
inserted into the ion source of a mass spectrometer through the vacuum lock.
Any gas
introduced in this procedure must be removed prior to measuring the mass
spectrum.
Loading and pump down of the spectrometer typically requires two to three
minutes,
and the total time for measurement of each sample to obtain a spectrum is
typically
one minute or less. Thus 50 or more complete DNA spectrum may be determined
per
hour according to the present invention. Even if the samples were manually
loaded,
less than one hour would be required to obtain sequence data on a particular
segment
of DNA, which might be from 400 to 600 bases in length. Even this latter
technique is
much faster than the conventional DNA sequencing techniques, and compares
favorably with the newer automated sequencers using fluorescence labeling. The
technique of the present invention does not, however, require the full- time
attention
195

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
of a dedicated, trained operator to prepare and load the samples, and
preferably is
automated to produce 50 or more spectrum per hour.
Greater detail of the preferred technique for DNA sequencing is depicted
herein. Under the control of the computer, the disk may be rotated by another
stepper
motor relative to the reference mark to sequentially bring any selected sample
to the
position for measurement. If the disk contains 120 samples, operator
intervention is
only required approximately once every two hours to insert a new sample disk,
and
less than five minutes of each two hour period is required for loading and
pumpdown.
With this approach, a single operator can service several spectrometers. The
particular
disk geometry shown for the automated system is chosen for illustrative
purposes
only. Other geometries, employing for example linear translation of the planar
surface, could also be used.
1.4.7.2 The Mass Spectrometer
The present invention preferably utilizes a laser desorption time of flight
(TOF) mass spectrometer. The disk has a planar face containing a plurality of
sample
spots, each being approximately equal to the laser beam diameter. The disk is
maintained at a voltage V 1 and may be manually inserted and removed from the
spectrometer. Ions are formed by sequentially radiating each spot on the disk
with a
laser beam from source.
The ions extracted from the face of the disk are attracted and pass through
the
grid covered holes in the metal plates. The plates are at voltages V2 and V3.
Preferably
V3 is at ground, and V1 and V2 are varied to set the accelerating electrical
potential.,
which typically is in the range of 15,000-50,000 volts. A suitable voltage VI -
V2 is
5000 volts and a suitable range of voltages VZ -V3 is 10,000 to 45,000 volts.
The low mass ions are almost entirely prevented from reaching the detector by
the deflection plates. The ions travel as a beam between the deflection plates
which
suitably are spaced 1 cm. apart and are 3-10 cm long. The first plate is at
ground and a
second plate receives square wave pulses, for example, at 700 volts with a
pulse width
in the order of 1 microsecond after the laser strikes the tip. Such pulses
suppress the
unwanted low mass ions, for example, those under 1,000 Daltons, by deflecting
them,
so that the low weight ions do not reach the detector, while the higher weight
ions
pass between the plates after the pulse is off, so they are not deflected, and
are
detected by detector.
196

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
An ion detector is positioned at the end of the spectrometer tube and has its
front face maintained at voltage Vd. The gain of the ion detector is set by Vd
which
typically is in the range of -1500 to -2500 volts. The detector is a chevron-
type
tandem microchannel plate array with a front plate at about -2000 volts. The
spectrometer tube is straight and provides a linear flight path, for example,
1l2 to 4
meters in length, and preferably about two meters in length. The ions are
accelerated
in two stages and the total acceleration is in the range of about 15,000-
50,000 volts,
positive or negative. The spectrometer is held under high vacuum, typically 10
uPa,
which may be obtained, for example, a$er 2 minutes of introduction of the
samples.
The face of the disk is struck with a laser beam to form the ions. Preferably
the
laser beam is from a solid laser. A suitable laser is an HY-400 Nd-YAG laser
(available from Lumonics Inc., Kanata (Ottawa), Ontario, Canada), with a 2nd,
3rd
and 4th harmonic generation/selection option. The laser is tuned and operated
to
produce maximum temporal and energy stability. Typically, the laser is
operated with
an output pulse width of 10 ns and an energy of 15 mj of UV per pulse. To
improve
the spatial homogeneity of the beam, the amplifier rod is removed from the
laser.
The output of the laser is attenuated with a 935-5 variable attenuator
(available
from Newport Corp., Fountain Valley, Calif.), and focused onto the sample on
the
face, using a 12-in. focal length fused-slica lens. The incident angle of the
laser beam,
with respect to the normal of the disk's sample surface, is 70°. The
spot illuminated on
the disk is not circular, but a stripe of approximate dimensions 100x300 um or
larger.
The start time for the data system (i.e., the time the laser actually fired)
is determined
using a beam splitter and a PS-O1 fast pyroelectric detector (available from
Molectron
Detector Inc., Campbell, Calif.). The Laser is operated in the Q switched
mode,
internally triggering at 5 Hz, using the Pockels cell Q-switch to divide that
frequency
to a 2. S Hz output.
The data system for recording the mass spectra produced is a combination of a
TR8828D transient recorder and a 6010 CAMAC crate controller (both
manufactured
by Lecroy, Chestnut Ridge, N.Y.). The transient recorder has a selectable time
resolution of 5-20 ns. Spectra may be accumulated for up to 256 laser shots in
131,000 channels, with the capability of running at up to 3 Hz, or with fewer
channels
up to 10 Hz. The data is read from the CAMAC crate using a Proteus IBM AT
compatible computer. During the operation of the spectrometer, the spectra
(shot-to-
197

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
shot) may be readily observed on a 2465A 350 MHz oscilloscope (available from
Tektronix, Inc., Beaverton, Oreg.). A suitable autosampler for mixing the
matrix
solution and each of the separated DNA samples and for depositing the mixture
on a
solid planar surface is the Model 738 Autosampler (available from Alcott Co.,
Norcoss, Ga.).
This linear TOF system may be switched from positive to negative ions easily,
and both modes may be used to look at a single sample. The sample preparation
was
optimized for the production of homogeneous samples in order to produce
similar
signals from each DNA sample spot.
1.4.7.3 Data Analysis and Determination of Sequence
The raw data obtained from the laser desorption mass spectrometer 30 consists
of ion current as a function of time after the laser pulse strikes the target
containing
the sample and matrix. This time delay corresponds to the "time-of flight"
required
for an ion to travel from the point of formation in the ion source to the
detector, and is
proportional to the mass-to-charge ratio of the ion. By reference to results
obtained for
materials whose molecular weights are known, this time scale can be converted
to
mass with a precision of 0.01 % or better.
In a graph of intensity v. time-of flight of the pseudomolecularion region of
a
TOF mass spectrum of Not I Linker (DNA) in which the matrix is ferulic acid
and the
wavelength is 355 nm, four consecutive spectra can be obtained using the
present
invention by the successive measurement of the four collections of DNA
fragments
obtained from fragmentation of each sample of DNA. Each of these spectra will
correspond to the set of fragments ending in a particular base or bases G, G
and A, C
and T, or C. To determine the order of the peaks in the four spectra, a simple
computer algorithm may be utilized.
It should be noted that the data obtained from the mass spectra contains
significantly
more useful information that the corresponding traces from electrophoresis.
Not only
can the mass order of the peaks be determined with good accuracy and
precision, but
also the absolute mass differences between adjacent peaks, both in individual
spectra
and between spectra, can be determined with high accuracy and precision. This
information may be used to detect and correct sequence errors which might
otherwise
go undetected. For example, a common source of error which often occurs in
conventional sequencing results from variations the amounts of the individual
198

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
fragments present in a mixture due to variations in the cleavage chemistry.
Because of
this variation it is possible for a small peak to go undetected using
conventional
sequencing techniques. With the present invention, such errors can be
immediately
detected by noting that the mass differences between detected peaks do not
match the
apparent sequence. In many cases, the error can be quickly corrected by
calculating
the apparent mass of the missing base from the observed mass differences
across the
gap. As a result, the present invention provides sequence data not only much
faster
than conventional techniques, but also data which is more accurate and
reliable. This
correction technique will reduce the number of extra runs which are required
to
establish the validity of the result.
199

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.8 The Amplification Of A DNA Stretch Using The Pcr Procedure With The
Knowledge Of Only One Primer
In another embodiment, the present invention enables the amplification of a
DNA stretch using the PCR procedure with the knowledge of only one primer.
Using
this basic method, the present invention describes a procedure by which a very
Long
DNA of the order of millions of nucleotides can be sequenced contiguously,
without
the need for fragmenting and sub-cloning the DNA. In this method, the general
PCR
technique is used, but the knowledge of only one primer is sufficient, and the
knowledge of the other primer is derived from the statistics of the
distributions of
oligonucleotide sequences of specified lengths.
1.4.8.1 Method of Sequencing without the Need for Fragmenting or
Subcloning
The objects and advantages of the present invention are also achieved by a
method comprising:
a) synthesizing a partly fixed primer, with 4, 5, 6 nucleotide, or longer
sequence characters fixed within it. The fixed sequence can be any sequence,
with
some preferred sequences such as those containing many G-C pairs that
increases
binding affinity. The fixed position within the primer can be anywhere, with
some
preferred positions;
b) taking a very long genomic DNA, either uncloned or a cloned large insert
such as the YAC or cosmid in which a short sequence of about 20 characters
somewhere within the DNA is known;
c) synthesizing a primer from the sequence known from the DNA in step b;
d) radiolabeling the primer in step c;
e) annealing the primers (from step a, and step d or step g as appropriate) to
the DNA in step b, and amplifying the DNA between the attached primers;
f) performing DNA sequencing of the amplified DNA by the chemical
degradation method of Maxam and Gilbert, or carrying out DNA sequencing by the
Sanger method, or by modified PCR-sequencing method;
g) after obtaining the DNA sequence from step f, selecting an appropriate
first
primer towards the 3' end of the sequence, synthesizing it, and radiolabeling
it;
200

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
h) repeating the steps a through g with the two primers (the same partly fixed
unknown primer as the second primer and the newly synthesized primer from step
g
as the first primer);
i) if the sequence obtained in step f is too short to be of value, using
another
partly fixed primer with a different fixed sequence and the same first primer
to obtain
a longer DNA sequence.
Unless defined otherwise, all technical and scientific terms used herein have
the same meaning as commonly understood by one of ordinary skill in the art to
which this invention belongs. All publications mentioned hereunder are
incorporated
herein by reference. Unless mentioned otherwise, the techniques employed
herein are
standard methodologies well known to one of ordinary skill in the art.
The partly fixed primer used to perform DNA amplification and sequencing
are, of course, not limited to those described under the examples. Further
modification
in the method may be made by varying the length, content and position of the
fixed
sequence and the length of the random sequence. Additional obvious
modifications
include using different DNA polymerases and altering the reaction conditions
of DNA
amplification and DNA sequencing. Furthermore, the basic technique can be used
for
sequencing RNA using appropriate enzymes.
Instead of preparing the first primer completely, it can also be prepared as
follows. Two or three shorter oligonucleotides that would comprise the
complete
primer could be ligated, by joining end-to-end after annealing to the template
DNA,
as described under another patent (Helmut Blocker, U.S. Pat. No. 5,114,839,
435/6,
5/1992) or as described in the publication (L. E. Kotler, et al., Proceedings
of the
National Academy of Science, USA, 90:4241-4245 (1993)). Alternatively, it can
be
synthesized using the single-stranded DNA binding protein, the subject of
another
invention (J. Kieleczawa, et al., Science, 258:1787-1791 (1992)). One of such
procedures, or an improved version thereof, can be used to make the first
primer in
the present invention. All in all, the first primer need not be synthesized at
every PCR
reaction while contiguously sequencing a long DNA, and can be directly
constructed
from an oligonucleotide bank. Based on the present invention, the second
primer also
can be chosen from a set of only a few pre-prepared primers. This enables the
direct
automation of sequencing the whole long DNA by incorporating the primer
elements
into the series of sequential PCR reactions.
201

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.8.2 Advantages of Method
An advantage of the present invention is that from a known sequence in a very
long DNA, sequencing can be performed in both directions on the DNA. Two first
primers can be' prepared, one on each strand, running in the opposite
directions, and
the sequence can be extended on both directions until the two very ends of the
long
DNA are reached by the present invention, using a small set of pre-prepared
partly
fixed second primers.
One of the major advantages of the present invention is that it is highly
amenable to various kinds of automation. Instead of radiolabeling the first
known
primer, it can be fluorescently labeled, and with this the DNA sequencing can
be
performed in an automated procedure on machines such as that marketed by the
Applied Biosystems ("373 DNA Sequencer: Automated sequencing, sizing, and
quantitation", a pamphlet from the Applied Biosystems, A Division of Perkin-
Elmer
Corporation (1994)). In the present invention there is no need to newly
synthesize any
primers to sequence a very long DNA. Thus, with the pre- prepared set of
partly fixed
second primers, an oligonucleotide bank for the synthesis of the first primer,
and a
large supply of the template genomic DNA (or any long DNA), the sequencing of
the
whole long DNA can be automated using robots almost without any human
intervention, except for changing the sequencing gels.
1.4.8.3 Applications of Method
The following processes can be computer controlled: 1) the selection of the
appropriate sequence for constructing the first primer close to the 3' end of
the newly
worked out sequence, 2) determining whether the sequence obtained is too short
and
selection of a different partly fixed second primer, 3) assembling the
contiguous DNA
sequences from the various lanes and various gels and appending to a database,
and
other such processes. Thus the present invention enables the construction of a
fully
automated contiguous DNA sequencing system. Any such automations are obvious
modifications to the present invention.
The present invention is not limited to only unknown genomic DNA, and can
be used to sequence any DNA under any situations. DNAs or RNAs of many
different
origins (e.g. viral, cDNA, mRNA) can be sequenced not only limited to research
or
information gathering purposes, but also to other purposes such as disease
diagnosis
and treatment, DNA testing, and forensic applications.
202

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
It is understood that the examples and embodiments described herein are for
illustrative purposes only and that various modifications or changes in light
thereof
will be suggested to persons skilled in the art and are to be included within
the spirit
and purview of this application and scope of the appended claims.
It should be noted that any kit or process used for research, diagnostic,
forensic, treatment, production or other purposes that uses the present
invention is
covered under these claims. Furthermore, the various sequences of the partly
fixed
second primers that can be used in the present invention are covered under
this patent.
Thus, any kit or process that uses this method and/or the DNA strands with the
sequences that would comprise the partly fixed second primers will also be
covered
under this.
In addition to contiguous DNA sequencing, the present invention will cover
the amplification of the DNA strands that are bounded between the known primer
and
the partly fixed second primer (either from claim 1 or from claim 2). The DNA
amplification can also be performed for long DNA strands using the long PCR
amplification protocols.
203

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
L4.9 Polynucleotide Sequencing With Random Surface Immobilization And
Light Microscopic Detection Of Affinity Labels Coupled To Microscopic Beads
A DNA sample is prepared by shearing or digestion at a first sequence with a
first restriction enzyme producing a 3' overhang terminus, to some
appropriate, known
size distribution, and labeled with a digoxigenin bearing nucleotide by the
action of
terminal deoxynucleotidyl transferase.
After such digoxigenin labeling, said DNA sample is then subjected to random
internal cleavage, for example by shearing so as to produce a population of
molecules
with an average length half that produced in the previous sizing step, or
digestion with
a second restriction enzyme recognizing a distinct, second recognition
sequence.
Sample molecules of said sample are then bound at some convenient surface
density
to a transparent surface modified with a monolayer or a sub-monolayer density
of
anti-digoxigenin antibody. Said sample molecules, which will thus be bound to
said
transparent surface by the 3' termini of one strand, are then subjected to
treatment by a
3' to 5' exonuclease, which will only act at the 3' terminus which does not
bear the
digoxigenin moiety due to the hindrance of this latter 3' terminus by its
interaction
with the surface, preferably not to completion of digestion of susceptible
strands.
Thus primed DNA sample template molecules bound to a transparent surface in an
end-wise manner are prepared.
Using a single nucleotide labeling affinity moiety in a manner similar to the
example provided for one-bit binary labeling systems, utilizing for example
each of
the four nucleotides derrivatized to effect communication of said nucleotides
with a
biotin moiety via a chemically cleavable linker, such as those described by
S.W. Ruby
et a1.34 polymerization directed by the template provided by each involved DNA
sample template molecule is effected with an appropriate DNA polymerase
lacking a
3' to 5' exonuclease activity, such as Sequenase 2.0,35 with only one
nucleotide type
present during each polymerization step sub-cycle, at sufficiently low
concentration to
effect equilibrium controlled stepping. Polymerization reagents are then
washed
away, and may favorably be recycled after quantitation and readjustment of
respective
labeled nucleotide content.
After each such polymerization sub-cycle step, which will add a biotin labeled
nucleotide to only a fraction of those sample template molecules having only
the base
complementary to the nucleotide of said sub-cycle located immediately 5' to
the base
204

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
opposite the 3' terminal base of the strand priming this nucleotide addition,
biotin
bearing molecules may be labeled with microscopic streptavidin coated beads.
Unbound beads are then washed away. Bead labeled molecules may then be
observed by a video microscope, and the position of said bead labeled
molecules
within a sample may be recorded by image analysis of digital images thus
obtained, in
a manner similar to that used by Finzi and Gelles. Dithiothreitol or other
reagents
capable of cleaving said linker holding said biotin in communication with said
nucleotide incorporated during the previous polymerization sub-cycle are then
used to
treat sample molecules to cleave said linkers and thus release said biotin
labeling
moieties and the beads which have bound to them. A wash step is then performed
to
remove said beads. The extent of bead removal may be checked with another
video
microscopy detection step if needed; and further cleavage treatment may be
performed if decoupling was not adequate. The same subcycle (comprising
polymerization, bead association, video microscopic examination, bead and
label
cleavage and removal by washing, and optionally a bead removal confirmation
video
microscopic examination step) is then repeated in succession for each of the
three
remaining nucleotide types, to complete a full base sequencing cycle (which as
noted
may yield information about more than one base location for some template
molecules according to the sequence composition and the order of sub-cycles,
and no
information for other sample template molecules). Multiple said base sequence
cycles
are repeated until enough data have been accumulated relative to the total
complexity
of the initial DNA sample. Recorded data are then used to reconstruct sequence
information for a segment of each sample template molecule, and segment
sequence
data are then aligned by appropriate computational algorithms.
Note that this embodiment avails only existing and generally available
materials and devices, relies on relatively simple manipulations which are
known to
be highly reproducible according to their general use in the relevant fields,
but due to
the novel process of the present invention may yield genome sequence
information far
more rapidly and inexpensively than highly complex robotic instruments with
sequencing methods utilizing electrophoretic separation.
Note that microscopic detection may be performed with a computer controlled
stepable sample stage to effect the automated examination of large surface
areas and
hence very large numbers of sample molecules.
205

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Alternatively, the transparent substrate providing the surface for
immobilization may be that of a spooled film, which may be advanced at an
appropriate rate before the objective of said video microscope of the present
embodiment. Further, with such a spooled sample arrangement, said film may be
circular, and continuously advanced through multiple video microscope apparata
and
wells effecting polymerization sub-cycles, all in appropriate order such that
benefit of
full pipelining of each step may be enjoyed. The construction of such
instrumentation
and rudimentary robotic actuation systems will be straightforward to those
skilled in
the relevant engineering arts.
Surface immobilization with single photon detection of plural fluorescent
labels coupled to photodetachable 31-hydroxyl protecting groups. Sequence
determination may additionally effected by the random immobilization at some
appropriate density of appropriately prepared and primed sample molecules on
the
surface of a transparent film, and stepwise polymerization with some
appropriate
polymerase, of all four nucleotides, all of which are protected at the 3'-
hydroxyl with
a photolabile (and hence photoremovable) protecting group in communication
with
labeling moieties which distinctly correspond to each nucleoside base type of
the
respective nucleotide. Label incorporation is detected, for example by the
scanned
beam light microscopic methods of the present invention, or with highly
sensitive
CCDs, and assigned to the spatial region occupied by a particular molecule.
Said film
is translated appropriately such that the full complexity of the sample may be
examined after each polymerization cycle.
Data are recorded electronically and according to the molecule for which they
are obtained. Illumination of the sample with an appropriate frequency and
intensity
of light to effect 3'-hydroxy deprotection and hence also labeling moiety
removal is
performed, and a wash step is performed to remove freed label. Such
polymerization,
detection and deprotection cycles are repeated until the sample is su~ciently
well
characterized.
1.4.9.1 Random And Non-Random Immobilization To Optical Detection Array
Devices With Optical Labels
1.4.9.1.1 Detection And Classification Of Pathogens In Clinical Samples
Methods of the present invention may be combined with the immobilization of
highly diverse libraries of binding specificities with either encoding labels
or
206

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
phenogenocouples, which may therefore be characterized dynamically and related
to any detected binding of particles of interest from a sample. Clinical
samples are
interacted with said libraries. All retained material is then interacted with
some
general label such as a polynucleotide binding dye (e.g. ethidium bromide,
DAPI)
or some chromophorigenic or photoemissive or labeled competitive inhibitor
analog reagent detecting some metabolically fundamental reaction such as ATP
hydrolysis, or the presence enzymes catalyzing said metabolically fundamental
reaction. Pathogens containing polynucleotides or capable of said
metabolically
fundamental reaction may thus be detected.
The essential features of such a system are massively parallel screening for
affinity interactions, generalized labeling methodology, and automated sample
characterization. Because pathogen culturing is not required, and many types
of
highly specific information may be obtained in one assay procedure, without
any
previous knowledge of the state of the organism from which said clinical
sample
was obtained, this represents the basis for extremely powerful diagnostic
methods.
Note that various implementations may distribute binding specificities of
known composition in a spatially controlled manner, and thus rely on spatial
information to encode specificity type and hence, if known, composition of
each
specificity type. Note also that said libraries may comprise known mimetics or
small molecules of known binding specificity.
The profile of any sample type from an individual organism according to such
an assay may be monitored over time, and a profile is preferably obtained for
a
state of presumed health for comparison to samples correlated to states of
disease,
deficiency or degeneration or other states of ill health (i.e. longtitudinal
tracking
of individuals stratified by sample type). Samples of similar type may also be
compared across populations and subpopulations, and the profile of these
samples
also correlated with state of health of the respective individuals (cross-
sectional
comparison).
For additional selectivity of detection, such a sample characterized as above
may be further characterized according to the immunocharacterization method
below.
1.4.9.1.2 Automated Immunocharacterization And Cyber-Immune Detection
207

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Such a system resembles that used for the detection and characterization of
clinical samples, except that said highly libraries of binding specificities
comprises a large number of immunoglobulin specificities. Libraries comprising
immunoglobulin specificities may include such specificities in the form of
immunoglobulins expressed on bacteriophages, viruses, or in the form of the
phenogenocouples of the present invention.
Banks comprising all of the specificities of a library may be maintained as
monoclones, and upon detection of a pathogen in association with one or more
binding specificity contained in some library, and the identification andlor
characterization of said one or more binding specificity, an alignment of the
respective said monoclone, from one of said banks, may can be provided to the
organism. Such analysis and provision of one or more monoclones be automated
and controlled by algorithms.
Similar rapidity and broad characterization advantages are attained as with
the
preceding method for the characterization of clinical sample.
1.4.9.1.3 Massively Parallel Enzymological Assays:
In a manner similar to the preceding embodiments, several enzymes contained
within some sample may be analyzed according to their binding probability,
binding duration or dissociation rate and conformational or phosphorylation or
other status. Such assays may favorably be performed by the methods of the
present invention, with immobilized libraries which may include competitive
inhibitors, and with pre- or post-binding labeling of sample enzymes by
encoded
label antibodies, to permit classification of sample enzyme type on a molecule
by
molecule basis, which classification data may be combined with the data
obtained
in this assay.
1.4.9.2 Hybridization Based Detection Of Polynucleotide Sequences.
Various methods have been developed to test for the presence of short
polynucleotide sequences and combinations of such sequences (according to
stringency) in polynucleotide samples by hybridizing oligonucleotides or
polynucleotides of known sequence to said polynucleotide samples. Such methods
are
sometimes terined'gene-probe" methods and often involve the use of
immobilized,
ordered arrays of oligonucleotides of known composition.
208

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Said ordered arrays have been formed on the surfaces of integrated electronic
devices. It has been shown that, provided stringency can be made sufficiently
high to
prevent binding with even one base mismatch, such methods may be used to
obtain
sequence information about a sufficiently small sample.
The methods of the present invention provide a more rapid and convenient
method for testing for the binding of known oligonucleotides to a complex
polynucleotide sample, owing largely to the higher degree of parallelism which
may
be accomplished with single molecule methods. Here, each oligonucleotide, of
known
sequence, to be used as a specific gene probe, is synthesized with some
perceptible
encoded label, as described above, where the codes assigned to the sequence of
said
each oligonucleotide are known (due to the synthetic scheme by which they are
produced and concurrently labeled). These are then hybridized to sample
polynucleotide molecules, which either have previously been or will
subsequently be
immobilized, or may otherwise be separated from probe oligonucleotides, and
the
presence or absence of said each oligonucleotide in the sample polynucleotide
containing fraction, which is a direct result of the success or failure of
said each
oligonuclectide to bind said sample polynucleotide molecules, will be readily
ascertained through the detection and discrimination of the perceptible
encoding
labels corresponding to said each oligonucleotide. Contrary to the
conventional gene-
probe methodology, known probing molecules are generally unbound in this
variation
of the method as may be used with the present invention.
If the complexity of the polynucleotide sample is not too large, and the
population made up of said oligonucleotides is sufficiently large and complex,
preferably exhaustively enumerating all possible oligonucleotides of the
respective
and sufficiently long length, and provided hybridization may be sufficiently
stringent,
which stringency is affected by a large number of known factors but also has
sequence dependent components, information about the binding of said each
oligonucleotide, which may be related to the respective known sequence and by
Watson-Crick pairing rules to the respective sample polynucleotide sequence
segment
(or by identity with the strand complementary to the strand to which said each
oligonucleotide has bound) may thus be obtained. As with other methods,
alignment
of such data may yield information about the sequence of the sample. The
methods of
the present invention further provide for the quantitation of such
oligonucleotide
209

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
hybridization by way of counting the number of times a particular perceptible
encoded label is retained by a said polynucleotide sample, which may be
availed both
in the monitoring and correcting of errors and in the modulation of binding
(hybridization) conditions.
Alternatively, probing may be accomplished by oligomeric sequences
immobilized in some known configuration, for example by spatially patterned
methods such as those of S.P.A. Fodor et a1.37 or by the lattices produced
hierarchically by the method of N.C. Seeman noted above but comprising an
ordered
array (the order of which is predetermined by the incorporation or association
of
single stranded oligonucleotides or other single stranded termini of known
sequence
into or with modular components used to build up said lattices) of short
single
stranded regions of known sequence and preferably one free terminus (so as not
to
hinder conformational changes required for hybridization), but detected by the
methods of the present invention, where sample polynucleotides are labeled
with
some appropriate discernible label, such as the dye YOYO-I, to facilitate the
detection
of their presence in association with each of said oligomeric sequences.
A yet further variation for effecting the spatially predetermined distribution
of,
for example and exhaustively enumerated population of single stranded
oligonucleotides, may be effected by the used of the methods of N.C. Seeman to
produce a uniform two dimensional lattice with a repeating pattern of short
single
stranded sequences with photo protected termini, for example all of the 256
possible
4-mers. Such a lattice may have a periodicity substantially smaller than the
wavelength of visible light. Said short single stranded sequences may be
comprise
some synthetic backbone so as to be resistant to enzymatic cleavage, which
backbone
preferably also is non-ionic (for example, of alkyl or beta-cyanoethyl
derivation,
peptide-nucleic-acid composition, or methylphosphonate composition) so as to
denature from a complementary sequence only at markedly elevated temperatures
relative to ordinary oligonucleotides. Thus, a pattern of oligonucleotide
complexity
may be distributed in a predetermined manner below the resolution of light
directed
patterning.
Light patterning techniques may then be availed to spatially direct the
photodeprotection of said short single stranded sequences at lower resolution.
Such
light directed syntheses are preferably terminated with some comonomer which
will
210

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
prevent exonucleolytic degradation of said short single stranded sequences, or
all of
said short single stranded sequences are of a polarity opposite to that
specified by the
exonuclease to be subsequently used. By this combination of methods,
patterning
resolution is not limited by the properties of light, but may avail of the
convenience of
light directed patterning at lower resolutions. After a known distribution of
all
possible single stranded sequences of sufficient complexity has thus been
produced, a
denatured, labeled polynucleotide sample produced by extensive nick
translation, with
fluorescent labeled nucleotides, of a naturally occurring polynucleotide
sample is
hybridized to said lattice. Hybridized molecules are treated mildly with a
single strand
specific nuclease, followed by an exonuclease, to degrade or by the same
process to
free those regions which are not bound to the probing said short single
stranded
sequences. Label incorporated into the nick translation products of said
polynucleotide sample is then detected and spatially mapped by the methods of
the
present invention, and binding is thus scored according to the known probing
said
short single stranded sequences. This method thus avails the molecular
parallelism
made possible by the molecular recognition, high density and high resolution
detection methods availed with the present invention.
Note, finally, that higher density patterning than attainable by conventional
light patterning methods may also be effected by scanning probe lithographic
methods, such as the use of NFSOM lithography with photodeprotectable groups.
1.4.9.2.1 Methods For Repeatable Detection And Identification Of Single
Molecules
Repeatable detection and identification of single molecules is achievable by
microscopic labeling with some readily identifiable, e.g. combinatorially or
permutationally diverse and readily examined particle or molecule or group of
molecules and detection of the thus marked identity of individual free
molecules in
solution, with removal of excess nucleotides (e.g. by filtration); and,
scanning of a
liquid sample volume where sample molecules and sample conditions are matched
to
ensure manageably slow free diffusion of sample molecules permitting tracking
of the
motions of free individual molecules in solution, as observed by T.T. Perkins
et al. for
reptation of DNA in solution, in which instance unreacted labeled monomers may
be
removed, for instance, according to their more rapid diffusion, possibly
through a
Z1I

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
filter, and detection may favorably comprise observation of reduced mobility
of a
labeling moiety after it has become attached to a sample molecule.)
According to the labeling methods employed, various detection methods may
satisfy the requirements of signal detection with repeatable assignability to
a
particular unique sample template molecule.
Prominent among these detection methods are microscopy methods such as
video microscopy including confocal fluorescence microscopy with or without
enhancement, and with or without variations incorporated into the present
invention
near field scanning optical microscopy (NFSOM) and variations thereof; contact
and
non-contact varieties of scanning force microscopy (SFM; also termed atomic
force
microscopy (AFMI) and variations thereof; other scanning probe microscopies
including scanning tunneling microscopy (STM), scanning tunneling spectroscopy
(STS), and so-called field emission mode STM (which is more accurately
described
as microscopy by field emission from a scanned conductive probe, or scanning
field
emission microscopy, SFEM, because no tunneling actually occurs). Any
enhancements of scanning probe microscopy, including multiple probe
parallelism,
may readily be availed in the practice of the present invention.
Additionally, optical detection methods employing optoelectronic array
devices (OADs), such as spatial light modulators (SLMs), laser diode arrays
(LDAs),
light-emitting diode arrays, or charge coupled photo-diode arrays
(conventionally
termed CCDs), in combination with appropriately high sensitivity detection
methods,
may also be employed, particularly with samples immobilized such that the
maximal
proportion of pixel elements of said array will be involved with the detection
of a
signal from exactly one sample molecule. CCD and SLM array device are
presently
available at pixel densities of approximately 105 to 106 per cm2. LDAs of
comparable
density are currently under development. Device level constraints upon
parallelism
will thus be significant, but may be overcome by increasing the data obtained
per
molecule (i.e. processivity or sequence segment length.) Such devices may be
employed remotely, i.e. in some arrangement where light passes through the
sample
under study and is detected by some apparatus involving said array devices, or
in
close or direct contact with said sample, as for instance, polynucleotides
have been
immobilized to integrated circuits for other applications. Appropriate
arrangements of
such devices for the appropriate detection scheme in which each device type is
212

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
appropriately used will be obvious to those skilled in the arts of optics and
optoelectronics.
Note that for purposes of those variations of the present invention involving
the immobilization of sample molecules, said immobilization may be
conveniently
effected in a random manner, relying upon some appropriate surface or volume
density which yields a corresponding random surface or volume distribution,
and
appropriate detection methods to permit repeatable resolution of most sample
molecules from each other. The length of the molecules in question will be an
important factor in the determination of a desirable said density. Generally
speaking,
for random surface immobilization and without the use of measures to orient or
order
sample molecules, for molecules of length L (which may additionally account
for any
labeling bead diameter), and detection methods relying on spatial resolution
R,
maximum practical molecule number density will generally be the less than
I /(2L+R)a. This assumes the worst case configuration of two end immobilized
molecules extending directly towards each other and both labeled near their
respective
termini. Similar calculations may be applied to three dimensional cases.
Alternatively,
one may consider (2L+R)2 or (2L+R)3 to be an average bin size, and determine
via the
Poison distribution the optimal molecular number density corresponding to the
largest
number of bins being occupied by precisely one sample template molecule.
Alternatively, molecules may be labeled by a first label, for example with a
particular fluorescent dye incorporated by nick translation, in a manner
identifying a
portion of the molecule near the site of polymerization, and proximity of said
first
label to the perceptibly distinct labeling moieties used for nucleotide
incorporation
detection and discrimination will permit the detection of unacceptable
proximity of
two distinct sample molecules. Such a method is consistent with the tracking
methods
described below for free sample molecules. in such a case, the data collected
during
the cycle in which said unacceptable proximity is observed for the affected
molecules
may be ignored, and lack of information from this cycle noted for the
respective
molecules. Conditions, such as solution viscosity, sample molecule diffusion
rate,
sample molecule concentration, sample dimensions, etc., may be optimized to
reduce
the occurrence of such unacceptable proximity, and oversampling methods
described
in other portions of the present disclosure may be applied to preclude this
form of
213

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
error from degrading final data quality. These methods may be applied to
either
immobilized or unimmobilized sample molecules.
1.4.9.2.1.1 Microscopy Based Detection
Light microscopic visualization represents a particularly convenient and
technically simple detection and unique molecule localization method. A
visualization
method of particular interest for purposes of the present invention in higher
performance or more demanding applications is video enhanced confocal
fluorescence microscopy (VECFM), preferably utilizing optics well matched to
the
refractive index of the reaction or detection medium.
As discussed above, various scanning probe microscopies may also be
advantageously used within the present invention according to labeling agents
and
methods used. Most prominent among these are NFSOM and variations thereof, and
both contact and non-contact SFM, and variations thereof.
Generally speaking, a microscopy based detection method must be sufficiently
convenient, capable of use with a stepper translated or otherwise translatable
sample,
not destructive of the sample, and capable of detection of any labeling
methodology to
be used with it. Thus, it is quite likely that many microscopy methodologies
not yet
developed may readily be employed with the present invention. Further,
microscopy
and corresponding apparata shall comprehend any miniaturized or
microfabricated
microscopy devices or other comparable integrated detection means.
1.4.9.2.1.2 HIGH SENSITIVITY AND SCANNED EXCITATION BEAM
FLUORESCENCE CONFOCAL MICROSCOPY
A modification of VECFM which is particularly suited for SMD and SMV
relies upon selective fluorescent excitation of an appropriate dye molecule
label (or of
molecules within a sample with appropriate fluorescent properties independent
of
labeling) in some sample by means of some tightly defined beam, with
dimensions at
or near the resolution limit of the apparatus, of an appropriate frequency, or
of
parametrically controllable frequency, where said beam is caused to scan in a
controlled manner through the sample region within the visual field. This
microscopy,
including numerous variations, may be termed either scanned beam confocal
microscopy or steered beam confocal microscopy (in either case, SBCM).
Scanning
214

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
of said beam through the sample within the visual field may be accomplished by
introducing said beam into the optical path of the VECFM via mobile mirrors
which
may effect said controlled scanning, or by first producing said beam with a
pinhole
which is itself scanned, before deflection towards the sample via said
mirrors, which
in the present case may be fixed in position, through the use of pinholes in a
rotating
disk arranged in one or more spiral arms to effect an approximately rastering
illumination of the sample as said disk rotates, or by other means which will
be
obvious to those skilled in the design of optical instrumentation and
microscopy. Said
beam will excite fluorescence in any appropriately responsive molecules which
occur
in its path. An optical splitter may then redirect a fraction of the light
transmitted from
the sample through the objective lens, and direct it through a narrow
bandwidth, high
transmissiveness filter, which may be specific for a fixed or for a
parametrically
controllable variable frequency, to uniquely select the appropriate
fluorescent
emission frequency, to a highly sensitive photodetector, which may record
either
intensity as intensity information or as the number of photons detected per
unit time,
as a function of the region being subjected to fluorescence exiting
illumination or
being distinctly observed (see below). Thus a high resolution map of the
fluorescence
of the sample may be reconstructed, and further overlayed images obtained for
the
same sample and sample location by conventional VECFM means.
Alternatively, the entire sample of visual field may be subjected to
illumination by an appropriate excitation frequency, and a pinhole scanned
through
the portion of the output of said optical splitter, such that light passing
through said
pinhole will reach said highly sensitive photodetector.
In yet a third, albeit technically more complex implementation, an SLM, may
be used in place of said pinhole (in either configuration), and fluorescent
excitatory
illumination may be either broadly distributed or scanned.
In a fourth, albeit technically more complex implementation, sensitive
photodetection may be accomplished with a highly sensitive CCD, and
fluorescent
excitatory illumination may be either broadly distributed or scanned. At
present, CCD
sensitivity approaching single photon detection is technically possible though
is not
practical for high volume applications.
In a fifth implementation, said scanned beam may originate from a laser diode
array device or a light emitting diode array device, where only one of, or a
contiguous
215

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
group of elements of, such an array is active at any particular time so as to
produce a
particular beam, and the group of active elements of said such an array is
changed as a
function of time to effect scanning of the sample by the coordinated
activation and
deactivation of the plural beams thus produced.
In all of the above implementations, spatial information is gained about any
particular fluorescent emission, and this may then be combined with other
visual
information obtained via the same VECFM apparatus.
Note that for scanned beam methodologies, where beams are used for
excitation or detection, even where said beams may have inhomogeneous but
invariant distribution of internal flux density, known samples such as
individual dye
molecules may be imaged for calibration purposes and information useful for
algorithmic enhancement may be collected. This information represents the
characterization of the convolution of the beam and optics properties with the
signal
actually owing to the known sample, and thus localization of fluorescent
sample
features may be accomplished at better than optical resolution limitations.
For
example, a single, immobilized fluorescent molecule may be examined by such an
apparatus, and the intensity as a function of beam position may be recorded
for the
full duration of its presence within the beam's path as said beam scans the
sample, and
the data thus obtained may then be used to determine the change in observed
intensity
as the sample molecule enters the extremity of the beam, traverses the beam
and exits
the beam. This information may then be subjected, for instance to averaging or
other
computations to determine the relationship between the location of the
molecule
within the beam and the intensity observed, and finally that information used
to
estimate the intensity which would be observed when such a calibration sample
molecule is in the precise center of the beam. This information may then be
used in
image enhancement of unknown samples. Note, however, that localization to
below
optical resolution limitations is distinct from increasing the resolution
capability for
two nearby objects.
Scanning beam microscopies will be of particular advantage where it is
desirable to use particular illumination frequencies to modify the sample. For
purposes of the present invention, a beam of predetermined frequency, for
instance
delimited and scanned by means of a pinhole as described above, may be used to
selectively modify a particular sample molecule. For example, a beam of
216

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
predetermined frequency may be used to effect the photobleaching of the
labeling
moiety on a particular sample molecule. to selectively remove a photocleavable
protecting group on a particular sample molecule, to selectively remove a
moiety
joined to a sample molecule by a photocleavable linker, or selectively control
any
photochemical reactions in a highly localized but non-invasive manner.
Note that implementations permitting variations of illumination frequency
and/or variations of the frequency or frequencies selected b". * filters for
detection
purposes constitute microspectroscopy or microfluorimetry, and may be applied
to
any of the various light microscopies.
217

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.9.2.1.3 REPEATABILITY BY IMMOBILIZATION WITH DISCERNIBLE
LOCATION
Surface Immobilization
A large number of methods presently exist to effect the immobilization of
macromolecules and other molecules to various surfaces including the, surfaces
of
optically transparent materials. In general, such methods on the chemical
modification
of said surfaces such that they will be reactive with or have specific
affinity for
particular chemical functional groups placed on said macromolecules or
molecules.
Applicable methods include those described by S.P.A. Fodor et al effect
micropatterned surface immobilization and controlled synthesis polypeptides
and
polynucleotides, those described by M. Hegner et a1.14 ' effect the end-wise
immobilization of terminally thiol modified double helical DNA molecules to a
gold
coated surface, or those methods recently used by L. Finzi and J. Gel1es15 to
effect
end-wise attachment of DNA molecules to an antibody coated glass surface. Many
alternative methods will be obvious to those skilled in the relevant arts.
For purposes of genome sequencing applications of the present invention,
DNA from a cosmid library which may have been prepared from total genomic
material., from a cDNA library derived from a particular tissue type, from a
cosmid
library which may have been prepared for a single chromosome or group of
chromosomes or particular chromosome segments, or directly purified genomic
DNA
or directly purified RNA from a particular cell type, etc., may be subjected
to
fragmentation. Physical methods such as shearing with a hypodermic apparatus
may
be suitable. Where the sample is in the form of duplex DNA, it may be treated
with
restriction enzymes, which preferably restrict either 6- or 4-base recognition
sequences, so as to produce sample molecules of mean length of either 4
kilobases or
256 bases, respectively. Such lengths are sufficiently short to yield a high
number
density of sample molecules. Said sample molecules may then be appropriately
derrivatized, for example by fill-in reactions at 5' overhang cohesive termini
produced
by said restriction enzymes with nucleotides bearing an affinity label or an
appropriately reactive chemical functional group.
218

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.9.2.1.4 MATRIX IMMOBILIZATION
There has been increasing interest and progress in the field of affinity
chromatography which relies upon varyingly specific affinity interactions
between
molecules immobilized to a chromatographic matrix or polymeric matrix and the
molecules contained in some sample. Of particular relevance are matrices with
polynucleotides immobilized thereupon. An example which is widely known and
used
within the relevant fields is oligo-dT cellulose. Further, many chemistries
and
methods used to immobilize macromolecules to surfaces will be similarly
applicable
to immobilization to a polymeric matrix provided said matrix is chosen so as
to have
appropriate reactivities and not pose any difficulties associated with non-
specific
interactions. Most methods capable of effecting such matrix immobilization
will be
acceptable for purposes of the present invention. Note, however, that any
matrix used
in the present invention must admit the sufficiently rapid transport or
diffusion of
reagents, enzymes and buffers, as required by the particular embodiment.
1.4.9.2.1.5 FOCAL PLANE SCANNING
For detection an discrimination within a volume, whether for matrix
immobilized samples or diffusion constrained free molecules in solution,
especially
where fluorescent labeling of one form or another has been employed, a sample
may
be examined by microscopy with reconstruction of three-dimensional spatial
information by scanning the focal plane through the depth of the sample and
collecting image data at appropriate intervals. Such methods of three-
dimensional
reconstruction are well known within the art of microscopy.
1.4.9.2.1.6 PLANE EXCITATORY ILLUMINATION
Alternatively, optical means such as moving slits or SLMs or laser diode
arrays may be employed to selectively illuminate a particular region,
preferably a
single plane (of thickness similar to the wavelength of light employed or
feature size
of integrated device means employed), to examine a particular subset of sample
template molecules and labels associated with them, providing spatial
reconstructability of the data thus collected.
1.4.9.2.2 TWO BEAM METHODS INCLUDING PLANE ILLUMINATION
219

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Volume distributed samples may also be examined with methods closely
analogous to those recommended for three dimensional optical mass data
storage, for
instance, by Sadik Esener in U.S. Patent Number 5,325.324. Here, labels
requiring
excitation by photons of two distinct frequencies for photoemission may be
employed.
Alternatively, the related methods of illuminating an entire plane of a sample
with one
of said distinct frequencies may be availed as a mechanism for imaging with
spatial
reconstructability.
1.4.9.2.3 Immobilization Via Concatenation
For the various applications of the present invention involving the
interaction
of enzymes with extended linear macromolecules such as polynucleotides, when
said
extended linear molecules may be conveniently circularized by appropriate
treatments
(which will generally be obvious to those skilled in the relevant arts),
immobilization
of said extended linear molecules may be conveniently effected by their
concatenation
with second extended linear molecules which are likewise conveniently
circularized
by appropriate treatments (which will again generally be obvious to those
skilled in
the relevant arts) bearing chemical properties (i.e. functional groups such as
thiols or
affinity moieties such as biotin) favorable for convenient, specific
immobilization to a
surface, matrix or other solid support. For purposes of, for example, certain
sequencing applications of the present invention, said second extended linear
molecules are favorably bound (with methods which will generally be obvious to
those skilled in the relevant arts) at a predetermined location along their
length, to
some protein, which may be an enzyme such as a polymerase, before
immobilization.
Said second extended linear molecules may have termini with reactive chemical
functional groups which may be bound together by the addition of some
appropriate
reagent such as a chemical cross-linking agent, or with some affinity moiety
such as
an oligo- or polynucleotide which may be bound together by an appropriately
complementary oligonucleotide or polynucleotide (with or without ligation
thereof),
or some appropriate multifuctional binding protein or receptor. Such an
arrangement
permits the following steps to be performed: said second extended linear
molecule is
bound to said enzyme; said protein is caused to bind to said first extended
linear
molecule (which may be circularized either in a prior or subsequent step);
said second
extended linear molecule to which said protein has been bound is caused to
circularize
by appropriate treatment; and if said first extended linear molecule is at
this stage
220

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
linear, it is caused to circularize. Without any special measures, there is a
fifty percent
chance that such a process will result in concatenation of the first extended
linear
molecule with the second extended linear molecule. Numerous methods, such as
size
separation followed by retention by immobilization, may be used to purify the
resulting desired concatenate. Where said second extended linear molecule was
chosen to be relatively short, such an assemblage will provide for the
retention of said
first extended linear molecule, now in concatenated circular form, in
proximity to said
protein, with specific immobilization or convenient immobilizability. Thus,
said
protein and said first extended linear molecule now in concatenated circular
form
have a high effective concentration with respect to eachother upon
dissociation, and
said protein and said first extended linear molecule now in concatenated
circular form
will not interact with the molecules of other such assemblages when said
assemblages
are at sufficiently low density or said second extended linear molecule now in
concatenated circular form is particularly short (i.e. effectively shackles
said first
extended linear molecule now in concatenated circular form to said protein
whether or
not said first extended linear molecule now in concatenated circular form is
bound by
said protein.)
Such an immobilization scheme will be particularly desirable in, for example,
sequencing applications of the present invention where a polymerise must
perform a
cycle, in which it binds, modifies and releases a sample molecule, at a high
rate. A
particular instance in which such desirability obtains is.for samples to be
analyzed
with long sequence segments (e.g. hundreds or thousands of bases) where
dissociation
of the polymerise is necessary to permit either 3' hydroxy deprotection (e.g.
removal
of a photolabile protecting group) and or labeling moiety removal by
appropriate
means. Note that by immobilizing the enzyme, and hence the spatial location at
which
the labeling moiety first comes into physical communication with a sample
molecule,
the above stated limitation on sample molecule density may be overcome, with
the
new limit being that imposed by the detection method, thus increasing sample
density
and in some embodiments the parallelism that thence may readily be achieved
with
detection methods such as microscopy. It is therefore feasible, with such
assemblages,
to collect sequence data dynamically from each molecule at a rate approaching
the
limits imposed by the slower of the characteristic nucleotide incorporation
rate of the
polymerise; or, the diffusion rate limit of nucleotide association with the
nucleotide
221

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
binding site of the polymerase (divided by four) when nucleotides are -at a
sufficiently
low concentration that their presence as labeled but free molecules in the
detection
field does not interfere with the detection (which may be time averaged
according to
the particular instrumentation used) of incorporated labeled nucleotides,
which
concentration will be dependent in part on the geometry of the liquid volume;
or, the
maximum rate of single label detection (but note that such a rate need not be
low
because detection rate will increase for multimeric labels, which may be
employed).
Such an immobilization method will favorably be employed for embodiments
locating
sample molecules on or near the surface of a CCD or SLM. Note that kinetic
control
of polymerization rate (and hence stepping rate, e.g. by adjusting nucleotide
concentration) is also enhanced by the .use of such a concatenation
methodology.
1.4.9.3 IMMOBILIZATION WITH NON-RANDOM DISTRIBUTION
While the above methods are convenient precisely because they require only
the simple optimization of sample molecule density, the resulting random
distribution
will less than fully utilize available substrate or matrix space and fewer
than all
sample molecules will be sufficiently well separated for unambiguous
resolution of
two adjacent sample molecules. Due to the inherent advantages provided by
molecular parallelism, this will not in general be a significant constraint.
For
applications in which a high degree of instrumentation miniaturization is
desired,
however, a better effective density of usable sample molecules, distributed in
either
two or three dimensions, may be effected as needed by non-random
immobilization
methods.
One such random immobilization method may avail of the invention of N.C.
Seeman, described in U.S. Patent Number 5,278,051, which provides a process
for the
construction of complex geometrical objects. These methods may be applied to
the
production of regular two- and three-dimensional molecular lattices from .
polynucleotide compositions. The process of this invention may be extended by
the
incorporation of appropriate affinity groups at predetermined locations within
the
objects, which for present purposes may favorably be small ligands such as
biotin or
digoxigenin, which may then be used as the target for a sample molecule which
has
been terminally labeled by a similar small ligand which has subsequently been
bound
by (an excess of) an appropriate multimeric receptor. Said multimeric receptor
will
222

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
then recognize and bind the complementary small molecule ligand incorporated
into
the structure of said lattice, and thus effect sample molecule immobilization
according
to the non-random pattern predetermined by the precise structure of said
lattice and
the precise distribution of ligands thereupon. Note that because the objects
provided
by the invention of N. C. Seeman comprise polynucleotide structures, care must
be
taken in using such a sample substrate with the methods of the present
invention to
ensure that said objects will be stable to all treatments which are to be
applied to
sample molecules, including denaturation, exonucleolytic degradation, primer
hybridization, exposure to active polymerases, etc. Generally, these
constraints may
be met by effecting topological closure of all strands such that no free
polynucleotide
terminus is carried on such a lattice, and no denaturation procedures will
result in
matrix dissociation; the methods of the invention of N.C. Seeman may be
availed in a
manner meeting these constrains.
Note that to ensure complete regularity of lattices constructed by such means,
or any other molecular lattices which do not have complete internal rigidity,
the
extremities of these lattices may be bound to solid supports which are then
positioned
so as to apply tensile stresses to said molecular lattices which will enforce
constraints
limiting flexural internal degrees of freedom and enforcing substantial
spatial
regularity on sample molecule distributions.
Any other method which provides a regular array of binding sites to which
sample molecules may selectively be associated will also suffice for the
purpose of
non-random immobilization of sample molecules in two- or three-dimensions for
the
present invention.
Note also that said appropriate affinity groups incorporated (directly or, by
conjugation or other methods, indirectly) at appropriate sites in a lattice
may be
chosen so as to interact directly with polynucleotide sample molecules in a
sequence
dependent or independent manner. Sequence dependent affinity binding may be
effected with oligonucleotides or analogs thereof capable of forming double-,
triple-
or quadruple helices with said sample polynucleotides, ribozymes, or sequence
dependent binding proteins including but not limited to: transcriptional
activators (e.g.
TATA- Binding Protein), enhancers and repressors; integrases; restriction
enzymes;
replicator proteins (e.g. DnaA); DNA repair proteins; anti- polynucleotide
antibodies,
RNA processing complexes (e.g. snRNPs); and RNA binding proteins all under
223

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
conditions permitting desired selectivity, specificity or stringency but,
where
appropriate, preventing polynucleotide cleavage or degradation. Where sequence
specific binding is desired, and hierarchically prepared lattices are used,
the
distribution of particular specificities may be controlled by the staged
incorporation of
said affinity groups at various hierachial levels of the synthetic procedure.
This will
permit classification of sequence data according to the location of the sample
template
molecule from which it is obtained in the lattice (i.e. on the surface or
within the
matrix). Sequence independent binding of polynucleotides may be effected by
the use
of proteins such as RecA, histones, Ul, etc.
1.4.9.3.1 Repeatable Identification Of Unimmobilized Molecules:
Single molecule tracking with controlled diffusion- For samples under
continuous observation, e.g. continuously within as visual field of a video
microscope, molecules may be perceptibly labeled, for example by perceptible
microscopic beads or the incorporation of a first fluorescent label, and
tracked by the
use of image analysis algorithms. Said algorithms will recognize only the
appropriate
type of label and track the motions of the respective sample molecule as it
slowly
diffuses in solution, so as to permit the unambiguous direct correlation or
assignment
of the signal associated with the addition of a labeled nucleotide to said
respective
sample molecule. For these methods, nucleotide labeling does not necessitate
the use
of large beads or other complexes for detection. Instead, single or oligomeric
fluorescent labeling moieties, or enzymatic label amity conjugation are
preferred,
such that labels may be removed without greatly disturbing the trajectory of
said
respective sample molecules. Either the direct colocalization (to within the
resolution
of the imaging method) of nucleotide label with said first fluorescent label
or
reductions in the Brownian motion of said nucleotide Label sufficiently near
(e.g.
closest to) said first fluorescent label may be exploited in the detection of
nucleotide
label incorporation.
Note that manipulation with a laser trap, as for instance described by T.T.
Perkins et al. for reptation of DNA in solution, may be employed with such
free
molecules.
1.4.9.3.2 Unique Labeling Of Sample Molecules And Identification Methods
Various methods may be employed to uniquely label individual sample
molecules. The complexity of such unique labels must be greater than the
number of
224

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
sample molecules contained within a unitary sample preparation, such that any
label
is highly unlikely to occur more than once within said unitary sample
preparation.
Labels may be visually discriminatable, or may be diverse affinity labels or
combinations thereof. Labels of this type may conveniently be random
combinations
of some basis set of distinct labels, formed for example, by a random coupling
or
polymerization of such labeling moieties to a defined chemical site provided
by
chemical modification of sample molecules.
Visual labeling may be accomplished by the use of a sufficient number of
distinguishable fluorescent dye molecules, or other visual labels, such that
the
presence or absence of association of any one of said distinguishable
fluorescent dye
molecules may comprise the state of a bit in a binary code. Such labeling is
similar to
the combinatorial encoding described by S. Brenner and R.A. Lerner, but
differs in
that: perceptible labels may be used for encoding; labels need not be genetic
material
or linear copolymers; where only unique identifiability is required, the label
moiety
employed for encoding may be synthesized separately and possibly randomly, and
bound possibly randomly with sample molecules; the information contained by
each
labeling moiety need not depend on its precise spatial association with sample
molecules, or its location within a sequence, only its sufficient proximity;
and,
because of such modes of independence between the encoding, which serves here
only for purposes of unique labeling, difficulties which may arise for
particular
orthogonal polymerization chemistries of different copolymer types may be
avoided
either by separate synthesis. Alternatively, for biopolymers, and, possibly
for
specifically encoded libraries, the use of specific enzymes which may for
example
ligate polynucleotides or polypeptides, may be used to specifically control
reactions
and prevent polymerizations of one biopolymer from affecting a second, linked
biopolymer. Note that moieties different from biologically occurring
comonomers
may be used as encoding: label moieties, via functionalization of appropriate
biopolymer segment with such moieties, in synthetic manners which will be
obvious
to those skilled in the relevant arts, or may be used, similarly, as
constitutes the
random library thus encoded. This latter case is, for example accomplished
with the
use of multiple distinct short double stranded DNA molecules with
appropriately
complementary cohesive termini which each carry some particular affinity or
photolabel type, and which may be ligated together in a manner stepped by the
225

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
addition of appropriate adaptor linkers, even in the presence of other
biopolymers
(such synthetic methods being further favorably facilitated by the use of
solid phase
synthetic methodologies). Depending on the sensitivity of the detection
methods used,
multimers of each single type of fluorescent dye moiety, or detectable
multiplications
of other photolabels, may be used to effect higher modulo coding of labels.
1.4:9.4 ENCODING BY SYNTHESIS WITH MULTIMACROMONOMERS
Note that the labeling methods of the present invention suggest a convenient
solution to the problem recognized by Brenner and Lerner, as limiting the
facility of
their encoding system, i.e. the requirement of separate distinct comonomer (or
co-
oligomer) type addition steps for each polymer type. This prevents the use of
highly
random (but step- controlled) synthetic preparation of such encoded libraries,
because
the information encoded is realized by individual preparative synthetic steps,
i.e. all of
the information content of the encoding is conferred upon these compounds by
the
intervention or agency of a chemist (or automated systems) at each step. Such
encoded libraries, of either the sequence encoded or modulo encoded types,
including
compounds comprising more than two polymer types, may be prepared with the
following stepped random method in one container (with or without the
favorable use
of solid phase synthetic methodologies). Note that the term random here refers
to the
mixture of two or more multimacromonomers in each addition step, such that
addition
to all compounds under preparation will occur in a random manner within the
reaction
mixture, in a manner weighted according to the relative concentration of each
such
multimacromonomer. Such multimacromonomers may also be used in more directly
controlled addition schemes with advantages which will be obvious to those
skilled in
the relevant arts.
Multimacromonomers comprising two or more monomer (or macromonomer)
types (e.g. comprising an amino acid monomer and a trinucleotide oligomer, or
an
amino acid monomer, a trinucleotide oligomer and a fluorescent or affinity
labeling
moiety) may be prepared by joining some or all of said two or more monomer (or
macromonomer) types by cleavable linkers such as those described in other
sections
of the present disclosure. Thus, each multimacromonomer may be added to
compounds under synthesis by addition of one of the monomer or macromonomer
types to the corresponding polymer or macropolymer types of said compounds
under
226

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
synthesis by appropriate polymer synthesis chemistry, followed by addition of
some
or all of each of the remaining monomer or macromonomer types to the
respective
corresponding polymer or macropolymer types of said compounds under synthesis
by
appropriate polymer synthesis chemistry. Control over the details of such
additions
may be effected by control over, for example, removal of distinct protecting
groups
from distinct polymer or macropolymer types of said compounds under synthesis
by
appropriate polymer synthesis chemistry. Linkers or specific linker branches
may be
cleaved at appropriate steps or after synthesis has otherwise been completed.
Thus,
correspondence between the composition of each polymer or macropolymer type
comprised within each molecule of the compound under synthesis (which final
composition may vary widely from molecule to molecule of the compound under
synthesis, but strictly observe the correspondence between composition of some
or all
of each of the polymers or macropolymers comprised within each molecule of the
compound under synthesis) is provided by the communication of the distinct
monomer or macromonomer types comprised within each multimacromonomer. The
first bond formed between a first monomer or first macromonomer of a
multimacromonomer and a molecule of the compound under synthesis will thus
ensure that other monomer or macromonomer types of the multimacromonomer
which will be added at the respective multimacromonomer addition stage will
correspond to the identity of the first monomer or first macromonomer thus
added.
Thus correspondence of some or all of each of the polymer or macropolymer
types of
final compounds is enforced (by the communication effected by, for example,
linkers)
even where the composition of some or all of the polymer or macropolymer types
is
respectively random.
Preferably, such linkers (which may be multiply branched, each of such
branches possibly comprising cleavable groups susceptible to distinct cleaving
treatments) are held in communication with some or all of the two or more
distinct
monomer or macromonomer types (which are added to the compounds under
synthesis with distinct and mutually non-interfering addition or
polymerization,
deprotection and/or activation chemistries, termed "orthogonal" chemistries in
the
respective art) by attachment to the protecting groups used to effect the
stepping of
additions of each such multimacromonomer. Said diverse amity labels may be
used
in conjunction with multiple affinity separation paths and nucleotide label
detection
227

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
that associates the detected said nucleotide label with the resolved location
of the
respective affinity labeled sample molecule, thus accomplishing the required
assignment of detection and discrimination of the appropriate nucleotide label
precisely to the correct respective sample molecule. Alternatively, said
diverse
affinity labels may be added to sample molecules so as to be independently
recognizable by appropriate receptor molecules or other affinity means, each
complementarity type of which is respectively labeled with some distinct
independently perceptible label.
Such labeling methods permit the processing of samples in fluid flow based
apparata without the loss of single molecule identifiability or assignability
of results.
Also note that manipulation with a laser trap, as for instance described by
T.T.
Perkins et al., may be employed with such uniquely labeled molecules.
Note that a case of encoding of particular interest is that of a functional
molecule coupled to an informational molecule which is sufficient to direct
the
synthesis of said functional molecule in an appropriate, (e.g. biological or
biological
derived) system. Libraries of polypeptides expressed on the surface of, for
example,
bacteriophages carrying genetic material specifying said polypeptides, have
found
great use in the in vitro selection of binding specificities. Encoding which
may
additionally direct synthesis may be availed in the affinity characterization
and
molecular evolution applications of the present invention. The communication
of a
synthesis directing informational molecule (favorably DNA or RNA) with the
correspondingly synthesized one or more functional molecules (generally a
polypeptide) may be effected by the in vivo coupling or otherwise
compartmentally
enforced unique one-to- one corresponding coupling of said informational and
said
functional molecule. A particularly convenient instance of such a molecules
comprises the fused expression of said functional molecule or molecules as
segments
of the terminal proteins of the informational molecules (i.e. DNA) of various
virus
(e.g. adenovirus) or bacteriophage (e.g. PRDI or phi29) genomes.
Alternatively, said
functional molecules may be fused with some molecule which associates in a
specific
manner with said terminal proteins, and which has su~cient opportunity during
its in
vivo synthesis, without or preferably with concurrent viral or bacteriophage
replication, to associate with the terminal protein of the genomic material
which
determines the composition of said functional molecules, such that upon
purification
228

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
or lysis functional molecules remain in communication with the genetic
material that
determines their composition. Because biosynthesis of functional and
informational
moieties may favorably occur within the confines of a single cell, cross-
coupling of
inappropriate molecules may be readily avoided. Alternatively, the
communication
between polypeptide and polynucleotide moieties may be effected with some
intermediate snRNP or snRNP-like moiety, where such an intermediate moiety may
be targeted on the one hand by an appropriate affinity characteristic of one
or more
polypeptides to which said functional molecules are fused, and on the other
hand by a
polynucleotide sequence complementary (according to appropriate rules for
double-,
triple- or quadruple- helix formation) with the polynucleotide moiety of said
intermediate snRNP or snRNP-like moiety.
Such complexes comprising an intermediate snRNP or snRNP-like moiety
may also favorably be formed within the confines of a single cell.
1.4.9.5 CYBERNETIC MOLECULAR EVOLUTION AND ALGORITHM
MEDIATED CYBERNETIC MOLECULAR EVOLUTION OF
PHENOGENOCOUPLES
Such polynucleotide-polypeptide chimera, or other molecule types comprising
thus communicating and informationally corresponding chimera (e.g. where the
polypeptide moiety has further been subjected to post-translational
modification such
as specific glycosylation and has been associated by some method to the
respective
genetic material determining its composition, for example by the sorting of
individual
cells carrying said genetic material in the form of a DNA vector with terminal
proteins and expressing and processing said polypeptide, into distinct wells
or vessels
followed by disruption of membranes such that terminal proteins fused with
peptides
having affinity for the particular polypeptide of interest may come into
contact with
the processed polypeptide of interest, comprising a method for the molecular
evolution of multiple-biopolymer containing macromolecules), which may be
termed
phenogenocouples, may be used as sample molecules with the broad methods of
the
present invention to effect the affinity characterization (including either or
both
equilibrium and kinetic characterization of molecular recognition including
catalytic
recognition and catalysis) of functional moieties and then the
characterization and
229

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
transcription of informational moieties thus determined to be of interest.
Where
algorithms control such a process, cybernetic molecular evolution is embodied.
Selected informational molecules may be selectively replicated or transcribed
by activatable (e.g. photodeprotectable and especially 3' hydroxyl
photodeprotectable)
primers with appropriate complementarity to some region which bounds the
informational content specifying said functional molecule or molecules.
Alternatively,
immobilization of a sample to be subjected to such manipulations may be
effected so
as to comprise some photolabile linkage, which may then be subjected to
selective
photodegradation to effect specific release. For immobilized samples,
informational
molecules which carry the relevant genetic component of a phenogenocouple may
thus be released by either of these methods either singly, or as the
population of
multiple such molecules simultaneousl" copied or otherwise released according
to the
pattern of deprotection.
Alternatively, successive generations of molecules need only be related
informationally, by analysis of composition of one generation, by, for
example, the
massively parallel characterization methods of the present invention, followed
by de
novo synthesis of molecules carrying the desired complexity and diversity of
the
succeeding generation. This is a particular distinguishing feature of
cybernetic
molecular evolution; selection, amplification and mutation may be directed
strictly by
algorithms which manipulate data gathered about one generation to determine
the
composition of a succeeding generation.
Released molecules may then be recovered for subsequent amplification,
mutation and subsequent rounds of selection by similar or other methods, as
will be
obvious to those skilled in the art of in vitro molecular evolution.
Note that post transcriptionally modified polypeptide moieties or other
phenogenocouples may also be selected and otherwise subjected to in vitro
evolution
by conventional means as well as by the massively parallel examination and
modification methods of the present invention.
Because of the correspondence between the diversity generation and selection
aspects of molecular evolution, and immunological recognition and memory, all
of
these methods may be directly applied to cybernetic immune system applications
of
the present invention.
Labeled reagents and signal amplification and elimination techniques:
230

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The categories enumerated below are included for description and not
limitation; other appropriate labeling methods will be obvious to those
skilled in the
arts of biotechnology, cell biology and cytology, microscopy, organic
chemistry,
biochemistry or recombinant DNA techniques.
Each category will comprehend a variety of specific variations, as will be
obvious to those skilled in the relevant arts. Various labeling methods will
generally
correspond best to various detection methods.
231

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.10 DETECTION METHODS FOR THE PRESENT INVENTION
Non-radioactive labeling techniques have been explored and, in recent years,
integrated into partly automated DNA sequencing procedures. These improvements
utilize the Sanger sequencing strategy. The label (e.g. fluorescent dye) can
be tagged
to the primer (Smith et al., Nature M, 674-679 (1986) and EPO Patent No.
87300998.9; Du Pont De Nemours EPO Application No. 0359225; Ansorge et al., J.
Biochem. Biophys. Methods 13, 325-32 (1986)) or to the chain- terminating
dideoxynucloside triphosphates (Prober et al. Science 218, 336-41 (1987);
Applied
Biosystems, PCT Application WO 91/05060). Based on either labeling the primer
or
the ddNTP, systems have been developed by Applied Biosystems (Smith et al., S
cience 23 S, G89 (1987); U. S. Patent Nos. 5 70973 and 689013), Du Pont De
Nemours (Prober et al., Science 238, 336-341 (1987); U.S. Patents Nos. 881372
and
57566), Pharmacia-LKB (Ansorge et al., Nucleic Acids Res. 1 l, 4593-4602
(1987)
and EMBL Patent Application DE P3 724442 and P3 805 808. 1) and Hitachi (JP I -
90844 and DE 4011991 AI). A somewhat similar approach was developed by
Brumbaugh et al., (Proc. Nad. Sci. US A85 5610-14 (1988) and U.S. Patent No.
4,729,947). An improved method for the Du Pont system using two
electrophoretic
lanes with tyvo different specific labels per lane is described (PCT
Application
W092/02635). A different approach uses fluorescently labeled avidin and biotin
labeled primers. Here, the sequencing ladders ending with biotin are reacted
during
electrophoresis with the labeled avidin which results in the detection of the
individual
sequencing bands (Brumbaugh et al., U.S. Patent No. 594676).
More recently even more sensitive non-radioactive labeling techniques for
DNA using chemiluminescence triggerable and amplifyable by enzymes have been
developed (Beck, OKeefe, Coull and Koster, Nucleic Acids Res. 12, 5115- S 123
(1989) and Beck and Koster, Anal. Chem. Q 2258-2270 (1990)). These labeling
methods were combined with multiplex DNA sequencing (Church et al., Science
240,
185-188 (1988) and direct blotting electrophoresis (DBE) (Beck and Pohl, EMBO
I
Vol. 3: p 2905-2909 (1984)) to -provide for a strategy aimed at high
throughput DNA
sequencing (Koster et al., Nucleic Acids Res. Symposium Ser. No. 2,4, 318- 321
(1991), University of Utah, PCT Application No. WO 90/15883). However, this
strategy still suffers from the disadvantage of being very laborious and
difficult to
automate.
232

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Multiple distinctly labeled primers can be used to discriminate sequencing
patterns. For example, four differently labeled sequencing primers specific
for the
single termination reactions, e.g. with fluorescent dyes and online detection
using
laser excitation in an automated sequencing device. The use of eight
differently
labeled primers allow the discrimination of the sequencing pattern from both
strands.
Instead of labeled primers, labeled ddNTP may be used for detection, if
separation of
the sequencing fragments derived from both strand is provided, With one biotin
labeled primer, sequencing fragments from one strand can be isolated for
example via
biotin-streptavidin coated magnetic beads. Possible is also the isolation via
immunoaffinity chromatography in the case of a digoxigenin labeled primer or
with
affinity chromatography in case of complementary oligonucleotides bound to a
solid
support.
1.4.10.1 Fluorescent labels
In automated sequencing, fluorescence labeled DNA fragments are detected
during migration through the sequencing gel by laser excitation. Fluorescence
label is
incorporated during the sequencing reaction via labeled primers or chain
extending
nucleotides (Smith, L. et. al., Fluorescence detection in automated DNA
sequence
analysis, Nature 321.674-89 1986), (Knight, P., Automated DNA sequencers,
Biotechnology 6:1095-96 1988).
Detection methods for the present invention may favorably exploit fluorescent
labeling techniques.
Genome sequencing applications of the present invention may thus avail of
established fluorescent modification and detection methods. Other applications
of the
present invention may also benefit from the application of fluorescence
modification
and detection methods.
Much effort has already been invested in the development of fluorescently
labeled nucleotide triphosphate compounds and analogs thereof. Many such
compounds are acceptable substrates for polynucleotide polymerase molecules.
These
compounds have therefore proven suitable for use in various electrophoresis
based
DNA sequencing methodologies utilizing fluorescence detection, as well as in
other
applications such as chromatin mapping. There are therefore various compounds
comprising a fluorescent dye moiety and a nucleotide triphosphate moiety
commercially available.
233

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Fluorescent labels find use in variety of different biological., chemical.,
medical and biotechnological applications. One example of where such labels
find use
is in polynucleotide sequencing, particularly in automated DNA sequencing,
which is
becoming of critical importance to Iarge scale DNA sequencing projects, such
as the
Human Genome Project.
In methods of automated DNA sequencing, differently sized fluorescently
labeled DNA fragments which terminate at each base in the sequence are
enzymatically produced using the DNA to be sequenced as a template. Each group
of
fragments corresponding to termination at one of the four labeled bases are
labeled
with the same label. Thus, those fragments terminating in A are labeled with a
first
label, while those terminating in G, C and T are labeled with second, third
and fourth
labels respectively. The labeled fragments are then separated by size in an
electrophoretic medium and an electropherogram is generated, from which the
DNA
sequence is determined.
As methods of automated DNA sequencing have become more advanced, of
increasing interest is the use of sets of fluorescent labels in which all of
the labels are
excited at a common wavelength and yet emit one of four different detectable
signals,
one for each of the four different bases. Such labels provide for a number of
advantages, including high fluorescence signals and the ability to
electrophoretically
separate all of the labeled fragments in a single lane of an electrophoretic
medium
which avoids problems associated with lane to lane mobility variation.
Although such sets of labels have been developed for use in automated DNA
sequencing applications, heretofore the differently labeled members of such
sets have
each emitted at a different wavelength. Thus, conventional automated detection
devices currently employed in methods in which all of the enzymatically
produced
fragments or primer extension products are separated in the same lane must be
able to
detect emitted fluorescent light at four different wavelengths. This
requirement can
prove to be an undesirable limitation. More specifically, carrying out
sequencing on
vast numbers of different DNA templates simultaneously increases the number of
different fragments and corresponding labels required. At the same time, there
is a
need for a reduction in the complexity of the detection device, e.g. a device
which can
operate with light detection at only two wavelengths is preferable.
234

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Sets of fluorescent labels, particularly sets of fluorescently labeled
primers,
and methods for their use in mufti component analysis applications,
particularly
nucleic acid enzymatic sequencing applications, are provided. At least two of
the label
members of the set are energy transfer labels having a common donor and
acceptor
fluorophore separated by sufficiently different distances so that the two
labels provide
distinguishable fluorescent signals upon excitation at a common wavelength. In
further describing the subject invention, the subject sets will first be
described in
greater detail followed by a discussion of methods for their use in mufti
component
analysis applications.
Before the subject invention is further described, it is to be understood that
the
invention is not limited to the particular embodiments of the invention
described
below, as variations of the particular embodiments may be made and still fall
within
the scope of the appended claims. It is also to be understood that the
terminology
employed is for the purpose of describing particular embodiments, and is not
intended
to be limiting. Instead, the scope of the present invention will be
established by the
appended claims.
It must be noted that as used in this specification and the appended claims,
the
singular forms "a," "an" and "the" include plural reference unless the context
clearly
dictates otherwise. Unless defined otherwise all technical and scientific
terms used
herein have the same meaning as commonly understood to one of ordinary skill
in the
art to which this invention belongs.
The subject sets of fluorescent labels comprise a plurality of different types
of
labels, wherein each type of label in a given set is capable of producing a
distinguishable fluorescent signal from that of the other types of labels in
different
sets. Labels in the different sets generate different signals, preferably,
though not
necessarily upon excitation at a common excitation wavelength. For DNA
sequencing
applications, the subject sets will comprise at least 2 different types of
labels, and may
comprise 8 or more different types of labels, where for many applications the
number
of different types of labels in the set will not exceed 6, and will usually
not exceed
four, where at least two of the different types of labels are energy transfer
labels
sharing a common donor and acceptor fluorescer, as described in greater detail
below.
For other applications, such as fluorescence in situ hybridization (FISIT),
substantially
more than 8 labels are ideal so that multiple targets can be analyzed.
235

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
The distinguishable signals generated by the "at least two energy transfer
labels" will at least comprise the intensity of emitted light at one to two
wavelengths.
Preferably, the distinguishable signals produced by the "at least two energy
transfer
labels" will comprise distinguishable fluorescence emission patterns, which
patterns
are generated by plotting the intensity of emitted light from differently
sized
fragments at two wavelengths with respect to time as differently labeled
fragments
move relative to a detector, which patterns are known in the art as
electropherograms.
For analyses not based on electrophoresis, such as micro- array chip based
assays,
different targets tagged with a specific label can be differentiated from each
other by
the unique fluorescence patterns. For example, in one type of label of a set
the
intensity of emitted light at a first wavelength may be twice that of the
intensity of
emitted light at a second wavelength and in the second label the magnitude of
the
intensities of light emitted at the two wavelengths may be reversed, or light
may be
emitted at only one intensity. The different patterns are generated by varying
the
distance between the donor and acceptor. These patterns emitted from each of
these
labels are thus distinguishable.
The subject sets will comprise a plurality of different types of fluorescent
labels, where at least two of the labels and usually all of the labels are
energy transfer
labels which comprise at least one acceptor fluorophore and at least one donor
fluorophore in energy transfer relationship, where such labels may have more
complex configurations, such as multiple donors and/or multiple acceptors,
e.g. donor
l, acceptor I and acceptor 2. Critical to the subject sets is that at least
two of the labels
of the sets have common donor and accceptor fluorophores, where the only
difference
between the labels is the distance between these common acceptor and donor
fluorophores. Thus, for sets of labels in which each label comprises a single
donor and
a single acceptor, at least one of the energy transfer labels will have a
donor
fluorophore and acceptor fluorophore in energy transfer relationship separated
by a
distance x and at least one of the energy transfer labels will comprise the
same donor
and acceptor fluorophores in energy transfer relationship separated by a
different
distance y, where the distances x and y are sufficiently different to provide
for
distinguishable fluorescence emission patterns upon excitation at a common
wavelength, as described above.
236

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In those sets comprising a third label having the same donor and acceptor
fluorophores as the first and second label, the distance z between the donor
and
acceptor fluorophore will be sufficiently different from x and y to ensure
that the third
label is capable of providing a distinguishable fluorescence emission pattern
from the
first and second labels. Thus, in a particular set of labels, one may have a
plurality of
labels having the same donor and acceptor fluorophores, where the only
difference
among the labels is the distance between the donor and acceptor fluorophores.
To
ensure that different types of labels of a set having common donor and
acceptor
fluorophores yield distinguishable fluorescence emission patterns, the
distances
between the donor and acceptor fluorophores will differ by at least about 5 %,
usually
by at least about 10 % and more usually by at least about 20 % and will
generally
range from about from about 4 to 200 ~ , usually from about 12 to 100 ~ and
more
usually from about 15 to 80 ~, where the minimums in such distances are
determined
based on currently available detection devices and may be reduced as detection
technology becomes more sensitive, therefore more distinct labels can be
generated.
In one preferred embodiment, at least a portion of, up to and including all
of,
the labels of the subject sets will comprise a donor and acceptor fluoresces
component
in energy transfer relationship and covalently bonded to a spacer component,
i.e.
energy transfer labels. Thus, one could have a set of a plurality of labels in
which
only two of the labels comprise the above mentioned donor and acceptor
fluoresces
components and the remainder of the labels comprise a single fluoresces
component.
Preferably, however, all of the labels will comprise a donor and acceptor
fluoresces
component. Generally, for one donor and one acceptor ET systems, if a set
comprises
n types of energy transfer labels, the number of different types of acceptor
fluorophores present in the energy transfer labels of the set will not exceed
n- 1. Thus,
if the number of different types of energy transfer labels in the set is four,
the number
of different acceptor fluorophores in the set will not exceed 3, and will
usually not
exceed 2.
In other preferred embodiments, additional combinations of labels are
possible. Thus, in a set of labels, two of the labels could be energy transfer
labels
sharing common donor and acceptor fluorophores separated by different
distances and
the remaining labels could be additional energy transfer labels with different
donor
and/or acceptor fluorophores, non-energy transfer fluorescent labels, and the
like.
237

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
In the energy transfer labels of the subject sets, the spacer component to
which
the fluorescer components are covalently bound will typically be a polymeric
chain or
other chemical moiety capable of acting as a spacer for the donor and acceptor
fluorophore components, such as a rigid chemical moiety, such as chemicals
with
cyclic ring or chain structures which can separate the donor and acceptor and
which
also can be incorporated with an active group for attaching to the targets to
be
analyzed, where the spacer component will generally be a polymeric chain,
where the
fluorescer components are covalently bonded through linking groups to
monomeric
units of the chain, where these monomeric units of the chain are separated by
a
plurality of monomeric units sufficient so that energy transfer can occur from
the
donor to acceptor fluorescer components. The polymeric chains will generally
be
either polynucleotides, analogues or mimetics thereof , or peptides, peptide
analogues
or mimetics thereof, e.g. peptoids. For polynucleotides, polynucleotide
analogues or
mimetics thereof, the polymeric chain will generally comprise sugar moieties
which
may or may not be covalently bonded to a heterocyclic nitrogenous base, e.g.
adenine,
guanine, cytosine, thymine, uracil etc., and are linked by a linking group.
The sugar
moieties will generally be five membered rings, e.g. ribose, or six membered
rings, e.
g. hexose, with five membered rings such as ribose being preferred. A number
of
different sugar linking groups may be employed, where illustrative linking
groups
include phosphodiester, phosphorothioate, methylene(methyl imino)(MMI),
methophosphonate, phosphoramadite, guanidine, and the like. See Matteucci &
Wagner, Nature (1996) Supp 84: 20-22. Peptide, peptide analogues and mimetics
thereof suitable for use as the polymeric spacer include peptoids as described
in WO
91119735, the disclosure of which is herein incorporated by reference, where
the
individual monomeric units which are joined through amide bonds may or may not
be
bonded to a heterocyclic nitrogenous base, e.g, peptide nucleic acids. See
Matteucci &
Wagner supra. Generally, the polymeric spacer components of the subject labels
will
be peptide nucleic acid, polysugarphosphate as found in energy transfer
cassettes as
described in PCT/LJS96/13134, the disclosure of which is herein incorporated
by
reference, and polynucleotides as described in PCT/US95/01205, the disclosure
of
which is herein incorporated by reference.
Both the donor and acceptor fluorescer components of the subject labels will
be covalently bonded to the spacer component, e.g. the polymeric spacer chain,
238

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
through a linking group. The linking group can be varied widely and is not
critical to
this invention. The linking groups may be aliphatic, alicyclic, aromatic or
heterocyclic, or combinations thereof. Functionalities or heteroatoms which
may be
present in the linking group include oxygen, nitrogen, sulfur, or the like,
where the
heteroatom functionality which may be present is oxy, oxo, thio, thiono,
amino,
amido and the like. Any of a variety of the linleing groups may be employed
which do
not interfere with the energy transfer and gel electrophoresis, which may
include
purines or pyrimidines, particularly uridine, thymidine, cytosine, where
substitution
will be at an annular member, particularly carbon, or a side chain, e.g.
methyl in
thymidine. The donor andlor fluorescer component may be bonded directly to a
base
or through a linking group of from 1 to 6, more usually from 1 to 3 atoms,
particularly
carbon atoms. 'The linking group may be saturated or unsaturated, usually
having not
more than about one site of aliphatic unsaturation.
Though not absolutely necessarily, generally for DNA sequencing applications
at least one of the donor and acceptor fluorescer components will be linked to
a
terminus of the polymeric spacer chain, where usually the donor fluorescer
component will be bonded to the terminus of the chain, and the acceptor
fluorescer
component bonded to a monomeric unit internal to the chain. For labels
comprising
polynucleotides, analogues or mimetics thereof as the polymeric chain, the
donor
fluorescer component will generally be at the 5' terminus of the polymeric
chain and
the acceptor fluorescer component will be bonded to the polymeric chain at a
position
3' position to the 5' terminus of the chain. For other applications, such as
FISH, a
variety of labeling approaches are possible.
The donor fluorescer components will generally be compounds which absorb
in the range of about 300 to 900 nrn, usually in the range of about 350 to 800
nm, and
are capable of transferring energy to the acceptor fluorescer component. The
donor
component will have a strong molar absorbance co-eff dent at the desired
excitation
wavelength, desirably greater than about 104 preferably greater than about 105
cm iM~
1 . The molecular weight of the donor component will usually be less than
about 2.0
kD, more usually less than about 1.5 kD. A variety of compounds may be
employed
as donor fluorescer components, including fluorescein, phycoerythrin, BODIPY,
DAPI, Indo-1, cournarin, dansyl, cyanine dyes, and the like. Specific donor
compounds of interest include fluorescein, rhodamine, cyanine dyes and the
like.
239

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Although the donor and acceptor fluoresces component may be the same, e. g
both may be FAM, where they are different the acceptor fluoresces moiety will
generally absorb light at a wavelength which is usually at least 10 nm higher,
more
usually at least 20 nm or higher, than the maximum absorbance wavelength of
the
donor, and will have a fluorescence emission maximum at a wavelength ranging
from
about 400 to 900 run. As with the donor component, the acceptor fluoresces
component will have a molecular weight of less than about 2.0 kD, usually less
than
about 1.5 kD. Acceptor fluoresces moieties may be rhodamines, fluorescein
derivatives, BODIPY and cyanine dyes and the like. Specifc acceptor fluoresces
moieties include FAM, JOE, TAM, ROX, BODIPY and cyanine dyes.
The distance between the donor and acceptor fluoresces components will be
chosen to provide for energy transfer from the donor to acceptor fluoresces,
where the
efficiency of energy transfer will be from 20 to 100 %. Depending on the donor
and
acceptor fluoresces components, the distance between the two will generally
range
from 4 to 200 ~, usually from 12 to 100A and more usually from 15 to 80 ~, as
described above.
For the most part the labels of the subject sets will be described by the
following formula:
D-N-X
A
wherein: D is the donor fluoresces component, which may consist of more than
two
different donors separated by a spacer;
N is the spacer component, which may be a polymeric chain or rigid chemical
moiety, where when N is a polymeric spacer that comprises nucleotides,
analogues or
mimetics thereof, the number of monomeric units in N will generally range from
about 1 to 50, usually from about 4 to 20 and more usually from about 4 to 16;
A is the acceptor fluoresces component, which may consist of more than two
different acceptors separated by a spacer; and X is optional and is generally
present
when the labels are incorporated into oligonucleotide primers, where X is a
functionality, e.g an activated phosphate group, for linking to a mono- or
240

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
polynucleotide, analogue or mimetic thereof, particularly a
deoxyribonucleotide,
generally of from 1 to 50, more usually from 1 to 25 nucleotides.
For sets to be employed in nucleic acid enzymatic sequencing in which the
labels are to be employed as primers, the labels of the subject sets will
comprise either
the donor and acceptor fluoresces components attached directly to a
hybridizing
polymeric backbone, e.g. a polynucleotide, peptide nucleic acid and the like,
or the
donor and acceptor fluoresces components will be present in an energy transfer
cassette attached to a hybridizable component, where the energy transfer
cassette
comprises the fluoresces components attached to a non-hybridizing polymeric
backbone, e.g. a universal spacer. See PCT/LJS96/13134 and Ju et al., Nat.
Med.
(1996) supra, the disclosures of which are herein incorporated by reference.
The
hybridizable component will typically comprise from about 8 to 40, more
usually
from about 8 to 25 nucleotides, where the hybridizable component will
generally be
complementary to various commercially available vector sequences such that
during
use, synthesis proceeds from the vector into the cloned sequence. The vectors
may
include single-stranded filamentous bacteriophage vectors, the bacteriophage
lambda
vector, pUC vectors, pGEM vectors, or the like. Conveniently, the primer may
be
derived from a universal primer, such as pUC/M13, g t I O, gtl 1, and the
like, (See
Sambrook et al., Molecular Cloning: A Laboratory Manual., 2nd ed., CSHL, 1989,
Section 13), where the universal primer will have been modified as described
above,
e.g. by either directly attaching the donor and acceptor fluoresces components
to bases
of the primer or by attaching an energy transfer cassette comprising the
fluoresces
components to the primer.
Sets of preferred energy transfer labels comprising donor and acceptor
fluorescers covalently attached to a polynucleotide backbone in the above D-N-
A
format include: (1) F6R, F 13R, F16R and F16F; where different formats can
employed as long as the four primers display distinct fluorescence emission
patterns.
The fluorescent labels of the subject sets can be readily synthesized
according
to known methods, where the subject labels will generally be synthesized by
oligomerizing monomeric units of the polymeric chain of the label, where
certain of
the monomeric units will be covalently attached to a fluoresces component.
The subject sets of fluorescent labels find use in applications where at least
two components of a sample or mixture of components are to be distinguishably
241

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
detected. In such applications, the set will be combined with the sample
comprising
the to be detected components under conditions in which at least two of the
components of the sample if present at all will be labeled with first and
second labels
of the set, where the first and second labels of the set comprise the same
donor and
acceptor fluorescer components which are separated by different distances.
Thus, a
first component of the sample is labeled with a first label of the set
comprising donor
and acceptor fluorescer components separated by a first distance X. A second
component of the sample is labeled with a second label comprising the same
donor
and fluorescer components separated by a second distance Y, where X and Y are
as
described above. The labeled first and second components, which may or may not
have been separated from the remaining components of the sample, are then
irradiated
by light at a wavelength capable of a being absorbed by the donor fluorescer
components, generally at a wavelength which is maximally absorbed by the donor
fluorescer components. Irradiation of the labeled components results in the
generation
of distinguishable fluorescence emission patterns from the labeled components,
a first
fluorescence emission pattern generated by the first label and second pattern
being
attributable to the second label. The distinguishable fluorescence emission
patterns
are then detected. Applications in which the subject labels find use include a
variety
of multicomponent analysis applications in which fluorescent labels are
employed,
including FISH, micro-array chip based assays where the labels may be used as
probes which specifically bind to target components, DNA sequencing where the
labels may be present as primers, and the like.
The subject sets of labels find particular use in polynucleotide enzymatic
sequencing applications, where four different sets of differently sized
polynucleotide
fragments terminating at a different base are generated (with the members of
each set
terminating at the same base) and one wishes to distinguish the sets of
fragments from
each other. In such applications, the sets will generally comprise four
different labels
which are capable of acting as primers for enzymatic extension, where at least
two of
the labels will be energy transfer labels comprising differently spaced common
donor
and acceptor fluorescer components that are capable of generating
distinguishable
fluorescence emission patterns upon excitation at a common wavelength of
light.
Using methods known in the art, a first set of primer extension products all
ending in
A will be generated by using a first of the labels of the set as a primer.
Second, third
242

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
and fourth sets of primer extension products terminating in G, C and T will be
also be
enzymatically produced. The four different sets of primer extension products
will then
be combined and size separated, usually in an electrophoretic medium. The
separated
fragments will then be moved relative to a detector (where usually either the
fragments or the detector will be stationary). The intensity of emitted light
from each
labeled fragment as it passes relative to the detector will be plotted as a
function of
time, i.e. an electropherogram will be produced. Since, the labels of the
subject sets
will generally emit light in only two wavelengths, the plotted
electropherogram will
comprise light emitted in two wavelengths. Each peak in the electropherogram
will
correspond to a particular type of primer extension product (i.e. A, G, C or
T), where
each peak will comprise one of four different fluorescence emission patterns.
To
determine the DNA sequence, the electropherogram will be read, with each
different
fluorescence emission pattern related to one of the four different bases in
the DNA
chain.
Where desired, two sets of labels according to the subject invention may be
employed, where the distinguishable fluorescence emission patterns produced by
the
labels in the first set will comprise emissions at a first and second
wavelength and the
patterns produced by the second set of labels will comprise emissions at a
third and
fourth wavelength. By using two such sets in conjunction with one another, one
could
detect primer extension products produced from two different template DNA
strands
at essentially the same time in a conventional four color detector, thereby
doubling the
throughput of the detector.
The subject sets of labels may be sold in kits, where the kits may or may not
comprise additional reagents or components necessary for the particular
application in
which the label set is to be employed. Thus, for sequencing applications, the
subject
sets may be sold in a kit which further comprises one or more of the
additional
requisite sequencing reagents, such as polymerase, nucleotides,
dideoxymicleotides
and the like.
The following examples are offered by way of illustration and not by way of
limitation. The following examples are put forth so as to provide those of
ordinary
skill in the art with a complete disclosure and description of how to make and
use the
subject sets of fluorescent labels.
243

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.10.2 Affinity labels
Other single molecule detection methods have availed of compounds having
well studied affinity interactions with other molecules, such as receptor-
ligand
interactions.
Genome sequencing applications of the present invention may thus avail of
established affinity labeling and detection methods. Other applications of the
present
invention my also benefit from the application of affinity labeling and
detection
methods.
Various compounds comprising a nucleotide triphosphate moiety and a small
molecule affinity moiety are commercially available and suitable as substrates
for
DNA polymerases. Said compounds have been used, in conjunction with DNA
polymerases, to effect the affinity labeling of various polynucleotide
molecules, and
thus labeled polynucleotides are routinely subjected to manipulations
comprising the
formation of an affinity association with an appropriate receptor molecule.
Two
common examples are the use of biotin as said affinity moiety and streptavidin
as said
receptor molecule, and digoxigenin as said affinity moiety and anti-
digoxigenin
antibodies or fragments thereof as the respective said receptor molecule. it
will be
obvious to those skilled in the relevant arts that there are numerous other
possible
ligand-receptor interactions which may be exploited for afFnity labeling
purposes as
well as immobilization purposes of the present invention, and that multiple
distinct
affinity interactions may be employed simultaneously.
For detection purposes, said affinity labels may be used to bind a microscopic
colloid or bead which has been modified with an appropriate complementary
affinity
group such as a receptor.
1.4.10.1 Affinity Label Detection With Microscopic Beads
In recent years a number of different methods and materials have been
developed to permit the affinity binding of beads to molecules. Such binding
is
commonly accomplished by coating said beads with receptor molecules, such as
streptavidin or Protein A (also known as Staph A, to which immunoglobulin G
antibodies may subsequently be bound). Bead types include polymeric spheres of
micron or submicron dimensions, metallic colloids such as colloidal gold,
silica beads
244

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
and magnetic beads. As will be obvious to those skilled in the art of polymer
chemistry, polymer beads including dendrimers may incorporate dyes or liquid
crystal
molecules as side chains or within polymeric backbones, and these may
facilitate
optical detection methods. Attachment of appropriate receptor or affinity
molecules to
the surfaces of such beads yields a reagent suitable for the detection of an
affinity
labeled molecule. One such detection scheme was utilized by Finzi and Gelles,
albeit
for different purposes.
1.4.10.3 Multimeric labels:
Where sensitivity to a single labeling moiety is insufficient, labeled
reagents
may comprise multiple occurrences of said labeling moiety in a manner that
does not
interfere with the corresponding molecular recognition and monomer addition
processes, to increase the likelihood of correct signal amplification of any
labeled
molecule. For example, the ordinary single biotin moiety attached to a
nucleotide by a
linker may be replaced with a polymer having multiple biotin moieties as side
chains,
such that the likelihood of a streptavidin molecule interacting with this
multimeric
affinity label is increased. Fluorescent labels may similarly multiplied, as
may any
other labeling moieties. Measures must be taken in the design and synthesis of
such
multimerically labeled reagents to ensure that solubility is retained. This
may be
accomplished by choosing a highly soluble polymer as the backbone carrying
said
labeling moieties comprising the multimeric label.
1.4.10.4 Polymerization nucleating labels:
Any compound capable of serving as an initiator for some aqueous
polymerization may also serve as a labeling moiety. This initiator nucleates
the
formation of a perceptible polymer attached to the sample molecule. Such a
polymer,
may, for example, comprise multiple fluorescent moieties, or simple effect a
local
change in transmitance of light or a local change of refractive index. After
detection
has been accomplished, said perceptible polymer is degraded or otherwise
removed
from the sample molecule. Such polymerizations may be self limiting, as is the
case
for some dendrimeric polymers.
For this label detection methodology, polymerization is caused to occur in a
step after the labeled nucleotide is added to the sample molecule, and must
proceed
via a chemistry that leaves the sample molecule in tact. Degradation or
removal of
said perceptible polymer must also leave the sample molecule in tact. Subject
to the
245

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
above stated limitation, any polymer and respective detection method may be
employed.
1.4.10.5 Enzymatic labels and conjugates thereof
1.4.10.5.1 Photochemical labeling
Various methods have been developed for the photochemical labeling of
molecules and especially biological macromolecules. These include detection of
affinity labels such as biotin with conjugates of streptavidin and an
appropriate
enzyme capable of catalysing the formation of a chromophore from a
chromophorigenic substrate, or capable of catalysing a photon liberating
chemical
reaction, as with the enzyme luciferase. Such photochemical labeling methods
will be
readily applicable as detection methods for various embodiments of the present
invention.
Note that multimeric affinity labels accessible for simultaneous association
with multiple such enzymes will enable greater signal amplification, as will
secondary
enzyme amplification techniques and other techniques known within the
molecular
biological and microscopic arts.
1.4.10.5.2 Cleavable linkers
Labeling moieties are favorably in communication with or coupled to
nucleotides via a linker of sufficient length to ensure that the presence of
said labeling
moieties on said nucleotides will not interfere with the action of a
polymerase enzyme
on said nucleotides. Linkers will also necessarily be of some minimal length
when
stepping control is effected through the use of various preformed enzyme-
nucleotide
complexes (as described below). Once a nucleotide has been added by
polymerization
to (the daughter strand of) a sample molecule, and the accompanying label has
been
detected, proper detection and discrimination of subsequent nucleotides
requires the
elimination of said accompanying label. This may favorably be accomplished
through
the cleavage of said linkers which have been designed and synthesized to admit
of
cleavage by treatments which will not degrade or otherwise modify the relevant
state
or information content of sample molecules.
Cleavability may be provided for in a number of ways which will be obvious
to those skilled in the arts of organic and synthetic chemistry. For example,
said
linker may include along its length one or more ester linkages, which will be
susceptible to hydrolysis, which may be sufficiently mild for various ester
functional
246

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
groups. Amide linkages may similarly be employed. Linkages comprising
disulfide
bonds within their length have been developed to provide for cleavability;
reagents
comprising such linkages are commercially available and have been used to
modify
nucleotides in a manner which may be conveniently reversed by treatment with
mild
reducing agents such as dithiothreitol. Cieavable linkages may be provided so
as to
minimize the portion of the linker which remains on the sample molecule.
Because
polymerases are relatively tolerant of linkers which may extend from various
atoms of
nucleotide molecules, it is not, however, critical that all of said linkers be
cleaved
away from the nucleotides incorporated into said sample molecules in the
process of
label removal.
Note that commercially available biotin derived nucleotides frequently
contain, along the linker joining said biotin moiety to said nucleotide
moiety, one or
more ester or amide bonds, which is susceptible to cleavage by various
chemical
treatments.
Note also that for linkers comprising appropriate bonds along their length,
enzymatic cleavage may be performed.
1.4.10.5.3 Dissociative cleavage:
Note that cleavage of a labeling moiety may also be effected by the disruption
of some affinity interaction which effects the communication between said
labeling
moiety and the nucleotide moiety. In such cases, moieties joined by non-
covalent
associations may, for example, be dissociated by physical or chemical changes
which
do not necessarily cleave covalent bonds.
Photocleavable moieties may also comprise an intermediate portion of linkers
joining labeling moieties to nucleotide moieties, such that upon photocleavage
of said
photocleavable moieties, communication between the termini of said linker is
disrupted and the label moiety is liberated from the nucleotide moiety.
Because
photodeprotection or photocleavage reactions generally proceed quite rapidly,
with
appropriate detection and photoexcitation means, detection, label removal and
nucleotide incorporation rates per sample molecule may approach the limit
imposed
by any particular polymerase enzyme and the processivity of said enzyme. Long
linkers with photocleavable termini have been synthesized.
Similarly, compounds which thermally degrade into two or more portions
may comprise an intermediate portion of such linkers, such that thermal
cycling may
247

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
be employed to effect linker cleavage. Thermostable polymerases may be
conveniently employed in embodiments availing thermolabile linkers.
1.4.10.5.4 Photomodification
Single dye molecule photobleaching has been directly observed. Fluorescent
labels of nucleotides, particularly when only one or a small number of such
moieties
are used for labeling, may be neutralized by photobleaching, such that while
some
product of said fluorescent label may remain in communication with the sample
molecule ~(e.g. the daughter strand of a polynucleotide being sequenced) it
will no
longer provide a signal sufficiently strong to interfere with the detection
and
discrimination of subsequently added labels.
Beyond photobleaching of fluorescent labels, affinity labels with appropriate
photochemical properties may be subjected to photochemical modification
rendering
them inert to binding, generally subsequent to dissociation of the
corresponding
receptor by appropriate means.
For affinity labels, fluorescent labels or any other labeling moieties,
chemical
modification appropriate to the chemistries of said labels which effects a
change or
reduction in the detectable signal provided by said label may be availed to
prevent
interference of said labels with similar or distinct labels subsequently added
to sample
molecules or complexes thereof.
1.4.10.5.5 Labeling with activation and thermodynamic decay:
Compounds such as spirobenzopyran, which have labile, structurally and
photochemically distinct but interconvertible isomers, may be used as labeling
moieties. Here, an excited state of such a moiety may be used as a means of
detection.
After said detection has been successfully effected, chemical modification of
one or
another state of such interconvertible molecules may then neutralize it.
Alternatively,
activation may cause such a label to convert to some unstable but discernible
state,
which then irreversibly degrades according to characterizable kinetics. Such
molecules must be chosen so as to remain in said discernible state for a
sufFcient time
period to permit detection, but reliably degrade (to completion for a
population of
such molecules) within a practical time period.
1.4.10.5.6 Binding reaction inhibition detection methods:
Agents which specifically inhibit binding reactions may be identified rapidly
through the detection of molecules, of a diverse library each molecular
species of
248

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
which is uniquely labeled, not bound by particles some sample which may
comprise
many different species, in the presence test reagent, which is labeled, and
permitted to
associate with said sample' (preferably during a preincubation step before the
addition
of said diverse library to said sample,) in analogy to blocking antibody
assays.
Results are compared to those obtained with an aliquot of said diverse library
and
another portion of the same said sample. Such an assay may be performed for
increasing concentrations of said test reagent.
1.4.10.5.7 Enzymatically enforced associations at defined molecular sites:
Methods are provided to enforce highly specific associations and reactions,
including molecular recognition processes, on individual sample molecules or
on
populations and subpopulations of sample molecules. These are described for
genome
sequencing applications, but the methods included thereunder have broad
applicability, including to any molecular affinity interaction.
1.4.10.5.8 Enzymatically enforced template directed copolymer addition at
defined site:
Controlled comonomer addition Various methods may be used to accomplish
the controlled addition of monomers, including nucleotides and especially
labeled or
protected nucleotides, to the daughter strand of a sample template molecule.
1.4.10.6 Rate control or accommodation:
Means of slowing the time required for the addition of a single nucleotide to
a
sample molecule will circumvent the requirement of stepping control. This will
be
particularly applicable for detection mechanisms not requiring separate
manipulation
steps (such as the separate association of beads to affinity labeled sample
molecules).
For example, the four nucleotides, each respectively labeled with unique,
removable
or neutralizable fluorescent labels, may be added to appropriately primed
sample
template molecules in the presence of polymerases, at low concentrations. Said
concentrations must be sufficiently low that two nucleotides are not added to
the
sample molecule in less than the time required to accomplish the detection of
the first
such addition. Because all labels are present in the observation field,
detection is
accomplished through the observation of the reduction of the Brownian motion
of a
fluorescent moiety due to its addition to the sample molecule, in close
analogy to the
experiments of Finzi and Gelles, but it will be noted that the change in
mobility is
much larger in the present case. Alternatively detection may be understood to
depend
249

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
on an increase in the net residence of some fluorescent moiety within a
defined region
or the occupancy of said a region, above the occupancy arising from the
background
of unbound labeled nucleotides.
Such detection is preferably conducted with a scanning excitation beat
fluorescence confocal microscopic method as described above, or with a
scanning
detection light path, as also described above. Conditions (particularly
nucleotide
concentration) are chosen such that on average less than one labeled
nucleotide will
be present within the area illuminated by such a beam or thus observed, so
that a light
pulse of appropriate frequency passing through, for example, the pinhole which
effects the scanning of the excitation beam, may be used to photobleach or
photocleave the fluorescent label from the sample molecule after it has been
detected
to have been added to the sample molecule, without the appreciable
accumulation of
incidentally unlabeled nucleotides. Alternatively, an SLM may be used to
spatially
control illumination of the sample by an appropriate frequency of light to
effect
photochemical unlabeling, and thus permit the simultaneous unlabeling of
multiple
sample molecules.
This method may be understood as concentration modulated control of the
kinetics of polymerization processivity, which is used to facilitate direct
observation
of successive addition of individual (labeled) nucleotides, with controlled
unlabeling.
Scanning rate and other instrumentation dependent parameters will influence
optimal
conditions and concentrations. Thus, direct observation of the addition of
comonomers is dynamically observed, and sequence information for the
respective
sample molecule may be reconstructed accordingly.
1.4.10.7 Stepping control by equilibrium means:
A simple method to effect adequate stepping control for sequencing
applications of the present invention relies on equilibrium control. In this
method,
nucleotides (which are labeled) are limiting, and there is a relative
excess.of sample
molecules. Exonuclease activity intrinsic to most polynucleotide polymerises
is
circumvented by the use of alpha- phosphorothioate nucleotides (which are
appropriately labeled) which are resistant to such degradation, in this
method. Other
nucleotide derivatives or analogs suitable as substrates for polymerises and
yielding
exonuclease resistant polynucleotides may likewise be employed.
250

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
As an example of equilibrium controlled stepping, a thirty-three-fold excess
of
sample molecules relative to labeled complementary nucleotides per cycle may
be
chosen. Polymerase molecules are preferably provided in excess of sample
template
molecules. Each sample molecule has a three percent chance of undergoing a
single
nucleotide addition. Nucleotides are rapidly depleted. Any sample molecule
which
has undergone one nucleotide addition has a further three percent chance, or
in total
approximately a 0.1 % chance of undergoing a second nucleotide addition. For a
sequencing segment run of 20 bases per sample molecule, each segment will
experience an error contribution of (20)(0.1 %) or 2% from multiple additions
within a
cycle. Such erroneous segment data will be conspicuous when oversamplIng is
performed due to the correspondingly low frequency with which it occurs.
Alternatively, for tenfold excess of sample molecules with respect to labeled
complementary nucleotides, there is a 1 % chance per base of multiple
additions to the
same molecule, or, again for sequencing runs of bases, a 20% chance that a
segment
experiences at least one duplicate addition event. For five-fold oversampling,
the
binomial distribution indicates that there is approximately a 94.2% chance
that three
or more segments including a particular base contain correct data regarding
that base.
Any specific individual data error is highly unlikely to occur more than once
for
fivefold oversampling. Note that in practice such calculations will also have
to
account for label amplification error and label detection error, but these
error
contributions should be susceptible to reduction to manageably low levels.
More generally, for a ratio x of nucleotide molecules to sample template
molecules with a complementary base properly located relative to the primer,
for
x<lthere is a probability p equal to x that a particular sample molecule will
experience the addition of at least one nucleotide and a probability pk that
any sample
molecule will experience at least k nucleotide additions within the same
sequencing
cycle. Multiple nucleotide additions to a sample molecule within the same
sequencing cycle will result in erroneous sequence information being obtained
from
said sample molecule. The probability (d) of such a multiple incorporation
error
occurring within the sequence segment data obtained from a particular sample
molecule in a sequencing run of n bases will be less than 2(n)(p2). The net
sequence
information per sample molecule obtained per sequencing cycle will be x bases,
and
the net sequence information for a sample with N molecules will be (x)(I~
bases,
251

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
which will be Iarge for Iarge N. For example, with x=. 03 and
N=3.3x101°, there will
be a net raw data accumulation of approximately 109 bases per cycle, which,
with
one-hundred-fold oversampling (i.e. due to each sequence being represented 100
times in the sample) will yield 10' bases of data per cycle; for a desired
segment
length of n=15 bases, n/x=(15)/(.03) or approximately 500 sequencing cycles
will be
required per run, and the run will yield 1.5x108 bases of information. For
polymerase
fidelity of 95% (an extremely low value chosen for purposes of illustration)
there will
be a 5% error rate (e) per base or a segment error rate of (n)(e)=75% per
molecule, but
the probability of two erroneous sequence segments having identical sequences
will
be e2(1-e)°-I for segments with a single base error, which will be the
most frequent
error species. For this example, this yields a 0.12% frequency. Methods
similar to
those used to determine consensus sequences may thus be employed to obtain
highly
accurate data in spite of less than perfect polymerization fidelity. Thus,
fidelity error
components will be negligible compared to multiple base incorporation errors.
For
this example, multiple base incorporation error components will yield an error
rate of
less than (2)(15)(.03)a or about 3% per molecule. Again, oversampling will
readily
detect such errors, which will occur identically for two molecules with only
d2=(.03)2
or less than 0.1 % probability, yielding a far lower error rate for over
sampled data.
1.4.10.8 Stepping control by removable protecting groups:
Stepping control may favorably be applied to any polymerization process
useful within the scope of the present invention, including both genome
sequencing
and amity characterization applications.
Template directed polymerization depends on the processive addition of
comonomers at the terminus of a growing daughter strand as specified by the
respective complementary base of the parent template strand. Complementarity
may
be enforced through molecular recognition of said complementarity of protected
analogs of said comonomers with the appropriate base of a template molecule,
by the
action upon such protected comonomers of appropriate polymerase enzymes.
Numerous monomers which may thus be added but do not provide an
appropriate chemical functional group for subsequent elongation of the
polynucleotide strand to which they have been enzymatically added are known
within
the relevant arts, and are generally referred to as chain terminators. Any
such
terminators which may be chemically or photochemically modified, particularly
in a
252

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
rilanner not disrupting the sample molecule, to a form which may support
subsequent
addition of comonomers in the usual manner, may be employed to effect
controlled
stepping of polymerization addition.
Removable protecting groups are particularly advantageous for the genome
sequencing applications of the present invention because they may be utilized
to
permit and ensure that exactly one nucleotide is added to a sample molecule
per
sequencing cycle. This will permit an even greater rate of data accumulation
than may
be achieved by equilibrium control methods, with which only a fraction of the
sample
molecule population per cycle yields data.
Photoremovable protecting groups may be used to gain similar advantage but
further permit controlled spatial localization of deprotection. Examples of
such
nucleosides have been prepared.31 Because photodeprotection reactions
generally
proceed rapidly, with appropriate detection and photoexcitation means,
processivity
and nucleotide incorporation rates per sample molecule may approach the limit
imposed by any particular polymerase enzyme.
Nucleotide analogs comprising such removable protecting groups preferably
further comprise labeling moieties. A particularly convenient category of such
compounds comprises a labeling moiety or multimer thereof in communication
with
the nucleotide moiety exclusively through said removable protecting group. For
such
compounds, removal of said removable protecting group will simultaneously
effect
removal of said labeling moiety. Simultaneous removal of both protecting
moiety and
labeling moiety will conveniently prepare a sample molecule for the next
sequencing
cycle in a single step.
Enzymological evidence concerning binding of 3' acetate esterified
nucleotides and 5'-triphosphate-3'-(nucleoside-5'-monophosphate) to the
triphosphate
binding site of E. coli Polymerase I supports the acceptability of 3' modified
nucleotides as substrates for this enzyme. Such protecting groups should
therefore be
compatible with either naturally occurring or genetically modified
polymerases.
Note that in other applications of the present invention, primers comprising a
photodeprotectable 3' hydroxyl terminus (which may be synthesized by the
polymerization of an appropriate 3' protected nucleotide onto the unprotected
3'
hydroxyl of an oligo- or poly-nucleotide, for instance, by the action of
terminal
deoxynucleotidyl transferase) may provide for the selective polymerization of
a
253

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
polynucleotide moiety selectable by control over illumination of the
appropriate
region of the sample. A polynucleotide moiety to which such a primer is
hybridized
and then selectively deprotected may thus be subjected to amplification
techniques
such as PCR in a selectable manner. Such modified primers shall simply be
referred
to as photoactivatable primers.
The 3' deprotectable nucleotides employed in some variations on the present
invention may also find other uses in molecular biology and biotechnology.
They may
be used as chain terminators in conventional enzymatic sequencing methods. If
such
manipulations are performed, any species terminating in a particular base may
be
extracted from the resolution medium (conventionally polyacrylamide gel),
deprotected and then subjected to other manipulations requiring an active 3'
hydroxyl
group, such as ligation.
254

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.10.9 Enzyme adaptation to specific substrates:
The emergence of resistance to chain terminating nucleotide analogs by
various viral polynucleotide polymerases suggests a convenient method for the
in
vitro evolution of polymerases capable of using reversibly 3' protected
nucleotide
analogs, or nucleotide analogs which otherwise serve as chain terminators
which may
be reactively modified to form an elongation competent molecule after
incorporation
into a polvnucleotide. Further selection constraints may be concurrently or
subsequently applied to fidelity, as the inclusion of non-sense condons in the
coding
region of a dominant lethal protein coding gene which is carried by the same
genetic
material carrying the polymerase gene under selection, such that misreading of
the
non-sense codon, by the polymerase under selection, will effect lethality to
the host
and thus select against low-fidelity polymerases.
As stated above, such deprotectable compounds may serve as a convenient
stepping control means for polymerization. Included among such deprotectable
nucleotides are nucleotides with photocleavable protecting groups, including
those
which reside on the 3' hydroxyl of a nucleotide.
1.4.10.10 Label encoding and labeling methods for data collection:
Various systems may be used to represent the data corresponding to the
occurrence of an affinity interaction. The complexity required of such a
representational system will be determined by the types of molecules and
associations
being examined and the extent to which manipulative steps are to be minimized.
The most rudimentary encoding system will be a one-bit binary labeling
system, consisting of only one label moiety type, indicating whether or not an
association of only one resolvable type occurred during the preceding
association
step.
For example, consider a sequencing application employing only a single
nucleotide labeling moiety. Such a system may avail each of the four
nucleotides
modified with a biotin moiety attached by a sufficiently long, cleavable
linker arm. In
such a case, a polymerization sub-cycle comprises: the incubation of sample
template
molecules bearing appropriate primers with an appropriate polymerase and
limiting
quantities of only one labeled nucleotide (and no unlabeled nucleotides) such
that this
monomer will be added only if the template molecule has the complementary base
in
the template position immediately 5' to the base opposite the 3' terminal base
of the
255

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
primer, and no monomers will be added otherwise; sample molecules are then
washed
to remove any remaining free nuclectides; the sample is then exposed to excess
quantities of streptavidin modified fluorescently labeled beads for a
sufFcient length
of time to ensure that all biotin moieties are bound by said labeled beads,
and then all
unbound beads are washed away; detection is then performed and data recorded;
linkers are then cleaved. Said sub-cycle is repeated for the remaining three
nucleotides, to constitute a cycle which successively tests for tile presence
in the
sample template molecule of each type of base immediately to the base opposite
the 3'
terminal base of the primer. If a sample molecule does not bind any label
through
such a cycle, then it was most likely "missed" due to the limiting
concentration of
nucleotides used to effect stepping of polymerization. If a sample molecule is
labeled
multiply during such a cycle, then the respective subsequent bases are
detected as
occurring in the template according to the pattern of labeling.
A somewhat more efficient encoding system is provided if two distinct
labeling moieties may be availed. Each nucleotide will be indicated by the
presence or
absence of each of the two moiety types, as a binary code. The moieties may,
for
instance be biotin (B) and digoxigenin (D). For example, the representation
may be:
A=B+D; T=B; G=D. These three nucleotides are added for a first polymerization
sub-
cycle, and all unbound reagents then washed away. Either two perceptibly
distinct
bead types may be used for simultaneous detection, provided distinct affinity
labels
are sufficiently well separated by extended linkers for simultaneous binding,
or a
single bead type with two distinct receptor molecules may be used in two
separate
binding and release cycles, in which case the release of one bead type will
have to
leave the remaining affinity moiety bound to sample molecules.
After detection of bead labels, all remaining beads are removed and a second
subcycle with C nucleotides affinity labeled with only one moiety are then
polymerized onto sample molecules and appropriate detection is performed.
Where
protecting groups are used to effect stepping control, only one sub-cycle is
needed and
C may be unlabeled. In such cases unlabeled molecules will be detected as
having
added a cytidine.
More conveniently, nucleotides of each of the four types distinctly labeled
with a fluorescent dye moiety may be used with fluorescence detection means,
and a
sequencing cycle consisting of only one sub-cycle. Alternatively, four
antibodies (or
256

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
four other appropriate receptor molecules or affinity reagents) which each
bind each
of the four distinct dye moieties may be bound to each of four perceptibly
distinct
beads. In another arrangement, nucleotides may each be labeled with some
distinct
combination of multiple dye moieties, again encoding a unique binary label.
257

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.11 UTILITY OF THE SEQUENCE OF A GENOME
The present invention provides methods of detection and discrimination which
address the complexity found in biological systems, though they may further be
applied to non-natural systems including but not limited to mimetics. Much of
this
complexity derives from combinations or permutations of simple units such as
the
four nucleotide bases of polydeoxyribonucleic acids and polyribonucleic acids,
or the
twenty common amino acids found in polypeptides and proteins.
This complexity, which underlies the most diverse and nuanced of biological
processes, has presented both the promise that ultimately much mechanistic
knowledge of biological processes may be gained through the accumulation of
greater
information about underlying structures and biopolymer sequences, and the
correspondingly motivated challenge of full enumeration and determination of
these
structures and sequences.
Because typical eukarvotic qenomes contain between 10' and 101°
DNA base
pairs, and because there are several well studied organisms of particular
interest,
economical and technically simple methods capable of determining the full
genome
sequence of an individual organism over a convenienth short period of time
would be
particularly desirable.
The present invention can find applications in many fields, for instance,
medical, diagnostic, forensic, genetics, biotechnology, and genome research.
It should
be noted that this technique would be applicable in many other fields and
instances,
and such applications would be discernible by people of ordinary skills in the
respective fields.
The availability of such sequencing methods would enable greater clinical
applications of molecular medicine, would facilitate greater and safer
application of
gene therapy, would permit timely completion of the several genome projects
within
fiscal constraints, and would enable facile gathering of genome information on
populations of individuals, which would have applications in such areas as the
study
of polygenic diseases, epidemiology and field ecology. Such applications are
presently limited by the cost and cumbersome nature of existing sequencing
methodologies.
Combinatorial chemistry, affinity characterization, therapeutic synthetic
immunochemistry, pharmacology and drug development, in vitro evolution and
other
258

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
fields concerned with the elaboration of a diverse population of molecules,
their
characterization according to desired properties, and recovery or
identification of
molecules displaying suitable characteristics may be favorably improved by the
availability of methods which permit the introduction of and both qualitative
and
quantitative characterization of kinetic and equilibrium properties of
molecular
recognition and binding phenomena, particularly where such parameters may be
used
as selective constraints.
There has further been some interest in rebuilding or supplementing the
immune systems of immunocompromized individuals, and in the development of
highly specific antibiotic agents targeted to antibiotic, antifungal or
antiviral resistant
or otherwise poorly treatable pathogens. Both of these goals may be furthered
by the
use of the methods of the present invention as they may readily be applied to
the
determination of pathogen specificity and antigenicity.
1.4.11.1. Application: Gene finding
An integrated clone map is constructed by the method described herein. When
the bin probes include polymorphic genetic markers, and these markers are
typed
against the DNAs of member of families carrying a genetic trait, that trait
can be
genetically localized on the map relative to one or more bin probes. Depending
on the
study design, this genetic localization can be carried out using one of a
variety of
methods (G. M. Lathrop and J.-M. Lalouel, "Efficient computations in
multilocus
linkage analysis," Amer. J. Hum. Genet., vol. 42, pp. 498-505, 1988; T. C.
Matise, M.
W. Perlin, and A. Chakravarti, "Automated construction of genetic linkage maps
using an expert system (MultiMap): application to 1268 human microsatellite
markers," Nature Genetics, vol. 6, no. 4, pp. 384-390, 1994; E. S. Lander and
D.
Botstein, "Mapping Complex Genetic Traits in Humans: New Methods Using a
Complete RFLP Linkage Map," in Cold Spring Harbor Symposia on Quantitative
Biology, vol. LI, Cold Spring Harbor, Cold Spring Harbor Laboratory, 1986, pp.
49-
62; L. Penrose, Ann. Eugenics, vol. 18, pp. 120-124, 1953; N. E. Morton, Am.
J.
Hum. Genet., vol. 35, pp. 201-213, 1983; N. Risch, Am. J. Hum. Genet., vol.
40, pp.
1-14, 1987; E. Lander and D. Botstein, Genetics, vol. 121, pp. 185-199, 1989;
N.
Risch, "Linkage strategies for genetically complex traits," in three parts,
Am. J. Hum.
Genet., vol. 46, pp. 222-253, 1990; N. Risch, Genet. Epidemiol., vol. 7, pp. 3-
16,
259

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1990; N. Risch, Am. J. Hum. Genet., vol. 48, pp. 1058-1064, 1991; P. Holmans,
"Asymptotic Properties of Affected-Sib-Pair Linkage Analysis," Am. J. Hum.
Genet.,
vol. 52, pp. 362-374, 1993; N. Risch,~S. Ghosh, and J. A. Todd, "Statistical
Evaluation of Multiple-Locus Linkage Data in Experimental Species and Its
Relevance to Human Studies: Application to Nonobese Diabetic (NOD) Mouse and
Human Insulin-dependent Diabetes Mellitus (IDDM)," Am. J. Hum. Genet., vol.
53,
pp. 702-714, 1993; R. C. Elston, in Genetic Approaches So Mental Disorders, E.
S.
Gershon and C. R. Cloninger, ed. Washington DC: American Psychiatric Press,
1994,
pp. 3- 21 ), incorporated by reference.
Following genetic localization relative to the bin probes, the integrated
contiged clone map provides an immediate means to proceed with positional
cloning
procedures. (D. Cohen, I. Chumakov, and J. Weissenbach, Nature, vol. 366, pp.
698-
701, 1993; B.-S. Kerem, J. M. Rommens, J. A. Buchanan, D. Markiewicz, T. K.
Cox,
A. Chakravarti, M. Buchwald, and L.-C. Tsui, "Identification of the cystic
fibrosis
gene: genetic analysis," Science, vol. 245, pp. 1073-1080, 1989; J. R.
Riordan, J. M.
Rommens, B.-S. Kerem, N. Alon, R. Rozmahel, Z. Grzelczak, J. Zielenski, S.
Lok, N.
Playsic, J.-L. Chou, M. L. Drumm, M. C. Iannuzzi, F. S. Collins, and L.-C.
Tsui,
"Identification of the cystic fibrosis gene: cloning and characterization of
complementary DNA," Science, vol. 245, pp. 1066-1073, 1989), incorporated by
reference. When an expression of candidate genes is included in the mapping
resource
(e.g.; ESTs, cDNAs), the search may proceed more rapidly. When the genome
sequences of the clones in the region have been determined, the gene search
may be
done in part using computer searches for candidate genes.
1.4.11.2 Application: Structure/function relation
The sequence of a genome is determined by the method described herein.
From this genome sequence, the relation of a gene or its promoters to other
known
functions may be determined using similarity or homology searches. Protocols
for
these determinations are well described (N. J. Dracopoli, J. L. Haines, B. R.
Korf, C.
C. Morton, C. E. Seidman, J. G. Seidman, D. T. Moir, and D. Smith, ed.,
Current
Protocols in Human Genetics. New York: John Wiley and Sons, 1995),
incorporated
by reference. The use of expressed sequence tag (EST) databases (Merck Gene
Index,
St. Louis, Mo.; Human Genome Sciences, Gathersburg, Md.) together with the
260

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
genome sequence provides a highly effective means for rapidly correlating a
gene's
sequence with the structure and function of its protein products.
1.4.11.3 Application: Metabolic network determination
The sequence of a genome is determined by the method described herein.
Using the RT-PCR technique of differential display, perturbations on the cell
state can
be assayed in terms of DNA expression. Select perturbations can elucidate the
metabolic networks of coupled enzyme systems in the cell. Reference back to
the
DNA sequence of the genome provides information about local control and
gene/promoter interactions. This information can be used to understand disease
mechanisms and to develop new pharmaceutical agenst to alleviate said
diseases.
1.4.11.4 Application: Growth and development
The sequence of a genome is determined by the method described herein. A
method is described for constructing an integrated genetic-physical-expression
map
that includes the genome sequence and cDNAs. It is currently impractical to
map very
large numbers of cDNAs at high resolution, due to the currently used
technology of
sequencing each cDNA, constructing PCR primers for it, and then performing
multiple PCR amplifications and detections relative to a panel of RHs to
accurately
map even a single cDNA. However unobvious it may currently seem to those
skilled
in the art, it would nonetheless be extremely desirable for elucidating the
mechanism
of cell growth and organism development to construct and map tissue-specific
cDNA
expression libraries at numerous points (e.g., at least every 24 hours) early
in
organism development. Further, the mapping of these expressed sequences back
to
their genomic locations would provide information on candidate genes, local
gene
expression, the coordination of normal and diseased cellular function under
genetic
control, and the time course of development in different tissues that would be
highly
useful in developing new diagnostic tests and therapeutic treatments for human
disease. The method of the said examples provides such a novel means for
practical
rapid and high-resolution mapping of many expression libraries that would
otherwise
be neither constructed nor mapped.
261

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
1.4.1L5 Application: Drug development
A sequence and map of a genome is determined by the method described
herein. The sequence of the human genome or integrated clone maps can be used
to
identify genes that are causative for human disease. From such genes, and
their DNA
promoters and protein products, mechanisms of diseases related to said genes
can be
determined. Pharmacological agents that intervene at key junctures in gene-
related
functions can then be devised to specifically circumvent and treat diseases
related to
these genes.
1.4.11.6 Application: Diagnostic testing
A sequence and map of a genome is determined by the method described
herein. The sequence of the human genome or integrated clone maps can be used
to
identify genes that are causative for human disease. From such genes, and
their DNA
promoters and protein products, mechanisms of diseases related to said genes
can be
determined. Diagnostic tests that detect key junctures in gene-related
structures and
functions can then be devised to diagnose diseases related to these genes, and
develop
kits.
1.4.11.7 Application: Animal models
The sequence of a genome is determined by the method described herein. In
the current art, sequencing even one complete mammalian genome is a highly
debated
and very expensive proposition (estimated to cost around one billion dollars)
which is
not likely to be performed more than once. However, the novel sequencing
method
described renders sequencing more practical., since it produces a high-
resolution
clone map which can be used to cost-effectively direct the sequencing effort
and to
practically assemble the resulting sequences. Given the pressing medical need
for
sequencing a mammalian genome, and the absence of any such useful coordinating
map, clearly the described invention is highly nonobvious.
By constructing a map as described in the method described herein, the
upfront burden of building maps for mammalian species other than humans is
considerably reduced. Further, since the cost per base of sequencing is
expected to
diminish, particularly as newer sequencing technologies become established,
the
described method provides the first useful starting point for beginning (and
eventually
completing) the DNA sequence determination of model animal genomes. Comparison
262

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
of the DNA sequences and genes between human and model organisms is a well-
established route for understanding and treating human disease. .
1.4.11. Application: Somatic cell hybrids
The method described herein describes an inner product mapping analysis
mechanism. Localization profiles are produced that can localize DNA sequences
to
high resolution. This inner product operation can be applied to somatic cell
hybrid
deletion panel data, thereby increasing the utility of such data by providing
more
confident and higher resolution localizations.
L4.11.9 Application: Genome mismatch scanning
Genome mismatch scanning (GMS) (S. F. Nelson, J. H. McCusker, M. A.
Sander, Y. Kee, P. Modrich, and P. O. Brown, "Genomic mismatch scanning: a new
approach to genetic linkage mapping," Nature Genetics, vol. 4, no. May, pp. 11-
18,
1993), incorporated by reference, has been described as powerful hybridization-
based
approach to genetic linkage mapping. GMS has applications both in the mapping
of
genetic traits and in the diagnosis and prevention of disease. What is
currently
impeding practical application of the GMS method is the lack of a sequence or
map of
the human (or animal model) genome that would provide densely spaced (e.g., 1
Mb) hybridization probes for the genome sampling step that scans the
mismatched
genome DNAs. Applicant's invention discloses a practical method for
constructing
such a sequence or map of a genome using the method described in the
specification.
In a preferred embodiment, densely spaced subsequences from the constructed
sequence of a genome are used as hybridization probes in GMS. In an
alternative
embodiment, densely spaced clones (or subsequences therefrom) from the
constructed
map of a genome are used as hybridization probes in GMS.
1.4.1110 Application: Reliable maps from unreliable data
A sequence and map of a genome is determined by the method described
herein. It is generally believed that such maps can be reliably constructed
only from
highly reliable and relatively complete data. This belief adds considerably to
the time,
expense, and effort currently expended in constructing genome maps. However,
the
method described herein discloses a novel mechanism for constructing highly
reliable
maps from unreliable and incomplete data (J. von Neumann, "Probabilistic
logics and
the synthesis of reliable organisms from unreliable components," in Automata
263

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
Studies, C. E. Shannon and J. McCarthy, ed. Princeton, N.J.: Princeton
University
Press, 1956, pp. 43-9~), incorporated by reference.
Specifically:
In step 6, table A's long-range characterization of the clone library can be
comprised of very noisy, highly unreliable hybridization data exhibiting large
error rates.
In step 9, table B's characterization of the long-range probe library can
be sparsely sampled. In some embodiments, a.gtoreq.l Mb average inter-bin
distance suffices for accurate mapping and contig construction.
In step 14, table D's short-range characterization of the clone library has
a high tolerance for data errors.
This unobvious result is due to the considerable redundancy in the three
data tables, and to the noise filtering and consistency cross- checking
capabilities
of the analysis methods:
In step 11, table C is a highly reliable binning because the clean PCR-
based data table B is used as a global corrective for the noisy complex
hybridization-based data table A. This has been empirically demonstrated for
human chromosome 11.
In step 16, table E is a highly reliable contiging because every clone has
been probed with both long-range and short-range data. Therefore, the global
binning information relaxes the requirements on the short-range probings:
useful comparisons can be made within a~relatively small bin region using
imperfect data.
1.4.11.11 Application: Mutation Detection
The techniques described herein will have a wide range of applications,
particularly wherever desired to determine if a target nucleic acid has a
particular
nucleotide sequence or some other sequence differing from a known sequence.
For
example, one application of the inventions herein is found in mutation
detection.
These techniques may be applied in a wide variety of fields including
diagnostics,
forensics, bioanalytics, and others.
For example, assume a "wild-type" nucleic acid has the sequence 5' N~NZN3N4
where, again, N refers to a monomer such as a nucleotide in a nucleic acid and
the
subscript refers to position number. Assume that a target nucleic acid is to
be
264

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
evaluated to determine if it is the same as 51-N~NaN3N4 or if it differs from
this
sequence, and so contains a mutation or mutant sequence. The target nucleic
acid is
initially exposed to an array of typically shorter probes, as discussed above.
Thereafter, one or more "core" sequences are identified, each of which would
be
expected to have a high binding affinity to the target, if the target does not
contain a
mutant sequence or mutation. In this particular example, one probe that would
be
expected to exhibit high binding affinity would be the complement to S'-N,N2N3
3'-
PlPzP3, assuming a 3-mer array is utilized. Again, it will be recognized that
the probes
and/or the target may be part of a longer nucleic acid molecule.
As an initial screening tool, the absolute binding affinity of the target to
the 3'-
P1P2P3 probe will be utilized to determine if the first three positions of the
target are
of the expected sequence. If the complement to 5'-NIN2N3 does not exhibit
strong
binding to the target, it can be properly concluded that the target is not of
the wild-
type.
The single base mismatch profile can also be utilized according to the present
invention to determine if the target contains a mutant or wild-type sequence.
As
shown herein, the single base mismatch plots for wild-type targets generally
follow
the typical., smile-shaped plot. Conversely, when the target has a mutation at
a
particular position, not only will the absolute binding affinity of the target
to a
particular core probe be less, but the single base mismatch characteristics
will deviate
from expected behavior.
According to one aspect of the invention, a substrate having a selected group
of nucleic acids (otherwise referred to herein as a "library" of nucleic
acids") is used
in the determination of whether a particular nucleic acid is the same or
different than a
wild-type or other expected nucleic acid. Libraries of nucleic acids will
normally be
provided as an array of probes or "probe array." Such probe arrays are
preferably
formed on a single substrate in which the identity of a probe is determined by
ways of
its location on the substrate. Optionally, such substrates will not only
determine if the
nucleotide sequence of a target is the same as the wild-type, but it will also
provide
sequence information regarding the target. Such substrates will find use in
fields
noted above such as in forensics, diagnostics, and others. Merely by way of
specific
example, the invention may be utilized in diagnostics associated with sickle
cell
anemia detection, detection of any of the large number of P-53 mutations, for
any of
265

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
the large number of cystic fibrosis mutations, for any particular variant
sequence
associated with the highly polymorphic HLA class 1 or class 2 genes
(particularly
class 2 DP, DQ and DR beta genes), as well as many other sequences associated
with
genetic diseases, genetic predisposition, and genetic evaluation.
When a substrate is to be used in such applications, it is not necessary to
provide all of the possible nucleic acids of a particular length on the
substrate. Instead,
it will be necessary using the present invention to provide only a relatively
small
subset 4S of all the possible sequences. For example, suppose a target nucleic
acid
comprises a S-base sequence of particular interest and that one wishes to
develop a
substrate that may be used to detect a single substitution in the S-base
sequence.
According to one aspect of the invention, the substrate will be formed with
the
expected S-base sequence formed on a surface thereof, along with all or most
of the
single base mismatch probes of the S-base sequence. Accordingly, it will not
be
necessary to include all possible S-base sequences on the substrate, although
larger
arrays will often be preferred. Typically, the length of the nucleic acid
probes on the
substrate according to the present invention will be between about S and 100
bases,
between about S and SO bases, between about 8 and 30 bases, or between about 8
and
1 S bases.
By selection of the single base mismatch probes among all possible probes of
a certain length, the number of probes on the substrate can be greatly
limited. For
example, in a 3-base sequence there are 69 possible DNA base sequences, but
there
will be only one exact complement to an expected sequence and 9 possible
single base
mismatch probes. By selecting only these probes, the diversity necessary for
screening will be reduced. Preferably, but not necessarily, all of such single
base
mismatch probes are synthesized on a single substrate. While substrates will
often be
formed including other probes of interest in addition to the single base
mismatches,
such substrates will normally still have less than SO% of all the possible
probes of n-
bases, often less than 30% of all the possible probes of n-bases, often less
than 20% of
all the possible probes of n-bases, often less than 10% of the possible probes
of n-
bases, and often less than S% of the possible probes of n-bases.
Nucleic acid probes will often be provided in a kit for analysis of a specific
genetic sequence. According to one embodiment the kits will include a probe
complementary to a target nucleic acid of interest. In addition, the kit will
include
266

CA 02413022 2002-12-13
WO 01/96551 PCT/USO1/19367
single base mismatches of the target. The kit will normally include one or
more of C,
G, T, A and/or U single base mismatches of such probe. Such kits will often be
provided with appropriate instructions for use of the complementary probe and
single
base mismatches in determining the sequence of a particular nucleic acid
sample in
accordance with the teachings herein. According to one aspect of the
invention, the kit
provides for the complement to the target, along with only the single base
mismatches. Such kits will often be utilized in assessing a particular sample
of genetic
material to determine if it indicates a particular genetic characteristic. For
example,
such kits may be utilized in the evaluation of a sample as mentioned above in
the
detection of sickle cell anemia, detection of any of the large number of P-53
mutations, detection of the large number of cystic fibrosis mutations,
detection of
particular variant sequence associated with the highly polymorphic HLA class 1
or
class 2 genes (particularly class 2 DP, DQ and DR beta genes), as well as
detection of
many other sequences associated with genetic diseases, genetic predisposition,
and
genetic evaluation.
Accordingly, it is seen that substrates with probes selected according to the
present invention will be capable of performing many mutation detection and
other
functions, but will need only a limited number of probes to perform such
functions.
267

DEMANDE OU BREVET VOLUMINEUX
LA PRESENTE PARTIE DE CETTE DEMANDE OU CE BREVET COMPREND
PLUS D'UN TOME.
CECI EST LE TOME 1 DE 4
~~ TTENANT LES PAGES 1 A 267
NOTE : Pour les tomes additionels, veuillez contacter 1e Bureau canadien des
brevets
JUMBO APPLICATIONS/PATENTS
THIS SECTION OF THE APPLICATION/PATENT CONTAINS MORE THAN ONE
VOLUME
THIS IS VOLUME 1 OF 4
CONTAINING PAGES 1 TO 267
NOTE: For additional volumes, please contact the Canadian Patent Office
NOM DU FICHIER / FILE NAME
NOTE POUR LE TOME / VOLUME NOTE:

Representative Drawing

Sorry, the representative drawing for patent document number 2413022 was not found.

Administrative Status

2024-08-01:As part of the Next Generation Patents (NGP) transition, the Canadian Patents Database (CPD) now contains a more detailed Event History, which replicates the Event Log of our new back-office solution.

Please note that "Inactive:" events refers to events no longer in use in our new back-office solution.

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Event History , Maintenance Fee  and Payment History  should be consulted.

Event History

Description Date
Inactive: IPC expired 2018-01-01
Application Not Reinstated by Deadline 2007-06-14
Time Limit for Reversal Expired 2007-06-14
Deemed Abandoned - Failure to Respond to Maintenance Fee Notice 2006-06-14
Inactive: Abandon-RFE+Late fee unpaid-Correspondence sent 2006-06-14
Inactive: IPC from MCD 2006-03-12
Inactive: IPC from MCD 2006-03-12
Inactive: IPC from MCD 2006-03-12
Inactive: IPC from MCD 2006-03-12
Inactive: Office letter 2005-12-06
Revocation of Agent Requirements Determined Compliant 2005-12-06
Appointment of Agent Requirements Determined Compliant 2005-12-06
Revocation of Agent Request 2005-11-25
Appointment of Agent Request 2005-11-25
Letter Sent 2003-07-28
Letter Sent 2003-07-28
Inactive: Single transfer 2003-06-23
Inactive: Correspondence - Formalities 2003-06-12
Inactive: Incomplete PCT application letter 2003-05-07
Inactive: Courtesy letter - Evidence 2003-02-18
Inactive: Cover page published 2003-02-13
Inactive: IPC assigned 2003-02-12
Inactive: IPC removed 2003-02-12
Inactive: IPC assigned 2003-02-12
Inactive: First IPC assigned 2003-02-12
Inactive: First IPC assigned 2003-02-11
Inactive: Notice - National entry - No RFE 2003-02-11
Application Received - PCT 2003-01-21
National Entry Requirements Determined Compliant 2002-12-13
Application Published (Open to Public Inspection) 2001-12-20

Abandonment History

Abandonment Date Reason Reinstatement Date
2006-06-14

Maintenance Fee

The last payment was received on 2005-05-30

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

  • the reinstatement fee;
  • the late payment fee; or
  • additional fee to reverse deemed expiry.

Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Fee History

Fee Type Anniversary Year Due Date Paid Date
Basic national fee - standard 2002-12-13
Registration of a document 2002-12-13
MF (application, 2nd anniv.) - standard 02 2003-06-16 2003-05-21
Registration of a document 2003-06-23
MF (application, 3rd anniv.) - standard 03 2004-06-14 2004-06-02
MF (application, 4th anniv.) - standard 04 2005-06-14 2005-05-30
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
DIVERSA CORPORATION
Past Owners on Record
JAY M. SHORT
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
Description 2002-12-13 267 15,313
Description 2002-12-13 170 9,639
Description 2002-12-13 293 15,336
Description 2002-12-13 269 15,317
Drawings 2002-12-13 28 890
Claims 2002-12-13 3 118
Abstract 2002-12-13 1 64
Cover Page 2003-02-13 1 40
Description 2003-06-12 185 10,222
Reminder of maintenance fee due 2003-02-17 1 106
Notice of National Entry 2003-02-11 1 189
Courtesy - Certificate of registration (related document(s)) 2003-07-28 1 106
Courtesy - Certificate of registration (related document(s)) 2003-07-28 1 106
Reminder - Request for Examination 2006-02-15 1 117
Courtesy - Abandonment Letter (Request for Examination) 2006-08-23 1 167
Courtesy - Abandonment Letter (Maintenance Fee) 2006-08-09 1 175
PCT 2002-12-13 7 291
Correspondence 2003-02-11 1 26
Correspondence 2003-05-07 1 34
Correspondence 2003-06-12 17 642
Correspondence 2005-11-25 1 31
Correspondence 2005-12-06 1 14

Biological Sequence Listings

Choose a BSL submission then click the "Download BSL" button to download the file.

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Please note that files with extensions .pep and .seq that were created by CIPO as working files might be incomplete and are not to be considered official communication.

BSL Files

To view selected files, please enter reCAPTCHA code :