Note : Les descriptions sont présentées dans la langue officielle dans laquelle elles ont été soumises.
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
M~D A~D APPA~L~TUS FOR ID~r. 1~-~'~,
CLASSl~Yl~G, 0~ QUA~ Yl~G DNA SEQ~ENCES
IN A S ~ PLE ~lln~l SE~u~NClN~
The application is a continuation-in-part of
~ 5 copending U.S. Patent Application serial number 08/547,214,
filed on October 24, 1995, which is hereby incorporated by
reference in its entirety.
This invention was made with United States
Government support under award number 70NANB5H1036 awarded by
10 the National Institute of Standards and Technology. The
United States Government has certain rights in the inventiOn.
1. FIELD OF ~HE lNv~llON
The field of this invention is DNA sequence
15 classification, identification or determination, and
quantification; more partic~larly it is the quantitative
classification, comparison of expression, or identification
of preferably all DNA sequences or genes in a sample without
performing any sequencing.
2. BAC~GROUND
Over the past ten years, as biological and genomic
research have revolutionized our underst~n~;ng of the
molecular basis of life, it has become increasingly clear
25 that the temporal and spatial expression of genes is
responsible for all life's processes, processes occurring in
both health and in disease. Science has progressed from an
unders~ ing of how single genetic defects cause the
traditionally recognized hereditary disorders, such as the
30 thalassemias, to a realization of the importance of the
interaction of multiple genetic defects along with
environmental factors in the etiology of the majority of more
complex disorders, such as cancer. In the case of cancer,
current scientific evidence demonstrates the key causative
35 roles of altered expresslon of and multiple defects in
several pivotal genes. ~ther complex diseases have similar
-- 1
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
etiology. Thus the more complete and reliable a correlation
that can be established between gene expression and health or
disease states, the better diseases can be recognized,
diagnosed and treated.
S This important correlation is established ~y the
quantitative determination and classification of DNA
expression in tissue samples, and such a method which is
rapid and economical would be of considerable value. Genomic
DNA ("gDNA") sequences are those naturally occurring DNA
10 sequences constituting the genome of a cell. The state of
gene, or gDNA, expression at any time is represented by the
composition of total cellular messenger RNA ("mRNA"), which
is synthesized by the regulated transcription of gDNA.
Complementary DNA ("cDNA") sequences are synthesized by
15 reverse transcription from mRNA. cDNA from total cellular
mRNA also represents, albeit approximately, gDNA expression
in a cell at a given time. Consequently, rapid and
economical detection of all the DNA sequences in particular
cDNA or gDNA samples is desired, particularly so i~ such
20 detection was rapid, precise, and quantitative.
Heretofore, gene specific DNA analysis techniques
have not been directed to the determination or classi~ication
of substantially all genes in a DNA sample representing total
cellular mRNA and have required some degree of sequencing.
25 Generally, existing cDNA, and also gDNA, analysis techniques
have been directed to the determination and analysis of one
or two known or unknown genetic sequences at one time. These
t~hn;ques have used probes synthesized to specifically
recognize by hybridization only one particular DNA sequence
30 or gene. (See, e.g., Watson et al., 1992, Recombinant DNA,
chap 7, W. H. Freeman, New York.) Further, adaptation of
these methods to the problem of recognizing all sequences in
a sample would be cumbersome and uneconomical.
one existing method for finding and sequencing
35 unknown genes starts from an arrayed cDNA library. From a
particular tissue or specimen, mRNA is isolated and cloned
into an appropriate vector, which is then plated in a manner
.,
~.
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/171~9
so that the progeny of individual vectors bearing the clone
of one cDNA sequence can be separately identified. A replica
of such a plate is then probed, often with a labeled DNA
oligomer selected to hybridize with the cDNA representing the
5 gene of interest. Thereby, those colonies bearing the cDNA
of interest are found and isolated, the cDNA harvested and
subject to sequencing. Sequencing can then be done by the
Sanger dideoxy chain termination method (Sanger et al., 1977,
"DNA sequencing with chain terminating inhibitors", Proc.
lO Natl . Acad. sCi. USA 74(12) :5463--5467) applied to inserts so
isolated.
The DNA oligomer probes for the unknown gene used
for colony selection are synthesized to hybridize,
preferably, only with the cDNA for the gene of interest. One
15 manner of achieving thls specificity is to start with the
protein product of the gene of interest. If a partial
sequence of 5 to 10-mer peptide fragment from an active
region of this protein can be determined, corresponding 15 to
30-mer degenerate oligonucleotides can be synthesized which
20 code for this peptide. ~his collection of degenerate
oligonucleotides will tyPically be suf~icient to uniquely
identify the corresponding gene. Similarly, any information
leading to 15 to 30 long nucleotide subsequences can be used
to create a single gene probe.
Another exist~lng method, which searches for a known
gene in a cDNA or gDNA prepared from a tissue sample, also
uses single gene or single sequence probes which are
complementary to unique subsequences of the already known
gene sequences. For example~ the expression of a particular
30 oncogene in sample can ~e dete- ;ne~ by probing tissue
derived cDNA with a probe derived from a subsequence of the
oncogene's expressed sequence tag. Similarly the presence of
a rare or difficult to culture pathogen, such as the TB
bacillus or the HIV, can be determined by probing gDNA with a
35 hybridization probe specific to a gene of the pathogen. The
heterozygous presence Of a mutant allele in a phenotypically
normal individual, or i~s homozygous presence in a fetus, can
- 3 -
CA 02235860 1998-04-24
W O 97/15690 PCTAUS96/17159
be determined by probing with an allele specific probe
complementary only to the mutant allele (See, e.g., Guo et
al., 1994, Nucleic Acid Research, 22:5456-65).
All existing methods using single gene probes, of
5 which the preceding examples are typical, if applied to
determine all genes expressed in a given tissue sample, would
require many tho~ n~-~ to tens of thousands of individual
probes. It is estimated a single human cell typically
expresses approximately to 15,000 to 15,000 genes
10 simultaneously and that the most complex tissue, e.g., the
brain, can express up to half the human genome (Liang et al.,
1992, "Differential nisplay of Eukaryotic Messenger RNA ~y
Means of the Polymerase Chain Reaction, Science, 257:967-
971). Such an application requiring such a number of probes
15 is clearly too cumbersome to be economic or, even, practical.
Another class of existing methods, known as
sequencing by hybridization ("SBH"), in contrast, use
combinatorial probes which are not gene specific (Drmanac et
al., 1993, Science 260:1649-52; U.S. Patent No. 5,202,231,
20 Apr 13, 1993, to Drmanac et al). An exemplary implementation
of SBH to determine an unknown gene requires that a single
cDNA clone be probed with all DNA oligomers of a given
length, say, for example, all 6-mers. Such a set of all
oligomers of a given length synthesized without any selection
25 is called a combinatorial probe library. From knowledge of
all hybridization results for a combinatorial library, say
all the 4096 6-mer probe results, a partial DNA sequence for
the cDNA clone can be reconstructed by algorithmic
manipulations. Complete sequences are not determinable
30 because, at least, repeated subsequences cannot be fully
determined. SBH adapted to the classification of known genes
is called oligomer sequence signatures ("OSS") (Lennon et
al., 1991, Tren~s In Genetics 7(10):314-317). This technique
classifies a single clone based on the pattern of probe hits
35 against an entire combinatorial library, or a significant
sub-library. It requires that the tissue sample library be
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/171~9
arrayed into clones, each clone comprising only one pure
sequence from the library. It cannot be applied to mixtureS
These exemplary existing methods are all directed
to finding one sequence in an array of clones each expressing
5 a single sequence from a tissue sample. They are not
directed to rapid, economical, quantitative, and precise
characterization of all the DNA sequences in a mixture of
sequences, such as a particular total cellular cDNA or gDNA
sample. Their adaptation to such a task would be
10 prohibitive. Determination by sequencing the DNA of a clone,
much less an entire sample of thousands of sequences, is not
rapid or inexpensive enough for economical and useful
diagnostics. Existing probe-based techniques of gene
determination or classification, whether the genes are ]cnown
15 or unknown, require many thousands of probes, each specific
to one possible gene to be observed, or at least thousands or
even tens of thousands o~ probes in a combinatorial library.
Further, all of these methods require the sample be arrayed
into clones each expressing a single gene of the sample.
In contrast to the prior exemplary existing gene
determination and classification techniques, another existing
technique, known as dif~erential display, attempts to
fingerprint a mixture of expressed genes, as is found in a
pooled cDNA library. This fingerprint, however, seeks merely
25 to establish whether two s~mples are the same or different.
No attempt is made to determine the quantitative, or even
qualitative, expression of particular, determined genes
(Liang et al., 1995, Current Opinions in Immunology 7:274-
280; Liang et al., 1992, Science 257:967-71; Welsh et al.,
30 1992, Nucleic Acid Res. 20:4965-70; McClelland et al., 1993,
Exs 67:103-15; Lisitsyn, 1993, Science 259:946-50).
Differential display uses the polymerase chain reaction
("PCR") to amplify DNA subsequences of various lengths, which
are defined by being between the hybridization sites of
35 arbitrarily selected primers. Ideally, the pattern of
lengths observed is characteristic of the tissue from which
the library was prepared- Typically, one primer used in
- 5
-
CA 0223~860 1998-04-24
WO97/15690 PCT~S96/17159
differential display i5 oligo(dT) and the other is one or
more arbitrary oligonucleotideS designed to hybridize within
a few hundred base pairs of the poly-dA tail of a cDNA in the
library. Thereby, on electrophoretic separation, the
5 amplified fragments of lengths up to a few hundred base pairs
should generate bands characteristic and distinctive of the
sample. Changes in tissue gene expression may be observed as
changes in one or more bands.
Although characteristic banding patterns develop,
lO no attempt is made to link these patterns to the expression
of particular genes. The second arbitrary primer cannot be
traced to a particular gene. First, the PCR process is less
than ideally specific. One to a few base pair ("bp")
mismatches ("bubbles") are permitted by the lower stringency
15 annealing step typically used and are tolerated well enough
so that a new chain can be initiated by the Ta~ polymerase,
often used in PCR reactions. Second, the location of a
single subsequence or its absence is insufficient infcrmation
to distinguish all expressed genes. Third, length
20 information from the arbitrary primer to the poly-dA tail is
generally not found to be characteristic of a se~uence due to
variations in the processing of the 3' untranslated regions
of genes, the variation in the poly-adenylation process and
variability in priming to the repetitive sequence at a
25 precise point. Thus, even the bands that are produced often
are smeared by the non-specific background sequences present.
Also known PCR biases to high G+C content and short sequences
further limit the specificity of this method. Thus this
t~c~n; gue is generally limited to "fingerprinting" samples
30 for a similarity or dissimilarity determination and is
precluded from use in quantitative determination of the
differential expression of identifiable genes.
~ Existing methods for gene or DNA sequence
classification or determination are in need of improvement in
35 their ability to perform rapid and economical as well as
quantitative and specific determination of the components of
a cDNA mixture prepared from a tissue sample. The preceding
- 6 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
background review identifies the deficiencies of several
exemplary existing methods.
..
3. S~MMARY OF THE lNv~L-lON
It is an object of this invention to provide
methods for rapid, economical, quantitative, and precise
determination or classification of DNA sequences, in
particular genomic or complementary DNA sequences, in either
arrays of single sequence clones or mixtures of sequences
10 such as can be derived from tissue samples, without actually
sequencing the DNA. Thereby, the deficiencies in the
background arts just identified are solved. This object is
realized by generating a plurality of distinctive and
detectable signals from the DNA sequences in the sample being
15 analyzed. Preferably, all the signals taken together have
sufficient discrimination and resolution so that each
particular DNA sequence in a sample may be individually
classified by the particular signals it generates, and with
reference to a database of DNA sequences possible in the
20 sample, individually determined. The intensity of the
signals indicative of a particular DNA sequence depends
quantitatively on the amount of that DNA present.
Alternatively, the signals together can classify a
predominant fraction of the DNA sequences into a plurality of
25 sets of approximately no ~ore than two to four individual
sequences.
It is a further object that the numerous signals be
generated from measurements of the results of as few a number
of recognition reactions as possible, preferably no more than
30 approximately 5-400 reactions, and most preferably no more
than approximately 20-50 reactions. Rapid and economical
determinations would not be achieved if each DNA sequence in
a sample contA;ning a com~lex mixture required a separate
reaction with a unique probe. Preferably, each recognition
35 reaction generates a lar~e number of or a distinctive pattern
of disting~;~hAhle signals, which are quantitatively
proportional to the amo~t of the particular DNA sequences
- 7
CA 0223~860 1998-04-24
W O 97/15690 PCTnJS96/17159
present. Further, the signals are preferably detected and
measured with a minimum number of observations, which are
preferably capable of simultaneous performance.
The signals are preferably optical, generated by
5 fluorochrome labels and detected by automated optical
detection technologies. Using these methods, multiple
individually labeled moieties can be discriminated even
though they are in the same filter spot or gel band. This
permits multiplexing reactions and parallelizing signal
10 detection. Alternatively, the invention is easily adaptable
to other labeling systems, for example, silver staining of
gels. In particular, any single molecule detection system,
whether optical or by some other technology such as scanning
or tunneling microscopy, would be highly advantageous for use
15 according to this invention as it would greatly improve
quantitative characteristics.
According to this invention, signals are generaLed
by detecting the presence (hereinafter called "hits") or
absence of short DNA subsequences (hereinafter called
20 "target" subsequences) within a nucleic acid sequence of the
sample to be analyzed. The presence or absence of a
suksequence is detected by use of recognition means, or
probes, for the subsequence. The subse~uences are recognized
by recognition means of several sorts, including but not
25 limited to restriction endonucleases ("REs"), DNA oligomers,
and PNA oligomers. REs recognize their specific subsequences
by cleavage thereof; DNA and PNA oligomers recognize their
specific subsequences by hybridization methods. The
preferred ~ho~i -~t detects not only the presence of pairs
30 o~ hits in a sample sequence but also include a
representation of the length in base pairs between adjacent
hits. This length representation can be corrected to true
physical length in base pairs upon removing experimental
biases and errors of the length separation and detection
35 means. An alternative embodiment detects only the pattern of
hits in an array of clones, each containing a single se~uence
("single sequence clones").
-- 8
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
The generated signals are then analyzed together
with DNA sequence information stored in sequence databases in
computer implemented experimental analysis methods of this
invention to identify individual genes and their quantitative
5 presence in the sample.
The target subsequences are chosen by further
computer implemented experimental design methods of this
invention such that their presence or absence and their
relative distances when present yield a maximum amount of
10 information for classifying or determining the DNA sequences
to be analyzed. Thereby it is possible to have orders of
magnitude fewer probes than there are DNA sequences to be
analyzed, and it is further possible to have considerably
fewer probes than would be present in combinatorial libraries
lS of the same length as the probes used in this invention. For
each embodiment, target subsequences have a preferred
probability of occurrence in a sequence, typically between 5%
and 50%. In all embodiments, it is preferred that the
presence of one probe in a DNA sequence to be analyzed is
20 independent of the presence of any other probe.
Preferably, target subsequences are chosen based on
information in relevant DNA sequence databases that
characterize the sample. A minimum number of target
subsequences may be chosen to determine the expression of all
25 genes in a tissue sample ("tissue mode"). Alternatively, a
smaller number of target subsequences may be chosen to
quantitatively classify or determine only one or a few
sequences of genes of interest, for example oncogenes, tumor
suppressor genes, growth factors, cell cycle genes,
30 cytoskeletal genes, etc t"query mode").
A preferred embodiment of the invention, named
quantitative expression analysis ("QEA~"), produces signals
comprising target subsequence presence and a representation
of the length in base pairs along a gene between adjacent
35 target subsequences by measuring the results of recognition
reactions on cDNA (or gDNA) mixtures. Of great importance,
this method does not require the cDNA be inserted into a
g
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
vector to create individual clones in a library. Creation of
these libraries is time consuming, costly, and introduces
bias into the process, as it requires the cDNA in the vector
to be transformed into bacteria, the bacteria arrayed as
5 clonal colonies, and finally the growth of the individual
transformed colonies.
Three exemplary experimental methods are described
herein for performing QEA~: a preferred method utilizing a
novel RE/ligase/amplification procedure; a PCR based method;
10 and a method utilizing a removal means, preferably biotin,
for removal of unwanted DNA fragments. The preferred method
generates precise, reproducible, noise free signatures for
determining individual gene expression from DNA in mixtures
or libraries and is uniquely adaptable to automation, since
15 it does not require intermediate extractions or buffer
e~changes. A computer implemented gene calling step uses the
h-t and length information measured in conjunction with a
database of DNA sequences to determine which genes are
present n the sample and the relative levels of expression.
20 Signal intensities are used to determine relative amounts of
sequences in the sample. Computer implemented design methods
opt_~ize the choice of the target subsequences.
A second specific embodiment of the invention,
termed colony calling ("CC"), gathers only target subsequence
25 presence information for all target subsequences for arrayed,
individual single sequence clones in a library, with cDNA
libraries being preferred. The target subsequences are
carefully chosen according to computer implemented design
methods of this invention to have a ~x;~ information
30 content and to be minimum in number. Preferably from 10-20
subsequences are sufficient to characterize the expressed
cDNA in a tissue. In order to increase the specificity and
reliability of hybridization to the typically short DNA
subsequences, preferable recognition means are PNAs.
35 Degenerate sets of longer DNA oligomers having a common,
short, shared, target sequence can also be used as a
recognition means. A computer implemented gene calling step
-- 10 --
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
uses the pattern of hits in conjunction with a database of
DNA sequences to determine which genes are present in the
sample and the relative levels of expression.
The embodiments of this invention preferably
5 generate measurements that are precise, reproducible, and
free of noise. Measurement noise in QEA~ is typically
created by generation or amplification of unwanted DNA
fragments, and special steps are preferably taken to avoid
any such unwanted fragments. Measurement noise in colony
10 calling is typically created by mis-hybridization of probes,
or recognition means, to colonies. High stringency reaction
conditions and DNA mimics with increased hybridization
specificity may be used to minimize this noise. DNA mimics
are polymers composed of subunits capable of specific,
15 Watson-Crick-like hybridization with DNA. Also useful to
m~nimize noise in colony calling are improved hybridization
detection methods. Instead of the conventional detect-on
methods based on probe labeling with fluorochromes, new
methods are based on light scattering by small 100-200 um
20 particles that are aggregated upon probe hybridization
(S_imson et al., 1995, "Real-time detection of DNA
hybridization and melting on oligonucleotide arrays by using
optical wave guides", Proc. Natl . Acad . Sci . USA 92: 6379 -
6383). In this method, the hybridization surface forms one
25 surface of a light pipe or optical wave guide, and the
scattering induced by these aggregated particles causes light
to leak from the light pipe. In this manner hybridization is
revealed as an illuminated spot of leaking light on a dark
background. This latter method makes hybridization detection
30 more rapid by eliminating the need for a washing step between
the hybridization and detection steps. Further by using
variously sized and shaped particles with different light
scattering proFerties, multiple probe hybridizations can be
detected from one colony.
Further, the embodiments of the invention can be
adapted to automation by eliminating non-automatable steps,
such as extractions or buffer exchanges. The embodiments of
-- 1 1 --
CA 0223~860 1998-04-24
WO 97/15690 PCT/US96/17159
the invention facilitate efficient analysis by permitting
multiple recognition means to be tested in one reaction and
by utilizing multiple, distinguishable labeling of the
recognition means, so that signals may be simultaneously
5 detected and measured. Preferably, for QEA~ embodiments,
this labeling is by multiple fluorochromes. For the CC
embodiments, detection is preferably done by the light
scattering methods with variously sized and shaped particles.
An increase in sensitivity as well as an increase
lO in the number of resolvable fluorescent labels can be
achieved by the use of fluorescent, energy transfer, dye-
labeled primers. Other detection methods, preferable when
the genes being identified will be physically isolated from
the gel for later sequencing or use as experimental probes,
15 include the use of silver staining gels or of radioactive
labeling. Since these methods do not allow for multiple
samples to be run in a single lane, they are less preferable
when high throughput is needed.
Because this invention achieves rapid and
20 econGmical determination of quantitative gene expression in
tissue or other samples, it has considerable medical and
research utility. In medicine; as more and more diseases are
recognized to have important genetic components to their
etiology and development, it is becoming increasingly useful
25 to be able to assay the genetic makeup and expression of a
tissue sample. For example, the presence and expression of
certain genes or their particular alleles are prognostic or
risk factors for disease (including disorders). Several
examples of such diseases are found among the
30 neurodegenerative diseases, such as Huntington's disease and
ataxia--telangiectasia. Several cancers, such as
neuroblastoma, can now be linked to specific genetic defects.
Finally, gene expression can also determine the presence and
classi~ication of those foreign pathogens that are difficult
35 or impossible to culture in vitro but which nevertheless
express their own unique genes.
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
Disease progression is reflected in changes in
genetic expression of an affected tissue. For example,
~ expression of particular tumor promoter genes and lack of
expression of particular tumor suppressor genes is now ~nown
5 to correlate with the progression of certain tumors from
normal tissue, to hyperplasia, to cancer in situ, and to
metastatic cancer. Return of a cell population to a normal
pattern of gene expression, such as by using anti-sense
technology, can correlate with tumor regression. Therefore,
10 knowledge of gene expression in a cancerous tissue can assist
in staging and classifying this disease.
Expression information can also be used to chose
and guide therapy. Accurate disease classification and
staging or grading using gene expression information can
15 assist in choosing initial therapies that are increasingly
more precisely tailored to the precise disease process
occurring in the particular patient. Gene expression
information can then track disease progression or regression,
and such information can asslst in monitoring the success or
20 changing the course of an initial therapy. A therapy is
favored that results in a regression towards normal of an
abnormal pattern of gene expression in an individual, while
th~rapy which has little effect on gene expression or its
progression can need modification. Such monitoring is now
25 useful for cancers and will become useful for an increasing
numbeI of other diseases, such as diabetes and obesity.
Finally, in the case of direct gene therapy, expression
ar.alysis directly monitors the success of treatment.
In biological research, rapid and economical assay
30 for gene expression in tissue or other samples has numerous
applications. Such applications include, but are not limited
to, for example, in pathology ~ i ni ng tissue specific
genetic response to disease, in embryology determining
developmental changes in gene expression, in pharmacology
35 assessing direct and indirect effects of drugs on gene
expression. In these app~ications, this invention can be
applied, e.g., to in vitro cell populations or cell lines, to
- 13 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
in vivo animal models of disease or other processes, to human
samples, to purified cell populations perhaps drawn from
actual wild-type occurrences, and to tissue samples
containing mixed cell populations. The cell or tissue
5 sources can advantageously be a plant, a single celled
animal, a multicellular animal, a bacterium, a virus, a
fungus, or a yeast, etc. The animal can advantageously be
laboratory animals used in research, such as mice engineered
or bread to have certain genomes or disease conditions or
10 tendencies. The in vitro cell populations or cell lines can
be exposed to various exogenous factors to determine the
effect of such factors on gene expression. Further, since an
unknown signal pattern is indicative of an as yet unknown
gene, this invention has important use for the discovery of
15 new genes. In medical research, by way of further example,
use of the methods of this invention allow correlating gene -
e~pression with the presence and progress of a disease and
thereby provide new methods of diagnosis and new avenues of
therapy which seek to directly alter gene expression.
This invention includes various embodiments and
aspects, several of which are described below.
In a first embodiment, the invention p~ovides a
method for identifying, classifying, or quantifying one or
more nucleic acids in a sample comprising a plurality of
25 nucleic acids having different nucleotide sequences, said
method comprising probing said sample with one or more
recognition means, each recognition means recognizing a
different target nucleotide subsequence or a different set of
target nucleotide subsequences; generating one or more
30 signals from said sample probed by said recognition means,
each generated signal arising from a nucleic acid in said
sample and comprising a representation of (i) the length
between occurrences of target subsequences in said nucleic
acid and (ii) the identities of said target subsequences in
35 said nucleic acid or the identities of said sets of target
subsequences among which is included the target subsequences
in said nucleic acid; and searching a nucleotide sequence
- 14 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/171S9
database to determine sequences that match or the absence of
any sequences that match said one or more generated signals,
said database comprising a plurality of known nucleotide
sequences of nucleic acids that may be present in the sample,
5 a sequence from said database matching a generated signal
when the sequence from said database has both (i) the same
length between occurrences of target subsequences as is
represented by the generated signal and (ii) the same target
subsequences as is represented by the generated signal, or
10 target subsequences that are members of the same sets of
target subsequences represented by the generated signal,
whereby said one or more nucleic acids in said sample are
identified, classified, or quantified.
This invention further provides in the first
15 embodiment additional methods wherein each recognition means
recognizes one target subsequence, and wherein a sequence
from said database matches a generated 5ignal when the
seguence from said dat~h~s~ has both the same length between
occlrrences of target subsequences as is represented by the
20 generated signal and the same target subsequences as
represented by the generated signal, or op~ionally wherein
each recognition means recognizes a set of target
subsequences, and wherein a sequence from said database
matches a generated signal when the sequence from said
25 database has both the same length between occurrences of
target subsequences as is represented by the generated
signal, and target subsequences that are ~s of the sets
of target subsequences represented by the generated signal.
This invention ~urther provides in the first
30 embodiment additional methods further comprising dividing
said sample of nucleic acids into a plurality of portions and
performing the methods of this object individually on a
plurality of said portions, wherein a different one or more
recognition means are used with each portion.
This invention further provides in the first
~ embodiment additional methods wherein the quantitative
abundance of a nucleic acid comprising a particular
- 15 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
nucleotide sequence in the sample is determined from the
quantitative level of the one or more 5ignals generated by
said nucleic acid that are determined to match said
particular nucleotide sequence.
This invention further provides in the first
embodiment additional methods wherein said plurality of
nucleic acids are DNA, and optionally wherein the DNA is
cDNA, and optionally wherein the cDNA is prepared from a
plant, an single celled animal, a multicellular animal, a
10 bacterium, a virus, a fungus, or a yeast, and optionally
wherein the cDNA is of total cellular RNA or total.cellular
poly(A) RNA.
This invention further provides in the first
embodiment additional methods wherein said database comprises
15 substantially all the known expressed sequences of said
plant, single celled animal, multicellular animal, bacterium,
or yea~t.
This invention further provides in the first
embodiment additional methods wherein the recognition means
20 are one or more restriction endonucleases whose recognition
sites are said target subsequences, and wherein the step of
probing comprises digesting said sample w th said one or more
restriction endonucleases into fragments and ligating double
stranded adapter DNA molecules to said fragments to produce
25 ligated fragments, each said adapter DNA molecule comprising
(i) a shorter stand having no 5' terminal phosphates and
consisting of a first and second portion, said first portion
at the 5' end of the shorter strand being complementary to
the overhang pro~l~e~ by one of said restriction
30 endonucleases and (ii~ a longer strand having a 3' end
subsequence complementary to said second portion of the
shorter strand; and wherein the step of generating further
comprises melting the shorter strand from the ligated
fragments, contacting the sample with a DNA polymerase,
35 extending the ligated fragments by synthesis with the DNA
polymerase to produce blunt-ended double stranded DNA
fragments, and amplifying the blunt-ended fragments by a
- 16 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
method comprising contacting said blunt-ended fragments with
a DNA polymerase and primer oligodeoxynucleotides, said
primer oligodeoxynucleotides comprising the longer adapter
strand, and said contacting being at a temperature not
5 greater than the melting temperature of the primer
oligodeoxynucleotide from a strand of the blunt-ended
fragments complementary to the primer oligodeoxynucleotide
and not less than the melting temperature of the shorter
strand of the adapter nucleic acid from the blunt-ended
10 fragments.
This invention further provides in the first
embodiment additional methods wherein the recognition means
are one or more restriction endonucleases whose recognition
sites are said target subsequences, and wherein the step of
15 probing further comprises digesting the sample with said one
or more restriction endonucleases.
This invention further provides in the first
embodiment-additional methods further comprising identifying
a fragment of a nucleic acid in the sample which generates
20 said one or more signals; and recovering said fragment, and
optionally wherein the signals generated by said recovered
fragment do not match a sequence in said nucleotide sequence
database, and optionally further comprising using at least a
hybridizable portion of said fragment as a hybridization
25 probe to bind to a nucleic acid that can generate said
fragment upon digestion by said one or more restriction
endonucleases.
This invention further provides in the first
embodiment additional methods wherein the step of generating
30 further comprises after said digesting removing from the
sample both nucleic acids which have not been digested and
nucleic acid fragments resulting from digestion at only a
single terminus of the fragments, and optionally wherein
prior to digesting, the nucleic acids in the sample are each
3S bound at one terminus to a biotin molecule or to a hapten
molecule, and said removing is carried out by a method which
comprises contacting the nucleic acids in the sample with
- 17 -
CA 0223~860 1998-04-24
W O 97/lS690 PCT~US96/17159
streptavidin or avidin or with an anti-hapten antibody,
respectively, affixed to a solid support.
~ his invention further provides in the first
embodiment additional methods wherein said digesting with
S said one or more restriction endonucleases leaves single-
stranded nucleotide overhangs on the digested ends.
This invention ~urther provides in the first
embodiment additional methods wherein the step of probing
further comprises hybridizing double-stranded adapter nucleic
10 acids with the digested sample fragments, each said adapter
nucleic acid having an end complementary to said overhang
generated by a particular one of the one or more restriction
endonucleases, and ligating with a ligase a strand of said
adapter nucleic acids to the 5' end of a strand of the
lS digested sample fragments to form ligated nucleic acid
fragments.
This invention further provides in the first
embodiment additional methods wherein said digesting with
said one or more restricticn endonucleases and said ligating
20 are carried out in the same reaction medium, and optionally
~herein said digesting and said ligating comprises incubating
said roaction medium at a first temperature and then at a
se~Gnd temperature, in which said one or more restriction
endonucleases are more active at the first temperature than
25 the second temperature and said ligase is more active at the
second temperature that the first temperature, or wherein
said inc-lh~ting at said first temperature and said incubating
at said second temperature are performed repetitively.
This invention further provides in the first
30 embodiment additional methods wherein the step of probing
further comprises prior to said digesting removing terminal
phosphates from DNA in said sample by incubation with an
alkaline phosphatase, and optionally wherein said alkaline
phosphatase is heat labile and is heat inactivated prior to
35 said digesting.
This invention further provides in the first
embodiment additional methods wherein said generating step
- 18 -
CA 0223~860 1998-04-24
W097/1~690 PCT~S96/17159
comprises amplifying the ligated nucleic acid fragments, and
optionally wherein said amplifying is carried out by use of a
nucleic acid polymerase and primer nucleic acid strands, said
primer nucleic acid strands being capable of priming nucleic
5 acid synthesis by said polymerase, and optionally wherein the
primer nucleic acid strands have a G+C content of between 40%
and 60~.
This invention further provides in the first
embodiment additional methods wherein each said adapter
10 nucleic acid has a shorter strand and a longer strand, the
longer strand being ligated to the digested sample fragments,
and said generating step comprises prior to said amplifying
step the melting of the shorter strand from the ligated
fragments, cpntacting the ligated fragments with a DNA
15 polymerase, extending the ligated fragments by synthesis with
the DNA polymerase to produce blunt-ended double stranded DNA
fra~.llents, and wherein the primer nucleic acid strands
comprise a hybridizable portion the sequence of said longer
strands, or optionally comprise the sequence of said longer
20 st:cands, each different primer nucleic acid strand priming
ampli~lcation only of blunt ended double stranded DNA
fra~nents that are produced after diges'icn by a particular
restri~tion endonuclease.
This invention further provides in the first
25 embodiment additional methods wherein each primer nucleic
acid strand is specific for a particular restriction
endonuclease, and further comprises at the 3' end of and
contiguous with the longer strand sequence the portion of the
restriction endonuclease recognition site r~ -;ning on a
30 nucleic acid fragment terminus after digestion by the
restriction endonuclease, or optionally wherein each said
primer specific for a particular restriction endonuclease
further comprises at its 3' end one or more nucleotides 3' to
and contiguous with the r~ ~ining portion of the restriction
35 endonuclease recognition site, whereby the ligated nucleic
acid fragment amplified is that comprising said remaining
portion of said restriction endonuclease recognition site
-- 19 --
CA 0223~860 1998-04-24
W O 97/15690 PCTnUS96/17159
contiguous to said one or more additional nucleotides, and
optionally such that said primers comprising a particular
said one or more additional nucleotides can be
disting~ h~hly detected from said primers comprising a
5 different said one or more additional nucleotides.
This invention further provides in the first
embodiment additional methods wherein during said amplifying
step the primer nucleic acid strands are annealed to the
ligated nucleic acid fragments at a temperature that is less
10 than the melting temperature of the primer nucleic acid
strands from strands complementary to the primer nucleic acid
strands but greater than the melting temperature of the
shorter adapter strands from the blunt-ended fragments.
This invention further provides in the first
15 embodiment additional methods wherein the recognition means
are oligomers of nucleotides, nucleotide-mimics, or a
combination of nucleotides and nucleotide-mimics, which are
specifically hybridizable with the target subsequences, and
optionally further provides additional methods wherein the
20 step of generating comprises amplifying with a nucleic acid
polymerase and with primers comprising said oligomers,
whereby fragments of nucleic acids in _he sample between
hybridized oligomers are amplified.
This invention further provides in the first
25 embodiment additional methods wherein said signals further
comprise a representation of whether an additional target
subsequence is present on said nucleic acid in the sample
between said occurrences of target subsequences, and
optionally wherein said additional target subsequence is
30 recognized by a method comprising contacting nucleic acids in
the sample with oligomers of nucleotides, nucleotide-mimics,
or mixed nucleotides and nucleotide-mimics, which are
hybridizable with said additional target subsequence.
This invention further provides in the first
35 embodiment additional methods wherein the step of generating
comprises suppressing said signals when an additional target
subsequence is present on said nucleic acid in the sample
- 20 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/171S9
between said occurrences of target subsequences, and
optionally wherein, when the step of generating comprises
amplifying nucleic acids in the sample, said additional
target subsequence is recognized by a method comprising
5 contacting nucleic acids in the sample with (a) oligomers of
nucleotides, nucleotide-mimics, or mixed nucleotides and
nucleotide-mimics, which hybridize with said additional
target subsequence and disrupt the amplifying step; or (b)
restriction endonucleases which have said additional target
10 subsequence as a recognition site and digest the nucleic
acids in the sample at the recognition site.
This invention further provides in the first
embodiment additional methods wherein the step of generating
further comprises separating nucleic acid fragments by
15 length, and optionally wherein the step of generating further
cGmprises detecting said separated nucleic acid fragments,
and op.ionally wherein said detecting is carried out b~ a
method comprising staining said fragments with silver,
Labeling said fragments with a DNA intercalating dye, or
20 detecting light emission from a fluorochrome label on said
fragments.
This invention further provides in the first
embodiment additional methods wherein said representation of
the length between occurrences of target subsequences is the
25 length of fragments determined by said separating and
detecting steps.
This invention further provides in the first
embodiment additional methods wherein said separating eis
carried out by use of liquid chromatography, mass
30 spectrometry, or electrophoresis, and optionally wherein said
electrophoresis is carried out in a slab gel or capillary
configuration using a denaturing or non-denaturing medium.
This invention further provides in the first
embodiment additional methods wherein a predetermined one or
35 more nucleotide sequences in said database are of interest,
and wherein the target subsequences are such that said
sequences of interest generate at least one signal that is
- 21 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
not generated by any other sequence likely to be present in
the sample, and optionally wherein the nucleotide sequences
of interest are a majority of sequences in said database.
This invention further provides in the first
5 embodiment additional methods wherein the target subsequences
have a probability of occurrence in the nucleotide sequences
in said database of from approximately 0.01 to approximately
0.30.
This invention further provides in the first
10 embodiment sdditional methods wherein the target subsequences
are such that the majority of sequences in said database
contain on average a sufficient number of occurrences of
target subsequences in order to on ~verage generate a signal
that is not generated by any other nucleotide seguence in
15 said database, and optionally wherein the number of pairs of
target subsequences present on average in the ma~ority of
sequences in said database is no less than 3, and wherein th~
average number of signals generated from the sequences in
sald database is such that the average difference between
20 lengths represented by the s-enerated signals is greater than
or equal to 1 base pair.
This invention further provides in the first
embodiment additional methods wherein the target ~ubsequences
have a probability of occurrence, p, approximately given by
25 the solution of
R(~ + 1)p2 = A
and
L = B
Np2
wherein N = the number of different nucleotide sequences in
35 said database; L = the average length of said different
nucleotide sequences in said database; R = the number of
recognition means; A = the number of pairs of target
- 22 -
-
-
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
subsequences present on average in said different nucleotide
sequences in said database; and B = the average difference
between lengths represented by the signals generated from the
nucleic acids in the sample, and optionally wherein A is
5 greater than or equal to 3 and wherein B is greater than or
equal to 1.
This invention further provides in the first
embodiment additional methods wherein the target subsequences
are selected according to the further steps comprising
10 determining a pattern of signals that can be generated and
the sequences capable of generating each such signal by
simulating the steps of probing and generating applied to
each sequences in said database of nucleotide sequences;
ascertaining the value of said determined pattern according
15 to an information measure; and choosing the target
subsequences in order to generate a new pattern that
op_imizes the information me~sure, and option211y wherein
said choosing step selects target subsequences which comprisa
the recognition sites of the one or more restriction
20 endonucleases, and optionally wherein said choos ng step
selects target subsequences which ccmprise the recognitlon
sites of the one or more restriction endonucleases contiguous
with one or more additional nucleotides.
This invention further provides in the first
25 embodiment additional methods wherein a predetermined one or
more of the nucleotide sequences present in said database of
nucleotide sequences are of interest, and the information
measure optimized is the number of such said sequences of
interest which generate at least one signal that is not
30 generated by any other nucleotide sequence present in said
database, and optionally wherein said nucleotide sequences of
interest are a majority of the nucleotide sequences present
in said database.
This invention further provides in the f irst
35 embodiment additional methods wherein said choosing step is
- by exhaustive search of all combinations of target
subsequences of length less than approximately 10, or wherein
- 23 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
said step of choosing target subsequences is by a method
comprising simulated annealing.
This invention further provides in the first
embodiment additional methods wherein the step of searching
5 further comprises determining a pattern of signals that can
be generated and the sequences capable of generating each
such signal by simulating the steps of probing and generating
applied to each sequence in said database of nucleotide
sequences; and finding the one or more nucleotide sequences
10 in said database that are able to gener~te said one or more
generated signals by finding in said pattern those signals
that comprise a representation of the (i) the same lengths
between cccurrences of target subsequences as is represented
by the generated signal and (ii) the same target subsequences
15 as is represented by the generated signal, or target
subsequences that are members of the same sets of target
subsequences represented by the generated signal.
This invention further provides in the first
embodiment additional methods wherein the step of determining
20 further comprises searching for occurrences of said target
subsequences or sets of target subsequences in nucleotide
sequences in said database of nucleotide sequences; finding
the ]engths between occurrences of said target subsequences
or sets of target subsequences in the nucleotide sequences of
25 said database; and forming the pattern of signals that can be
generated from the sequences of said datAhAs~ in which the
target subsequences were found to occur.
This invention further provides in the first
embodiment additional methods wherein said restriction
30 andonucleases generate 5' overhangs at the terminus of
digested fragments and wherein each double stranded adapter
nucleic acid comprises a shorter nucleic acid strand
consisting of a first and second contiguous portion, said
first portion being a 5' end subsequence complementary to the
35 overhang produced by one of said restriction endonucleases;
and a longer nucleic acid strand having a 3' end subsequence
complementary to said second portion of the shorter strand.
- 24 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
This invention further provides in the first
embodiment additional methods wherein said shorter strand has
a melting temperature from a complementary strand of less
than approximately 68~C, and has no terminal phosphate, and
5 optionally wherein said shorter strand is approximately 12
nucleotides long.
This invention further provides in the first
embodiment additional methods wherein said longer strand has
a melting temperature from a complementary strand of greater
lo than approximately 68~C, is not complementary to any
nucleotide sequence in said database, and has no terminal
phosphate, and optionally wherein said ligated nucleic acid
frag~ents do not contain a recognition site for any o~ said
restriction endonucleases, and optionally wherein said longer
15 strand is approximately 24 nucleotides long and has a G+C
con'ent between 40% and 60%.
This invention further provides in the first
embodiment additional methods whereirl said one or more
restriction endonucleases are heat inastivated be~ore sa.id
20 ligating.
This invention further provides in the first
embodiment additional methods wherein said restriction
endonucleases generate 3' overhangs at the terminus of the
digested fragments and wherein each double stranded adapter
25 nucleic acid comprises a longer nucleic acid strand
consisting of a first and 6econd contiguous portion, said
first portion being a 3' end subsequence complementary to the
overhang produced by one of said restriction endonucleases;
and a shorter nucleic acid strand complementary to the 3' end
30 of said second portion o~ the longer nucleic acid stand.
This invention further provides in the first
embodiment additional methods wherein said shorter strand has
a melting temperature from said longer strand of less than
approximately 68~C, and has no terminal phosphates, and
35 optionally wherein said shorter strand is 12 base pairs long.
~ This invention further provides in the first
embodiment additional methods wherein said longer strand has
- 25 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
a melting temperature from a complementary strand of greater
than approximately 68~C, is not complementary to any
nucleotide sequence in said database, has no terminal
phosphate, and wherein said ligated nucleic acid fragments do
5 not contain a recognition site for any of said restriction
endonucleases, and optionally wherein said longer strand is
24 base pairs long and has a G+C content between 40% and 60~.
In a second embodiment, the invention provides a
method for identifying or classifying a nucleic acid
10 comprising probing said nucleic acid with a plurality of
recognition means, each recognition means recognizing a
target nucleotide subsequence or a set of target nucleotide
subsequences, in order to generate a set of signals, each
signal representing whether said target subsequence or one of
15 said set of target subsequences is present or absent in said
nuclei_ acid; and searching a nucleotide sequence database,
said database comprifiing a plurality of known nucleotide
sequences of nucleic acids that may be present in the sample,
for saquences matching said generated set o~ signalc, a
20 sequ2nce from said database matching a set of signals when
the sequence from said database (i) comprises the same target
subsequences as are represented as present, or comprises
target subsequences that are members of the sets of target
subsequences represented as present by the generated sets of
25 signals and (ii) does not comprise the target subsequences
represented as absent or that are members of the sets of
target subsequences represented as absent by the generated
sets of signals, whereby the nucleic acid is identified or
classified, and optionally wherein the set of signals are
30 represented by a hash code which is a binary number.
This invention further provides in the second
embodiment additional methods wherein the step of probing
generates quantitative signals of the numbers of occurrences
of said target subsequences or of members of said set of
35 target subsequences in said nucleic acid, and optionally
wherein a sequence matches said generated set of signals when
the sequence from said database comprises the same target
- 26 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
subsequences with the same number of occurrences in said
sequence as in the quantitative signals and does not compriSe
the target subsequences represented as absent or target
subsequences within the sets of target subsequences
5 represented as absent.
This invention further provides in the second
embodiment additional methods wherein said plurality of
nucleic acids are DNA.
This invention further provides in the second
10 embodiment additional methods wherein the recognition means
are detectably labeled oligomers of nucleotides, nucleotide-
mimics, or combinations of nucleotides and nucleotide-mimics,
and the step of probing comprises hybridizing said nucleic
acid with said oligomers, and optionally wherein said
15 detectably labeled oligomers are detected by a method
comprising detecting light emission from a fluorochrome label
on said oligomers or arranging said labeled oligomers to
cause light to scatter fro~ a light pipe and detecting said
scattering, and optionally wherein the recognition means are
20 oligomers of peptido-nucleic acids, and optionally wherein
the recognition means are DNA oligomers, DNA oligomers
comprising universal nucleotides, or sets of partially
degenerate DNA oligomers.
This invention further provides in the second
25 embodiment additional methods wherein the step of searching
further comprises determining a pattern of sets of signals of
the presence or absence of said target subsequences or said
sets of target subsequences that can be generated and the
sequences capable of generating each set of signals in said
30 pattern by simulating the step of probing as applied to each
sequence in said database of nucleotide seqtl~nc~c; and
finding one or more nucleotide sequences that are capable of
generating said generated set of signals by finding in said
pattern those sets that match said generated set, where a set
35 of signals from said pattern matches a generated set of
signals when the set from said pattern (i) represents as
present the same target subsequences as are represented as
- 27 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
present or target subsequences that are members of the sets
of target subsequences represented as present by the
generated sets of signals and (ii) represents as absent the
target subsequences represented as absent or that are members
5 of the sets of target subsequences represented as absent by
the generated sets of signals.
This invention further provides in the second
embodiment additional methods wherein the target subsequences
are selected according to the further steps comprising
10 determining (i) a pattern of sets of signals representing the
presence or absence of said target subsequences or of said
sets of target subse~uences that can be generated, and (ii)
the sequences capable of generating each set of signals in
said pattern by simulating the step of probing as applied to
15 each sequence in said database of nucleotide sequences;
ascertaining the value of said pattern generated according to
an information measure; and choosing the target subsequences
in order to generate ~ new pattern that optimizes the
information measure.
This invention further provides in the second
embodiment additional methods wherein the information measure
is the number of sets of signals ir. the pattern which are
capab;e of being generated by one or more sequences in said
database, or optionally wherein the information measure is
25 the number of sets of signals in the pat'ern which are
capable of being generated by only one sequence in said
database.
This invention further provides in the second
embodiment additional methods wherein said choosing step is
30 by a method comprising exhaustive search of all combination
o~ target subsequences of length less than approximately 10,
or optionally wherein said choosing step is by a method
comprising simulated annealing.
This invention further provides in the second
35 embodiment additional methods wherein the step of determining
by simulating further comprises searching for the presence or
absence of said target subsequences or sets of target
- 28 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
subsequences in each nucleotide sequence in said database of
nucleotide sequences; and forming the pattern of sets of
signals that can be generated from said sequences in said
database, and optionally where the step of searching is
5 carried out by a string search, and optionally wherein the
step of searching comprises counting the number of
occurrences of said target subsequences in each nucleotide
sequence.
This invention further provides in the second
10 embodiment additional methods wherein the target subsequences
have a probability of occurrence in a nucleotide sequence in
said database of nucleotide sequences of from 0.01 to 0.6, or
optionally wherein the target subsequences are such that the
presence of one target subsequence in a nucleotide sequence
15 in said database of nucleotide sequences is substantially
independent of the presence of any other target subsequence
in the nucleotide sequence, or optionally wherein fewer than
approximately 50 target subsequences are selected.
In a third embodiment, the invention provides a
20 programmable apparatus for analyzing signals comprising an
inputting device for inputting one or more actual signals
genarated by probing a sample comprising a plurality of
nu_leic acids with recognition means, each recognition means
recognizing a target nucleotide subsequence or a set of
25 target nucleotide subsequences, said signals comprising a
representation of (i) the length between occurrences of said
target subsequences in a nucleic acid of said sample, and
(ii) the identities of said target subsequences in said
nucleic acid, or the identities of said sets of target
30 subsequences among which is included the target subsequences
in said nucleic acid; a searching device operatively coupled
to said accepting device for searching a sequence in a
nucleotide sequence database for occurrences of said target
subsequences or target subsequences that are members of said
35 sets of target subsequences, and for the length between such
occurrences, said database comprising a plurality of known
nucleotide sequences that may be present in said sample; a
- 29 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/171S9
comparing device operatively coupled to said accepting device
and to said searching device for finding a match between said
one or more actual signals and a sequence in said database,
said one or more actual signals matching a sequence from said
5 database when the sequence from said database has both (i~
the same length between occurrences of target subsequences as
is represented by said one or more actual signals and (ii)
the same target subsequences as is represented by said one or
more actual signals or target subsequences that are members
10 of the same sets of target subsequences represented by said
one or more actual signals; and a control device operatively
coupled to said comparing device for causing said comparing
to be done for sequences in the database and for outputting
those database sequences that match said one or more actual
15 signals, and optionally wherein said searching device
searches for said target subsequences or a set of target
nucleotide subsequences in said database sequences by
performing a string comparison of the nucleotides in said
subsequences with those in said database sequence.
This invention further provides in the third
embodiment that said control device further comprises causinq
said searching device to search substantially all sequences
in said database in order to determine a pattern of signals
that can be generated by probing said sample with said
25 recognition means, and wherein said control device further
causes said ~ ring device to find any matches between said
one or more actual signals and said pattern of signals, said
one or more actual signals matching a signal in said pattern
of signals when the signal from said pattern represents (i)
30 the same length between occurrences of target subsequences as
is represented by said one or more actual signals and (ii)
the same target subsequences as is represented by said one or
more actual signals or target subsequences that are members
of the same sets of target subsequences represented by said
35 one or more actual signals.
This invention further provides in the third
embodiment that said sample of nucleic acids comprises cDNA
- 30 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
from RNA of a cell or tissue type,and said database comprises
DNA sequences that are likely to be expressed by d cell or
tissue type.
This invention further provides in the third
5 embodiment a computer readable memory that can be used to
direct a programmable apparatus to function for analyzing
signals according to steps comprising inputting one or more
actual signals generated by probing a sample comprising a
plurality of nucleic acids with recognition means, each
10 recognition means recognizing a target nucleotide subsequence
or a set of target nucleotide subsequences, said signals
comprising a representation of (i) the length between
occurrences of said target subsequences in a nucleic acid of
said sample, and (ii) the identities of said target
15 subsequences in said nucleic acid, or the identities of said
set~ of target subsequences among which is included the
.arget subsequences in sa_d nucleic acid; searching a
sequenc.e in a nucleotide sequence database for occurrences o~
said target subsequences or target subsequences that zre
20 members of said sets of target subsequences, and for the
'ength between such occurrences, said databasç comprising a
plurality of known nucleotide sequences that may be present
in said sample; matching said one or more actual signals and
a sequence in said database when the sequence in said
25 database has both (i) the same length between occurrences of
target subsequences as is represented by said one or more
actual signals and (ii) the same target subsequences as is
represented by said one or more actual signals, or target
subsequences that are members of the same sets of target
30 subsequences as is represented by said one or more actual
signals; and repetitively performing said searching and
matching steps for the majority of sequences in the database
and outputting those database sequences that match said one
or more actual signals, or alternatively a computer readable
35 memory for directing a programmable apparatus to function in
~ the manner o~ the third object.
- 31 -
. CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
In a fourth embodiment, the invention provides a
programmable apparatus for selecting target subsequences
comprising an initial selection device for selecting initial
target subsequences or initial sets of target subsequences; a
5 first control device; a search device operatively coupled to
said initial selection device and to said first control
device (i) for searching sequences in a nucleotide sequence
database for occurrences of said initial target subsequences
or occurrences of target subsequences that are members of
10 said inltial sets of target subsequences and fGr the length
between such occurrences and (ii) for determining an initial
pattern of signals that can be generated from said selected
initial target subsequences or said initial sets of target
subsequences, said database comprising a plurality of known
15 nucleotide sequences, said signals comprising a
representation of (i) the length between said occurrences in
a sequence in said database, ana (ii) the identities of said
init.ial target subsequences that occur in said sequence in
said database, or the identities of target subsequences that
20 are members of the same initial sets of target subsequences
that occur in said se~uence in said database; and an
ascertaining device operatively coupled ~o said searching
device and to said first control device for ascert~;ning the
value of said determined initial pattern according to an
25 information measure; and wherein said first control device
causes further target subsequences to be selected and causes
the search device to determine a further pattern of signals
and the ascertaining device to ascertain a further value of
said information measure and accepts the further target
30 subsequences when said further pattern optimizes said further
value of said information measure.
This invention further provides in the fourth
object that a predetermined one or more of the sequences in
said database are of interest, and wherein said ascertaining
35 device ascertains the value of an information measure by
counting the number of such sequences of interest which
generate in said determined pattern at least one signal that
~ 32 -
CA 0223~860 l998-04-24
W O 97/lS690 PCT~US96/171S9
is not generated by any other sequence in said database, and
optionally that said one or more of the sequences of interest
comprise substantially all the seguences in said database.
This invention further provides in the fourth
5 embodiment that said first control device optimizes the value
of said information measure according to a method of
exhaustive search, wherein said first control device selects
further target subsequences of length less than approximately
10 and accepts the further target subsequences if said
10 ~urther value of said information measure is greater than the
previous value.
This invention further provides in the fourth
embodiment that said first control device optimizes the value
of said information measure according to a method comprising
15 simulated annealing, wherein said first control device
repeatedly selects further target subsequences and accepts
.he ,urther target subsequences if said further value of said
information measure is not decreased by greater than a
pr~babilistic factor dependent on a simulated-temperature,
20 ard wherein said programmable apparatus further comprises a
second control device operatively coupled to said first
control device for decreasing said simulated-temperature as
said first control device selects further target
subsequences, and optionally wherein said probabilistic
25 factor is an exponential function of the negative of the
decrease in the information measure divided by said
simulated-temperature.
This invention further provides in the fourth
embodiment that the database comprises a majority of known
30 DNA sequences that are likely to be expressed by one or more
cell types.
This invention further provides in the fourth
embodiment a ~ er readable memory that can be used to
direct a programmable apparatus to function for selecting
35 target subsequences according to steps comprising selecting
~ initial target subsequences or initial sets of target
subsequences; searching a sequence in a nucleotide sequence
- 33 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
database for occurrences of said initial target subsequences
or occurrences of target subsequences that are members of
said initial sets of target subsequences and for the length
between such occurrences, said database comprising a
5 plurality of known nucleotide sequences that may be present
in said sample; dete in;ng an initial pattern of signals
that can be generated from said selected initial target
subsequences or said initial sets of target subsequences,
said signals comprising a representation of (i) the length
10 between said occurrences in a sequence in said database, and
(ii) the identities of said initial target subsequences that
occur in said sequence in said database, or the identities of
target subsequences that are members of the initial sets of
target subsequences that occur in said sequence in said
15 database; ascertaining the value of said determined initial
pattern according to an information measure; and repetitively
performing said selecting, searching, determining, and
aficertaining steps to determine a further pattern of signaLs
and a further value of said information measure, and
20 accepting the further target subsequences when said ~urt~er
pat'ern optimizes said further value of said information
measure, or alternatively a computer readable memory for
d rect..ng a programmable apparatus to function in the manner
of the fourth object.
- In a fifth embodiment, the invention provides a
programmable apparatus for displaying data comprising a
selecting device for selecting target subsequences or sets of
target subsequences, such that recognition means for
re~ognizing said target subsequences or said sets of target
30 subseq~ c~c can be used to generate signals by probing a
sample comprising a plurality of nucleic acids, said signals
comprising a representation of (i) the length between
occurrences of said target subsequences in a nucleic acid of
said sample and ~ii) the identities of said target
35 subsequences in said nucleic acid or the identities of said
sets of target subsequences among which are included the
target subsequences in said nucleic acid; an inputting device
- 34 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
for inputting one or more actual signals generated by probing
said sample with said recognition means; an analyzing device
for analyzing signals operatively coupled to said selecting
and inputting devices that determines which sequences in a
5 nucleotide sequence database can generate said actual signals
when subject to said recognition means, said database
comprising a plurality of known nucleotide sequences that may
be present in said sample; an input/output device operatively
coupled to said selecting, inputting, and analyzing devices
10 that inputs user requestS and controls the selecting device
to select target subsequences or sets of target subsequences.,
controls the inputting device to accept actual signals,
controls the analyzing device to find the sequences in said
database that can generate said actual signals, and displays
15 output comprising said actual signals and said sequences in
said database that can generate said actual sisnals.
This invention further provides in the ~ifth
embodiment that said sample is ~ cDNA sample prepared from
tissue specimen, and the app~ratus further comprises a
20 storage device operatively coupled to the input/output device
for stcring indications of the origin of said tissue specimen
~nd information concerning said tissue specimen, and wherein
sa.id indications can be displayed upon user input, and
optionally that the indications and information concerning
25 said tissue specimen comprises histological information
comprising tissue images.
This invention further provides in the fifth
embodiment additional apparatus further comprising one or
more instrument devices for probing said sample with said
30 recognition means and for generating said actual signals; and
a control device operatively coupled to said one or more
instrument devices and=to said input/output device for
controlling the operation of said instrument devices, wherein
said user can input control commands for control of said
35 insL I _nt devices and receive output concerning the status
- of said instrument devices, and optionally wherein one or
more of said selecting, inputting, analyzing, and
- 35 -
CA 0223~860 1998-04-24
WO 97/15690 PCTAJS96/171~9
input/output devices are physically collocated with each
other, or are physically spaced apart from each other and are
connected by a communication medium for exchanges of co~m~n~s
and information.
S This invention further provides in the f if th
embodiment a computer readable memory that can be used to
direct a programmable apparatus to function for displaying
data according to steps comprising selecting target
subsequences or sets of target subsequences, such that
recognition means for recognizing said target subsequences or
said sets of target subsequences can be used to generate
signals by probing a sample comprising a plurality of nucleic
acids, said signals comprising a representation of (i) the
length between occurrences of said target subsequences in a
15 nucleic acid of said sample and (ii) the identities of said
target subsequences in said nucleic acid or the identi.ties of
said sets of target subsequences among which are included the
target subsequences in said nucleic acid ir.putting one or
more actual signals generated by prob ng said sample with
20 said recognition means analyzing said one or more actual
signals to determine which sequences in a nucleotide sequence
database can generate said actual signals when subject to
s~id recognition means, said database comprising a plurality
of known nucleotide seq~lenc~s that may be present in said
25 sample; and inputting user requests to control said selecting
step to select target subsequences or sets of target
subsequences, said inputting step to input actual signals,
and said analyzing step to find the sequences in said
database that can generate said actual signals, and
30 outputting in response to further user requests in~ormation
comprising said actual signals and said sequences in said
database that can generate said actual signals, or
alternatively a computer readable memory for directing a
programmable apparatus to function in the manner of the fifth
35 object.
In a sixth embodiment, the invention provides a
method ~or identifying, classifying, or quantifying DNA
- 36 -
CA 0223~860 1998-04-24
W O97/15690 PCT~US96/17159
molecules in a sample of DNA molecules having a plurality of
different nucleotide sequences, the method comprising the
steps of digesting said sample with one or more restriction
endonucleases~ each said restriction endonuclease recognizing
5 a subsequenCe recognition site and digesting DNA at said
recognition site to produce fragments with 5' overhangs;
contacting said fragments with shorter and longer
oligodeoxynucleotides~ each said shorter oligodeoxynucleotide
hybridizable with a said 5' overhang znd having no terminal
10 phosphates, each said longer oligodeoxynucleotide
hybridizable with a said shorter oligodeoxynucleotide;
ligating said longer oligodeoxynucleotides to said 5'
overhangs on said DNA fragments to produce ligated DNA
Eragments; extending said ligated DNA fr~gments by synthesis
15 with a DNA polymerase to produce blunt-ended double stranded
DNA fragments; amplifying said blunt-ended double stranded
DNA fragments by a method compris-ng contacting said DNA
fragments with a DNA polymerase and primer
oLigodeoxynucleotides, each said primer oligodeoxynucleotide
20 having a sequence comprising that of one of the longer
oligodeoxynucleotides; determining tha length of the
zmpliEied DNA fragments; and searching a DNA sequence
database, said database comprising a plurality of known DNA
sequences that may be present in the sample, for sequences
25 mat~hing one or more of said fragments of determined length,
a sequence from said database matching a fragment oE
determined length when the sequence from said database
comprises recognition sites of said one or more restriction
endonucleases spaced apart by the determined length, whereby
30 DNA molecules in said sample are identified, classified, or
quantified.
This invention further provides in the sixth
embodiment additional methods wherein the sequence of each
primer oligodeoxynucleotide further comprises 3' to and
35 contiguous with the sequence of the longer
- oligodeoxynucleotide the portion of the recognition site of
said one or more restri$tion endonucleases remaining on a DNA
- 37 -
CA 0223~860 1998-04-24
WO 97/15690 PCT~US96/17159
fragment terminus after digestion, said remaining portion
being 5' to and contiguous with one or more additional
nucleotides, and wherein a sequence from said database
matches a fragment of determined length when the sequence
5 from said database comprises subsequences that are the
recognition sites of said one or more restriction
endonucleases contiguous with said one or more additional
nucleotides and when the subsequences are spaced apart by the
determined length.
This invention further provides in the sixth
embodiment additional methods wherein said determining step
further comprises detecting the amplified DNA fragments by a
method comprising staining said fragments with silver.
This invention further provides in the sixth
15 embodiment additional methods wherein said
oligodeoxynucleotide primers are detect~bly labeled, wherein
the determining step further comprises detection of said
de~ectable labels, and wherein a fiequence from said database
matcnes a fragment of determined length when ihe se~uence
20 ~rom said database comprises recognition sites of the one or
more rastriction endonucleases, _aid recognition sites being
identified by the detectable labels of sai~
oligodeoxynucleotide primers, said reCOgnitiQn sites being
spaced apart by the determined length, and optionally wherein
25 said dete i n i~g step further comprises detecting the
amplified DNA fragments by a method comprising labeling said
fragments with a DNA intercalating dye or detecting light
emission from a fluorochrome label on said fragments.
This invention further provides in the sixth
30 embodiment additional steps further comprising, prior to said
dete~ ;ning step, the step of hybridizing the amplified DNA
fragments with a detectably labeled oligodeoxynucleotide
complementary to a subsequence, said subsequence differing
from said recognition sites of said one or more restriction
35 endonucleases, wherein the detel ining step further comprises
detecting said detectable la~el of said oligodeoxynucleotide,
and wherein a sequence from said database matches a fragment
- 38 -
CA 0223~860 1998-04-24
W O 97/lS690 PCTAUS96/17159
of determined length when the sequence from said databa5e
further comprises said subsequence between the recognition
sites of said one or more restriction endonucleases.
This invention further provides in the sixth
5 embodiment additional methods wherein the one or more
restriction endonucleases are pairs of restriction
endonucleases, the pairs being selected from the group
consisting of Acc56I and HindIII, Acc65I and NgoMI, BamHI and
EcoRI, BglII and HindIII, BglII and NgoMI, BsiWI and BspHI,
10 BspHI and BstYI, BspHI and NgoMI, BsrGI and EcoRI, EagI and
EcoRI, EagI and HindIII, EagI and NcoI, HindIII and NgoMI,
NgoMI and NheI, NgoMI and SpeI, BglII and BspHI, Bspl20I and
NcoI, Bss~II and NgoMI, EcoRI and HindIII, and NgoMI and
XbaI, or wherein the step of ligating is performed with T4
15 DNA ligase~
This invention further provides in the sixth
emkodiment additional methods wherein the steps of digesting,
contzcting, and ligating are performed simultaneously in the
same reaction vessel, or optionally ~herein the steps of
20 digesting, contacting, ligating, extending, and amplifying
are performed in the same reaction vessel.
This invention further provides in the sixth
embod ment additional methods wherein the step of determining
the length is performed by electrophoresis.
This invention further provides in the sixth
embodiment additional methods wherein the step of searching
said DNA dat~h~se further comprises dete~ ;n;ng a pattern of
fragments that can be generated and for each fragment in said
pattern those sequences in said DNA database that are capable
30 of generating the fragment by simulating the steps of
digesting with said one or more restriction endonucleases,
contacting, ligating, extending, amplifying, and determining
applied to each sequence in said DNA database; and finding
the sequences that are capable of generating said one or more
35 fragments of determined length by finding in said pattern one
~ or more fragments that have the same length and recognition
sites as said one or more fragments of determined length.
- 39 -
CA 0223~860 1998-04-24
WO 97/15690 PCT~US96/17159
This invention further provides in the sixth
embodiment additional methods wherein the steps of digesting
and ligating go substantially to completion.
This invention further provides in the sixth
5 embodiment additional methods wherein the DNA sample is cDNA
prepared from mRNA, and optionally wherein the DNA is of RNA
from a tissue or a cell type derived from a plant, a single
celled animal, a multicellular animal, a ~acterium, a virus,
a fungus, a yeast, or a mammal, and optionally wherein the
10 mammal is a human, and optionally wherein the m~m~ l is a
human having or suspected o~ having a diseased condition, and
optionally wherein the diseased condition is a malignancy.
In a seventh embodiment, this invent on provides
additional methods for identifyinq, classifying, or
15 quantifying DNA molecules in a sample of DNA molecules with a
plurality of nucleotide sequences, the method comprising the
steps of digesting said sample with one or more restriction
~r.donucleases, each said restriction endonuclease recognizing
a subsequence recognition site and digesting DNA to produce
20 fragments with 3' overhangs; contacting said fragments with
shorter and longer oligodeoxynucleotides, each said longer
oligodeoxynucleotide consisting of a first and second
contiguous portion, said first portion being a 3' end
subsequence complementary to the overhang produced by one of
25 said restriction endonucleases, each said shorter
oligodeoxynucleotide complementary to the 3' end of said
second portion of said longer oligodeoxynucleotide stand;
ligating said longer oligodeoxynucleotide to said DNA
fragments to produce a ligated fragment; ext~;ng said
30 ligated DNA fragments by synthesis with a DNA polymerase to
form blunt-ended double stranded DNA fragments; amplifying
said double stranded DNA fragments by use of a DNA polymerase
and primer oligodeoxynucleotides to produce amplified DNA
fragments, each said primer oligodeoxynucleotide having a
35 sequence comprising that of a longer oligodeoxynucleotides;
determining the length of the amplified DNA fragments; and
searching a DNA sequence database, said database comprising a
- 40 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
plurality of known DNA sequences that may be present in the
sample, for sequences matching one or more of said fragments
of determined length, a sequence from said database matching
a fragment of determined length when the sequence from said
5 database comprises recognition sites of said one or more
restriction endonucleases spaced apart by the determined
length, whereby DNA sequences in said sample are identified,
classified, or quantified.
In an eighth embodiment, this invention provides
lo additional methods of detecting one or more differentially
expressed genes in an in vitro cell exposed to an exogenous
factor relative to an in vitro cell not exposed to said
exogenous factor comprising performing the methods the first
embodiment of this invention wherein said plurality of
15 nucleic acids comprises cDNA of RNA of said in vitro cell
exposed to said exogenous factor; performing the methods of
the r-rst embodiment of this invention wherein said plurality
of nucleic acids comprises cDNA of RNA of said in vitro cell
not exposed to said exogenous factor; and comparing ~he
20 identified, classified, or quantified cDNA of said in vitro
cell exposed to said exogenous factor with the identified,
classified, or quantified cDNA of said in vitro cell not
exposed to said exogenous factor, whereby differentially
expressed genes are identified, classified, or quantified.
In a ninth embodiment, this invention provides
additional methods of detecting one or more differentially
expressed genes in a diseased tissue relative to a tissue not
having said disease comprising performing the methods of the
first embodiment of this invention wherein said plurality of
30 nucleic acids comprises cDNA of RNA of said diseased tissue
such that one or more cDNA molecules are identified,
classified, and/or quantified; performing the methods of the
first embodiment of this invention wherein said plurality of
nucleic acids comprises cDNA of RNA of said tissue not having
35 said disease such that one or more cDNA molecules are
identified, classified, and/or quantified; and comparing said
identified, classified, and/or quantified cDNA molecules of
- 41 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
said diseased tissue with said identified, classified, and/or
quantified cDNA molecules of said tissue not having the
disease, whereby differentially expressed cDNA molecules are
detected.
Th~s invention further provides in the r.inth
embodiment additional methods wherein the step of comparir.g
further comprises finding cDNA molecules which are
reproducibly expressed in said diseased tissue or in said
tissue not having the disease and further finding which of
10 said reproducibly expressed cDNA molecules have significant
differences in expression between the tissue having said
disease and the tissue not having said disease, and
optionally wherein said finding cDNA molecules which are
reproducibly expressed and said significant differences in
15 expression of said cDNA molecules in said diseased tissue and
in said tissue not having the disease are determined by
me.thod comprising applying statistical measures, and
optionally wherein said statistical measures comprise
deter-,~ining reproducible express.ion if the standard devia~ion
20 of tke level of quantified expression of a cDNA molecule in
said aiseased tissue or said tissue not having the disease is
less than the average level of quantified expression of said
cDNA molecule in said diseased tissue or said tissue not
having the disease, respectively, and wherein a cDNA molecule
25 has significant differences in expression if the sum of the
st~n~d deviation of the level of quantified expression of
said cDNA molecule in said diseased tissue plus the standard
deviation of the level of quantified expression of said cDNA
molecule in said tissue not having the disease is less than
30 the absolute value of the difference of the level of
quantified expression of said cDNA molecule in said diseased
tissue minus the level of quantified expression of said cDNA
molecule in said tissue not having the disease.
This invention further provides in the ninth
35 emho~i -nt additional methods wherein the diseased tissue and
the-tissue not having the disease are from one or more
~ls, and optionally wherein the disease is a malignancy,
- 42 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
and optionally wherein the disease is a malignancy selected
from the group consisting of prostrate cancer, breast cancer,
colon cAnc~r, lung cancer, skin cancer, lymphoma, and
leukemia.
This inve~tion further provides in the ninth
embodiment additional methods wherein the disease is a
malignancy and the tissue not having the disease has a
premalignant character.
In a tenth embodlment, this invention provides
10 methods of staging or grading a disease in a human individual
comprising performing the methods of the first embodiment of
this invention in which said plurality of nucleic acids
comprises cDNA o~ RNA prepared from a tissue from said human
individual, said tissue having or suspected of having said
15 disease, whereby one or more said cDNA molecules are
identi'ied, classified, and/or quantified; and comparing said
one or more identified, classified, and/or quantified cDNA
molecuLes in said tissue to the one or more identified,
classified, and/or quantified cDNA molecules eApected at 2
20 particular stage or grade of said disease.
In an eleventh embodiment, this invention provides
additional methods for predicting a human patient's _esponse
to therapy for a disease, comprising performing the methods
of the first embodiment of this invention in which said
25 plurality of nucleic acids comprises cDNA of RNA prepared
~rom a tissue from said human patient, said tissue having or
suspected of having said disease, whereby one or more cDNA
molecules in said sample are identified, classified, and/or
quantified; and ascertA;n;ng if the one or more cDNA
30 molecules thereby identified, classified, and/or quanti~ied
correlates with a poor or a favorable response to one or more
therapies, and optionally which further comprises selecting
one or more therapies for said patient for which said
identified, classified, and/or quantified cDNA molecules
35 correlates with a favorable response.
In a twelfth embodiment, this invention provides
additional methods for evaluating the efficacy of a therapy
- 43 -
CA 0223~860 1998-04-24
W O 97/1~690 PCTAUS96/17159
in a mammal having a disease, the method comprising
performing the methods of the first embodiment of this
invention wherein said plurality of nucleic acids comprises
cDNA of RNA of said mammal prior to a therapy; performing the -
5 method of the first embodiment of this invention wherein saidplurality of nucleic acids comprises cDNA of RNA of said
mammal subsequent to said therapy; comparing one or more
identified, classified, and/or quantified cDNA molecules in
said mammal prior to said therapy with one or more
lo identified, classified, and/or quantified cDNA molecules of
said r~ 1 subsequent to therapy; and determining whether
the response to therapy is favorable or unfavorable according
to whether any dif~erences .in the one or more identified,
classified, and/or quantified cDNA molecules after therapy
15 are correlated with regression or progression, respectively,
of the disease, and optionally wherein the mammal is a human.
In a thirteenth embodiment, this invent.icn provides
a kit comprising one or more containers ha~-ing one or more
restriction endonucleases; one or more containers having one
20 or more shorter oligodeoxynucleotide strands; one or more
containers having one or more longer oligodeoxynucl~ootide
strands hybridizable with said shorter strands, wherein
either the longer or the shorter oligodeoxynucleotide strands
each comprise a sequence complementary to an overhang
25 produced by at least one of said one or more restriction
endonucleases; and instructions packaged in association with
said one or more containers for use of said restriction
endonucleases, shorter strands, and longer strands for
identifying, classifying, or quantifying one or more DNA
30 molecules in a DNA sample, said instructions comprising (i)
digest said sample with said restriction endonucleases into
fragments, each fragment being terminated on each end by a
recognition site of said one or more restriction
endonucleases; (ii) contact said shorter and longer strands
3S and said digested fragments to form double stranded DNA
adapters annealed to said digested fragments, (iii) ligate
said longer strand to said fragments; (iv) generate one or
- 44 -
CA 0223~860 1998-04-24
WO 97/15690 PCTAJS96/17159
more 5ignals by separating and detecting such of said
fragments that are digested on each end, each signal
comprising a representation of the length of the fragment and
the identity of the recognition sites on both termini of the
5 fragments; and (v) search a nucleotide sequence database to
determine sequences that match or the absence of any
sequences that match said one or more generated signals, said
database comprising a plurality of known nucleotide sequences
of nucleic acids that may be present in the sample, a
10 sequence from said database matching a generated signal when
the sequence from said database has both (i) the same length
between occurrences of said recognition sites of said one or
more restriction endonucleases as is represented by the
generated signal and (ii) the same recognition sites of said
15 one of more restriction endonucleases as is represented by
the generated signal.
This invention Eurther provides in the thirteenth
embodiment a kit wherein said one or more restriction
endonucleases generate 5' overhangs at the terminus of
20 digested fragments, wherein each said shorter
oligodeoxynucleotide strand consists of a first and second
contiguous portion, said first portion being a 5' end
subsequence complementary to the overhang produced by one of
said restriction endonucleases, and wherein each said longer
25 oligodeoxynucleotide strand comprises a 3' end subsequence
complementary to said second portion of said shorter
oligodeoxynucleotide strand, or optionally wherein said one
or more restriction endonucleases generate 3' overhangs at
the terminus of the digested fragments, wherein each said
30 longer oligodeoxynucleotide strand consists of a first and
second contiguous portion, said first portion being a 3' end
subsequence complementary to the overhang produced by one of
said restriction endonucleases, and wherein each said shorter
oligodeoxynucleotide strand is complementary to the 3' end of
35 said second portion of said longer oligodeoxynucleotide
- stand.
CA 0223~860 1998-04-24
WO 97/lS690 PCTAUS96/17159
This invention further provides in the thirteenth
embodiment a kit wherein said instructions further comprise
those signals expected from one or more DNA molecules of
interest when said sample is digested with a particular one
5 or more restriction endonucleases selected from among said
one or more restriction endonucleases in said kit, and
optionally wherein said one or more DNA molecules of interest
are cDNA molecules differentially expressed in a disease
condition.
$o This invention further provides in the thirteenth
embodiment a kit wherein the restriction endonucleases are
selected from the group consisting of Acc65I, AflII, AgeI,
~paLI, ApoI, AscI, AvrI, BamHI, BclI, BglII, BsiWI, Bspl20I,
BspEI, BspHI, BsrGI, BssHII, BstYI, EagI, EcoRI, HindIII,
15 MluI, NcoI, NgoMI, NheI, NotI, SpeI, and XbaI.
This invention further provides in the thirteenth
embodiment a kit further comprising one or more containers
ha~ing one or more double stranded adapter DNA molecules
formad ky annealing said longer -~nd sa iG shorter
20 oligonucleotide strands.
This invention further provides in the thirteenth
embodiment a kit further comprising the computer readable
memo_y of claim 106, or optionally further comprising the
computer readable memory of claim 114, or optionally further
25 ccmprising the computer readable memory of claim 122.
This invention further provides in the thirteenth
embodiment a kit further comprising in a container a DNA
ligase, or optionally further comprising in a container a
phosphatase capable o~ removing tsrminal phosphates from a
30 DNA sequence.
This invention further provides in the thirteenth
embodiment a kit further comprising one or more primers, each
said primer consisting of a single stranded
oligodeoxynucleotide comprising the sequence o~ one of said
35 longer strands; and a DNA polymerase, and optionally wherein
each of said one or more primers further comprises (a) a
first subse~uence that is the portion of the recognition site
- 46 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/171~9
of one of said one or more restriction endonucleases
r~m~;n;ng at the terminus of a fragment after digestion, and
(b) a second subsequence of one or two additional nucleotides
contiguous with and 3' to said first subsequence, wherein
5 said primer is detectably labeled such that primers with
differing said one or two additional nucleotides have
different labels that can be disting~ h;~hly detected.
This invention further provides in the thirteenth
embodiment a kit wherein said instructions further comprise:
10 detect such of said fragments digested on each end by a
method comprising staining said fragments with silver,
labeling said fragments with a DNA intercalating dye, or
detecting light emission from a fluorochrome label on said
fragments.
This invention further provides in the thirteenth
embodiment a kit further comprising reagents for performing a
cDNA sample preparation step; reagents for performing a step
~f digestion by one or more restriction endonucleases;
reagents for performing a ligation step; and -eagents for
20 performing a PCR amplification step.
4. BRIEF DESCRIPTION OF THE D~AWINGS
These and other features, aspects, and advantages
of the present invention will become better understood by
2S reference to the accompanying drawings, following
description, and appended claims, where:
Fig. 1 illustrates exemplary results of the signals
generated by QEA~ methods of this invention;
Figs. 2A, 2B, and 2C illustrate DNA adapters for an
30 RE/ligation implementation of QEA~ methods of this invention,
where the restriction endonucleases generate 5' overhangs,
open blocks indicating strands of DNA;
Figs. 3A and 3B illustrate the DNA adapters for an
RE/ligation implementation of QEA~ methods of this invention,
35 where the restriction endonucleases generate 3' overhangs;
Figs. 4A, 4B, and 4C illustrate an exemplary biotin
alternative embodiment of QEA~ methods;
- 47 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
Fig. 5 illustrates the DNA primers for a PCR embodiment
of QEA~ methods;
Figs. 6A and 6B illustrate a method for DNA sequence
database selection according to this invention;
S Fig. 7 illustrates an exemplary experimental description
for QEA~ embodiments of this invention;
Figs. 8A and 8B illustrate an overview of a method for
determining a simulated database of experimental results for
QEA~ embodiments of this invention;
Fig. 9 illustrates the detail of a method for simulating
a QEA~ reaction;
Figs. lOA-F illustrate exemplary results of the action
of the method of Fig. 9;
Fig. ll illustrates the detail of a method for
15 determining a simulated database of experimental results for
a QEA~ embodiment of this invention;
Figs. 12A, 12B, and 12C illustrate an exemplary -omputer
system apparatus, and an alternative embodimen., implementing
methods of this invention;
Fig. 13A illustrates exemplary detail of an experimental
design method for QEA~ and CC embodiments of this invention
and Fig. 13B illustrates exemplary detail of an experimental
design method for a QEA~ embodiment of this invention;
Fig. 14 illustrates an exemplary method for ordering the
25 DNA se~uences found to be likely causes of a QEA~ signal in
the order of their likely presence n the sample;
Fig. 15 illustrates the detail of a method for
detel in;ng a simulated database of experimental results for
a CC ~ hoAi ?nt of this invention;
Figs. 16A, 16B, 16C, and 16D illustrate exemplary
reaction temperature profiles for preferred manual and
automated implementations of a preferred RE embodiment of a
QEA~ method; and
Figs. 17A-F illustrate the SEQ-QEA~ alternative
3S embodiment of the RE/ligase embodiment of QEA~.
- 48 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96tl71~9
5. DE~T~n DE8CRIPTION
According to the present invention, to uniquely
- identify an expressed nucleotide or gene se~uence, full or
partial, as well as many components of genomic DNA, it is not
5 necessary to determine the actual, complete nucleotide
sequences. Full sequences provide far more information than
is needed to merely classify or determine a sequence
according to this invention. For example, in the human
genome, it is known that there are approximately 105 expressed
10 genes. Since the average length of a coding sequence is
approximately 2000 nucleotides, the total number of possible
sequences is approximately 42~, or about 10l2~. The actual
number of expressed human genes is an unimaginably small
fraction (10-ll95) of the total number of possible DNA
15 sequences. Even sequencing a 50 bp fragment of a cDNA
sequence generates about 10~ times more information than is
needed for classification of that sequence. Use of the
p-esent invention allows direct determination of sequences in
a sample wit~ far less information than either 2 complete or
20 a partial sequence determination of a sample by maXing use of
a database of sequences likely to be present in the sample.
If such a database is not available, sequences in ~he sample
can nevertheless be separately classified.
More generally, the invention is adaptable to
25 analyzing the sequences of any biopolymer, built of a small
number of repeating units, whose naturally occurring
representatives are far fewer that the number of possible,
physical polymers and in which small subsequences can be
recognized. Thus it is applicable to not only naturally
30 occurring DNA polymers but also to naturally occurring RNA
polymers, proteins, glycans, etc.
In computer science, codes which compactly identify
a few members f~om among a large set of possibilities are
called hash codes. An object of this invention is to
35 construct hash codes for expressed DNA seq~nc~c, or
alternatively for any other existing set of DNA sequences.
In a fully populated hash code without any unassigned code
- 49 -
CA 0223~860 1998-04-24
Wo 97/15690 PCT/US96/17159
words, all human genes could be coded by an approximately 17
bit binary number (2~7 = l.3 x lOs). A 20 bit code would be
about 10% filled or 90% sparse (220 = l.O x 106).
In this invention codes are constructed from one or
5 more signals which represent the presence of short nucleic
acid (preferably DNA) subsequences (hereinafter called
"target subsequences") in the sample sequence and,
preferably, in a QEA~ embodiment, include a representation of
the length along the sample sequence between adjacent target
lO subsequences. In some embodiments, the presence of target
subsequences is directly recognized by direct subsequence
recognition means, including, but not limited to, REs and
other DNA binding proteins, which bind and/or react with
target subsequences, and oligomers of, for example, PNAs or
15 DNAs, which hybridize to target subsequences. In other
embodiments, the presence of effective target subsequences is
recogni~ed indirectly as a result of app .ying protocols,
perhaps involving multiple DNA binding proteins together with
hybridi~ing oligomers. In this latter case, each of the
20 multiple proteins or ologomers can recognize a separate
subsequence and the effective target subsequence can be the
combination of the separate subsequences A preferable
combination is subsequence concatenation in the situation
where all the separately recognized subsequences are
25 adjacent. Such effective target subsequences can have
advantageous properties not achievable by, for example, REs
or PNA oligomers alone. However, this invention, and
particularly its computer methods, are adaptable to any
acceptable subsequence recognition means available in the
30 art. The computer implemented analysis and design methods
treat targer subsequences and effective targer subsequences
in the same manner. Such acceptable subsequence recognition
means preferably precisely and reproducibly recognize target
subsequences and generate a recognition signal with adequate
35 signal to noise ratio and further preferably provide
information on the length between target subsequences.
-- 50 --
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
The signals of this invention, which contain
representations of target subsequence occurrences and,
preferably, representations of the length between target
subsequence occurrences, can differ in various embodiments of
5 this invention. In some embodiments, target subsequences are
exactly recognized, for example, where REs are the
recognition means, and subsequence representation can be the
unique identity of the subsequences. In other embodiments,
target subsequence recognition is less exact, for example,
10 where short oligomers are used, and this representation can
be "fuzzy". In the case of short oligomer, a fuzzy
representation can consist of all subsequences which differ
by one nucleotide from a target subsequence, each such
subsequence, perhaps weighted by the probability that each
15 member of the set is the target subsequence. Further, length
representation may depend on the separation and detection
means used to generate the ~ignals. In the case of
electrophoretic separation, the length observed
elec~rophoretically may need to be corrected, perhaps up to S
20 to lC~, for mobility differences due to average base
composition differences or due to effects of labeling
moieties ~sed for detection. As these corrections often are
not be ~nown until the total sample sequence is determined,
the length representation of the signal can use the
25 electr~phoretic length in bp and not the physical length in
bp. For simplicity and without limitation, in the following
description unless otherwise noted the signals are presumed
to represent physically correct lengths, as if generated by
precise recognition means with a length determined by error
30 or bias free separation and detection means. However, in
particular embodiments, target subsequences can be
represented in a fuzzy manner and length, if present, can
include separation and detection bias.
Target subsequences recognized are typically
35 contiguous. This is typical for REs adaptable to this
invention. However, this invention is adaptable to means
recognizing discontiguous target subsequences or
- 51 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
discontiguous e~ective targer subsequences. For example,
oligomers recognizing discontinuous subsequences can be
constructed by inserting degenerate nucleotides in a
discontinuous region. A set of 16 oligomers recognizing AGC-
5 -TAT, with a two nucleotide discontiguous region, can be
constructed according to the schema TCGNNATA, where N is any
nucleotide. Alternately, such discontiguous subsequences can
be recognized by one oligomer of the form TCGiiATA, where "i"
is inosine, or any other "universal" nucleotide, capable of
10 hybridizing with any naturally occurring base.
Typically and without limitation, however, the
invention is applied to the analysis of cDNA samples
synthesized ~rom any in vivo or in vitro sources o~ RNA.
cDNA can be synthesized either from total cellular RNA, ~rom
15 poly(A)~ RNA, or from specific sub-pools of RNA. Such RNA
sub-pGols can be produced by RNA pre-purification, for
example, separation of mRNA of the endoplasmic reticulum from
c~oplasmic mRNA enriches mRNA primarily encoding for cell
surface or extracellular proteins (Celis et 31., 199~, Cell
20 Biology, Academic Press, New York, NY). Such enriched mRNAs
have ~ncreased diagnostic or therapeutic utility due, for
ex~mple, to their encoded protein's cell-surface or
ext-acellular roles, such as being a receptor. ~uch pre-
purified RNA pools can be used in all embodiments of this
25 invention. First strand cDNA synthesis can be performed by
any method known in the art and can use any priming method
known in the art. For example, first strand synthesis
primers can be oligo(dT) primers, random he~ ~r primers,
phasing primers, mixtures thereo~, etc. In particular,
30 phasing primers, cont~;n;ng either an A,C, or G at the 3'
end, can be used in separate cDNA synthesis reactions to
split the cDNA first strands into 3 pools, each generated
from poly(A)+ mRNA having a T, G, or C, respectively, 5' to
the poly(A)~ tail. Twelve pools can be synthesized by using
35 the 12 possible oligo(dT) phasing primers not containing a 3-
terminal thymidine. Further, cDNA can be synthesized by
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
methodS biased to producing full-length cDNAs, e.g. by
requiring presence of the 5'-cap in the source cap mRNAO
Two specific embodiments of the invention are
- respectively termed "quantitative expression analysis"
5 ("QEA~") and "colo-iy calling" ("CC"). The specific
embodiment known as QEA~ probes a sample with recognition
means generating signals that preferably comprise an
indication of the presence of a first target subsequence, an
indication of the presence of a second target subsequence,
10 and a representation of the length between the target
subsequences in the sample nucleic acid sequence. If the
first strand of target subsequences occur more than once in a
single nucleic acid in the sample, more than one signal is
generated, each signal comprising the length between adjacent
15 occurrences of the target subsequences.
QEA~ embodiments are preferred for clzssifying and
determining sequences in mixtures of cDNAs, but is also
adaptable to samples with only one cDNA. It affGrds the
relat-ve advantage over prior art methods that cloning of
20 sample nucleic acids is not required. Typically, enough
pairs of target subsequences can be chosen so that su~ficient
distir.gl~;~hAhle signals can be generated to determine one to
~1' the sequences in the sample mixture. For example, first,
any pair of target subsequences may occur more than once in a
25 single DNA molecule to be analyzed, thereby generating
several signals with differing lengths from one DNA molecule.
Second, even if a pair of target subsequences occurs only
once in two different DNA molecules to be analyzed, the
lengths between the hits may differ and thus disting~ h~'~le
30 signals may be generated.
The target subsequences used in QEA~ are preferably
optimally chosen by the computer implemented methods of this
invention in view of DNA sequence databases cont~;n;ng
sequences likely to occur in the sample to be analyzed. In
35 the case of human cDNA, efforts of the Human Genome Project
in the United States, efforts abroad, and e~forts of private
companies in the sequencing o~ the human genome sequences,
- 53 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
both expressed and genetic, are being collected in several
available databases (listed in Sec. 5.1).
Typically, QEA~ can be performed in a "query mode"
or in a "tissue mode." A query mode experiment focuses on
S determining the expression of a limited number of genes,
perhaps 1 - lO0, of interest and of known sequence. A
minimal number of target subsequences are chosen to generate
signals, with the goal that each of the limited number of
genes is discriminated from all the other genes likely to
10 occur in the sample by at least one unique signal. In other
words, such a QEA~ experiment is designed so that each gene
of interest generates at least one signal unique to it (a
"good" gene, see infra). ~ QEA~ tissue mode experiment
focuses on determining the expression of as many as possible,
15 preferably a majority, of the genes expressed in a tissue or
other sample, without the need for any pricr knowledge or
interest in their expression. Target subsequences are
optimally chosen to discriminate the maximum number of samp~e
DNA sequences into classes comprising one or preferabl~r at
20 most a few sequences. Preferably, enough signals are
produced and detected so that the computer methods of this
invention can uniquely determine the expression of a
majority, or more preferably most, of the genes expressed in
a tissue. In both modes, signals are generated and detected
25 as determined by the threshold and sensitivity of a
particular experiment. Some important dete ;nAnts of
threshold and sensitivity are the initial amount of mRNA and
thus of cDNA, the amount of molecular amplification performed
during the experiment, and the sensitivity of the detection
30 means.
QEA~ signals are generated by methods comprising a
recognition means for ~arget subsequences that include, but
3re not limited to one or more REs in a preferred RE/ligase
embodiment or nucleotide oligomer primers in an alternative
35 PCR embodiment. In both embodiments, this invention
contemplates embodiments which select certain classes of QEA~
reaction products and remove unwanted products. These
- 54 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
- embodiments advantageously increase the signal to noise
("s/n") ratio of the resulting signals.
- In general, the RE/ligase method proceeds ac~ording
- to the following. steps. The method employs recognition
5 reactions with one, a pair, or more REs which recognize
target subsequences with high specificity and cut the
sequAnce at the rPcQ~nition ~ites leaving fraoments with
sticky overhangs characteristic of the particular RE. To
each sticky overhang, specially constructed, labeled
10 amplification primers are ligated with the aid of shorter
lin~ers in a manner so that the particular RE making the cut,
and thus the particular target subsequence, can be later
identified. A DNA polymerase then forms blunt-ended DNA
~ragments. These fragments are then PCR amplified using ~he
15 same special labeled primers for a number of cycles
prefer~bly just sufficient to detect signals from all
fragments of interest and just suf,icient to make signals
from lragments not of interest, e.g., the linearly ampli~ying
sinaly cut fragments, relatively insignificant. '~he
20 amplified labeled fragments are then separated by ler.gth
using gel electrophoresis in either denaturing or nor.-
denaturing conditions and the length and labeling o~ the
fragments is optically detected. Optionally, single stranded
fragments can be removed by a binding hydroxyapatite, or
25 other single strand specific, column or by digestion by a
single strand specific n~clease. Also, this invention is
adaptable to other functionally equivalent amplification and
ler.gth separation means. In this manner, the identity of the
REs cutting a fragment, and thereby the subsequences present,
30 as well as the length between the cuts is determined.
The RE/ligase embodiment is adaptable to several
embodiments which enhance quantitative characteristics of
QEA~ signals or which increase sample sequence
discrimination. Certain embodiments use a removal means to
35 improve such quantitative characteristics as sensitivity and
linear responsiveness. One or more of the special, labeled
ampli~ication primers described above and used in the PCR
- 55 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
amplification step can have attached removal means comprising
a capture moiety attached to the primer and a binding partner
attached to a solid support, e.g., biotin and streptavidin
beads. In this manner certain products of the PCR reactions,
5 e.g., fragments cut with dif~erent REs at each end, can be
separated and purified from background fragments. Such
purified fragments can thereby be detected with increased
sensitivity. For example, fragments cut with pairs of
different REs on both ends are preferably separated since
10 such fragments contain the majority of signals. With N REs,
there are (N-1)/2 pairs with different REs but only N pairs
with the same RE.
Alternatively, cDNA is synthesized from an mRNA
sample with synthesis primers at least one of which is
15 biotinylated. In the case where only one synthesis primer i5
biotinylated, the cDNA is then cyclized. In any case, the
cDNA is then cut with a one or a pair of REs, and -he
special, labeled amplification primers are liaated to the cut
ends with the aid of shorter linkers as previously di~cussed.
20 Tne singly cut ends attached to the biotinylated c~NA
~ynthesis primers are removed with streptavidin or avidin
beads leaving highly pure double cut cDNA fragments with
ligated amplification primers, but with minimal singly cut
and labeled background fragments. With sufficiently
25 sensitive detection means, these pure doubly cut and labeled
fragments can be directly detected, after separation by
length (e.g., by electrophoresis or column chromatography),
without amplification. If amplification is needed, absence
of the DNA singly cut background fragments i Lo~es signal to
30 nGice ratio resulting in fewer necessary amplification
cycles. Thereby, PCR amplification bias is decreased or
eliminated and linear responsiveness of QEA~ signals to input
mRNA amounts is improved.
Other RE~ligase embodiments increase sample
35 sequence discrimination in QEA~ experiments, for example, by
recognizing target subsequences longer or less limited than
those recognized by REs, or by recognizing third subsequences
- 56 -
CA 0223~860 1998-04-24
WO 97/15690 PCT/US96/17159
interior to cut fragments. This added information can often
discriminate two sample sequences producing fragments having
identical original end subsequences and ' engths. It is used
in the computer implemented database lookup methods of this
5 invention in a manner similar to the use of target
subsequences. In one embodiments, the target subsequences
recognized can be effectively lengthened by using an
amplification primer with an internal Type IIS RE recognition
site so positioned that the Type IIS RE cuts the amplified
10 fragments in a manner producing a second overhang contiguous
with the recognition site of the initial RE. The sequence of
the second overhang concatenated with the initial target end
subsequence produces an effectively longer target
subsequence. Alternatively, an effectively longer target
15 subsequence can be recognized by using phasing primers during
PCR amplification. The PCR amplification step can de divided
into several pools with each pool using one phasing
~mplificatiGn primer constructed so as to recognize one or
more additional nucleotides beyond the original RE
20 recognition site. These additional nucleotides then
con1:ribute to an effectively longer target subsequence.
A third subsequence internal to a fragment can be
rec--gn zed by a distinctively labeled probe binding or
hybridizing with the third subsequence. Such a probe added
25 before detection generates unique signals from the fragment
containing that subsequence. ALternatively, a probe can
suppress signals from fragments with the third subsequence.
For example, a probe added before the PCR amplification step
and which prevents amplification of a fragment with the third
30 subsequence thereby removes and suppresses any signal from
such fragments. Such a probe can be without limitation
either an RE for recognizing and cutting the fragment with
the third subsequence or a PNA or modified DNA oligomer,
which cannot serve as a PCR primer, for hybridizing with the
35 third subsequence. Also, a third subsequence can be the
sequence of the overhang produced by a Type IIS RE cutting
the amplification primers sufficiently close to their 3' ends
-- 57 --
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
so that the resulting overhang is not contiguous with the
recognition sequence of the initial RE.
Further, various embodiments for improving the
quantitative characteristics of QEA~ experiments and for
5 improving the discrimination of sample sequences can be
combined in advantageous fashions to achieve both
improvements in the same experiment. For example, removal
means to increase the s/n ratio is combined with a Type IIS
RE cutting the amplification primers to increase sample
10 sequence discrimination in an embodiment called SEQ-QEA~.
In a preferred PCR method for QEA~, a suitable
collection o~ target subsequences is chosen by the computer
implemented QEA~ experimental design ~nethods, and PCR primers
distinctively labeled with fluorochromes are synthesized to
15 hybridize with these target subsequences. The primers are
designed as described in Sec. 5.3 to reliably recognize short
subse~uences while achieving a high specificity in PCR
am~lification. Using these primers, a m;ni ~m number of PCR
amplification steps amplifies those fragments between the
20 primed subsequences existing in DNA sequences in the sample,
thereby recognizing the target subsequences. The labeled,
amplified frag~ents are then separated by gel electrcphoresis
and detected. Further, the PCR embodiment is adaptable to
the same embodiment previously discussed with respect to the
25 RE/lig~se embodiment.
The signals generated from the recognition
reactions of a QEA~ experiment are analyzed by computer
methods of this invention. The analysis methods simulate a
QEA~ experiment using a database either of substantially all
30 known DNA sequences or of substantially all, or at least a
majority of, the DNA sequences likely to be present in a
sample to be analyzed and a description of the reactions to
be performed. The simulation results in a digest database
which contains ~or each possible signal that can be generated
35 the database sequences responsible for that signal. Thereby,
finding the sequences that can generate a signal involves a
look-up in the simulated digest database. Computer
- 58 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
implemented design methods optimize the choice of target
subsequences in QEA~ reactions in order to ~;~;ze the
information produced in an experiment. For the tissue mode,
the methods maximize the number of sequences having unique
5 signals by which their quantitative presence can be
unambiguously determined. For the query mode, the methods
maximize only the number of sequences of interest having
unique signals, ignoring recognition of other sequences that
might be present in a sample.
The second specific embodiment known as colony
calling ("CC") generates subsequence occurrence data without
length information. Since this method requires only
hybridizations, it is preferred for gene identification in
arrayed single-sequence clones constructed ~rom a tissue
15 library. This embodiment constructs a binary code in which
each bit of the code represents the presence or absence of
one target subsequence. By probing four to eight target
subseqllences in parallel, such as by using distinguishable
fluGrescent labeling of the multiple probes, in view of the
20 adequacy of a 20 bit code, the presence or absence of any
expressed human gene shoula be determinable in just three to
five separate probe steps. Such a compact method with such
economy in signal generation is highly useful.
Alternatively, recent real time hybridization detection
25 methods (Stimson et al., l99S, Proc. Natl. Acad. Sci. USA,
92:6379-6383) based on optical wave guides can be used ~or
detection. These methods make hybridization detection more
e~icient both by eliminating the washing step otherwise
needed between hybridization and detection and by speeding up
30 the detection step.
The hash code generated by the probe hybridization
reactions is interpreted by computer implemented methods of
this invention.- The analysis methods simulate a CC
experiment using a list of the target subsequences and a
35 database of the DNA sequences likely to be present in a
sample to be analyzed. The simulation results in a hash code
table which contains for each hash code all possible
_ 59 _
CA 0223~860 1998-04-24
W O 97/lS690 PCT~US96/171~9
sequences that can generate that code. Thereby,
interpretation of a detected hash code requires a look-up in
the table to find the possible sequences.
It is preferable that subsequences be carefully
5 chosen in order that a ~;n;~l-~ set of targets be obtained,
preferably no more than approximately 20, that produce the
maximum amount of information. Computer implemented methods
of this invention determine optimum sets of target
subsequences for a given database of sequences likely to
10 occur in the sample by optimizing the number of non-empty
hash codes in the simulated hash code table.
M~ ~ information is obtained when the target
subsequences occur completely randomly in the possible sample
sequences, that is, when their likelihood of occurrence is
15 approximately 50% and the presence of one subsequence is
independent of the presence of any other subsequence.
TherefGre, target subsequences chosen to generate a signaL
should preferably occur in th~ DNA sequence samp'e to be
analy~ed less than about 5G% and at least more often than 5-
20 10%, preferably more often than 10-15~. The most pre~erable
occurrence probability is from 25-50%. Also the presence of
one _arget subsequence is pr~ferably probabilistically
-ndependent o~ the presence of any other subsequence.
Using data on expressed RNA from human DNA sequence
25 databases, this means that sub-seguences are preferably less
than about 5 to 8 bp long for cDNA classification.
Typically, the resulting preferable target subsequences are 4
to 6 bp long. Longer sequences occur too infrequently to be
preferred for use. However, for classifying gDNA, longer
30 subseq~l~nc~c, up to 20 to 40 bp, are preferably used, because
gDNA fragments are normally of much greater length, from at
least 5 kilobases ("kb") for plasmid inserts to more the 100
kb ~or P1 inserts, and thus would typically have more
sequence variability, requiring longer target subsequences.
The preferred hybridization probes for short target
subsequences are labeled peptido-nucleic acids (PNAs).
Alternatively sets o~ degenerate, longer DNA oligonucleotides
- 60 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
are used which include as a common subse~uence the target
subsequence. These degenerate sets achieve improved
hybridization specificity as compared to 4 to 6-mers. Sets
of probes, each probe distinctively and distinguishably
5 labeled with a fluorochrome, are hybridized in conditions of
high stringency to arrayed DNA se~uence clones and optically
detected to detect the presence of target subsequences. For
example, in an embodiment wherein five fluorochromes are
simultaneously distinguished and 20 subsequences observations
~o are required for gene identification (a 20 bit code), any
gene in a colony can be identified in only four hybridization
steps. Alternately, efficient hybridization detection means
bascd on optical wave guide detection of DNA hybridization
can be used. By using differently sized and shaped particles
15 associated with different probes, the resultant differences
in light scattering can be used to detect hybridization of
multiple probes simultaneously with these wave guide methods.
Target subsequences can be chosen tc discriminate
not only single genes but also, more coarsely, sets of genes.
20 Fewer target subsequences can be chosen so that a particular
pattern of hits will indicate the presence of a gene of a
par~icular type. Types of genes of interest might be
oncogenes, tumor suppressor genes, growth Eactors, cell cycle
genes, or cytoskeletal genes, etc.
In embodiments of this invention where high
stringency hybridization ar~ specified, such conditions
generally comprise a low salt concentration, equivalent to a
concentration of SSC (173.5 g. NaCl, 88.2 g. Na Citrate, H20
to 1 1.) of less than approximately 1 mM, and a temperature
30 near or above the Tm of the hybridizing DNA. In contrast,
conditions of low stringency generally comprise a high salt
concentration, equivalent to a concentration of SSC of
greater than approximately 150 mM, and a temperature below
the Tm of the hybridizing DNA.
In embodiments of this invention where DNA
oligomers are specified for performing functions, including
hybridization and chain elongation priming, alternatively
- 61 -
CA 0223~860 1998-04-24
W O 97/15690 PCTnJS96/17159
oligomers can be used that comprise those of the ~ollowing
nucleotide mimics which perform similar functions.
Nucleotide mimics are subunits (other than classical
nucleotides) which can be polymerized to form molecules
5 capable of specific, Watson-Crick-like base pairing with DNA.
The oligomers can be DNA or RNA or chimeric mixtures or
derivatives or modified versions thereof. The oligomers can
be modified at the base moiety, sugar moiety, or phosphate
backbone. The oligomers may include other appending groups
10 such as peptides, hybridization-triggered cleavage agents
(see, e.g., Krol et al., 1988, BioTec~nigues 6:958-976), or
intercalating agents (see, e .g., Zon, 1988, Pharm. Res.
5:539-549). The o igomers may be con~ugated to another
molecule, e . g., a peptide, hybridization triggered cross-
15 linking agent, transport agent, hybridization-triggered
cleavage agent, etc.
The o~igomers may also comprise at least one
nucleotide mimic that is a modi~ied ba~e moiety which is
selectQd from the group including, but not Limited to,
20 5-~luorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil,
hypoxanthine, xantine, 4-acetylcytosine,
5-(carboxyhydroxylmethyl) uracil, 5-carboxymethylaminomethyl-
2-thiouridine, 5-carboxymethylaminomethyluracil,
dihydrouracil, beta-D-galactosylqueosine, inosine,
25 N6-isopentenyladenine, 1-methylgl~n;ne~ l-methylinosine,
2,2-dimethylguanine, 2-methyladenine, 2-methylguanine,
3-methylcytosine, 5-methylcytosine, N6-adenine,
7-methylguanine, 5-methyl~ in~ ~thyluracil,
5-methoxy~ ;nc ethyl-2-thiouracil, beta-D-mannosylqueosine,
30 5'-methoxycarboxymethyluracil, 5-methoxyuracil,
2-methylthio-N6-isopentenyl~;ne, uracil-5-oxyacetic acid
(v), wybutoxosine, pseudouracil, queosine, 2-thiocytosine,
5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil,
5-methyluracil, uracil-5-oxyacetic acid methylester,
35 3-(3-amino-3-N-2-carboxypropyl) uracil, and
2,6-diaminopurine. The oligomers may comprise at least one
modified sugar moiety selected from the group including but
- 62 -
CA 0223~860 1998-04-24
W O 97tlS690 PCT~US96/17159
not limited to arabinose, 2-fluoroarabinOSe, xylulose, and
hexose- The oligomers may comprise at least one modified
phosphate backbone selected from the group consisting of a
phosphorothioate, a phosphorodithioate, a
S phosphoramidothioate, a phosphoramidate, a phosphordiamidate,
a methylphosphonate, an alkyl phosphotriester, and a
formacetal or analog thereof.
The oligomer may be an ~-anomeric oligomer. An ~-
anomeric oligomer forms specific double-stranded hybrids with
10 complementary RNA in which, contrary to the usual ~-units,
the strands run parallel to each other (Gautier et al., 1987,
Nucl. Acids Res. 15:6625_6641).
oligomers of the invention may be synthesized by
standard methods known in the art, e.g., by use of an
15 automated DNA synthesizer (such as are commercially available
from Biosearch, Applied Bi~systems, etc.). As examples,
phosphorothioate oligos may be synthesized by the method of
Stein et al. (1988, Nucl. Acids Res. 16:3209),
methylphosphonate oligos can be prepared by use of controlled
20 pore glass polymer supports (Sarin et al., 1988, Proc. Natl .
Acad . Sci . U.S.A. 85:7448-7451), etc.
In specific embodiments of this invention it is
preferable to use oligomers that can specifically hybridize
to subsequences of a DNA sequence too short to achieve
25 reliably specific r~cognition, such that a set of target
subsequences is recognized. Further where PCR is used, as
Ta~ polymerase tolerates hybridization mismatches, PCR
specificity is generally less than hybridization specificity.
Where such oligomers recognizing short subsequences are
30 preferable, they may be constructed in manners including but
not limited to the following. To achieve reliable
hybridization to shorter DNA subsequences, degenerate sets of
DNA oligomers may be used which are constructed of a total
length sufficient to achieve specific hybridization with each
35 member of the set cont~ining a shorter sequence comple~entary
- to the common subsequence to be recognized. Alternatively, a
~ longer DNA oligomer may be constructed with a shorter
_ - 63
CA 0223~860 1998-04-24
WO 97/15690 PCT/US96/17159
sequence complementary to the subsequenCe to be recognized
and with additional universal nucleotides or nucleotide
mimics, which are capable of hybridizing to any naturally
occurring nucleotide. Nucleotide mimics are sub-units which
5 can be polymerized to form molecules capable of specific,
Watson--Crick--like base pairing with DNA. Alternatively, the
oligomers may be constructed from DNA mimics which have
improved hybridization energetics compared to naturally
occurr-ng nucleotides.
A preferred mimic is a peptido--nucleic acid ("PNA")
based on a linked N-(2-aminoethyl)glycine backbone to which
normal DNA bases have been attached (Egholm et al., 1993,
Nature 365:566-67). This PNA obeys specific Watson-Crick
base pairing, but with greater free energy of binding and
15 correspondingly higher melting temperatures. Suitable
ol gomers may be constructed entirely from PNAs or from mixed
PN~ and DNA oligomers.
In embodiments of this invention where DNA
fragments are separated by length, any length separation
20 means known in the art can be used. One alternative
separation means employs a sieving medium for separation by
fragmetlt length coupled with a force for propelling the DNA
fragments though the sieving medium. The sieving medium can
be a polymer or gel, such a polyacrylamide or agarose in
25 suitable concentrations to separate lO-lOOO bp DNA fragments.
In this case the propelling force is a voltage applied across
the medium. The gel can be disposed in electrophoretic
configurations comprising thick or thin plates or
capillaries. The gel can be non--denaturing or denaturing.
30 Alternately, the sieving medium can be such as used for
chromatographic separation, in which case a pressure is the
propelling force. Standard or high performance liquid
chromatographic: ("HPLC") length separation means may be used.
An alternative separation means employs molecular
35 characteristics such as charge, mass, or charge to mass
ratio. Mass spe~;~Loyr aphic means capable of separating lO--
lO00 bp fragments may be used.
-- 64 --
CA 0223~860 1998-04-24
W O 97/lS690 PCT~US96/17159
DNA fragment lengths determined by such a
separation means represent the physical length in base pairs
between target subsequences, after adjustment for biases or
errors intr~duced by the separation means and length changes
5 due to experimental variables (e.g., presence of a detectable
label, ligation to an adapter molecule). A represented
length is the same as the physical length between occurrences
o~ target subsequences in a sequence from said database when
both said lengths are equal after applying corrections for
10 biases and errors in said separation means and corrections
based on experimental variables. For example, represented
lengths determined by electrophoresis can be adjusted for
mobility biases due to average base composition or mobility
changes due to an attached labeling moiety and/or adapter
15 strand by conventional software programs, such as Gene Scan
So tware from Applied Biosystems, Inc. (Foster City, CA).
In embodiments of this invention where DNA
fragments must be labeled and detected, any compatible
labeling and detection means known in the art can be used.
20 Advances in fluorochromes, in optics, and in optical sensing
now permit multiply labeled DNA fragments to be distinguished
even if they completely overlap in space, as in a spot on a
filter or a band in a gel. Results of several recognition
reactions or hybridizations can be multiplexed in the same
25 gel lane or filter spot. Fluorochromes are available for DNA
labeling which permit disting~l; sh; ng 6-8 separate products
simultaneously (Ju et al., 1995, Proc. Natl . Acad Sci . USA
92:4347-4351).
Exemplary fluorochromes adaptable to this in~ention
30 and methods of using such fluorochromes to label DNA are
described in Sec. 6.11.
Single molecule detection by fluorescence is now
becoming possible (Eigen et al., 1994, Proc. Natl . Acad Sci .
USA 91:5740-5747), and can be adapted for use.
In embodiments of this invention where
~ intercalating DNA dyes are utilized to detect DNA, any such
dye known in the art is adaptable. In particular such dyes
- 65 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
include but are not limited to ethidium bromide, propidium
iodide, Hoechst 33258, Hoechst 33342, acridine orange, and
ethidium bromide homodimers. Such dyes also include POPO,
BOBO, YOYO, and TOTO from Molecular Probes (Eugene, OR).
Finally alternative sensitive detection means
available include silver staining of polyacrylamide gels
(Bassam et al., 1991, Analytic Biochemistry 196:80-83), and
the use of intercalating dyes. In this case the gel can be
photographed and the photograph scanned by scanner devices
10 conventional in the computer art to produce a computer record
of the separated and detected fragments. A further
alternative is to blot an electrophoretic separating gel onto
a filter (e.g., nitrocellulose) and then to apply any
visualization means known in the art to visualize adherent
15 DNA. See, e.g., Kricka et al., 1995, Molecular Probing,
Blotting, and Sequencing, Academic Press, New York. In
particular, visualization means requiring secondary reactions
with or.e or more reagents or enzymes can be used, as can any
means employed i~ the CC embodiment.
A preferred separation and detection apparatus for
~Ise in this invention is found in copending U.S. Patent
Application Serial No. 08/438,231 filed May 9, 1995, which is
hereby incorporated by reference in its entirety. Other
detection means adaptable to this invention include the
25 commercial electrophoresis machines from Applied Biosystems
Inc. (Foster City, CA), Pharmacia (ALF), Hitachi, Licor. The
Applied Biosystems machine is preferred among these as it is
the only machine capable of simultane~us 4 dye resolution.
In the following subsections and the accompanying
30 examples sections QEA~ and the CC embodiments are described
in detail.
5.1. OUANTITATIVE EXPRES8ION AN~LY8I8
This embodiment of this invention in the tissue
35 mode preferably generates one or more signals unique to each
cDNA sequence in a mixture of cDNAs, such as may be derived
from total~cellular RNA or total cellular mRNA from a tissue
- 66 -
CA 0223S860 1998-04-24
W O 97/15690 PCTrUS96/17159
sample, and to quantitatively relate the strength of such a
signal or signals to the relative amount of that cDNA
sequence in the sample or library. In the query mode, this
- embodiment preferably generates signals uniquely
s discriminating only a few sample sequences of interest in a
quantitative ~n~r. Less preferably, the signals uniquely
determine only sets of a small number of sequences, typically
2 - 10 sequences. QEA~ signals comprise an indication of the
presence of pairs of target subsequences and the length
10 between pairs of adjacent subsequences in a DNA sample.
Alternatives include recognizing the presence of third
subsequences between the pairs of target subsequences. In a
further embodiment ("5'-QEA~"), one of the subsequences is
the true end of the proteir. coding sequence, in a defined
15 relation to the 5' cap of the source mRNA. Signals are
preCerably generated in a manner permitt-ng straightforward
automa~ion with existing laboratory robots. For simplicity
of disclosur~, and not by way of limitation, the detailed
description of this method is directed to the analysis of
20 samples comprising a plurality of cDNA sequences. It is
equaliy applicable to samples comprising a single sequence or
samples comprising sequences of other types of DNA or nucleic
acids generally.
While described in terms of cDNA hereinbelow, it
25 will be understood that the DNA sample can be cDNA and/or
genomic DNA, and preferably comprises a mixture of DNA
sequences. In specific ~ ho~i -nts, the DNA sample is an
aliquot of cDNA of total cellular RNA or total cellular mRNA,
most pre~erably derived from human tissue. The human tissue
30 can be diseased or normal. In one embodiment, the human
tissue is malignant tissue, e.g., from prostate cancer,
breast cancer, colon cancer, lung cancer, lymphatic or
hematopoietic c~nc~s, etc. In another embodiment, the
tissue may be derived from in vivo animal models of disease
35 or other biologic processes. In this cases the diseases
modeled can usefully include, as well as cancers, diabetes,
obesity, the rheumatoi~ or autoimmune diseases, etc. In yet
- 67 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
another embodiment, the samples can be derived from in vitro
cultures and models. This invention can also be
advantageously applied to ~Amine gene expression in plants,
yeasts, fungi, etc.
The cDNA; or the mRNA from which it is synthesized,
must be present at some threshold level in order to generate
signals, this level being determined to some degree by the
conditions of a particular QEA~ experiment. For example,
such a threshold is that preferably at least 1000, and more
10 preferably at least 10,000, mRNA molecules of the sequence to
be detected be present in a sample. In the case where one or
only a few mRNAs of a type of interest are present in each
cell of a tissue from which it is desired to derive the
sample mRNA, at least a corresponding number of such cells
15 should be present in the initial tissue sample. In a
specific embodiment, the mRNA detected is present in a ratio
to to-al sample RNA of 1:105 to 1:1O6. With a lower rat,o,
more molecular amplification can be performed ~uring a QEA'~
experiment.
~0 The cDNA sequences occurring in a tissue derived
pool include short untranslated sequences and translated
protein coding sequences, which, in turn, may be a complete
protein coding sequence or some initial portion of a coding
sequence, such as an expressed sequence tag. A coding
25 sequence may represent an as yet unknown sequence or gene or
an already known sequence or gene entered into a DNA sequence
database. Exemplary sequence databases include those made
available by the National Center for Biotechnology
Information ("NCBI") tBe~h~s~, MD) (GenBank) and by the
30 European Bioinformatics Institute ("EMB~") (Hinxton Hall,
UK) .
A QEA~ method is also applicable to samples o~
genomic DNA in a manner similar to its application to cDNA.
In gDNA samples, in~ormation of interest includes occurrence
35 and identity of translocations, gene amplifications, loss of
heterozygosity for an allele, etc. This information is of
interest in cancer diagnosis and staging. In cancer
- 68 -
CA 0223~860 1998-04-24
W O 97/1~690 PCT~US96/17159
patients, amplified sequences might reflect an oncogene,
while loss of heterozygosity might reflect a tumor suppressor
gene. Such sequences of interest can be used to select
target subsequences and to predict signals generated by a
5 QEA~ experiment. Even without prior knowledge of the
sequences of interest, detection and classification of ~EA~
signal patterns is useful for the comparison of normal and
diseased states or for observing the progression of a disease
state. Gene expression information concerning the
10 progression of a disease state is useful in order to
elucidate the genetic mechanisms behind disease, to find
useful diagnostic markers, to guide the selection and observe
the results of therapies, etc. Signal differences identify
the gene or genes involved, whether already known or yet to
15 be sequenced.
Classification of QEA~ signal patterns, in an
exemplzry embodiment, can involve statistical analysis to
determine significant differences between patterns of
interest. This can involve first grouping samples that are
20 similar in one of more characteristics, such characteristics
including, for example, epidemiological history,
histopathological state, treatment history, etc. Signal
patterns from similar samples are then compared, e.g., by
finding the average and standard deviation of each individual
25 signals. Individual signals which are of limited
variability, for which the standard deviation is less than
the average, then represent genetic constants of samples of
this particular characteristic. Such limited variability
signals from one set of tissue samples can then be compared
30 to limited variability signals from another set of tissue
samples. Signals which differ in this comparison then
represent significant differences in the genetic expression
between the tissue samples and are of interest in reflecting
the biological differences between the samples, such as the
35 differences caused by the progression of a disease. For
- example, a significant difference in expression is detected
with the difference in the genetic expression between two
- 69 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
tissues exceed the sum of the standard deviation of the
expressions in the tissues. other standard statistical
comparisons can also be used to establish level of expression
and the significance of differences in levels of expressions.
Target subsequence choice is important in the
practice of this invention. The two primary considerations
for selecting subsequences are, first, redundancy, that is,
that there be enough target subsequence pair occurrences
(also known as "hits") per gene that a unique signal is
10 likely to be generated for each sample sequence, and second,
resolution, that is, that there not be so target subsequence
pair occurrences with very similar lengths in a sample that
the signals cannot be resolved. For sufficient redundancy,
it is preferable that there be on average, approximately
15 three target subsequence pair hits per gene or DNA sequence
in the sample. It is highly preferable that there be a
m rimum of at least one pair hit per each gene In tests of a
database of ~ukaryotic expressed sequences, it has been found
that ~n average value of three pair hit~ per gene appears ~o
20 be generally a sufficient guarantee of this minimum
criterion .
Sufficient resolution depends on the separation and
detection means chosen. For a particular choice of
separation and detection means, a recognition reaction
25 preferably should not generate more fragments than can be
separated and distinguishably detected. In a preferred
embodiment, gel electrophoresis is the separation means used
to separate DNA fragments by length. Existing
elsctrophoretic techniques allow an effective resolution of
30 three base pair ("bp~') length differences in sequences of up
to 1000 bp length. Given knowledge of fragment base
composition, effective resolution down to 1 bp is possible by
predicting and correcting for the small differences in
mobility due to differing base composition. However and
35 without limitation, an easily achievable three bp resolution
is assumed by way of example in the description of the
invention herein. It is preferable for increased detection
- 70 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
efficiency that the distinguishably labeled products from as
many recognition reactions as possible be combined for
separation in one gel lane. This combination is limited by
the number of labels distinguishable by the employed
5 detection means. Any alternative means for separation and
detection of DNA fragments by length, preferably with
resolution of three bp or better, can be employed. For
example, such separation means can be thick or thin plate or
column electrophoresis, column chromatography or HPLC, or
lo physical means such as mass spectroscopy.
The redundancy and resolution criteria are
probabilistically expressed in Eqns. 1 and 2 in an
approxima_ion adequate to guide subsequence choice. In these
equations the number of genes in the cDNA sequence mixture is
15 N, the average gene length is ~, the number of target
subse~uence pairs is M (the number of pairs of recognition
means), and the probability of each target subsequence
occurring in, or hitting, a typical sample sequence is p.
Since each target subsequences is preferably selecte~ to
20 occur independently in each sample sequenc~, the probability
of occurrence oE an arbitrary subsequence pair is then p2,
Eqn. l expresses the redundancy condition of three pair
oc-urrences per sample sequence, assuming the probability o~
occurrence of each target subsequence is independent.
Mp2 5 3 (1)
Eqn 2 expresses the resolution condition of having fragments
with lengths no closer on average than 3 base pairs. This
equation approximates the actual fragment length distribution
30 with a uniform distribution.
L = 3 (2 )
Given expected values of N, the number of se~l~nses in the
library or sample to analyze (library complexity), and L, the
average expressed sequence (or gene) length, Eqns 1 and 2 are
- 71 -
CA 0223~860 1998-04-24
W O 97tlS690 PCTAUS96/17159
solved for the subsequence occurrence probability and number
of subsequences required. This solution depends on the
particular redundancy and resolution criteria dictated by the
particular experimental method chosen to implement QEA~.
5 Alternative values may be required for other implementations
of this embodiment.
For example, it is estimated that the entire human
genome contains approximately 105 protein coding sequences
with an average length of 2000. The solution of Eqns 1 and 2
10 for these parameters is p = O.C82 and M = 450. Thereby the
expression of all genes in all human tissues can be analyzed
with 450 target subsequence pairs, each subsequence having an
independent probability of occurrence o~ 8.2%. In an
embodiment in which eight fluorescently labeled subsequence
lS pairs can be optically distinguished and detected per
electrophoresis lane, such as is possible when using the
separation and detection apparatus described in copending
~T.S. Patent Application Serial No. 08~438,231 filed May 9,
l9g5, 450 reactions can be analyzed in only 57 lane~.
20 Thereby only one electrophoresis plate is needed in order to
completely determine all human genome expression levels.
Since the best commercial machines ~nown to the applicants
can discriminate only four ~luorescent labels in one lane, a
corresponding increase in the number of lanes is required to
25 perform a complete genome analysis with such machines.
As a further example, it is estimated that a typically
complex human tissue expresses approximately 15,000 genes.
The solution for N = 15000 and L = 2000 i5 p = O. 21 and M =
68. ~hus expression in a typical tissue can be analyzed ~Yith
30 68 target subsequence pairs, each subsequence having an
independent probability o~ occurrence of 21%. Assuming 4
subsequence pairs can be run per gel electrophoresis lane,
the 68 reactions can be analyzed in 17 lanes in order to
determine the gene expression ~requencies in any human
35 tissue. Thus it is clear that this method leads to greatly
simplified quantitative gene expression analysis within the
capabilities of existing electrophoretiC systems.
- 72 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
These equations provide an adequate guide to
picking subsequence pairs. Typically, preferred
probabilities of target subsequence occurrence are from
approximately 0.01 to 0.30. Probabilities of occurrence of
specified subsequences and RE recognition sites can be
determined from databases of DNA sample sequences.. Example
6.2 lists these probabilities for exemplary RE recognition
sites. Appropriate target subsequences can be selected from
these tables. Computer implemented QEA~ experimental design
iO methods can then optimize this initial selection.
Another use of QEA~ is to compare directly the
expression of only a few genes or sample sequences, typically
1 to lO, between two different tissues, the query mode,
instead of seeking to determine the expression of all genes
15 in a tissue, the tissue mode. In this query mode, a few
target subsequences are selected to discriminate the genes of
intere~t both among themselves and from all other sequences
possibly present. The computer design methods described
hereinbelow can màke this selection. If 4 subsequence pairs
20 ~re sufficient for identification, then the fragments from
the 4 recognition reactions performed on each tissue are
prererably separated and detected on two separate 'ane~ in
the ~ame gel. If 2 subsequence pairs are sufficient for
identification, the two tissues are preferably analyzed in
25 the same gel lane. Such comparison of signals from the same
gel improves quantitative results by eliminating measurement
variability due to differences between separate
electrophoretic runs. For example, expression of a few
target genes in diseased and normal tissue samples can be
30 rapidly and reliably analyzed.
The query mode of QEA~ is also useful even if the
sequences of the particular genes of interest are not yet
known. Differentially expressed features can be identified
by comparing the results of QEA~ reactions applied to two
35 different samples. In the case where the separation and
detection of reaction products is by gel electrophoresis,
such a ~ ~ison can be done by comparing gel bands or
- 73 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
fluorescent traces of exiting fragments. Such differentially
expressed features can then retrieved from the gel by methods
known in the art (e.g., electro-elution from the gel) and the
DNA fragments analyzed by conventional techniques, such as by
5 sequencing. Such sequences, which are typically partial, can
then be used as probes (e.g., in PCR or Southern blot
hybridization) to recover full-length sequences. In this
manner, QEA~ techniques can guide the discovery of new
differentially expressed cDNA or of changes of the state of
10 gDN.~. The sequences of the newly identified genes, once
determined, can then be used to guide QEA~ target subsequence
choice for further analysis of the differential expression of
the new genes.
Alternative embo~;~e~ts of QEA~ are described
15 herein, differing primarily in how the recognition means
recognize the target subsequences. Associated with these
primary differences are secondary differences in how signals
are generated from the recognition means. In the PCR
embodiment, target subsequences are recognized by oligomers
20 which hybridize to the DNA target subsequences and act as PCR
primers for the amplification of the segments between
adjacent primer pairs. Amplified fragments from a sample are
separated preferably by electrophoresis. Selection of target
subsequences, the primer hybridizing sites, meeting the
25 probability of occurrence and independence criteria is
preferab~y made from a database containing sequences expected
to be present in the samples to be analyzed, for example
human GenBank sequences, and optimized by the computer
implemented experimental design methods. In a preferred
30 embodiment, subsequence selection begins by compiling
oligomer frequency tables containing the frequencies of,
preferably, all 4 to 8-mers by using a sequence database.
From these tables, target subsequences with the necessary
probabilities of occurrence according to Eqns. 1 and 2 are
35 selected and checked for independence, by, for example,
checking that the conditional probability for occurrence by
any selected pair of subsequences is the product of the
- 74 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
probabilities of occurrences of the individual subsequences
of the pair. An initial selection can be optimized to
determine target subsequence sets producing unique fragments
from the greatest number of sample sequences. PCR primers
5 are synthesized with a 3' end complementary to the chosen
subsequences and used in the PCR embodiment. Example 6.1
illustrates the signals output by this method in a specific
example.
The preferred embodiment uses DNA binding proteins,
10 specifically REs, including Type IIS REs, to recognize and
cleave sample sequences at the target subsequences. Desired
fragments, with lengths dependent only on source cDNA
sequence, are amplified by an amplification means in order to
dilute remaining, unwanted fragments with indefinite lengths.
15 Typically, but without limitations, desired fragments are
doubly cut by REs whereas unwanted fragments are singly cut.
3ut in 5'-QEA~, singly cut fragments have a definite lenqth
and are of interest. Unwanted singly cut fraqments can be
removed by affinity means (e.g., biotin labeling), physical
20 means (e.g., hydroxyapatit~ column separation), or enzymatic
means (e.g., single strand speci~ic nucleases). Sufficient
removal of the unwanted singly cut ends from _he desired
doubly cut fragments can permit fragment detection without an
amplification step. For the RE alternative embodiments, the
25 possible target subsequences, although limited to recognition
sites of available REs, can be selected in a manner similar
to the above in order to meet the previous probability or
occurrence and indep~n~enc~ criteria as closely as possible.
For example, the probabilities of occurrence of various RE
30 recognition sites can be determined from a database of
potential sample sequences, and those REs chosen with
recognition subsequences whose probabilities of occurrence
meet the criterion of Eqns 1 and 2 as closely as possible.
If multiple REs satisfy the selection criteria, a subset is
35 selected by including only those REs with independently
occurring recognition sl~hs~ql-enc~C, determined, for example,
in the previous manner USing conditional probabilities of
- 75 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96117159
occurrence. An initial choice can be optionally optimized by
the computer implemented experimental design methods.
A number, ~, of REs are preferably selected so that
the number of RE pairs is approximately M, as determined from
5 Eqn. 1, where the relation between M and Re is given by Eqn.
3.
~ R~(Re + 1)
10 For example, a set a set of 20 acceptable REs results in 210
subsequence pairs.
There are numerous REs currently available, whose
recognition sequences have a wide range of occurrence
probabilities, from which REs can be selected ~or the present
invention. Exemplary REs are listed in Sec. 6.2.
The PCR and the RE embodiments have different
accu-ac~- and flexibility characteristics. RE embodiments are
ger:erally more ~ccurate, with fewer false positive and false
neyative identifications, since the enzymatic recognition and
20 subsequent ligation reactions are generally moré specific
than the hybridization of short PCR primers to their
subsequence targets, even under stringent hybridization
conditions.
Restriction endonucleases ("RE") generally bind
25 with specificity only to their four to eight bp recognition
sites, cleaving the DNA preferably with an at least 2 bp
overhang. Although it is preferable that REs used produce
overhangs of known sequence and characteristic of the
particular RE, other REs, such as those known as class IIS
30 restriction enzymes, which produce overhangs of unknown
sequence can be used to extend initial target subsequences
into longer effective target subsequences. Phasing primers
can also be used to recognize longer effective targer
subsequences. Overhangs of the initial REs can be
specifically recognized by hybridization of an adapter
followed by ligation of one strand of this adapter, the
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
amplification primer. The ligase enzymes, which are used in
this alternative embodiment of this invention to ligate the
amplification primer, are highly specific in their
~hybridization requirements; even one bp mismatch near the
-5 ligation site will prevent ligation (U.S. Patent 5,366,877,
Nov. 22, 1994, to Keith et al.; U.S. Patent 5,093,245, Mar.
3, l99Z, to Keith et al . ) . On the other hand, PCR and the
preferred Taq polymerase used therein tolerates hybridization
mis-matches of elongation primers. Thus, PCR embodiments can
10 generate false positive signals which arise from mis-matches
in the hybridization of the oligomer probes to the target
subsequences. However, the PCR embodiments are more flexible
since any desired subsequence can oe a target subsequence.
The RE embodiment is limited to the recognition se~uences of
15 acceptable REs. However, more than 150 to 200 REs are now
com~ercially available recognizing a wide variety of
nucleotide sequences.
QEA~ experiments are also adaptable to distinguish
sample sequences into small sets, typically comprising 2 to
20 10 sequences. Such coarser grain analysis requires fewer
subsequence pairs, fewer recognition reactions, and less
analysis time. Alternati~ely, smaller numbers of target
suhseguence pairs can be optimally chosen to distinguish
individually a specific set of sequences of interest from all
25 the other sequences in a sample. These target subsequences
can be chosen either from REs that produce fragments from the
specific sample sequences or, in the case of the PCR
embodiment, from a set of subsequences optimized for this
specific set of sequences.
Detailed descriptions of exemplary implementations
for practicing QEA~ recognition reactions and the computer
implemented experimental analysis and design methods are
presented in the following subsections. Detailed
experimental protocols appear in Sec. 6. These
35 implementations are illustrative and not limiting, as this
embodiment of the inventiOn may be practiced by any method
generating the previously described QEA~ signals.
CA 0223~860 1998-04-24
W O 97/1~690 PCTnJS96/171~9
5.2. RE EMBODIMENTS OF OEA~
The preferred restriction endonuclease ("RE")
embodiments of QEA~ use novel simultaneous RE and ligase
enzymatic reactions, known as recognition reactions, for
5 generating labeled fragments of the sample sequences to be
analyzed. These labeled fragments are then optionally
amplified by an amplification means, separated according to
length by a separation means, and detected by a detection
means to yield QEA~ signals comprising the identity of the
10 REs cutting each fragment together with each fragment's
length. The RE/ligase subsequence recognition reactions can
specifically and reproducibly generate QEA~ signals with good
slgr.al to noise ratios. Preferred protocols for this
reaction perform all steps, including amplification, in a
15 single tube without any intermediate extractions or buffer
exchanges. This protocol is preferably ~utomatically
performed by standard laboratory robots.
REs bind with specificity to short DNA target
subsequences, usually 4 to 8 bp long, that are termed
20 "recognition sites" and are characteristic of each RE. REs
that are used cut the sequence at (or near) these recognition
sites preferably produc ng charac'eristic ("sticky") ends
with single-stranded overhangs, which usually incorporate
part of the recognition site. Type IIS REs, which cut
25 outside of their recognition site, can be used to extend the
initial target subsequence to a longer effective target
subsequence for use in the computer implemented database
lookup.
Preferred REs have a 6 bp recognition site and
30 generate a 4 bp 5' overhang. Less preferred REs generate a 2
bp 5' overhang. These are less preferred since 2 bp
overhangs have a lower ligase substrate activity than 4 bp
overhangs. All RE embodiments can be adapted to 3' overhangs
of two and four bp. In order that an amplification primer
35 hybridization site can be presented on each of the two
strands o~ the product of the RE/ligase recognition reaction,
as is necessary for experimental amplification. REs
- 78 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
generating 5' and 3' overhangs are preferably not used in the
same recognition reaction. Further, preferred REs have the
following additional properties.- Their recognition sites and
overhang sequences are preferably such that an amplification
5 2rimer can be designed whose ligation does to a cut end does
not recreate the recognition site. They preferably have
sufficient activity below 37~C, and particularly at 16~C, the
optimal ligase temperature, to cut unwanted ligation
products, and are heat inactivated at 65~C and above so that
10 PCR amplification can be performed by simply adding PCR
reagents to the RE/ligase reaction mix. They preferably have
low non-specific cutting and nuclease activities and cut to
completion. The REs selected for a particular experiment
preferably have recognition sites mee'_ing the previously
15 described occurrence and independence criteria. Preferred
pairs of REs for analyzing human and mouse cDNA are listed in
Sec. ~.10.
Orly cDNA fragments with definite and reproducible
lengths dependent only on the source cDNA sequence and
20 independent of cDNA synthesis conditions are of interest.
Only such fragments of definite length are adaptable to the
experimental analysis methods in order to determine their
originating sample sequence. cDNA fragments doubly cut on
each end and by REs have a length dependent only on the
25 sequence of the originating cDNA and are, therefore, of
interest. cDNA fragments singly cut on their 5' end by an RE
and terminated on their 3' end by the poly(A) tail have a
variable and non-reproducible lengths that depend strongly on
cDNA synthesis conditions. Such fragments singly cut on one
30 end by an RE and with a variable length tail on the other are
not of interest. To separate signals from doubly cut
fragments from the unwanted signals from singly cut
fragments, certain RE embodiments of QEA~ exponentially
amplify doubly cut fragments, while only linearly amplifying
35 singly cut fragments. This amplification is preferably done
~ by the PCR method. Other RE embodiments separate singly and
doubly cut fragments with a removal means targeted at either
- 79 -
CA 0223~860 1998-04-24
WO 97/15690 PCTAUS96/17159
type of fragment. The preferred removal means comprises a
biotin capture moiety and a streptavidin binding partner.
The removal means can either supplement or replace
differential amplification. On the other hand, cDNA
5 fragments singly cut on their 3' end by an RE and terminated
on their 5' end by a sequence in a fixed relation to the 5'
cap of the source mRNA also have definite lengths and are of
interest. Such fragments can be generated according to a
method herein called 5'-QEA~, whi~-h comprises synthesizing
10 cDNA according to the protocol of Sec. 6.3.3, performing
recognition reactions, and separating the fragments of
interest by a removal means. Alternatively, fragments are
also Gf interest if they have a definite, sequence dependent
length by being singly cut on their 5' end and by being
15 terminated in a fixed relation with respect to the beginning
of the 3' poly(A)+ tail.
This invention is adapta~le to alternat.ive
amplification means known in the art. If a removal means for
unwanted singly cut fragments is not uti.ized, alternative
20 amplification means must preferentially amplify doubly cut
fragmen s with respect to singly cut fragments, in order that
sianals from singly cut fragment~ be relatively suppressed.
On the other hand, if a removal means for singly cut
fragments is utilized in an embodiment, then alternative
25 amplification means can less preferably have no amplification
preference. In RE embodiments using a removal means, this
means can be used either to remove the singly or the doubly
cut fragments. Known alternative amplification means are
listed in Kricka et al., 1995, Molec~ r Probing, Blotting,
30 and Sequencing, chap. 1 and table IX, Academic Press, New
York. of these alternative means, those employing the T7 RNA
polymerase are preferred.
Certain other embodiments use a physical removal
means to directly remove unwanted singly cut fragments,
35 preferably before amplification. Singly cut fragment removal
can be accomplished, e.g., by labeling DNA termini with a
capture moiety prior to digestion, as by synthesizing the
- 80 -
CA 0223~860 1998-04-24
WO 97/15690 PCT/US96/17159
cDNA with biotinylated primers. After digestion, the singly
cut fragments are then removed by contacting the sample with
a binding partner of the capture moiety, affixed to a solid
- phase. Alternatively, the doubly cut fragments can be
S labeled with a capture moiety, as by amplifying the fragments
with primers one of which is labeled with a capture moiety
The amplification products are contacted with a binding
partner affixed to a solid support, washed, and then
denatured. Thereby, only doubly cut fragments, one end of
lO ~-hich is labeled with a capture moiety, are separated.
Alternately, single stranded fragments can be removed by
single stand specific column separation or single strand
specific nucleases.
This invention is applicable to any removal means
15 meeting the following m;ni~l requirements. The removal
means i ncludes a capture moiety and a binding partner. The
captur~ moiety is capable of conjugation to DNA oligomers
~i' hout disruption of hybridizat on or chain elongation
reactions. The binding partner is capable of attachment to a
20 solid phase support and can bind the capture moiety to such a
support in DNA denaturing conditions. The preferred removal
means is biotin-streptavidin. Other removal means ad~ptable
to this invention include various haptens, which are removed
by their corresponding antibodies. Exemplary haptens include
25 digoxigenin, DNP, and fluorescein (Holtke et al., 1992,
Sensitive chemiluminescent detection of digoxigenin labeled
nucleic acids: a fast and simple protocol for applications,
Biotechniques 12(l):lO4--ll3and Olesen et al., 1993,
Chemill ;neccent DNA sequencing with multiple labeling,
30 Biotechnigues 15(3):480--485).
RE/ligase embs~; -nts of QEAn' use recognition
moieties. In any one recognition reaction, each recognition
moiety is capable of hybridizing with and being ligated to
overhangs cut by only one RE. Thereby, the recognition
35 sequence of that RE is identified. Recognition moieties
typically comprise partially double stranded DNA oligomers,
each oligomer capable of specifically hybridizing with only
-- 81 --
CA 0223~860 1998-04-24
W O 97/15690 PCTnJS96/17159
one RE generated sticky end in one recognition reaction. In
the RE/ligase embodiment using PCR amplification, the
recognition moieties also provide primer means for the PCR
and thereby also provide for labeling and recognition of RE
5 cut ends. For example, using a pair of REs in one
recognition reaction generates doubly cut fragments some with
the recognition sequence of the first RE on both ends, some
with the recognition sequence of the second RE on both ends,
and the remainder with one recognition sequence of each RE on
10 either end. Using more REs generates doubly cut fragments
with all pair-wise combinations of RE cut ends from ad~acent
RE recognition sites along the sample sequences. All these
cutting combinations need preferably to be distinguished,
since each provides unique information on the presence of
15 different subsequences pairs, the RE recognition sites,
preser.t in the original cDNA sequence. Thus the recognition
moieties preferably have unique labels ~~hich labeL
specifically each RE cut made in a reaction. As many REs can
bc used in a single reaction as labeled recognition moieties
20 are availabLe to uniquely label each RE cut. If the
detectable labeling in a particular system is, for example,
by fluorochromes, then fragments cut with one RE have a
sing'e fluorescent signal from the one fluorochrome
associated with that RE, while fragments cut with two REs
25 have mixed signals, one from the fluorochrome associated with
each RE. Thus all possible pairs of fluorochrome labels are
preferably distingllich~hle~ Alternatively, if certain target
subsequence information is not needed, the recognition
moieties need not be distinctively labeled. In embodiments
30 using PCR amplification, corresponding primers would not be
labeled. If silver s~;ning is used to recognize fragments
separated on an electrophoresis gel, no recognition moiety
need be labeled, as fragments cut by the various RE
combinations are not distinguishable.
The recognition reaction conditions are preferably
selected, as described in Sec. 6.4, so that RE cutting and
recognition moiety ligation go to full completion: all
- 82 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
recognitiOn sites of all REs in the reaction are cut and
ligated to a recognition moiety. In this manner, the
~ fragments generated from a sequence analyzed lie only between
- adjacent recognition sites of any RE in that reaction. No
5 fragments remain which include an internal RE recognition
site. Multiple REs can be used in one recognition reaction.
Too many ~Es in one reaction can cut the sequences too
frequently, generating a compressed length distribution with
many short fragments of lengths between 10 and a few hundred
10 base pairs long that are not clearly resolvable by the
separation means. For example, for gel electrophoresis, if
the fragments are too close in length, fragments should not
be closer than 3 bp on the average. Too many REs also can
genera~e fragments of the same length and end subsequences
15 from different sample sequences. Finally, where fragment
labels are to be distinguished, no more REs can be used than
can have distinguishably labeled sticky ends. These
considerations limit the number of REs optimally useable in
one recognition reaction. Preferably two ~Es are used, with
20 one, three and four REs less preferable. Preferable pairs of
REs for the analysis of human cDNA samples are listed in Sec.
6.10.
An additional level of sample sequence
discrimination is possible by detecting occurrences of
25 internal subsequences (here called "third target
subsequences"). The presence or absence of a third interna
subseqll~nc~s can be used in the computer implemented
experimental analysis methods o~ this invention along with
identification of the two end subsequences and the fragment
30 length to further discriminate the origin of otherwise
identical fragment signals.
Fragments with specific third internal subsequences
can be detected by either labeling or suppressing such
fragments or with Type IIS REs. To label fragments with a
35 third internal subsequence, probes with distingllis~hle
labels which bind to this target subsequence are added to the
fragments prior to detection, and alternatively prior to
- 83 -
CA 0223~860 1998-04-24
W O 97/15690 PCTnUS96/17159
separation and detection. On deteCti~n, fragments with this
third subsequence present will generate a signal, preferably
fluorescent, from the probe. Such a probe could be a labeled
PNA or DNA oligomer. Short DNA oligomers may need to be
5 extended with a universal nucleotide or degenerate sets of
natural nucleotides in order to provide for specific
hybridization. Fragments with a third subsequence can be
suppressed in various manners. The absence of such fragments
i5 determined by comparing a recognition reaction without the
10 suppressing factors with a reaction with the suppressing
factors. First, in embodiments using PCR ampli~ication, a
probe hybridizing with this third subsequence which prevents
pclymerase elongation in PCR can be added prior to
amplification. Then sequences with this subsequence will be
15 at most linearly amplified and their signal thereby
sllppressed. Such a probe could be a PNA or modified DNA
oligomer (with the 3' nucleotide beir.~ a ddNTP). Second, iE
the third subsequence is recognized by an RE, this RE can be
added to the RE-ligase reaction without any co-responding
20 speciLic primer. Fragments with the third subsequence
thereby have primers on one end only are at most linearly
ampli'ied. Both these embodiments can be ext~n~e~ to
multiple internal sequences by using multiple probes to
recognize the sequences or to disrupt exponential PCR
25 amplification. Type IIS REs which cut a primer close to its
junction with the original cDNA fragment sequence generates
overhangs which are not contiguous with the initial RE
recognition sequence. The sequence of such an overhang can
~e used as a third internal subsequence.
5 ~ 2 . 1 ~ ~C~N ~ lON MOIETY STR~CTURE
Construction of the recognition moieties, also
herein called adapters or linker-primer oligomers, is
important and is described here in advance of further details
35 of the individual recognition reaction steps. Their basic
structure is first described, followed second by descriptions
of several ~nhAnc~- ~ntS adaptable to QEA~ variations. In the
- 84 -
-
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/171S9
preferred embodiment, the adapters are partially double
stranded DNA ("dsDNA"). Alternatively, the adapters can be
constructed as oligomers of any nucleic acid having
properties corresponding to those of the preferred DNA
5 polymers. In an embodiment employing an alternative
amplification means, the adapters preferably serve as a
primer for that amplification means, if needed.
Turning first to basic adapter structure, Fig. 2A
illustrates the DNA molecules involved in the ligation
10 reaction as conventionally indicated with the 5' ends of the
top strands and the 3' ends of the bottom strands at left.
dsDNA 201 is a fragment of a sample cDNA sequence with an RE
cut at the left end generating, preferably, four bp 5'
overhang 202. Adapter dsDNA 209 is a synthetic substrate
15 provided by this invention. The structure of adapter 209 is
seLected to ensure that RE digestion and adapter ligation
prefer2bly go to completion, that generation of unwanted
products and amplification biases are minimized, and thzt
ur.ique labels are attached to cut snds (if needed). Adapter
20 209 comprises strand 203, called a primer, and a partially
co-.nplementary strand 205, called a linker. The primer is
also kr.own as the longer strand of the adapter, and the
linker is also known as the shorter strand of the adapter.
The linker, or shorter strand, links the cDNA cut
25 by an RE to the primer, or longer strand, by hybridizing to
the ov~rhang generated by the RE and to the primer such that
the 3' end of the primer is adjacent to the 5' end of the
overhang. In this configuration, the primer can be
erfectively ligated to the cut dsDNA. Therefore, linker 205
~0 comprises subsequence 206 complementary to RE overhang 202
and subsequence 207 complementary to 3' end 204 of primer
203. Subsequence 206 is most preferably of the same length
as the RE overhang. Subsequence 207 is preferably eight
nucleotides long, less preferably from 4 to 12 nucleotides
35 long, but can be of any length as long as the linker reliably
hybridizes with only one primer in any one recognition
reaction at an appropriate Tm. The appropriate Tm should
- 85 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
preferably be less than the self-annealing Tm of primer 203.
This ensures that subsequent PCR amplification conditions can
be controlled so that linkers present in the reaction mixture
will not hybridize and act as PCR primers, and, thereby,
5 generate spurious fragment lengths. The preferable Tm is
less than approximately 68~C. Also, linker 205 preferably
lacks a 5' terminal phosphate to prevent its ligated to the
3' bottom strand of dsDNA 201. More importantly, lack of a
terminal phosphate also prevents self-annealed adapters from
10 ligating and forming dimers. Adapter self-ligation is
disadvantageous in that it would compete with ad~pter
ligation to cut cDNA fragments. Further, adapter dimers
would be amplified in a subsequent amplification step
generating unwanted fragments, termed amplification noise.
15 Terminal phosphates can be removed from linkers using
pnosphatases known in the art, followed by separation of the-
ehzime. An exemplary protocol for an alkaline phosphat-~se
react on is found in Sec. 6.3.~.
Primer, or longer strand, 203 has a 3' end
20 subsequence 204 complementary ~o 3' end subsequence 2G7 of
linker 205. It is preferable that each RE generated overhang
is ligated to a unique primer, in each recognition reaction
in order that the overhangs generated by each RE can be
detec'ed. Consequently, in each recognition reaction primers
25 and linkers are preferably chosen so that each primer is
complementary to and hybridizes with only one linker 205 and
that each linker which hybridizes with an RE has a unique
sequence 207 for hybridizing with a unique primer. In order
that the primer/cDNA overhang ligation reaction go to
30 completion, primer 203 preferably does not recreate the
recognition sequence of any RE in one recognition reaction
when it is ligated with cDNA end 202. Further, primer 203
preferably has no 5' te r ; n~ ~ phosphate in order to prevent
primer sel~-ligations. To ini ;ze ampli~ication noise, it
35 is preferred that primer 203 not hybridize with any sequence
present in the original sample mixture. If such
hybridization o~uL~ed, a subsequence PCR step can amplify
- 86 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
unwanted fragments not cut by the initial REs. The Tm ~f
primer 203 is preferably high, in the range from 50~ to 80~C,
and more preferably above 68~C. This permits that the
subsequent PCR amplification can be controlled so that only
5 primers and not linkers initiate new chains, the linkers
remaining melted through the PCR cycle. In the case of gel
electrophoretic fragment separation and detection with, e.g.
Ag staining or an intercalating dye, the primer is optionally
unlabeled. For example, this Tm can be achieved by use of a
10 primer having a combination of a G+C content preferably from
40-60%, most preferably from 55-60%, and a length most
preferably 24 nucleotides, and preferably from 18 to 30
nucleotides. Primer 203 is optionally labeled with
~luorochrome 208, although any DNA labeling system that
15 preferably allows multiple labels to be simultaneously
di~tinguished is usable in this invention. GeneralLy, the
primer, or longer strand, is constructed so that, preferably,
it is highly specific, free of dimers and hairpins, and
capable of Eorming stable duplexes under the conditions
20 specified, in particular at the desired T~. Software
packages are available for primer construction according to
these principles, an example being OLIGO~ Version 4.0 For
Macintosh from National Biosciences, Inc. (Plymollth, MN)o In
particular, a formula for Tm can be found in the OLIGO~
25 Re~erence Manual at E~n. I, page 2.
Fig. 2B illustrates two exemplary adapters and
their component primers and linkers constructed according to
the above description. Adapter 250 is specific for the RE
BamHI, as it has a 3' end complementary to the 5' overhang
30 generated by BamHI. Adapter 251 is similarly specific for
the RE HindIII. Sec. 6.10 contains a more comprehensive,
non-limiting list of adapters that can be used according to
the invention. All synthetic oligonucleotides of this
invention are preferably as short as possible for their
35 functional roles in order to minimize synthesis costs. A
~ further alternative illustrated in Fig. 2C is to construct an
adapter by self hybridiZation of single stranded DNA in
- 87 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
hairpin loop configuration 212. Subsequences of loop 212 are
constructed with similar structure to the corresponding
subsequences of linker 205 and primer 203. Exemplary hairpin
loop 211 sequences are C4 to ClO.
REs generating 3' overhangs are less preferred and
require different adapter structures. A preferable basic
adapter structure for 3' overhangs is illustrated in Fig. 3A.
dsDNA 301 is a fragment of a sample cDNA cut with a RE
generating 3' overhang 302. Adapter 309 comprises primer, or
10 longer strand, 30~ and linker, or shorter strand, 305.
Primer, or longer strand, 304 includes subsequence 306
complementary to and of the same length as 3' overhang 302
and subsequence 307 complementary to linker 305. It also
optionally has label 308 which distinctively labels primer
15 304. As in the case of adapters for 5' overhangs, in order
that the RE digestion and ligation reactions go to
completion, primer 304 pre~erably has no 5' terminal
phosphate, in order to prevent self-ligations, and preferably
has a sequence such that no recognition site for any RE in
20 one recognition reaction is created upon ligation of the
primer with dsDNA 301. To minimize amplification, noise,
primer 304 should preferably not hybridize with any sequence
in the initial sample mixture. The T~ of primer 304 is
preferably high, in the range from 50~ to 80~C, and more
25 preferably above 68~C. This ensures the subsequent PCR
amplification can be controlled so that only primers and not
linkers initiate new chAins. For example, this Tm can be
achieved by using a primer having a G+C content preferably
from 40-60%, most preferably from 55-60%, and a primer length
30 most preferably of 24 nucleotide and less preferably of 18-30
nucleotides. Each primer 304 in a reaction can optionally
have a distinguishable label 308, which is preferably a
fluorochrome.
Linker, or shorter strand, 305 is complementary to
35 and hybridizes with subsequence 307 of primer 304 in a
position adjacent to 3' overhang 302. Linker 305 is most
preferably 8 nucleotides long, less preferably from 4-16
- 88 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
nucleotides, and has no terminal phosphates to prevent self-
ligation. This linker only promotes ligation specificity and
activity and does not link primer 304 to the cut dsDNA, as in
the 5' case. Further, linker 30S Tm should preferably be
5 less than primer 304 self-annealing Tm- This insures that
subsequent PCR amplification conditions can be controlled so
that linkers present in the reaction mixture will not
hybridize and act as PCR pr1mers, and, thereby, generate
spurious fragment lengths. Fig. 3B illustrates an exemplary
10 adapter with its primer and linker for the case of the RE
NlaIII. As in the 5' overhang case, a 3' adapter can also be
constructed from a hairpin loop configuration.
Next, several adapter structural enhancements are
describad. The use of these enhancements is detailed in the
15 subsequent protocol descriptions. In one alternative, the
adapter primer strand can have a conjugated capture moiety in
addition to or in place of a conjugated label moiety. Such a
labe~ moiety is advantageous in separating variou~ classes of
RE;~igase reaction products by b,nding the capture moiety ~o
20 iis bin~ing partners. Acceptable and preferred capture
moieties and binding partners have been previously described.
Further, when a primer has a conjugated capture moiety,
particularly biotin which form a s~reptavidin complex that is
difficult to dissociate, it can advantageous to include a
25 release means in the primer in order to achieve controlled
relea~e from the bound capture moiety. Release means can
involve including subsequences in the primer which can be
cleaved in a controlled manner. One exemplary such
subsequence is one or more uracil nucleotides. In this case
30 digestion with uracil DNA glycosylase (UDG) and subsequent
hydrolysis of the sugar backbone at an alkaline pH effects
releases. Another exemplary such subsequence is the
recognition subsequence of an RE which cuts extremely rarely
if at all in the sequences of the sample. A preferred RE of
35 this sort for human cDNA sequences is AscI, which has an 8 bp
recognition sequence that rarely, if ever, occurs in
mammalian DNA. AscI is further advantageously active at the
- 89 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
ends of DNA molecules. In this case, digestion with this RE,
i.e., AscI, will release strand 2351.
In another enhanc~ ent, adapters can be constructed
~rom hybrid primers which are designed to facilitate the
5 direct sequencing of a fragment or the direct generation of
RNA probes for ln situ hybridization with the tissue of
origin of the DNA sample analyzed. Hybrid primers for direct
sequencing are constructed by ligating onto the 5' end of
existing primers the M13-21 primer, the M13 reverse primer,
10 or equivalent sequences. Fragments generated with such
hybrid adapters can be removed from the separation means and
amplified and sequenced with conventional systems. Such
sequence information can be used both for a previously known
sequence to confirm the sequence determination and for a
15 previously unknown sequence to isolate the putative new gene.
Hybrid primers for direct generation of RNA hybridization
prokes are constructed by ligating onto the 5' end of
existing primers the phage T7 promoter. Fragments generated
wilh such hybrid adapters can be removed using the separation
20 means and transcribed into anti-sense RNA with conventional
systems. Such probes can be used for in situ hybridization
~ith the tissue of origin of tho DNA sample to determine in
precisely what cell types a signal of interest is expressed.
Such hybrid adapters are illustrated in Sec. 6.8.
2S In a further enhancement, the previously described
adapters are used but the PCR primers strands have a extra
subsequence 3' to the adapter primer strands in order to act
as phasing primers. That is the PCR amplification reaction
is used to recognize additional nucleotides beyond the
30 initial RE target recognition subsequence. Fig. 2D
illustrates such alternative phasing primers. In that
figure, sample dsDNA 201 is illustrated after blunt-ending
RE/ligase reaction products but just prior to a PCR
amplification cycle. dsDNA 201 has been cleaved at position
35 221 producing overhang 202 by an RE recognizing target
recognition subsequence 227, has been ligated to adapter
primer strand 203, and has been completed to a blunt ended
-- 90 --
= ~
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
double strand by strand 220 by incubation at 72~C for 10
minutes. For definiteness and without limitation, the RE
recognition subse~uence 227 typically extends 1 bp beyond
overhang 202. Other relative positions depend on the lengths
5 of the overhang and the recognition sequence. Alternative
PCR phasing primer 222, illustrated with its 5' end at the
left, comprises subsequence 223, with the same sequence as
strand 203; subsequence 224, with the same sequence as the RE
overhang 202; subsequence 225, with a sequence consisting of
10 a remaining portion of RE recognition subsequence 227, if
any; and subsequence 226 of P nucleotides. Length P is
preferably from 1 to 6 and more preferably either 1 or 2.
Subsequences 223 and 224 hybridize for PCR priming with
corresponding subsequences of dsDNA 201. Subsequence 225
15 hybridizes with any remaining portion of recognition
subseauence 227, typically 1 bp. Subsequence 226 hybridize~
only with fragments Z01 having complementary nucLeotides in
cor_esponding pocitions 228. ~hen I; is 1, PCR primer 222
seiects for PCR amplification 1 of 4 possible f agments 20L;
20 when P is 2, 1 of 16 are selected. Using ~ (or 16) primers
722, each with one o~ the possible (pairs of) nucleotides, in
4 ~16) aliquots or RE/ligase reaction products selects for
amplification one of the possible fragments 201. These
primers are similar to phasing primers (European Patent
25 Application No. O 534 858 Al, published Mar. 31, 1993).
The effect of using PCR primers 222, having
subsequences 2Z6 of length P bp, is to extend the initially
recognized RE target subsequence into an effective target
subsequence, which is the initial RE target subsequence
30 concatenated to a subsequence complementary to subsequence
226. Thereby, many additional target subsequences can be
recognized while ret~; n i ng the specificity and exactness
characteristic of the RE embodiment. For example, REs
~ recognizing 4 bp subsequences can be used in such a combined
35 reaction with an effective 5 or 6 bp target subsequence,
which need not be palindromic. REs recognizing 6 bp
sequences can be used in a combined reaction to recognize 7
- 91 -
CA 0223~860 1998-04-24
WO 97/15690 PCTAJS96/17159
or 8 bp sequenceS~ Such effective recognition sequences are
input to the computer implemented design and analysis methods
subsequently described.
In a further enhancement, additional subsequence
5 information can be generated from adapters comprising primers
with specially placed Type IIS RE recognition subsequence
followed by digestion with the Type IIS RE and se~uencing of
the generated overhang. In a preferred embodiment, the Type
IIS recognition subsequence is placed so that the generated
10 overhang is contiguous with the original recognition
subsequence of the RE that cut the end to which the adapter
hybridizes. In this embodiment, an effective target
subsequence is formed by concatenating the sequence of the
Type IIS overhang and the original recognition sequence. In
15 another embodiment, the Type IIS recognition sequence is
placed so that the sequence of the generated overhang is not
conti~uous with the original recognition sequence. Here, the
sequence of the overhang is used as an third internal
subsequence in the fragment. In both cases, the additionally
20 recognized subsequence is used in the computer implemented
experimental analysis methods to increase the capability of
determining the source sequence of a fragment. This
enhancement is illustrated in Figs. 17A-E and is described in
detail in Sec. 5.2.3 ("The SEQ-QEA~ Embodiment"). The
25 primers used in the SEQ-QEA~ embodiment advantageously
included combined enhancements, including label moieties,
capture moieties, and release means.
It will be apparent to those of skill in the art
that the previously described primers and linkers can be
30 enhanced with combinations of the previously described
embodiment and with other alternatives known in the art to
practice further embodiments and refinements of the RE/ligase
embodiment of QEA~. This invention comprises these
substantially similar variations of the embodiments described
35 herein.
5.2.2. . RE/LIGASE MET~OD STBP8
- 92 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
The steps of the preferred RE/ligase embodiment of
QEA~ comprise: first, in one reaction cutting a cDNA sample
with one or more REs, hybridizing adapters corresponding to
the RES, and ligating the primers of the adapters on the cut
s ends; second, amplifying the cut fragments, if necessary; and
third, separating the fragments according to length and
detecting fragment lengths and fragment target end
subsequences. If necessary, prior to the first step, the
cDNA sample can be synthesized by methods commonly known in
10 the art, such as those described in Sec. 6.3. Optionally,
following the amplification step, additional steps to remove
unwanted DNA fragments or RE/ligase reaction products prior
tG separation detection can increase QEA~ signal to noise
ratio or simplify interpretation of the resulting signals.
15 Additional Re/ligase embodiments are described, including
those known as 5'-QEA~ and SEQ-QEA~.
In more detail, the RE/ligase embodiment can b~gin
with p~e-synthe~ized cDNA, or with a tissue sample or mRNA
from which cDNA is to be synthesized. When cDNA is to be
20 synthesized, the exemplary methods and procedures of Sec. 6.3
can be used. QEA~ does not require cloning into a vector
In the case of a tissue sample, a first step is the largely
conventional separation of RNA from the tissue sample.
Separated RNA is preferably poly(A)+ purified RNA, mRNA
25 separated from particular cellular fractions, or less
preferably total cellular RNA. The steps of separation
involve RNase extraction, DNase treatment and mRNA
purification according to protocols, ~.g., of Sec. 6.3.1.
First and second strand cDNA synthesis from mRNA can be
30 performed according to the protocols of Sec. 6.3.2, or the
less preferred protocols of Sec. 6.3.4. In the case of small
quantities of mRNA or where it is advantageous to have full-
length cDNA including complementary sequences out to the 5'
cap of the source mRNA, the preferred synthesis protocols of
35 Sec. 6.3.3, or functionally equivalent protocols, can be
used.
- 93 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
However obtained, it is important that cDNA Used in
the RE/ligase embodiment of QEA~ not have any terminal
phosphates. This is to ~ini~ize noise in subse~uent fragment
length separation and detection caused by exponential
5 amplification of unwanted fragments singly cut on one end by
an RE and terminated on the other by a variable léngth
oligo(dT) tail. Significant background noise can arise from
exponential amplification o~ singly cut ~ragments whose blunt
ends have ligated to form a singLe dsDNA with two cut ends
10 having ligated primers, an apparently doubly cut fragment.
The lengths of such fragment vary depending on cDNA synthesis
conditions and produce diffuse background noise on gel
electrophoresis, which obscures sharp bands from the normally
doubly cut fragments. This background can be eliminated by
~5 preventing blunt end ligation of such singly cut cDNA
fragments by initially removing all terminal phosphates from
th2 cDNA sample, without otherwise disrupting the integrity
o~ 'he cDNA. Thus, the final preparation step o~ a ~NA
sa~pl2 is removal of terminal phosphates from the cDN..
~o sampl~r if nee~e~.
Thus the final preparation step of a cDNA sample is
reltloval of terminal phosphates, if needed. Terminal
phosphate removal is preferably done with a heat-inactivated
phosphatase. Phosphatase activity is preferably removed
25 prior to RE digestion and adapter ligation step in order to
prevent interference with the intended ligation of adapters
to doubly cut fragments. Heat inactivation allows
phosphatase removal without a separation or extraction step.
A preferred phosphatase comes from cold living Barents Sea
30 (arctic) shrimp (U.S. Biochemical Corp.) ("shrimp alkaline
phosphatase" or "SAP"). Terminal phosphate removal need be
done only once for each population of cDNA being analyzed.
In other embodiments alternative phosphatases can be used ~or
terminal phosphate removal, such as calf intestinal
35 phosphatase-alkaline from Boehringer Mannheim (Indianapolis,
IN). Those that are not heat inactivated require a step to
-
CA 0223~860 1998-04-24
W O 97/15690 PCTnUS96/17159
separate the phosphatase from the cDNA sample before the
RE/ligase reactions, such as by phenol-chloroform extractiOn.
The prepared cDNA is then separated into batches of
from 1 picogram ("pg") to 200 nanograms ("ng") of cDNA each,
5 and each batch is separately processed by the further steps
of the method. A number of batches sufficient for whichever
QEA~ mode is to be practiced are made. For a tissue mode
experiment, to analyze gene expression, preferably, from a
majority of expressed genes in a human tissue, the presence
lo of about 15,000 distinct cDNA sequences needs to be
determ~ned. By way of example, one sample is divided into
approximately 50 batches, each batch is then subject to an
RE/ligase recognition reaction to generate approximately 200-
500 fragments, and more preferably 250 to 350 fragments of 10
15 to 1000 bp in length, the majority of fragments preferably
havina a distinct length and being uni~uely derived from one
cDNA sequence. A preferable tissue mode analysis entails
approximately 50 batches generating approximately 300 ~ands
each. For ~uery mode experiments, fewer recognitivn
20 reactions are employed since only a subset of the expressed
genes are of interest, perhaps approximately from 1 to 100.
The number of recognition re.actions in an experiment can then
n--mber approximately from 1 to 10 and an approximately from l
to 10 cDNA batches are prepared.
Following cDNA preparation is the important step of
simultaneous RE cutting of and adapter ligation to the sample
cDNA sequences. The prepared sample is cut with one or more
REs. The number of REs and associated adapters preferably
are limited so that both a compressed length distribution
30 consisting of shorter fragments is avoided and enough
disting~ h~hle labels are available for all the REs used.
Alternatively, REs can be used without associated adapters in
order that the amplified fragments not have the associated
recognition sequences. Absence of these sequences can be
35 used to additionally differentiate genes that happen to
; produce fragments of identical length with particular REs.
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
In the same reaction mix, herein called the Qlig
mix, REs, adapters and ligase enzyme are simultaneously
present for concurrent adapter ligation and RE cutting. The
amount of RE enzyme in the reaction is preferably
5 approximately a 10 fold unit excess. Substantially greater
quantities are less preferred because they can lead to star
activity (non-specific cutting), while substantially lower
quantities are less preferred because they will result in
less rapid and only partial digestion and hence incomplete
10 and inaccurate characterization of the subsequence
distribution. REs and corresponding adapters are chosen
according to the previous description. Table 10 in Sec. 6.10
lists exemplary REs and cor-esponding primers and linkers.
Table ll in Sec. 6.10.1 lists exemplary combinations for
15 biotin labeled primers. The method is adaptable to any
liga~e enzyme that is active in the temperature r~nge 10 to
37~C. T4 DNA ligase is the preferred ligase. In other
embodiments, cloned ~4 DNA ligase or T4 RNA ligase can also
be used. rn a fur_her embodlment, thermosta~le ligases can
20 ~e used, such as Ampligase~ Thermostable DNA Ligase from
Epicenpre (Madison, WI), which has a low blunt end ligation
activity. These Ligases in conjunction with the repetitive
cycllng of the basic thermal profile for the RE-ligase
reaction, described in the following, permit more complete RE
25 cutting and adapter ligation.
Also present in the Qlig mix are necessary buffers,
as known in the art, and ATP. An excess of primers is
pre~erably present in the Qlig mix in order than subsequent
amplification can be performed in an automated manner.
30 Preferably primers and linkers are present approximately in
the ratio of 20:1 and to an adequate total primer amount of
approximately 20 pm where 1 ng of cDNA is used. Less
preferably the ratio is 10:1. Also, Betaine (Sigma
Chemicals) is preferably present in the Qlig reaction mix.
35 Betaine has been found to improve the uniformity of signals
from fragments that are at approximately the same original
- 96 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
concentration by aiding ligation activity. Betaine also
improves the PCR amplification of hard to amplify products.
RE/ligase reaction conditions are optimized to
minimize unwanted products. As previously explained,
5 ~erminal phosphate removal from cDNA samples prevents
- unwanted ligation of cDNA blunt ends together and subsequent
exponential amplification of the resulting dimers. Another
class of unwanted products are fragment concatamers, formed
when the sticky ends of cut cDNA fragments hybridize and
10 ligate together. Fragment concatamers are removed by
maintaining restriction enzymes a~tivity during ligation in
order to cut any unwanted concatamers. Further, ligated
primers terminate further RE cutting, since primers do not
recreate RE recognition subsequences. A high molar excess of
15 adapters is, therefore, preferable to limit concatamer
formation by driving the RE and ligase reactions toward
~ompLete digestion and adapter ligation. Finally, ur.wanted
a~apter self-ligation is prevented since primers and linker,
lack terminal phosphates (preferaDly due to synthesis without
20 phosphates or less preferably due to pretreatment thereof
with phosphatases ?
The temperature profile of the REJligase reaction
is important for complete cutting and ligation. The
preferred protocol has several steps. The first step is at
2~ the optimum RE temperature for a time sufficient to achieve
substantially complete cutting, for example 37 ~C for 30
minutes. The ligase used is preferably active during the
first step. The second step is a ramp at -1 ~C/min down to
an optimum temperature for adapter annealing and primer
30 ligation, for example, 16 ~C. The third step achieves
substantially complete primer ligation of cut products, and
is, for example, at 16 ~C for 60 minutes. The REs used are
preferably active during this third step. The fourth step is
- again at the temperature for optimum RE activity to achieve
35 complete cutting of recognition sites and unwanted ligation
products, for example at 37 ~C for 15 minutes. The fifth
step is to heat inactivate the Qlig enzymes and is, for
- 97 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
example, above 65 ~C. If the PCR amplification is to be
performed immediately, as in the preferred single tube
protocol of Sec. 6.4.1., this fifth step is at 72 ~C for 20
minutes and performs additional reactions to be subsequently
5 described. If the PCR amplification is not to be immediately
performed, the Qlig reaction results are held at i ~C, as in
the much less preferred multi-tube protocol as Sec. 6.4.5.
This temperature profile, together with the subsequence PCR
profile, is illustrated in Fig. 16D.
A less preferred profile involves repetitive
cycling of the first four steps of the temperature protocol
described above, that is from an optimum RE temperature to an
optimum annealing and ligation temperature, and back to an
optimum RE temperature. ~he additional temperature cycles
15 act to further drive the RE/ligase reactions to completion.
With this profile, it is preferred to use thermostable ligase
enzymes. The majority of restriction enzymes are active at
the conventional 16 ~C ligation temperature and hence prevent
unwanted ligations without thermal cycling. However,
20 temperature profiles comprising alterr.ating optimum ligation
con-litions and optimum RE conditions can cause both enzymatic
reac'ions to proceed more rapidly than if at one constant
temperature. An exemplary profile comprises periodically
cycling between a 37 ~C optimum RE temperature to a 16 ~C
25 op.imum annealing and ligation temperature at a ramp of
-1 ~C/min, then to a 16 ~C optimum ligation temperature, and
then back to the 37 ~C optimum RE temperature. Following
completion of approximately 2 to 4 of these temperature
cycles, the RE and ligase enzymes are heat inactivated by a
3C final stage above 65 ~C for 10 minutes.
These thermal profiles are easily controlled and
automated by the use of commercially available computer
controlled thermocyclers, for example from MJ Research
(Watertown, MA) or Perkin Elmer (Norwalk, CT).
The Qlig mix and reaction temperature profile are
designed to achieve the substantially complete cutting of all
RE recognition sites present in the analyzed sequence mixture
- 98 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
and the substantially complete ligation of primers to cut
ends, each primer being unique in one reaction for one
particular RE cut end. The fragments generated are limited
by adjacent RE recognition sites, with substantially no
5 fragments having an internal undigested sites. Further, a
minimum of unwanted self-ligation products and concatamers is
formed. This invention is adaptable to other temperature
profiles which achieve the same effect of substantially
complete cutting and ligation. Exemplary alternative
10 profiles are described in the accompanying examples in Sec.
6.4.
Following the RE/ligase step is a step for
~mpLifying the doub~y cut cDNA ~ragments. Although PCR
protocols are described in the exemplary embodiment of this
15 invention, any amplification method that selects fragments to
be amplified based on end sequences is adaptable to this
invention (see above). With high enough sensi_ivi.y of
detection means, or even single molecule detection means, the
amplification step can be dispensed with entirely. This is
20 preferable as molecular amplification often distorts the
quantitative response of this method.
PCR amplification protocols used in this invention
are designed to have maximum specificity and reproducibility.
First, PCR amplification produces fewer unwanted products if
25 the linkers remain substantially melted and unable to
irJitiate DNA strands, such as by performing all amplification
steps at a temperature near or above the Tm ~f the linker.
Second, amplification primers, typically strand 203 of Fig.
2A (and 304 of Fig. 3A), are preferably designed for high
30 amplification specificity by having a high Tm~ preferably
above 50 ~C and most preferably above 68 ~C, to ensure
specific hybridization with a minimum of mismatches. They
are further chosen not to hybridize with any native cDNA
- species to be analyzed. The previously described phasing
35 primers, which are alternatively used for PCR amplification,
have similar properties. Third, the PCR temperature profile
is preferably designed for specificity and reproducibility.
_ 99 _
CA 0223~860 1998-04-24
W O 97/lS690 PCTAJS96/17159
High annealing temperatures ~; nimi ze primer mis--
hybridizations. Longer extension times reduce PCR bias
related to smaller fragments. Longer melting times reduces
PCR amplification bias related to high G+C content. A
5 preferred PCR temperature cycles is 95 ~C for 30 sec., then
57 ~C for 1 min., then 72 ~C for 2 min. This preferred PCR
temperature profile is illustrated in Fig. 16D. Fourth, it
is preferable to include Betaine in the PCR reaction mix, as
this has been found to improve amplification of hard to
10 amplify products. To further reduce bias, large
amplification volumes and a minimum number of amplification
cycles, typically between 10 and 30 cycles, are preferred.
Any other techniques designed to raise specificity,
yield, or reproducibility of amplification are applicable to
15 this method. For example, one such technique is the use of
7-deaza-2'-dGTP in the PCR reaction in place of dGTP. This
ha~ been shown to increase PCR efficiency for G+C rich
Larg-ets (Mutter et al., 1955, Nuc. Acid ~es. 23:1411--1418).
For a furtAer example, another such technique is the additior.
26 of tetramethylammonium chloride to the reaction mixture,
which has the effect of raising the Tm (Chevet et al., 1995,
N~cleic Acids Research 23(16) :3343-3344).
It can be advantageous to process multiple
identical samples of RE/ligase reaction products, e.g. the
25 processed Qlig mix, with multiple PCR amplifications.
Amplifications of multiple ident_cal samples -~ith the same
number of cycles serves to check reliability and quantitative
response by comparing signals from each of the separately
amplified aliquots. Amplifications of multiple identical
30 samples with an increasing number of amplification cycles,
for example 10, 15, and 20 cycles, are preferable in that
amplifications with a lower n- h~r of cycles can detect more
prevalent fragments in a more quantitative manner, while
amplification with a higher number of cycles can detect less
35 prevalent fragments but less quantitatively.
It is preferable to process PCR amplification in
the same reaction tube as the RE/ligase reaction, as this
-- 100 --
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
promoteS automation. First, a PCR reaction mix, herein
called the QPCR mix, is made from appropriate DNA
polymerases, dNTPs, and PCR buffer, but without any primer
strands. Exemplary QPCR mix compositions can be found in the
5 examples of Sec. 6.4. The QPCR mix is placed in a reaction
tube, and a layer of wax melting near but below 72 ~C is
layered above the QPCR mix. The Qlig mix is placed above the
wax layer and processed according to the previously described
temperature profile, which does not melt the wax. When the
10 RE/ligase reactions are complete, the tube is incubated at
72~C for 20 min. This incubation melts the linkers from the
fragments, melts the wax layer and allows the processed Qlig
mix and the QPCR mix to combine, and finally, permits the DNA
polymerase to complete the fragments to blunt-ended dsDNA.
15 After this incubation, the PCR temperature profile is
performed according to the preferred protocol for a certain
number of cycles.
~ t is important in tho preferred single tube
embodiment that the Q~ig and QPCR mixes do not intermingie
20 before the intended step. Even sljght mixing due to hairline
cracks in the wax layer can contaminate the reactions. The
prcferred ~ax to prevent such intermingling is a mixture of
Par2ffin wax and Chillout~ 14 wax in a 90:10 ratio,
respectively. The paraffin is a highly purified paraffin wax
25 melting between 58 ~C and 60 ~C such as can be obtained from
Fluka Chamical, Inc. (Ronkonkoma, N.Y.) as Paraffin Wax cat.
no. 76243. Chillout 14 Liquid Wax is a low melting, purified
paraffin oil available from MJ Research. This wax layer is
created in the following ~nn~, The reaction tubes are pre-
30 waxed by melting the preferred wax onto the upper half of thesides of the tubes. The QPCR mix is added carefully avoiding
this wax layer. Then the wax layer is melted onto the
surface of the QPCR mix by incubating the tubes at 7S~C for 2
min. The wax layer is then carefully solidified by
35 decreasing the temperature of the tubes by 5~C every 2 min.
- until a final temperature of 25~C is reached. The Qlig mix
is then gently added on top of this wax surface. This single
-- 101 --
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96tl7159
tube protocol is adaptable to other less preferable waxes
that melt at approximately at 72~C, such as Ampliwax beads
(Perkin-Elmer, Norwalk, CT). Further, other so called PCR
"hot-start" procedures can be used, such as those employing
s heat sensit~ve antibodies (InvitrOgen, CA) to initially block
the activity of the polymerase.
Alternatively, PCR amplification can be performed
in a separate tube. In this case the QPCR mix is prepared in
a second tube. The first tube with the processed Qlig mix is
10 incubated at 72~C for approximateLy 10 min. in order to melt
the linker from the ~ragments. An aliquot of the Qlig mix is
then combined with the QPCR mix in the second tube, and a
,urther incubation at 72~C for l0 minutes completes the
fragments to blunt-ended dsDNA. After this incubation, the
15 PCR temperature profile is performed according to the
preferred protocol for a certain number of cycles.
Following the amplification step, optional cLeanup
ana ~eparation steps prior to length separ~tion and fragment
~etection can be advzntageous to substantially eliminate
20 ~ertain unwanted DNA strands and thereby to improve the
signal to noise ratio of QEATX signals, or to substantially
separare the reaction products into various classes and
thereby to simplify interpretation of detected fragment
patterns by removing signal ambiguities. For example, unused
25 primer strands and single strands produced by linear
ampl fication are unwanted ir. later steps. These steps are
based on previously described primer enhancements including
conjugated capture moieties and release means.
In one embodiment of these optional steps where one
30 of the two primers used has a conjugated capture moiety, QEA~
reaction products fall into certain categories. These
categories, described without limitation in the case where
the capture moiety is biotin, are:
a) dsDNA fragments neither strand of which has a biotin
moiety;
b) dsDNA fragments having only one strand with a conjugated
biotin moiety;
- 102 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
c) dsDNA molecule fragments having biotin moieties
conjugated to both strands; and
d) unwanted ssDNA strands with and without conjugated
biotin.
5 The additional method steps comprise contacting the amplified
r fragments with streptavidin affixed to a solid support,
preferably streptavidin magnetic beads, washing the beads to
in a non-denaturing wash buffer to remove unbound DNA, and
then resuspending the beads in a denaturing loading buffer
~0 and separating the beads from this buffer. The denatured
single strands are then passed tG the separation and
detection steps.
As a results of these steps only the strand of
category "b" without biotin is removed in the loading buffer
15 for separation and detection. Thereby, only fragments cut on
eit~Qr end by different REs and freed from single stranded
contaminants are separated and detected with minimized noise.
Category "a" products are not bound to the beads and are
wa~hed away in the non-denaturi~g wash buffer. Simil2rly,
20 cLass "d" products without biotin moieties are washed away.
AlI products with a conjugated biotin are retair.ed by the
strep~avidin beads after washing. The denaturing loading
bufrer denatures categories "b" and "c" products attached to
the beads, but both strands of category "c" products have
25 -onjllgated biotin and remain attached to the beads.
Similarly, class "d" products with conjugated biotin are
retained by the beads.
In another embodiment, the biotinylated primer can
include a release means in order to recover fragments of
30 class "c". After the step of suspension in a denaturing
buffer, the releasing means, e.g. UDG or AscI, can be applied
to release the biotinylated strands for separation and
detection. Fragments detected at this second separation in
addition to those previously detected then represent class
3s "c" products.
Further embodiments will be apparent to those of
- skill in the art. For example, two or more types of capture
- 103 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
moieties can be used in a single reaction to separate
different classes of products. Capture moieties can be
combined with release means to achieve similar separation.
Label moieties can be combined with capture moieties to
s verify separations or to run reactions in parallel.
This invention is adapted to other less preferred
means for single strand separation and product concentratiOn
that are known in the art. For example, single strands can
be removed by the use of single strand specific exonucleases.
10 Mung ~ean exonuclease, Exo I or S1 nuclease can be used, with
Exo I preferred because of its higher specificity for single
strands while S1 is least preferred. Other methods to remove
unwant~d strands include the affinity based methods of gel
fiLtration and affinity column separation. Amplified
15 products can be concentrated by ethanol precipitation or
cclumn separation.
The last QF.A~ step is separat-on acco=ding to
length of the amplified fragments followed by detection the
fragment lengths and end labels (if any). Bengths of the
20 frasments cut from a cDNA sample typically span a range frGm
a ~ew 'ens of bp to perhaps 1000 bp. Any separati~n method
with adequate length resolution, preferably t Least to ~hree
base pairs in a lOOO base pair sequence, c~n be used. It is
preferred to use gel electrophoresis in any adequate
25 configuration known in the art.
Gel electrophoresis is capakle of resolving
separate fragments which differ by three or more base pairs
an~, with knowledge of average fragment composition and with
correction of composition induced mobility differences, of
30 achieving a length precision down to 1 bp. A preferable
electrophoresis apparatus is an ABI 377 (Applied Biosystems,
Inc.) automated sequencer using the Gene Scan software (ABI)
for analysis. The electrophoresis can be done by suspending
the reaction products in a loading buffer, which can be non-
35 denaturing, in which the dsDNA remains hybridized and carriesthe labels (if any) of both primers. The buffer can also be
denaturing,~in which the dsDNA separates into single strands
- 104 -
,
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
that typically are expected to migrate together (in he
absence of large average differences in strand composition or
significant strand secondary structure). The length
distribution is detected with various detection means. If no
5 labels are used, means such as Ag staining and intercalating
dyes can be used. Here, it can be advantageous to separate
reaction products into classes, according to the previously
described protocols, in order that each band can be
unambiguously identified as to its target end subsequences.
10 In the case of fluorochrome labels, since multiple
fluorochrome labels can be typically be resclved from a
single band in a gel, the products of one recognition
reaction with several REs or other recognition means or of
several separate recognition reactlons can be analyzed in a
15 single lane. However, where one band reveals signals from
multiple fluorochrome Labels, interpretation can be
ambiquous: is such a band due to one fragment ~ut with
multiple REs or to multiple ~ragments each cut by one RE. Ln
t~.is case, it can also be ad~antageous to separate reacticn
20 proaucts into classes.
Preferred protocols for the specific RE embodiments
are d~scribed in detail in Sec. 6.4.
5.2.3. THE SEO-OEA~ EMBODINENT
SEQ-QEA~ is an alternative embodiment of the
preferred method of practicing a RE/ligase embo~i ~nt of QEA~
method as previously described in Sec. 5.2.2. By the use of
adapters comprising specially constructed primers bearing a
recognition site for a Type IIS RE, a SEQ-QEA~ method is able
30 to identify an additional 4-6 terminal nucleotides adjacent
to the recognition subsequence of the RE initially cutting a
fragment. Thereby, the effective target subsequence is the
concatenation of the initial RE recognition subsequence and
the additional 4-6 terminal nucleotides, and has, therefore,
35 a length of at least from 8 to 12 nucleotides and preferably
~ has a length of at least 10 nucleotides. This longer
effective target subsequence is then used in QEA~ analysis
- 105 -
CA 0223~860 1998-04-24
WO 97/15690 PCT/US96/17159
methods as described in Sec. 5.4 ("QEA~ Analysis and Design
Methods") which involves searching a database of sequenceS to
identify the sequence or gene from which the fragment
derived. The longer effective target subsequence increases
5 the capability of these methods to determine a unique source
sequence for a fragment.
In this section, for ease of description and not
limitation, first shall be described Type IIS REs, next the
specially constructed primers, and then the additional method
10 steps of a SEQ-QEA'M method used to recognize the additional
nucleotides.
A Type IIS RE is a restriction endonuclease enzyme
which cuts a dsDNA molecule at locations outside of the
recognition sequence of the Type IIS RE (Szybalski et al.,
lS 1991, Gene lOO:13--26). Fig. 17C illustrates Type IIS RE 1731
cutting dsDNA 1730 outside of its recogr.ition subsequence
l720 at locations 1708 ~nd 1709. The Type IIS RE preferably
gener2tss an overhang by cutting the .wo dsDNA strands at
locations differently displaced away on ihe two str;inds from
20 the recognition sequence. Although the recognition
subsequence and the displacement(sj to the cutting site(s)
are determined by the RE and are ~cnown, the sequence of the
~ene-ated overhang is determined by the dsDNA cut, in
particular by its nucleotide sequence outside of the Type IIS
25 recognition region, and is, at first, unknown. Thus in a
SEQ--QEA~ embodiment the overhangs generated by the Type IIS
REs are sequenced. Table 17 in Sec. 6.lO.l lists several
Type IIS REs adaptable for use in the ~;EQ--QEA~ method and
their relevant characteristics, including their recognition
30 subsequences on both DNA strands and the displacements from
these recognition subsequences to the respective cutting
sites. It is preferable to use REs of high specificity and
generating an overhang of at least 4 bp displaced at least 4
or 5 bp beyond the recognition subsequence in order to span
35 the remaining recognition subsequence of the RE that
initially cut the fragment. FokI and BbvI are most preferred
Type IIS REs for the SEQ-QEA~ method.
-- 106 --
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/171S9
Next, the special primers, and the special linkers
if needed, which hybridize to form the adapters for SEQ-QEA~,
have, in additional to the structure previously described in
Sec. 5.2.1, a Type IIS recognition subsequence whose
5 placement is important in order that the overhang generated
by the Type IIS enzyme be contiguous to the initial target
end subseguence. The placement o~ this additional
subsequence is described with reference to Figs. 17A-E, which
illustrate steps in a SEQ-QEA~ alternati~e embodiment. Fig
10 17B schematically illustrates dsDNA 1702, which is a fragment
cut from an original sample sequence orl one end by a first RE
and on the other end by a different second initial RE, with
adapters fully hybridized but prior to primer ligation.
Thus, linker strand 1711 has hybridized to primer strand 1712
15 and to the 5' overhang generated by the first RE, and how
fixes primer 1712 adjacent to fragmen~ 170~ fcr subsequent
~ tion. Primer 1712 has recognitlon subsequence 172G fcr
Type IIS RE 1721. Linker 1711, to the extent it overlaps and
hybridizes with recognition subsequen_e. '720, has
~o complemantary recognition subsequence 1721. Additicnally,
pri~er 1712 preferably has a conjugated label moioty 1734,
e.g. a ~1uorescent FAM moiety. Si~ilarly, linker strand 1713
ha~: hybridized to primer strand 17 l-t and _o the 5' overhang
generated by the second RE. Primer 17L4 pre~era~ly has a
25 con~ugated capture moiety 1732, 9~g~ a biotin moiety, and a
rel22se means represented by subsequence 1723.
Subsequence 1704 terminating at nucleotide 1707 in
Fig. 17B is the portion of the recognition subsequence of the
first RE r~in;~g after its cutting of the original sample
30 sequence. The placement of t~e Type II~ RE recognition
subsequence is determined by the length of this subsequence.
Fig. 17A schematically illustrates how the length of
subsequence 1704 is determined by properties of the first RE.
~he first initial RE is chosen to be o~ a type that
35 recognizes subsequence 1703, terminating with nucleotide
1707, of sample dsDNA 1701, and that cuts the two strands of
~ dsDNA 1701 at locations 1705 that are located within
- 107 -
CA 0223~860 1998-04-24
W O 97/15690 PCTnUS96/17159
recognition subsequence 1703. In order that the first RE
recogniZe a known target subsequence, it is highly preferable
that subsequence 1703 be entirely determined by the first RE
and be without indeterminate nucleotides. As a result o~
5 this cutting, overhang subsequence 1706 is generated and has
a known sequence, since it is entirely within the determined
recognition subsequence 1703. Thereby, subsequence 1704, the
portion of the recognition subsequence 1703 remaining on a
fragment cut by the first RE, has a length not less than the
10 length of overhang 1706 and is typically longer. Typically
and preferably, subsequence 1703 is of length 6 and is
palindromic; locations 1705 are symmetrically placed in
suhsequence 1703; and overhang 1706 is of length 4.
Therefore, the typical length of the remaining portion 1704
15 of the recognition subsequence 1703 is of length 5.
The preferred placement of T~pe IIS recognition
seque~ce 1720 is now be described with reference to Fig. 17C,
whi_h schematically illustrates dsDNA 1730, which derives
from dsDNA 1702 of Fig. 17B a~ter the further steps cf primer
20 ligatlon, PCR amplification with primers 1712 and 1714,
bin~ing of capture moiety 1732 to binding partner 1733
aff xed to a solid-phase substrate, and bin~ing o~ Type IIS
RE 1731 to its recognition subsequence 1720. Subsequence
1722 is the subsequence ~etween recognition subsequence 1720
25 and the end of primer 1712 at location 1705. Type IIS RE is
illustrated cutting dsDNA 1730 at nucleotide locations 1708
and 1709 and, thereby, generating an exemplary 5' overhang
1724 between these locations. For this o~erhang to be
contiguous with the remaining portion 1704 of initial target
30 end subsequence 1703, nucleotide 1709 is adjacent to
nucleotide 1707 terminating subsequence 1704. Therefore,
Type IIS recognition sequence 1720 is preferably placed on
primer 1712 such that the length of subsequence 1704 plus the
length of subsequence 1722 equals the distance of closest
35 cutting o~ Type IIS RE 1731. For example, in the case of
FokI, since the closest cutting distance is 9 and the typical
length o~ subsequence 1704 is 5, its recognition sequence is
- 108 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
preferably placed 5 bp from the end of primer 1712. In the
~ case of BbvI, since the closest cutting distance is 8, its
recognition sequence is preferably placed 3 bp from the end
of primer 1712.
Finally, Fig. 17D schematically illustrates dsDNA
1730 after cutting by Type IIS RE 1731. dsDNA has ~'
overhang 1724 between and including nucleotides 1708 and
1709, where the Type IIS RE cut dsD~A 1730 of Fig. 17C. This
overhang is contiguous with former subsequence 1704, the
~o remaining portion of the recognition sequence of the first
RE, which has been cut off. The shorter strand has primer
1714 including release means represented by subsequence 1723.
ds~NA 1730 remains bound to the solid-phase support through
capture moiety 1732 and binding partner 1724. The absence of
15 label moiety 1734 can be used to monitor the completeness of
cutting by Type IIS RE 1731.
This invention is zlso adaptable to other less
preCerable placements of recognition sequence 1720. I~
recognition sequence 1720 is placed closer to the 3' end of
20 primer 1712 than the optimal and preferable distance, the
overhang produced by Type IIS RE 1731 is not contiguous with
re~-ognition subsequence 1703 of the first RE, and a
contiguous effective target subsequence is not generated. In
this case, optionally, the determined sequence of the Type
25 IIS RE generated overhang can be used as third internal
subsequence information in QEA~ experimental analysis methods
in order to further resolve the source sequence of fragment
1702, if necessary. If recognition sequence 1720 is placed
further from the 3' end of the cut primer than the optimal
30 and preferable distance, the overhang produced by Type IIS RE
overlaps with recognition subsequence 1703 of the first RE.
In this case, the length of the now contiguous effective
target subsequence is less than the sum of the lengths of the
Type IIS overhang and the first RE recognition subsequence.
35 Effective target end subsequence information is, thereby,
lost. In case recognition sequence 1710 is placed further
-
- 109 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/171~9
from the 3' end than the distance of furthest cutting, no
additional in~ormation is obtained.
Primer 1714 also has certain additional structure.
First, primer 1714 has capture moiety 1732 conjugated near or
5 to its 5' end. Biotin/streptaVidin are the preferred capture
moiety/binding partner pair, which are used in the following
description without limitation to this invention. Second,
primer 1714 has release means represented as subsequence
1723. As previously described, the release means allows
10 controlled release of strand 1735 of Fig. 17D from the
capture moiety/binding partner complex. This alternative is
adaptable to any such controlled release means, including the
cases where subsequence 1723 is one or more uracil
nucleotides and where it is the recognition subsequence o~ an
15 RE which cuts extremely rarely if at all in the sequences of
the sample. e.g. AscI. Release m~ans are particularly useful
ir. t~.e case of biotin-streptavidin, which ~orm ~ complex that
is difficult to dissociate.
Table 18 of Sec. 6.1G.l lisls exemplary primers,
20 linkers, and associated REs, for the preferred implementation
cf SE~-QEA~ in which contiguous effective target end
subseqll2nces are Eormed. This descriptior. has illustrated
the generation of a 5' Type IIS generated overhang. Primers
can equally be constructed to generate a less preferable 3'
25 overhang by using a Type IIS whose closest cutting distance
is on the 3' strand, rather than on the 5' strand.
Finally, the method steps of SEQ-QEA~ are now
described. SEQ-QEA~ comprises, first, practicing the
RE/ligase emboAi -nt of QEA~ using the special primers and
30 linkers previously described followed, second, by certain
additional steps unique to SEQ-QEA~. Figs. 17B-E illustrate
various steps in a SEQ-QEA~ method. Fig. 17B illustrates a
fragment ~rom a sample sequence digested by two dif~erent REs
and just prior to primer ligation. Fig. 17C illustrates a
35 sample sequence after primer ligation, chain blunt-ending,
and PCR amplification. These QEA~ steps are preferably
per~ormed according to the embodiments described in Sec.
-- 110 --
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
5.2.2, but can alternatively be performed by any RE/ligase
embodiment~ The additional steps unique to SEQ-QEA~ include,
first, binding the amplified fragments to a solid-phase
support, also illustrated in Fig. 17C, second, washing the
5 bound fragments, and third, digesting the bound fragments by
the Type IIS RE corresponding to primer 1712 used. The Type
IIS digestion is preferably performed with reaction
conditions suitable to achieve complete digestion, which can
be checked by insuring the absence of optional label moiety
10 1734 after washing the bound, digested sequences. Fig. 17D
illustrates dsDNA fragments 1730 remaining after complete
digestion by the Type IIS RE. Before Type IIS digestion, an
aliquot of the bound, amplified RE/ligase reaction products
is denatured and the supernatant, containing the labeled 5'
15 strands, are separated according to length by, e.g., gel
electrophoresis, in order to determine the length of each
fragment dcubly cut by dif~erent REs.
The subsequent additional SE~-QEA~ step is
sequencing of overhang 1724. Ihis can be done in any man~er
20 kr.owrl in the art. In a preferred embodiment suitable for
lower fragment quantities, an alternative, herein called a
phasing QEA~ method, can be used .o sequence this overhang.
Phasing QEA~ depends on the prec se sequence specificity with
which RE/ligase reactions recognize short overhangs, in this
25 case the Type IIS generated overhang. Fig. 17E illustrates a
first step of this embodiment in which a QEA~ method adapter,
which is comprised of primer 1751 with label moiety 1753 and
linker 1750, has hybridized to overhang 1724 in Type IIS
digested fragment 1730 bound to a solid-phase support. By
30 way of example only, overhang 1724 is here illustrated as
being 4 bp long. In this ht~li ?nt, special phasing linkers
are used. For each nucleotide position of overhang 1724,
e.g. position 1754, 4 pools of linkers 1750 are prepared.
All linkers in each pool have one fixed nucleotide, i.e. one
35 of either A, T, C, or G, at that position, e.g. position
1755, while random nucleotides in all combinations are
present at the other three positions. For each nucleotide
-- 111 --
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
position of the overhang, four RE/ligase reactions are
performed according to QEA~ protocols, one reaction using
linkers from one o~ the four corresponding pools. Linkers
from only one pool, that having a nucleotide complementary to
s overhang 172~ at position 1754, hybridize without error, and
only these linkers can cause ligation of primer 1i51 to the
5' strand of fragment 1730. When the results of the four
RE/ligase reactions are denatured and separated according to
length, only one reaction of the four can produce labeled
10 products at a length corresponding to the length of fragment
1730, namely the reaction with linkers complementary to
position 1754 of overhang 1724. Thereby, by performing four
RE/ligase reactions for each nucleotide position of overhang
1724, this overhang can be sequenced. Optionally, the
15 products of these four RE/ligase reactions can be further PCR
ampli~ied. In a further option, if linkers 1750 comprise
subsequence 1756 that is uniquely related to the fixed
nucleo.ide in subsequence 1752 and if four separately and
di~ting-i~h~hly labeled primers 1751 complementary to these
20 unique subsequences are used, all four RE/liga~e reactions
for one overhang position can be simultaneously performed in
one reaction tube. With this overhang sequencing alternative
embodiment, release means 1723 can be omitted from primer
1714.
In an alternate embodiment, sequencing of a 5'
overhang can be done by st~n~d Sanger reactions. Thus
strand 1735 is elongated by a DNA polymerase in the presence
of labeled ddNTPs at a relatively high concentration to dNTPs
in order to achieve frequent incorporation in the short 4-6
30 bp elongation. Partially elongated strands 1735 are released
by denaturing fragment 1730, washing, and then by causing
release means 1723 to release strands 1735 from the capture
moiety bound to the solid phase support. The released,
pa~rtially elongated strands are then separated by length,
35 e.g., by gel electrophoresis, and the chain terminating ddNTP
is observed at the length previously observed for that
- 112 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
fragment. In this manner, the 4-6 bp overhang 1724 of each
fragment can be quickly sequenced.
The effective target subsequence information,
- formed by concatenating the sequence of the Type IIS overhang
5 to the sequence of the recognition subsequence of the first
RE, is then input into QEA~ Experimental Analysis methods,
and is used as a longer target subsequence in order to
determined the source of the fragment in question. This
longer effective target subsequence information preferably
~o permits exact and unique sample sequence identification.
5.2.4. 5'-OEA~ ALTERNATIVE RE EMBODIMENT
In QhA~ embodiments of this invention, it is
important that the one or more fragments of a nucleic acid
15 from a sample which are generated by the recognition
reactions be of definite length, that is that the length of
each fragment depends only on the sequence of the nucleic
acid and not on experimental conditions, e.g., the synthesis
con~itions of the nucleic acid. Further, it is important for
20 the experimental analysis and design methods of Sec. 5.4 that
the length of a fragment be precisely predicable from the
nucleotide sequence of the sample nu-leic acid. In the
pre~erred RE/ligase embodiments of QEA~, these goals are
accomplished primarily by selecting signals from fragments
25 doubly cut on both ends by one or more REs. The nucleotide
distance between adjacent RE recognition subsequences is
determined only by the sequence of nucleic acid from the
sample. Also the described alternatives and extensions
generate additional signal information dependent only on the
30 nucleic acid sequence. In these embodiments, nucleic acid,
e.g. cDNA, synthesis conditions are then only of indirect
importance, in that they preferably adequately represent
input mRNA.
Other RE/ligase embodiments utilize signals from
35 fragments of a nucleic acid that, although only singly cut by
an RE on one end, nevertheless have a definite length,
dependent only on nucleotide sequence, because of particular
- 113 -
CA 02235860 1998-04-24
W O 97/15690 PCTnJS96/17159
cDNA synthesis conditions that fix the other end. For these
embodiemnts, therefor, the cDNA synthesis conditions are of
direct importance, in that these embodiments can only be used
with cDNA synthesized according to the particular conditions.
S In general, these aonditions insure that the _DNA begins or
ends in a known relation, herein called "anchored;" to
general landmarks on the input mRNA. In particular,
preferable anchoring l~n~' ~rks include the 5' end of the
poly(A)t tail present on the 3' end of the input mRNA, or the
10 cap on the 5' end Gf the input mE~NA. For example, cDNA
fragments terminated on their 5' end in a fixed relation to
the 5' cap of the source mRNA and cut on their 3' end at the
neare~. recognition subsequence of a single RE have a
definite length and generate QEA~ signals that can be used to
15 determine the source nucleic acid in the sample. Similarly,
cD~A frn~ents terminated on their ~' end in a fixed relation
t~ the 5' end of the poly(A)+ tail pre_ent on the source mRNA
~~nCt CU~ on thetr 5' end ~t the ne~rest reco~nition sequence
o~ a ;ingle R~ als~ have a oe-in te length and generate QE~
20 signa's that can also be used to determine the source nuc;eic
acid in t~e sample.
l~lrning first the case sf 5' anchored cDNA, such
cD~r-~ call ke synthesized by a protocol which requires the
presence of an intact 5' cap on the input mRNA. One such
~5 exemplary preferred protocol is described in Sec. 6.3.3.
This protocol dep~n~ upon using a RNA ligase to ligate to a
source mRNA at the nucleotide adjacent to the 5' cap a DNA-
~NA chimera comprising a first DNA sub~equence 5' to the
ribonucleotide triplet GGA at the 3' end o~ the ~i -~a. The
30 RNA component of the DNA--RNA ~-hi -~a is preferably GGA, but
any RNA subsequence can be used that promotes effective
ligation by the ligase chosen of the chimera to the source
mRNA. The DNA oligonucleotide c~m~s~nt is later used as a
primer and is herein called a "5'-cap-primern
35 oligonucleotide. ~his ligation is accomplished by
dephosphorylating input mRNA with an alkaline phosphatase and
then cleaving the 5' cap with an acid pyrophosphatase,
- 114 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/171~9
preferably tobacco acid pyrophosphatase, leaving a 5'
phosphate needed for ligation only on mRNAs having a 5' cap.
During the ligation step, an excess of primer is used to
prevent self-ligations of the input mRNA- The preferred RNA
5 ligase is T4 RNA ligase. First strand synthesis is then
performed with a first DNA primer comprising the ~irst DMA
subsequence. Thereby, all cDNAs originate from input mRNAs
having their 5' cap. Second strand synthesis is then
per~ormed with such second strand primers as are known in the
10 art. Preferabl,y second strand primers are three second
strard primers mixed or in separate pools, each of which
comprises a second DNA subsequence 5' to one of three
oLigo(dT) one-nucl~otide phasing primers, as known in the art
Liang et al., 1994, Nuc. Acid ~es. 22:5763-5764).
~5 ALternatively, other primers known in the art could be used,
--cluding, a single oilgo(dT) primer, a sequence specific
~rim-r, or random primers. For small amounts of inplt ~RN~,
_he f -~ rst DNA primer and a second DNA primer co~prising the
second DNA subsequence can be used in a PCR reaction to
~O amplify the synthesized cD~A. ~his QEA~ embodiment is
adap~ble tc other methods known in the art to produce c~NA.s
h a S' end anchored in a f xed relation to the 5' ~RNA
c3p, _or example the CapFindersY PCR cDNA Library Construction
Kit Clonete~h (Palo Alto, CA). See also Schmidt et al.,
25 19~6, Nuc. Acids. Res. 24:1789-1791.
The first and second DNA pri~er sequence~ are
preferably chosen according to certain guidelines. First,
they are chosen not to generate by themselves any PCR
prGducts from the CDNA sample nucleic acids. Second, they
30 are of a suf~icient length and average base content
(approximately 60% G+C) to hybridize in high stringency
conditions. Third, they have no significant secondary
structure. Finally, they can have included RE recognition
sites, initiators, etc. to promote later cloning or
35 expression. Exemplary first and second primers are described
in Sec. 6.3.3. Software packages are available for primer
- construction according to such guidelines, an example being
- - 115 -
CA 0223~860 1998-04-24
W O 97/15690 PCTnUS96/17159
OLIGO~ Version 4.0 For Macintosh from National Bioscience
Inc. (Plymouth, MN).
Having cDNA synthesized according to the exemplary
5' anchoring protocol, the 5'-QEA~ embodiment is performed
5 according to the general methods Sec. 5.2.2, including the
optional cleanup and separation steps. In particular, the
QPCR mix is prepared as previously described. The Qlig mix
includes the one RE chosen to cut the fragment and an
2ssociated adapter with primer excess. These primers are
10 preferably be labeled are most preferably do not have a
conjugated capture moiety. Also included in the Qlig mix in
a quantity sufficient ~or PCR amplification is an extra
p imer, which is the first DNA primer, that is the DNA
portion of the chimera now appearing on the 5' end of the
15 synthesized cDNA, together with a conjugated biotin moiety or
other capture moiety,. The RE/ligase reactions and the
subse -ent PCR amplification are performed as previously
described and result in the follo-~iny classes of fragments.
Fi-st, there are fragments singly cut ~y the chosen RE which
2~ a~e ex~onentially amplified because of the presence of the
-irst DrlA primer and which have on their 5' ends tne biotin
labeled ~irst DNA primer. Second, there are exponentially
ampl.fied fragments doubly cut by the chosen RE which have no
biotin labels. Third, there can be linearly amplified, non-
25 labeled, singly cut fragments. After contacting thesereaction products with streptavidin beads and washing, only
the first class of fragments is retained, that is ~ragments
s~ngly cut adjacent to the 5' end. Upon resuspending the
beads in a denaturing loading buffer, cnly the denatured
30 single strands from such fragments generate signals after the
separation and detection steps. These signals have a
definite length, because the RE recognition site nearest the
5' end is determined only by the sequence of the nucleic
acid.
Turning to the less preferred case of 3' anchored
cDNA, such cDNA can be synthesized by protocols known in the
art which utilize phasing primers. Such phasing primers can
- 116 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
comprise a first DNA subsequence, which is constructed
according to the previously described primer guidelines~ 5-
to one of three oligo(dT) one nucleotide phasing primer
subsequences (Liang et al. 1994)- Sequences MBTA, MBTC, and
5 MBTG of Sec. 6.3.3 are exemplary of such primers. The
RE/ligase and PCR amplification reactions are carried out
according to the protocol of the 5'-QEA~ embodiment with the
exception that the extra primer used in the Qlig mix is the
~irst DNA subsequence used in the prior cDNA synthesis w~th a
10 conjugated biotin or other capture moiety. After completion
of the protocol, signals are only generated from fragments
cut ~y the chosen RE adjacent to the 3' end. These signals
have a definite length, because the RE recognition site
nearest the 3' end is determined only by the sequence of _he
15 nucleic acid.
The signaLs generated rrom the singly cut fragments
according to t~e protocols o~ this sec'icn can be used in the
~Gmputer implemented experimental analysis methads of Sec.
5 4 i.. order to determine the sample nucleic source of a
20 particular signal. ~he analysis methods need ~ini m~ 1
~daptation in a manner that will be apparent to one of skill
Ln .he computer arts in order that the 5' ~r 3' end cDNA
sequer.ce is one of the target end sequences. This adaptatlon
can be done in several ways, including simply specially
25 marking in the signals that one target end subsequence is the
3' or 5' end as needed or by including in the generated
signal an artificial and not naturally occurring target
subsequence that represents the 3' or the 5' end as
appropriate and concatenating these artificial subse~len~
30 to nucleic acid sequences input from a dat~h~c~ prior to
computer processing. Similar in; ~l adaptations to the
computer implemented experimental design methods can be made
in order to create and optimize experiments generating singly
c~t fragments.
The embodiments described in this section, in
particular 5'-QEA~, can be practiced in combination with QEA~
~ embodiments herein described. It will be apparent to one of
- 117 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
skill in the art how such combinations can be performed.
Specifically, it is advantageous to combine 5'-QEA~ with SEQ-
QEA~ to obtain signals which include longer effective target
subsequence information on the singly cut end along with
5 information-on the distance of the effective target
subsequence from the end of the cDNA.
5.2.5. FURT~ER ALTERNATIVE RE E~BOD~ . S
The embodiments of this section remove unwanted
10 REjligase reactiGn products at least partially by utilizing
cDUA with conjugated capture moieties, obtained perhaps from
either first and second strand synthesis with primers having
con~ugated capture moieties or from PCR amplification of cDNA
with such primers. The preferred capture moiety is biotin
15 fGr w~ich the corresponding binding partner is streptavidin
at_ach~d to a sclid support, preferably magnetic beads.
~h~se embodiments are adapt~ble to otheL- capture moie~ies an~
corresponding binding partn~rs.
A first QEA~ embodiment in conjunction with
~o sufficiently sensitive detection means can aavan.ageously
m-rlim ze or eliminate alto~ether the PCR amplification step.
PCR amplification disadvantageously has a no~-linear response
well known in the arts, dependi;ng on ~uch f~ctors as fragment
length, average base composition, and secondary structure.
25 To improve quantitative response, it i5 preferred to
eliminate the PCR amplification step or at least to minimize
the number of PCR cycles. Then output signal intensity is
more nearly linearly responsive to the abundance of the input
nucleic acids generating that signal.
In the previously described RE/ligase embodiments
the amplification step serves both to amplify the signals
from fragments of interest and simultaneously to dilute the
signals from unwanted fragments without a definite sequence-
dependent length and. For example, in the protocol of Sec.
35 5.2.2, fragments doubly cut with REs and ligated to adapters
are exponentially amplified, while unwanted fragments singly
cut by an RE are at best linearly amplified. After ten
- 118 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/171~9
cycles of amplification, since doubly cut fragments are
~ amplified lOOOX while singly cut fragments are amplified lOXr
fragments from sample nucleic acids with a relative abundance
of 1% or more can be detected above the background noise
5 while fragments from sample nucleic acids with a relative
~ abundance of 1% or less can be lost in the unwanted
background. More amplification cycles permit both greater
sensitivity and greater ability to observe rare fragments
from rare sequences.
1~ More sensitive detection means decrease the need
for amplification in order to generate observable signals.
In the case of standard fluorescent detection means, a
minimum of 6 x 10-l~ moles of fluorochrome (approximately 105
molecules) is re~uired for detection. Since one gram of cDNA
15 contains about 10~ moles of transcripts, it is possible to
detact transcripts to at least a 1% relative level ~rom
~Lsrogram quantities of mRNA. With greater mRNA quantitics,
proportiondtely rarer transcripts are detectable. Labeling
and ~etection schemes of increased sensitivity permit use of
20 le6s mRNA. Such a scheme of increased sensitivity is
described in Ju et al., 1995, Fluorescent energy transfer
~ye-labeled primers for DNA sequencing and analysis, Proc.
Nall. Ac-d. Sci. USA 92:4347-4~51. Possible single molecule
detection means are about 105 times more sensitive than
25 e~isting fluorescent means (Eigen et al., 1994, Proc. Natl.
Acad. sci. USA 91:5740-5747).
To ~ini i ze or eliminate amplification steps, the
fi~st ~ ho~ i ent described in this section i n i ; zes the need
for amplification in order to dilute unwanted signals by
30 using a capture moiety to remove unwanted singly cut
fragments from the doubly cut fragments of interest. In the
protocols of Sec. 5.2.2, only the doubly cut fragments have
definite lengths dependent only on the se~enc~ o~ the input
nucleic acids. Singly cut fragments have non-diagnostic
35 lengths depending also on cDNA synthesis conditions. In this
protocol, PCR amplification can be optionally employed to
- generate su~ficient signal intensity for detection. It is
- -- 119 --
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
not n~e'7~'~7 to i n;~ ize the background noise generated in the
previously described protocols- The steps of this protocol
comprise synthesis of cDNA using a primer labeled with a
capture moiety, circularization of the cDNA, cutting with
5 R~s, and ligation to adapters. Singly cut ends are then
removed by contacting the reaction products with a solid
p~,ase to which the binding partner of the capture moiety is
affixed.
Figs 4A, 4B, and 4C illustrate this alternative
o protocol, which pre~erably uses biotin as a capture moiety
for direct removal of the singly cut 3' and 5' cDNA ends from
the RE/ligase reaction products. cDNA first strands are
s~nthesized according to the method of Sec. 6.3.3 using, for
example, zn oligo(dT) primer with a biotin molecule Linked to
15 a thymidine nucleotide. For example, such a primer is
TnT(biotin)Tm; with n approximately equaL to m, and with n + m
sl~friciently large, approximately 12 to 2~, so that the
prir7er ~ 1 reliably hybridize to the poly(A) tail of mRNA.
O'her kiotin labeled primers may also be used, s7~ch as randcm
Z0 hexamers. Double stranded cDNA is then synthe~ized, a~so
~ccording to Sec. 6.3.3. In this emboaiment, termin~l
phosphates are retained. Fig. 4A illustrates such a cDNA 401
wi~h ~nds 407 and 408~ poly(dA) subse~uence 402, oligo(dT)
primer 403 with biotin 404 attached. Subsequence 405 is the
~5 recognition sequence for RE~; subsequence 406 is the
recognition sequence for RE2. Fragment 409 is the cDNA
sequence defined by these adjacent RE recognition sequences.
Fragments 423 and 424 are singly cut fragments resulting from
RE cleavages at subsequences 405 and 406.
3C Next, the cDNA is ligated into a circle. A
ligation reaction using, for example, T4 DNA ligase is
performed under sufficiently dilute conditions so that
predominantly intramolecular ligations occur circularizing
tne cDNA, with a only a ;n; of intermolecular, concatamer
35 forming ligations. Reaction conditions favoring
circularization versus concatamer ~ormation are described in
Maniatis, 1982, Mole~-~71; r Cloning A Laboratory Manual, pp.
- 120 -
CA 0223~860 1998-04-24
W O 97/lS690 PCTAUS96/17159
124-125, 286-288, Cold Spring Harbor, NY. A DNA
concentration of less than approximately 1 ~g/ml has been
found adequate to favor circularization. Concatamers can be
separated from circularized single molecules by size
5 separation using gel electrophoresis, if necessary. Fig. 4B
A illustrates the circularized cDNA. Blunt end ligation
occurred between ends 407 and 408.
Then the circularized, biotin labeled, cDNA is cut
with REs and ligated to adapters uniquely recognizing and
10 ~erhaps uniquely labeled for each particular RE c~t. The
RE/ligase step is performed by procedures descrihed in the
sections hereinabove, for example in Sec. 5 2.2, so that RE
digestion and primer ligation proceed to completion with
minimal formation of concatamers and other unwanted ligation
15 products. Next, unwanted singly cut ends are removed by
ccntacting the reaction products with streptavidin or avidin
magnetic beads, leaving only doubly cut ~ragments that have
Rr-specific recognition sequences ligated to each end. Fig.
4C illuatrates these steps. SeqUenCQS 405 and 406 are cut by
20 RE~ and RE2, respectively, and adapters 421 and 422 specific
Cor cuts by RE~ and RE2, respectively are ligated onto the
overhangs. Thereby, fragment 40~ is freed from the
circul~rized cDNA and adapters 421 and 422 are ligated to it.
The remaining segment of the circularized cDNA comprises
25 singly cut ends 423 and 424 with ligated adapters 421 and
422. Both singly cut ends are joined to the primer sequence
403 with attached biotin 404. Removal is accomplished by
contact with streptavidin or avidin 420 which is fixed to
substrate 425, perhaps comprising magnetic beads. Doubly cut
30 labeled fragment 409, now separated from the singly cut ends,
can be separated according to length and detected with
~; n i~ized bac~ground noise signals.
Thereby, signals from the labeled doubly cut ends
- of interest can be directly detected with i n; ~1
35 cont~ ;n~tion from signals from unwanted labeled singly cut
ends. Importantly, the detected signals more quantitatively
~ reflect the relative ab~ nce of the source cDNA, and thus
- - 121 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
gene expression levels. Optionally, if the signal levels are
too low for direct detection, the reaction products can be
subjected to just the minimum number of cycles, for example
according to the methods of Sec. 5.2.2, to detect the gene or
s sequence of interest. For example, the number of cycles can
be as small as four to eight without any concern of
background cont~ Ation or noise. Thus, in this embodiment,
amplification is not needed to suppress signals from singly
cut ends, and preferred more quantitative response signal
1~ intensities result.
Another QEA~ embo~; ?nt amplifies the cDNA sample
prior to the RE/ligase reactions, removes unwanted fragments
~ith a removal means, and then separates and detects the
reaction products. Alternately, further amplification of the
15 ~ragments of interest can be performed after the RE/ligase
step.
Tn this embodiment, first, double stranded cDrJA,
perhaps prepared from a tissue s8~pl e according to Sec.
6.3.1, is FCR ampLified using primers a conjugated capture
~o moiery, preferably biotin. Any suitable primers known in the
~rt, all biotin-labeled, can be used. For example, a set of
ar~i'rary primers with no net sequence preference can be
used. For a further example, where the cDNA ~s synthesized
according to the protocol of Sec. 6.3.3, the method o~ step 6
25 of that protocol can be used, except that both the MA24 and
~IB24 have a conjugated biotin. The resulting cDNA with
biotin l;nk-~-l to both ends is then cut with one or more REs
and ligated to adapters corresponding to the REs used. The
adapter primers can be optionally labeled but cannot have a
3Q conjugated biotin. The RE/}igase reaction is preferably
per~ormed according to the protocols of Sec. 5.2.2 in order
that the RE digestion and adapter ligation proceed to
completion with --;ni formation of concatamers and other
unwanted ligation products. The reaction products comprise
35 fragments of interest that are doubly cut by REs and without
any conjugated biotin, and unwanted fragments with a biotin
conjugated to one end that are singly cut and derive from the
- 122 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
ends of cDNAs. Next, the unwanted singly cut fragments are
removed by contacting the reaction products with Streptavidin
beads. Optionally, the purified fragments of interest can be
blunt-ended and subject to further PC~ amplification for a
s minimum number of cycles to observe the signals of interest.
Finally, the products are then analyzed, also as in the prior
sections, by separation according tc length and by detection
of the DNA and of the optionally labeled adapter primers,
which indicate the RE cutting each fragment.
Other direct removal means may alternatively be
used in this invention. Such removal means include but are
not limited to digestion by single strand specific nucleases
or Fassage though a single strand specific chromatographic
coLumn, for example, containing hydroxyapatite.
It will be apparent to those of skill in the art,
tha. these alternative protocol~ u~ing cDNAs with a
cor.jugated capture moiety can combined with the other QEA~
embc~diments in various manners. This invention encompasses
a_l such insubstantially different variations.
5.3. PC~ BMBODIMENT OF QEA~
An alternative implementation of QEA~ methods not
u_in~ REs is based on PCR, or al~ernative amplification
means, to select and amplify cDNA fragments between chosen
25 target subsequences e~yl.ized by amplification primers.
See, generally, Innis et al., 1989, PCR Protocols A Guide to
Methods and Applications, Ac~ ;c Press, New York, and Innis
et al., 1995, PCR Strategies, Academic Press, New York.
Typically target subse~nc~ between four and
30 eight base pairs long chosen by the methods previously
described are preferred because of their greater probability
of occurrence, and hence information content, as compared to
longer subsequences. However, DNA oligomers this short may
not hybridize reliably and reproducibly to their
35 complementary subsequences to be effectively used as PCR
~ primers. Hybridization reliability depends strongly on
several variables, including primer composition and length,
- 123 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
stringency condition such as annealing temperature and salt
concentration, and cDNA mixture complexity. For the hash
code to be effective for gene calling, it is highly preferred
that subsequence recognition be as specific and reproducible
5 as possible so that well resolved bands representative only
cf the underlying sample sequence are produced. Thus,
instead of directly using single short oligonucleotides
complementary to the selected, target subsequences as
p imers, it is preferable tG use carefully designed primers.
The RE embodiments of QEA~ have been verified to
prod~ce rep~oducible signal patterns over a 103 range on
input DNA concentrations. The PCR ~~ hoA; ~-nt is less
preferred because the input DNA concentration, as well as the
initial hybridization temperature, must be closely to yield
~5 reproducible results.
The preferred primers are constructed according to
tha mcdel in Fig. 5. Primer 501 is constructed of three
cc~ponents, which, listed 5' to 3', are 504, 50~, and 502.
~omponent 503, described i.~fra, is optional. Component 502
s a sequence which is complementary to the subsequence which
primer 501 is designed to recognize. Component 502 is
typic~lly 4-8 bp long. Component 504 is a 10-20 bp sequence
cnosen so the final primer does not nybridize with any native
sequence in the cDNA sample to be analyzed; that is, primer
25 501 dces not anneal with any sequence known to be present in
the s~ple to be analyzed. The sequence of component 504 is
also chosen so that the final primer has a melting point
~bove 50~C, and preferably above 68~C. The method ~or
controlling melting temperature selecting average primer
30 composition and primer length is described above.
Use of primer 501 in the PCR embodiment involves a
first Ann~l ;n~ step, which allows the 3' end component 502
to anneal to its target subsequence in the presence of end
component 504, which may not hybridize. Preferably, this
35 ~nne~l ;ng step is at a temperature between 36 and 44~C that
is empirically determined to ~; i7e reproducibility of the
resulting signal pattern. The DNA concentration is
- 124 -
CA 0223~860 l998-04-24
W O 97/15690 PCTAJS96/17159
approximately 10 ng/50 ml and is simi~arly determined to
e reproducibility. Other PCR conditions are st~n~rd
and are described in Sec. 6.6. Once annealed, the 3' end
serves as the primer elongation point for the subsequent
5 first elongation step. The first elongation step is
preferably at 72~C for 1 minute.
If stringency conditions are such that exact
complementarity is not required for hybridization, false
positi-ve signals can be generated, that is signals resulting
10 from inexact recognition of the target subsequence. The
generation of these false positive bands can be accounted for
in the experimental analysis methods in order that DNA sample
seq~ences can still be recognized, but, perhaps, with some
increased recognition ambiguity that may need resolution.
15 These bands are accounted for by allowing inexact
~ybridi~ation matches of the target subseguence, the degree
of inaxactness depending on the stringency of the
hybridi.ation conditions. In this case the ciynal~ generated
contain only a fuzzy representation o~ the actuai subsequence
20 in the sample, the degree of fuzziness be-ng a functiGn of
subsequence length and the stringency ccndition, that is
kinding free energy, an~ the temperature of the
h-~bridization. Given the free energy and temperature, the
various possible actual subsequences can be approximately
25 determined by well known thermodynamic equilibrium
c2l culations.
Subsequent PCR cycles then use high temperature,
high stringency Anne~ling steps. The high stringency
ann~-; n~ steps ensure exact hybridization of the entire
3C primer. No further false positive bands are generated.
Preferably, these PCR cycles alternate between a 65~C
annealing step and 95~C melting step, each for 1 minute.
Optional c~ _ nent 503 can be used to improve the
~ specificity of the first low stringency ~nn~l ing step and
35 thereby ; ni ; ~e false positive bands generated then.
Component 503 can be -(N)j-, where N is any nucleotide and j
is typically between 2 and 4, preferably 2. Use of all
- 125 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
possible components 503 results in a degenerate set of
primers, 16 primers if j=2, which have a 3' end subsequence
effectively j bases longer than the target subsequence.
These longer complementary end se~nC~C have improved
5 hybridization specificity. Alternately, component 503 can be
-(U)j-, where N is a "universal" nucleotide and j is typically
between 2 and 4, preferably 3 or 4. A universal nucleotide,
such as inosine, is capable of forming base pairs with any
other naturally occurring nucleotide. In this alternative,
o single primer 501 has a 3' end subsequence effectively j
ba~es longer than the target, and thus also has improved
hybridization specificity.
A less preferred primer design comprises sets of
degenerate oligonucleotides of sufficient length to achieve
15 specific and reproducible hybridization, where each member of
z set includes a shared subse~uence com~lementary to one
selected, target sequence. For example, i~ a subsequense to
l~e recognized i5 GATT, the set cf primers used may be all
seql~ence~ of the form NNAATCNN, where N is any nucleo~ide.
20 Also sets of degenerate primers permit the recognition cf
dis~ontinuou~ subsequences. For exampl~, GA--~T may be
recog~ zed hy all sequences of the form NAANNTCNN.
Al~ernately, a universa~ nucleotide can be used in place o~
the degene-ate nucleotides represented by 'N'.
Each primer or primer set used in a single reaction
i., prefe~rably distinctively-labeled for detection. In the
preferred ~ ho~i ~~t using electrophoretic fragment
separation, labeling is by fluorochromes that can be
simultaneously distinguished with optical detection means.
An exemplary experimental protocol is ~, ~ized
here, with details presented in Sec. 6.6. Total cellular
mRNA or purified sub-pools of cellular mRNA are used for cDNA
synthesis. First strand cDNA synthesis is performed
according to Sec. 6.3 using, for example, an oligo(dT) primer
35 or alternatively phasing primers. Alternatively, cDNA
samples can be prepared ~rom any source or be directly
obtained.
- 126 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/171~9
Next, using a first strand cDNA sample, the primerS
of the selected primer sets are used in a conventional PCR
amplification protocol. A high molar excess of primers is
preferably used to ensure only fragments between primer sites
5 that are adjacent on a target cDNA sequence or gene are
- amplified. With a high molar excess of primers binding to
all available primer binding sites, no amplified fragment
should include internally any primer recognition site. As
many primers can be used in one reaction as can be labeled
10 for ccncurrent separation and detection and which generate an
adequately resolved length distribution, as in the RE
embodiments. For example, if fluorochrome labeling is used,
~ach p~ir of fluorochromes preferably is distinguishable in
one bind and separate pairs preferably are distinguishable in
lS separate bands. After amplification, the fragments are
separa~ed, re-suspended for gel electrophoresis,
~lectrGphoretically separated, and optical y detected.
Ther~ny the length di~tribu.ion of ~ragmenl~ having
pa~-'icular pairs of target suhsequences at their ends is
20 ascer~ained.
Preferred proto~~ols for the spe~ific PCR
er.lbodiments are described in detai~ ir ~ec. 5.6.
5.4. OBA~ ANA~Y8IS A~D DE8IGN l~ln~8
This inventions provides two groups of methods for
the Quantitative Expression Analysis ~ho~; -nt of this
invention: first, methods for QEA~ experimental design; and
second, methods for QEA~ experimental analysis. Although,
logic~lly, design prec~es analy~is, the methods of
30 experimental design depend cn basic methods described herein
as part of experimental analysis. consequently, experi~ental
analysis methods are described first.
In the following, descriptions are often cast in
- terms of the preferred QEA~ embodiment, in which REs are used
35 to recognize target subseq~nsec~ However, such description
is not limiting, as all the methods to be described are
~ equally adaptable to a~l QEA~ embodiments, including those in
~ - 127 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
which target subsequences are recognized by nucleic acid, or
nucleic acid mimic, and probes which recognize target
sub~equences by hybridization~
Further, the following descriptions are directed to
5 the currently preferred embodiments of these methods.
However, it will be readily apparent to those skiiled in the
computer and simulation arts that many other em~odiments of
these methods are substantially equivalent to those described
and can be used to achieve substantially the same results.
10 This invention comprises such alternative implementations as
well as its currently preferred implementation.
5.4.1. OEA~ EXP~TM~NTAL ANArYSI8 M~l~v8
The analysis methods comprise. first, selecting a
15 database of DNA sequences representative of the DNA sample to
be ar.alyzed, second, using this database and a description of-
~he e~periment to derive the pattexn of simulated signals,
contained in a database of simulated signais, which will be
prod~lced by ~N~ *ragments generated in the experiment, a~d
20 thir~, for any particular detected signal, using the pattern
o- database of simulated signals to predict the sequences in
the original sample likely to cause this siynal. Further
anaiysis methods present an easy to use user interface and
permil determination of the se~nceC actually causing a
25 signal n cases where the signal may arise from multiple
saquences, and perform statistical correlations to quickly
determine signals of interest in multiple samples.
The first analysis method is selecting a database
of DNA se~ences representative of the sample to be analyzed.
30 In the preferred use of this invention, the DNA se~nc~ to
b~ analyzed will be derived from a tissue sample, typically a
human sample ~Y~ ; n~ for diagnostic or research purposes.
In this use, database selection begins with one or more
publicly available databases which c~ ~ehensively record all
35 observed DNA se~len~. Such databases are G~nR~nk from the
National Center for Bio~e~hnology InformatiOn (Bethes~A, MD),
the EMBL Data Library at the European Bioinformatics
- 128 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
InstitUte (Hinxton Hall, UK) and databases from the Nationa
Center for Genome R~S~rch (Santa Fe, NM)- However, as any
sample of a plurality of DNA sequences of any prov~n~nc~ can
be analyzed by the methods of this invention, any database
5 cGntaining entries for the sequences likely to be present in
such a sample to be analyzed is usable in the further steps
of the computer methods.
Fig. 6A illustrates the preferred database
selection method starting from a comprehensive tissue derived
10 database. Database 1001 is the comprehensive input database,
h~ving the exemplary flat-fi~e or relational structure 1010
shown in Fig. 6B, with one row, or record, 1014 for each
enter2d DNA sequence. Column, or field, 1011 is the
accession number field, which uniquely identifies each
~5 se~uence in database 1001. Most such databases contain
redundant entries, that is muLtiple sequ~nce records are
pr~sen~ that are derived f~-om one biolcgical sequence.
C~Lum. L013 is the actual nucleotide sequence of the entry.
Th~ p;~rality of columns, cr fields, represented by 1012
20 cont:ain other data identifying thi-s entry including, for
example whether this is a cDNA or gDNA sequence, if cDN~
whethe~ this is a full length coding sequence or ~ fragment,
the species origin of the sequence or its product, the name
of the gene con~;ning the sequence, if known, etc. Although
25 shown as one file, DNA sequence databases often exits in
di~isions and selection from all relevant divisions is
contemplated by this invention. For exa~ple, GenBank has 15
different divisions, of which the EST division and the
separate database, dbEST, that contain expressed sequence
30 tag~ ("EST"~ are of particular interest, since they contain
expressed sequences.
From the comprehensive database, all records are
selected which meet criteria for representing particular
experiments on particular tissue types. This is accomplished
35 by conventional t~chniques of sequentially sc~nning all
records ~n the comprehensive database, selecting those that
- 129 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
match the criteria, and storing the selected records in a
selected database.
The following are exemplary selection methods. To
analyze a genomic DNA sample, database 1001 is scanned
5 acainst criteria 1002 for human gDNA to create selected
database 1003. To analyze expressed genes (cDNA sequences),
several selection alternatives are available. First, a
genomic sequence can be scanned in order to predict which
sukse~uences (exons) will be expressed. Thus selected
10 database 1005 is created by making selections according to
expression predictions 7 004. Second, observed expressed
sequences, such as cDNA sequences, coding domain sequences
("~S",, and ESTs, can be selected 1006 to create selected
aatabase 100~ of expressed sequences. Additionally,
15 predicted and observed expressed sequences can be combined
into another, perhaps more comprehensive, selecte~ database
of expressed sequences. Third, expressed sequences
determ-ned by e-ther of the prior methods may be ~urther
selected by any available indication of interest 1008 in the
20 databzse records ~o create more tar~eted selacted database
lOC5. Without limitation, selected databases can be composed
oS se~lences th~t car. be sele~ted accor~ing to any available
rele~ant field, indication, or combination present ~n
sequence databases.
2~ The second analysis method uses the previously
selected database of sequences likely to be present in a
sample and a description of ar. int~n~ experiment to derive
a p~tern of the signals wh~ch will be produced by DNA
fragments generated in the experiment. This pattern can be
30 stored in a computer implementaticn in any convenient manner.
In the following, without limitation, it is described as
being stored as a table of information. This table may be
stored as individual records or by using a database system,
fiuch as any conventionally available relational database.
35 Alternatively, the pattern may simply be stored as the image
of the in-memory structures which represent the pattern.
- 130 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
A QEA~ experiment comprises several independent
recognition reactions applied to the DNA sample sequenceS,
where in each of the reactions labeled DNA fragments are
produced from sample sequences, the fragments lying between
s certain target subsequences in a sample sequence. The target
~ subsequences can be recognized and the fragments generated by
the preferred RE embodiments of QEA~ methods or by the ~CR
embodiment of QEA~. The following description is focused on
the RE embo~; -nts.
10Fig. 7 illustrates an exemplary description 1100 of
a pr~ferred QEA~ embodiment. Field 1101 contains -~
description of the tissue sample which is the source of the
DN~ sample. For example, one experiment could analyze a
normal prostrate sample; a second otherwise identical
15 experiment could analyze a prostrate sample with premalignant
changes; and a third experimer.t could analyze a cancerous
pros~ate samp e. Differellces in gene expression between
tnese samples then relate tc the progress of the cancer
disease st~te. Such samples could be drawn from any other
20 human cancer or malignancy.
Major rows 1102, 1105, and 1109 describe the
s2parate ~ndividual recogniticn -seactions to which the DNA
from tissue sample 1101 is subjected. Any number of
reactions may be assembled into an experiment, from as few as
.5 cne to as many as there are pairs of available recognition
means to recognize subsequences. Fig; 7 illustrates 15
reactions. For example, reaction 1 specified by major row
1102 ~enerates fragments between target subsequences which
are the recognition sites of restriction endonucleases 1 and
30 2 described in minor rows 1103 and 1104. Further, the ~El
cut end is roco~ni7ed by a labeling moiety labeled with
r~R~Tl, and the RE2 end is recognized by LABEL2. Similarly,
reaction 15, 1109, utilizes restriction endonucleases 36 and
-37 labeled with labels 3 and 4, minor rows 1110 and 1111,
35 respectively.
Major row 1105 describes a variant QEA~ reaction
using three REs and a separate probe. As described, many REs
- 131 -
CA 0223~860 1998-04-24
W O 97/15690 PCTnUS96/171~9
can be used in a single recognition reaction as long as a
useful fragment distribution results- T~o many ~Es results
in a compressed length distribution. Further, probes ~or
target subsequences that are not intended to be labeled
5 ~ragment ends, but rather occur within a &ragment, can be
used. For example, a labeled probe added after QEA~ PCR
amplification step (if present in a given embodiment), a post
PCR probe, can recognize subsequences internal to a fragment
and thereby provide an additional signal which can be used to
10 dis_riminate between two sample sequences which produce
frag~.ents of the same length and end sequence which otherwise
have differing internal sequences. For another example, a
prohe added before QEA~ PCR step and which cannot be extended
by DNA polymerase will prevent PCR amplification of those
15 fragment cont~;n;ng the probe's target subsequences. I~ PCR
3~plification is necessary to generate detectable signals (in
~ 9'Vel. ~mho~; - nt), such a probe wi~l prevent the detection
of s~ch a fraament. The absence of a fragment may make a
pre-~icusly ambiguous detected band no~ un-mbiguous. Such PCR
20 disruFtion probes can be PNA oligomers or degenerate sets of
DNA oligomers, modified to prevent polymerase extension
(e.g., by incorporation of a dideoYynuclevtide at the 3'
en~).
Where alternative phasing PCR primers are used,
25 their extra recognition subsequences and labeling are
descr~bed in rGws ~p~n~ent to the RE/ligasç reaction whose
prod cts they are used to amplify.
Next Fig. 8A illustrates, in general, that from the
database selected to best represent the likely DNA sequences
30 in the sample analyzed, 1201, and the description of QEA~
experiment, 1202, the simulation methods, 1203, determine a
pattern of simulated signals stored in a simulated database,
1204, that represents the results of QEA~ experiments. The
experimental simulation generates the same ~ragment lengths
35 and end subse~n~ from the input dat~h~ that will be
generated in an actual experiment performed on the same
sample of DNA se~ences.
- 132 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
Alternately, the simulated pattern or database may
not be n~e~e~, in which case the DNA database is searched
- sequence by sequence, mock digestions are performed and
compared agàinst the input signals- A simulated database is
5 preferable if several signals need to be searched or if the
same QEA~ experiment is run several times. Conversely, the
simulated database can be dispensed with when few signals
from a few experiments need to searched. A quantitati~e
statement of when the simulated database is more efficient
lo depends upon an analysis o~ the costs of the various
oper~tions and the size of DNA data~ase, and can be performed
as is well known in the computer arts. Without limitation,
in thc fcllowing the simulated database is described
Fig. 8B illustrates an exemplary structure for the
15 simulate~ database. Here, the simulated results of all the
individual recognition reactions deCined for the exFeriment
are yathered into rectangular table 121d. The invention is
eq!l211y adaptable to other database structures cont~i n ing
equi~alent information; such an equivalent structLre wouLd be
20 one, for example. where each reaction ~as placed in a
separ:~te ~able. The rows of table 1210 are indexed by the
leny~h~ of possible fragments. For example, row 1211
con~ains fragments of length 52. The co umns of table 12~0
are indexed by the possible end subse~ences and probe hits,
25 i~ any, in a particular experimental reaction. For example,
cclumns 1212, 1213, and 1214 contain all ~ragments generated
in reaction 1, Rl, which have both end subse~n~ec
r~co~ni~ed by REl, one end subsequence recognized by REl and
tke other by RE2, and both end subsequences recognized by
30 REG, respectively. Other columns relate to other reactions
of the experiment. Finally, the entries in table 1210
contain lists of the accession numbers o~ se~n~C in the
database that give rise to a fragment with particular length
and end subsequences. For example, entry 12lS indicates that
35 only accession h~ AOl generates a fragment of length 52
with both end subse~en~ec recognized by REl in Rl.
Similarly, entry 1216 indicates that accession 1.l he~s AOl
- 133 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
and S003 generate a fragment of length 151 with both end
subsequences r~cogn;~ed by RE3 in reaction 2.
In alternative embodiments, the contents of the
table can be supplemented with various information. In one
5 afi~ect, this information can aid in the interpretation of
results produced by the separation and detection means used.
For example, if separation is by electrophoresis, then the
detected electrophoretic DNA length can be corrected to
obtain the true physical DNA length. Such corrections are
10 well Xnown in the electropho~etic arts and depend on such
facior~ as average base composition and fluorochrome labels.
One c ~rcially available package for making these
corrections is Gene Scan Software from Applied Biosystems,
Inc. (Foster City, CA). In this case, each table entry for a
15 fragment can contain additi~nally average base composition,
pe_haps expressed as percer.~ G~C conte~t, and the
~ eri~en~al de~inition car. include primer average base
compositior. ar.d fluorochrome la~el used. For a furth~r
exzm~le, i separation is b~ ~asC sp~ctroscopy or similar
20 me.hod, the additional information can be the molecular
weigrt of each fragment and perhaps a typically fragmentation
pat'ern. Use Gf other separation and detection means can
suga~e.~t the use of other appropriat~ ~upplemental data.
Where alternative phasing primers, the SEQ-QEA~
25 embodiment, or other means generating effective targer
subse~e~ces ~re used, supplemen'al columns are used with RE
pair in order to further identify such e~fective target
subsequenc~.
Before describing how this simulated dat~h~ce is
30 generated, it is useful f-rst to describe how this database
is used to predict experimental results. Returning to Fig.
7, labels are used to detect binding reaction events by
subsequence recogniticn means to the target DNA, to allow
detection after separation of the fragments by length. In an
35 ~ hoA; ~nt using fluorescent detection means, these labels
are fluorochromes covalently attached to the primer strands
of the adapters, as previously described, or to hybridization
- 134 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159 probes, if any. Typically, all the fluorochrome labels used
in one reaction are simultaneously distinguishable so that
fragments with all possible combinations of target
subsequences can be fluorescently disting~ he~. For
5 example, fragments at entry 1217 in table 1210 (Fig. 8B)
occur at length 175 and present simu1taneous fluorescent
signals LABELl and TAR~r~ upon stim~lation, since these are
the labels used with adapters which recognize ends cuts by
REl and RE2 respectively. For a further example, in reaction
10 2, major row 1105 of experimental definition 1100 (Fig. 7), a
~ragment with ends cut by RE2 and RE3 and hy~ridizing with
probe P will present simultaneous signals LABEL2, LABEL3, and
LABEL~. Where effective target subsequences are constructed.
e.g. by SE~-QEA~ or alternative phasing primers, this lookup
1~ is appropriately modified.
Other labelings are within the scope o~ this
inJentiOn. For example, a cer~ain group of target
subsequences can be identically labeled or not labeled at
~11, r. which case the corrcsPondiJlg grou~ of fragments are
20 not distingl-j 5h;~hle. in this case; if REl and RE3 end
subceyuences were identically labeled in table 1210 (Fig.
8~), a fragment of length 151 ~ay be generated by sequence
T;6~, ~0 , or S003, or any combination of these sequences.
In the e~L~, e, if silver (Ag) staining of an electrophoresis
25 gel is u~ed in an embodimen~ to detect separated fragmen~s,
th~n a~l-bands ~ill be identically labeled and only band
lengths can be distinguished within one electrophoresis lane.
Thus the simulated dat~ hz~ together with the
experimental definition can be used to predict experimental
30 results. If a signal is detected in a r~cognition reaction,
say Rn, whose end labe}ings are LABELl and TAR~T~ and whose
representation of length is corrected to physical length in
base pai~s of L, the length L row of the simulated database
is retrieved and it is scanned for Rn entries with the
35 detected subsequence labeling, by using the column h~A~ i ngS
indicating observed subsequences and the experimental
definition indicating how each subsequence is labeled. If no
- 135 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
match is found, this fragment represents a new gene or
sequence not present in the selected database. If a match is
~ound, then this fragment, in addition to possibly being a
new gene or sequence, can also have been generated by those
5 candidate sequences present in the table entry(ies) found.
The simulated database lookup is described herein
as using the physical length of a detected fragment. In
cases where the separation and detection means returns an
approximation to the true physical fragment length, lookup is
10 augmented to account for such as approximation. For example,
electrophoresis, when used as the separation means, returns
the electrophoretic length, which depending on average base
composition and labeling moiety is typically within 10% of
the physical length. In this case database lookup can search
15 all relevant entries whose physical length is within 10~ o~
the reported electrophoretic length, per~o-m corrections to
obtaln electrophcretic length, and then chcck ~or a match
with the detected signal. ~lternative lookup implementations
are apparent, one being to precompute the electrophoretic
20 length for all predicted fragments, construct an alternate
table index over the electrophoretic length, and then
directly lookup the electrophoretic length. Other separation
and de.ection means can require corresponding augmentations
to lookup to correct ~or their particular experimental biases
25 and inaccuracies. It is understood that where database
lookup is referred to subsequently, either simple physical
lookup or augmented lookup is meant as appropriate.
If matched candidate database sequences are found,
then the selected dat~R~ can be consulted to determine
30 other information concerning these sequences, for example,
gene name, tissue origin, chromosomal location, etc. If an
unpredicted fragment is found, this fragment can be
optionally retrieved from the length separation means, cloned
or sequenced, and used to search for homologues in a DNA
35 sequence database or to isolate or characterize the
previously unknown gene or sequence. In this ~nn~r this
- 136 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
invention can be used to rapidly discover and identify new
genes.
The computer methods of this invention are also
adaptable to other formats of an experimental definition.
5 For example, the labeling of the target subsequence
recognition moieties can be stored in a table separate from
the table defining the experimental reactions.
Now turning to the methods by which the simulated
database. is generated, Fig. 9 illustrates a basic method,
10 termed herein mock fragmentation, which takes one sequence
and the definition of one reaction of an experiment and
produces the predicted results of the reaction on that
se~uence. Generation of the entire simulated database
requires repetitive execution of this basic method.
Turning first to a description of mock
fragmentation, the method commences at 1301 and at 1302 it
inputs .he sequence to be fragmentea and the definition of
-he fragmentation reaction, in the Eollowing terms: the
target end subsequences RE1 ... REn. where n is typically 2
20 or 3, and the subsequences to be recognized by post PCR
probes, P1 ... Pn, where n is typically 0 or 1. Note that
PC~ d_sruFtion probes act as unlabeled ena subseq~nç~c and
a~e so treate~ for input to this methcd. ~he operation of
the metnod is illustrated by example in Fig. lOA-F for the
25 case RE1, RE2 and P1.
At step 1303, for each target end subsequence, the
method makes a "vector of ends", which has elements which are
pairs of nucleotide positions along the sequence, each pair
being labeled by the corresponding end subsequence. For
30 embodiments where end subseqll~cec are recognized by
hybridizing oligonucleotides, the first member of each pair
is the beginning of a target end subsequence and the second
h~t- is the end of a target end subsequence. For
embodiments where target end subsequences are recogn;zed by
35 restriction endonucleases, the first -~mh~ of each pair is
the beginning of the overhang region that corresponds to the
RE recognition subsequence and the second - h~ is the end
- 137 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
of that overhang region. It is preferred to use REs that
generate 4 bp overhangs. The actual target end subse~-~nces
are the RE recognition sequences, which are preferably 4-8 bp
long.
This vector is generated by a string operation
which compares the target end subsequence in a 5' to 3'
direction against the input sequence and seeks string
matches, that is the nucleotides match exactly. Where
effective target subsequences are formed by using, e.g. SEQ-
10 QEA~ or alternative phasing primers, it is the effective
subsequences that are ~ ~red. This can be done by simply
comparing the end subsequence against the input sequence
starting at one end and proc~ing alcng the sequence one
base at time. However, it is preferable to use a more
15 e~ficient string matching algorithm, such as the Knuth-
Morris-Pratt or the Boyer-Moore algorithms. These are
described with sample code in Sedgewick, 1990, Alqorithm. in
C, chap. 19, Addison-We=ley, Reading, MA.
In QEA~ embodiments where target subsequence are
20 recogJlized with accuracy, such as the RE embodiments, the
comparison of target subsequence against input sequence
should be exact, that is the bases should match in a one-~o-
one manner. In embodiments where target subsequences are
less accurately recognized, the string match should be done
25 in a less exact, or fuzzy, ~nn~. For example, in the PCR
embodiments, a target subsequence of length T can
inaccurately r~og~i~e an input sequence, also of length T,
by mat ~h; ng only T-n bases exactly, where n is typically 1 or
2 and is adjustable ~ep~n~; n~ on experimental conditions. In
30 this case the string operation, whlch generates the vector of
ends, should accept partial T-n matches as well as exact
matches. In this, the string operations generate the false
positive matches expected from the experiments and permit
these fragments to be identified. Ambiguity in the simulated
35 database, however, increases, since more fragments leads to a
greater chance of ~ragments of identical length and end
labels.
- 138 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
Fig. lOA illustrates end vectors 1401 and 1402,
comprising three and two ends, respectively, generated by RE1
and RE2, which are for this example assumed to be REs with a
4 bp overhang. The first overhang in vector 1401 occurs
5 between nucleotide 10 and 14 in the input sequence.
Step 1304 of Fig. 9 merges all the end vectors for
all the end subsequences and sorts the elements on the
position of the end. Vector 1404 of Fig. lOB illustrates the
result of this step for example end vectors 1401 and 1402.
Step 1305 of Fig. 9 tnen creates the fragments
generated by the reaction by selecting the parts of the full
input sequence that are delimited by adjacent ends in the
merged and sorted end vector. Since the experimental
conditions in conducting QEAn' should be selected such that
15 target end subsequence recognition is allowed to go to
comp'etion, all possible ends are recognized. For the
restriction endonuclease ~mhorl; ~nts, the cutting and ligase
;~eactions should be conducted such that all possible RE cuts
are made and to each cut end a labeled primer is lisated.
20 These c:onditions insure that no fragments contain inte~ nal
unrecognized target end subsequences and that cnly ad~acent
er.ds in the merged and sorted vector define genera_ed
fragments.
Where additional information is needed for
25 simulated database entries to adapt to inaccuracies in
particular separation and detection means, such information
can be collected at this step. For example, in the case of
electrophoretic separation, fragment sequence can be
determined and percent G+C content computed and entered in
30 the database along with the fragment accession number.
For the PCR embodiments, the fragment length is the
difference between the end position of the second end
subsequence and the start position of the first end
subsequence. For RE embodiments, the fragment length is the
35 difference between the start position of the second end
subsequence and the start position of the first end
-- 139 --
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
subsequence plus twice the primer length (48 in the preferred
primer embodiment).
Fig. lOC illustrates the exemplary fragments
generated, each fragment being represented by a 4 member
5 tuple comprising: the two end subsequences, the length, and
an ind_cator whether the probe binds to this fragment. In
Fig. lOC the position of this indicator is indicated by a
'~'. Fragment 1408 is defined by ends 1405 and 1406, and
fragment 1409 by ends 1406 and 1407. There is no fragment
10 defined by ends 1405 and 1407 because the intermediate end
sul:sequence is recognized and either fully cut in an RE
embodiment or used as a fragment end priming position in a
PC~ embodiment. For simplicity, the fragment lengths are
illustrated for the RE embodiment without the primer length
15 addition.
Step 1306 of Fig. 9 checks if a hybridization pro~e
is involved in t:he experiment. IE not, the method skips to
step 1309. If so, step 1307 determines the sequence of the
fragment defined in step 130~i. Fig. lOD llustra~es that the
20 fragment sequences for this example are the nucleotide
~ quences ~.rithin the input sequence that are betw -en the
indicated nucleotide positions. For example, the ~irst
fragment sequence is the part of the input sequence between
positions 10 and 62. Step 1308 then checks each probe
25 subsequence against each fragment se~uence to determine
whether there is any match (i.e., whether the probe has a
sequence complementary enough to the fragment sequence
suCficient for it to hybridi~e thereon). If a match is
four.d, an indication is made in the fragment 4 ~-nh~l- tuple.
30 This match is done by string searching in a similar manner to
that described for generation of the end vectors.
Next at step 1309 of Fig. 9, all the fragment are
sorted on length and assembled into a vector of sorted
fragments, which is output from the mock fragmentation method
35 at step 1310. This vector contains the complete list of all
fragments, with probe information, defined by their end
-- 140 --
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
subsequences and lengths that the input reaction will
generate from the input sequence.
Fig. lOE illustrates the fragment vector of the
example sorted according to length. For illustrative
5 purposes, probe P1 was found to hybridize only to the third
~ragment 1412, where a 'Y' is marked- 'N' is marked in all
the other fragments, indicating no probe binding.
The simulated database is generated by iteratively
applying the basic mock fragmentation method for each
10 sequence in the selected database and each reaction in the
experimental definition. Fig. 11 illustrates a simulated
database generation method. ~he method starts at 1501 and àt
1502 inputs the selected representative database and the
expe-imental definition with, in particular, the list of
15 reactions and their related subsequences. Step 1503
initializes the digest database table so that lists of
accession numbers may be inserted for all possible
combinations of fragment length and target end subsequences.
Step 1504, a DO loop, causes the iterative execution of steps
20 1505, L506, and 1507 for all sequences in the input selected
c~atclba.se .
Step 1505 takes the next sequence in the database,
as selected by the enclosing D0 loop, and the next reaction
of the experiment and performs the mock fragmentation method
25 o~ Fig. 9, on these inputs. Step 1506 adds the sorted
fragment vector to the simulated database by taking each
fragment from the vector and adding the sequence accession
number to the ist in the datAhAs~ entry indexed by the
~r~gment length and end subsequences and probe (if any).
30 Fig. lOF represents the simulated database entry list
additions that would result for the example mock
fragmentation reaction o~ Figs. lOA-E. For example,
accession number A01 is added to the accession number list in
the entry 1412 at length 151 and with both end subseqll~ncec
35 RE2.
Finally, step 1507 tests whether there is another
reaction in the input experiment that should be simulated
-- 141 --
CA 0223~860 1998-04-24
WO 97/15690 PCTAUS96/17159
against this sequence. If so, step 1505 is repeated with
this reaction. If not, the D0 loop is repeated to select
another database seguence. If all the database sequences
have been selected, the step 1508 outputs the simulated
5 database and the method ends at 1509.
5.4.2. OEA~ EXP~JM~NTAL DE8IGN MET~ODS
q~he goal of the experimental design methods i5 to
optimize each experiment in order to obtain the ~;
10 amount of quantitative information. An experiment is defined
by its component recognition reactions, which are in turn
defined by the target end subsequences recognized, probes
used, if any, and labels assigned. If alternative phasing
primers, SEQ-QEA~, or other similar means are used, effective
15 target subsequences are used. Any of several criteria can be
u~-ed to ascertain the amount of information obtained, and any
o~ several algorithms can be used to porform the reaction
optimization.
A preferred criteria for ascertaining the amount of
20 information uses the co~c~pt of "good sequence." A good
sequence for an experiment is a sequence fGr -~hich there is
at least one reaction in the experiment 'hat produces a
unique signal from that sequence, that is, a fragment is
produced from that good sequence, by at least one recognition
25 reaction, that has a unique combination of length and
labeling. For example, returning to Fig. 8B, the sequence
with accession l~ h~r A01 is a good sequence because reaction
1 pro~uc~c signal 1215, with length 52 and with both target
end subsequences recognized by REl, uniquely from sequence
30 A01. However, sequence S003 is not a good sequence because
there are no unique signals produced only from S003: reaction
R2 produces signal'1216 from both A01 and S003 and signal
1219 from both Q012 and S003. Using the amount of good
sequences as an information measure, the greater the number
35 of good se~e~ in an experiment the better is the
experimental design. Ideally, all possible sequences in a
sample would be'good sequences.
- 142 -
CA 0223~860 1998-04-24
W O 97/15690 PCTnUS96/17159
.
Further, a quantitative measure of the expression
of a good sequence can simply be determined from the detected
signal intensity of the fragment uniquely produced from the
good sequence. Relative quantitative measures of the
5 expression of different good sequences can be obtained by
comparing the relative intensities of the signal uniquely
produced from the good sequences. An absolute quantitative
measure of the expression of a good sequence can be obtained
by including a concentration st~n~rd in the original sample.
o Such a st~ rd for a particular experiment can consist of
several different good sequences known not to occur in the
original sample and which are introduced at known
cor.centrations. For example, exogenous good sequence 1 is
added at a 1: 103 concentration in molar terms; exogenous good
15 sequence 2 at a 1: 104 in molar terms; etc. Then comparison of
the relative intensity of the unique si~nal of a good
sequence in the sample with the intensities of .he unique
signal of the st~n~rds allows determination of the molar
concentrations of the sample sequ-nce. For example, if the
20 good sequence has a unique signal intensity half way-between
the unique signal intensities of good sequences 1 and 2, then
it is present at a concentration half way between the
concentrations of good sequences 1 and 2.
Another preferred measure for ascer~ining the
25 amount of information produced by an experiment is derived by
limiting attention to a particular set of sequences of
interest, for example a set of known oncogenes or a set of
receptors ~nown or expected to be present in a particular
tissue sample. An experiment is designed according to this
30 measure to ~xi i7e the number of sequences of interest that
are good sequences. Whether other sequences possibly present
in the sample are good sequences is not considered. These
other sequences are of interest only to the extent that the
sequences of interest produce uniquely labeled fragments
35 without any contribution from these other sequences.
~ his invention is adaptable to other measures for
ascert~ining information from an experiment. For example,
- 143 -
CA 0223~860 1998-04-24
WO 97/15690 PCT/US96/17159
another measure is to ;ni~i~e on average the number of
sequences contributing to each detected signal. A further
measure is, for example, to ini~ize for each possible
sequence the number of other sequences that occur in common
5 in the same signals. In that case each sequence is linked by
common occurrences in fragment labelings to a minimum number
o~ other sequences. This can simplif~y making unambiguous
signal peaks o~ interest (see infra).
Having chosen an information measure, for example
10 the number of good seq~ences, for an experiment, the
optimization methods choose target subsequences, and possibly
probes, which optimize the chosen measure. One possible
optimization method is exhaustive search, in which all
subsequences in lengths less than approximately lO are tested
15 in all combinations for that combination which is optimum.
Thi~ method requires considerable computing power, and the
upper bound is determined by the computational ~acilities
available and the average probaDility o~ occu~rence of
subseql1ences of a given length. With adequate resources, it
20 is pre~erable to search all sequences down to a probal:!ility
of --::urrenceoE about O.005 to O.Ol. ~pper bounds may range
*rom 8 to ll or 12.
A preferred optimization method is known as
simulated ~nrleAling. See Press et al., 1986, Numerical
25 Reci~es -- The Art of Scientific ComPUtinq, Sec. lO.9,
Cambridge University Press, Cambridge, U.K. Simulated
ann~l in~ attempts to find the m jn; 1~ 0~ an "energy"
function of the "state" of a system by generating small
changes in the state and accepting such changes according to
30 a probabilistic factor to create a "better" new state. While
the method progresses, a simulated "temperature", on which
the probabilistic factor depends and which limits acceptance
of new states o~ higher energy, is slowly lowered.
In the application to the methods of this
35 invention, a "state", denoted by S, is the experimental
definition, that is the target end subsequences and
hybridization probes, i~ any, in each recognition reaction o~
-- 144 --
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
the experiment. The "energy", denoted E, is taken to be l.o
divided by the information measure, so that when the energy
is in;~;zed, the information is ~x; ~zed. Alternatively,
the energy can be any monotonically decreasing function of
the information measure. The computation of the energy is
denoted by applying the function E( ) to a state.
The preferred method of generating a new
experiment, or state, from an existing experiment, or state,
is to make the following changes, also called moves to the
10 experimental definition: (1) randomly change a target end
subsequence in a randomly chosen recognition reaction; (2)
add a randomly chosen target end subsequence to a ~andomly
cho~en reaction; (3) remove a randomly chosen target end
subsequence from a randomly chosen reaction with three or
15 more target subsequences; (4) add a new reaction with two
randomly chosen target end subsequences; and (5) remove a
randomly chosen reaction. I~ an RE embodiment of QEA~ is
being designed, all target end subsequences are limited _o
~v~ilalsl~s ~E ~eccgn}~}on seq~e~s~ Tf alternatiYe phasing
20 primers, SEQ-QEA~, or other means are used to generate
e~ective target subsequences, all subsequences must be
chosen ~rom among such effective target subsequences that can
be generated from available ~Es. To generate a new
experimental definition, one o~ these moves is randomly
2s selected and carried out on the existing experimental
de~inition. Alternatively, the various moves can be
unequally weighted. In particular, if the number of
reactions is to be fixed, moves (4) and (5) are skipped. The
invention is further adaptable to other moves ~or generating
30 new experiments. Preferable generation methods will generate
all possible experiments.
Several additional subsidiary choices are needed in
order to apply simulated ~nne~l ;ng. The "Boltzman constant"
is taken to be 1.0, so that the energy equals the
35 temperature. The mi ni of the energy and t~ ~~ature,
~ denoted Eo and To~ respectively, are defined by the ~i of
the information measure. For example, if the number of good
- 145 -
CA 0223~860 1998-04-24
W O 97/15690 PCTnJS96/171~9
sequences of interest is G and is used as the information
measure, then E~, which equals To~ equals l/G. An initial
temperature, denoted Tl, is preferably chosen to be 1. An
initial experimental definition, or state, is chosen, either
5 randomly or guided by prior knowledge of previous
experimental optimizations. Finally, two execution
parameters are chosen. These parameters define the
"annealing schedule", that is the manner in which the
temperature is decreased during the execution of the
10 simulated annealing method. They are the number of
iLerations in an epoch, denoted by N, which is preferably
taken to be 100 and the temperature decay factor, denoted by
f, which is preferably taken to be 0.95. Both N and f may be
systomatically varied case-by-case to achieve a better
15 optimization of the experiment definition with a lower energy
and ~ higher information measure.
With choices for the informaticn ~easure or energy
function, the moves for generating new experiments, an
initial state or experiment, and the execution parameters
20 made as above, the general application of si~ulated annealin~
to optim-ze an experimental definition is illustrat~d in Fig.
13A. The information measure used in this description is the
number of good se~uences of interest. Any information
measure, such as those previously described, may be used
25 alternately.
The method begins at step 1701. At step 1702 the
t~ _~rature is set to the initial temperature; the state to
the initial state or experimental definition; and the energy
is set to the energy of the initial state. At step 1703 the
30 temperature and energy are checked to determine whether
either is less than or equal to the i n; ~ for the
information measure chosen, as the result of either a
fortuitous initial choice or subsequent c~ _uLation steps.
If the energy is less than or equal to the i n; energy, no
35 further optimization is possible, and the final experimental
definition and its energy is output. If the t~ _-rature is
less than or equal to the i n i temperature, the
- 146 -
CA 0223~860 1998-04-24
WO 97/15690 PCT/US96/17159
optimization is stopped. Then the inverse of the energy is
the number of good sequences of interest for this
experimental definition.
Step 1706 is a DO loop which executes an epoch, or
5 N iterations, of the simulated annealing algorithm. Each
iteration consists of steps 1707 through 1711. Step 1707
generates a new experimental definition, or state, Sywr
according to the described generation moves. Step 1708
ascertains or determines the information content, or energy,
10 of Sn~. Step 1709 tests the energy of the new state, and, if
-t is lower than the energy of the current state, at step
1711, the new state and new energy are accepted and replace
the current state and current energy. If the energy of the
new state is higher than the energy of the current state,
15 step 1710 computes the following function.
EXP [--( ~--E~ w) / T}
~b-s functiGn defines ~he probabilistic fac~:or controllir.g
acceptance. If this function is less than 2 randGm chosen
~~ number uniformly distributed be~ween 0 and 1, then the new
state i8 accepted at step 171~. ~f not, then the newly
generated state is discarded. These steps are e~ivalent to
ac-epting a new state if the energy is not increased by an
amount greater than that determined by function (4) in
25 conjunction with the selection of a random number. Or i~
other words, a new state is accepted if the new information
measure is not decreased by an amount greater than indirectly
dete~ ine~ by function (4).
Finally, after an epoch of the algorithm, at st~ep
30 1712 the temperature is r~ by the multiplicative factor
f and the method loops back to the test at step 1703.
Using this algorithm, starting from an initial
e~perimental definition which has certain information
content, the algorithm produces a final experimental
35 definition with a higher in~ormation content, or lower
energy, by repetitively and randomly altering the
- 147 -
CA 0223~860 1998-04-24
WO 97tlS690 PCT~US96/171~9
experimental definition in order to search for a definition
with a higher information content.
The computation of the energy of an experimental
definition, or state, in step 1708 is illustrated more detail
5 in Fig. 13B. This method starts at step 1720. Step 1721
inputs the current experimental definition. Step 1722
determines a complete digest database from this definition
and a particular selected database by the method of Fig. ~1.
Step 1723 scans the entire digest database and counts the
10 number of good sequences of interest. If the total number of
good sequences is the measure used, the total number of good
sequences can be counted. Alternatively, other information
measures may be applied to the digest database. Step 1724
computes the energy as the inverse of the information
15 measure. Alternatively, another decreasing function of the
ir.formation content may be used as the energy. Step 1725
outputs the energy, and the method ends at step 1726.
5.4.3. OEA~ AK8IGUITY RESOL~TION
~0 In one utilization of this invention two related
tissue samples can be subject to the same experiment, perhaps
cons-sting of only one r~co~n;tion reaction, and the outcomes
compared. The two tissue samples may be otherwise identical
except for one being normal and the other diseased, perhaps
25 by infection or a proliferative process, such as hyperplasia
or c~nc~. One or more signals may be detected in one sample
and not in the other sample. Such signals might represent
genetic aspects of the pathological process in one tissue.
These signals are of particular interest.
3~ The cA~ Ate sequences that can produce a signal
of interest are dete ; n~, as previously described, by look-
up in the digest datAhA~. The signal may be produced by
only one sequence, in which case it is unambiguously
identified. However, even if the experiment has been
35 optimized, the signal may be ambiguous in that it may be
produced by several candidate sequences from the selected
- 148 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
database. A signal of interest may be made unambiguous in
several manners which are described herein.
In a first manner of making unambiguous assume the
signal of interest is produced by several candidate sequences
5 all of whicX are good sequences for the particular
experiment. Then which sequences are present in the signal
of interest can be ascertained by determining the
quantitative presence of the good sequences from their unique
signals. For example, referring to Fig. 8B, if the cignal
lG 1217 of length 175 with the labeling 1213 is of interest, the
sequences actually present in the signal can be determined
from the quantitative determination of the presence of
signalc 1215 and 1218. Here, both the possible sequences
contributing to this signal are good sequences for this
15 experiment.
The fir~t manner of making unambiguous can be
extended to the case where cn~ of the sequer.ces possibly
contributing to a signal is not a good seq~ence. The
q.ianti~ative presence of all the possible good sequences can
~o be determined from the quantitative strellgth of their unique
sigr.als. The presence of the remainir~ sequence which is no_
a qo,d sequences can be determined by subtracting from the
q~an'itative presence of the signal of interest the
quantitative presences of all the good sequences.
Further extensions of the first ~nn~ can be made
,o cases where more than one of the possible seq~lences is not
a good sequences if the sequences which are not good appear
as contributors to further signals involving good sequences
in a manner which allows their quantitative presences to be
30 determined. For example, suppose signal 1219 is of interest,
where both possible se~encefi are not good se~l~nseC. The
quantitative presence of sequence Q012 can be determined from
signals 1220 and 1218 in the manner previously outlined. The
quantitative presence of sequence S003 can be determined from
35 signals 1216 and 1215. Thereby, the sequences contributing
~ to signal 1219 can be dete~ i ne~ . More complex combinations
can be similarly made unambiguous.
- 149 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
An alternative extension of the first ~nne~ of
making unambiguous is by designing a further experiment in
which the possible sequences contributing to a signal of
interest are good sequences even if they were not originally
5 so. Since there are approximately 50 suitable REs that can
be used in the RE embodiment of QEA~ (Section 6.2), there are
approximately 600 RE reaction pairs that can be per~ormed,
assuming that half of the theoretical m~;mll~ of 1,250 (50 X
50 f 2 = 1,250) are not useable. Since most RE pairs produce
1~ on the average of 2C0 fragments and st~ ~d electrophoretic
techniques can resolve at least approximately 500 fragment
lengths per lane, the RE QEA~ embodiment has the potential of
generating over 100,000 signals (500 X 200 = 10~,000). The
number of possible sign~s is further increased by the use of
~5 reactions with three or more REs and by the use of labeled
probes. Further, since the average complex human tissue, for
~xample brain, is estimated to express no more than
~pproximately 2~,000 genes, there is a 4 fold excess of
po~cibie signal_ over the number of possible sequences in a
2~ sa~plo. Thus it is highly likely that for any signal of
interest, a further experiment can be designed and optimized
for which all possible candidates of the signal of interest
ar~ good sequences. This design can be made by nsing the
prior optimization methods with an information measure the
25 saquences of interest in the signal of interest and starting
with an extensive initial experimental definition including
many additional reactions. In tpat m~nne~, any signal of
interast can be made unambiguous.
A second ~ ~n~ of making unambiguous is by
- 3C automatically rAnki ng the likelihood that the sequences
possibly present in a signal of interest are actually present
using information from the r~m~; n~e~ of the experimental
reactions. Fig. 14 illustrates a preferred ranking method.
T~e method begins at step 1801 and at step 1802 inputs the
35 list of possible accession numbers in a signal of interest,
the experimental definition, and the actual experimental
results. D0-loop 1803 iterates once for each possible
- 150 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
accession number. Step 1804 performs a simulated experiment
by the method illustrated in Fig. 11 in which, howe~er, only
the current accession il~ h~l- is acted on. The c~lL~uL is a
single sequence digest table, such as illustrated in Fig.
5 10F.
Step 1805 determines a numerical score o~ ranking
the similarity of this digest table to the experimental
results. One possible scoring metric comprises s~nn; ng the
digest table for all fragment signals and adding 1 to the
10 score if such a signal appears also in the experimental
results and subtracting l from the score if such signal does
not appear in the experimental results. Alternate scoring
metrics are possible. For example, the subtraction of 1 may
be omitted.
Step 1806 sorts the numerical scores of the
likelihood that each possible accession number is actually
prssent in the sample. Step 1807 outputs the sorted list and
the method ends at step 1808.
~y this method likelihood estimates of the p~esence
20 o~ the various possible se~uences in a signal of interest can
be determined.
5.5. COLO ~ ~r.r.TNG
The colony calling ~ ho~li ?nt recognizes and
2~ classifies single, individual genes or DNA se~n~C by
determining the-presence or absence o~ target subse~uences.
No length in~ormation is detel i n~ . This ~ ho~i -~t is
directed to gene determination and classification of arrayed
samples or colonies, where each sample or colony contains or
30 eY.presses only one sequence or gene of interest and is
perhaps prepared from a tissue cDNA library. The presence or
absence of target subsequences in a colony is determined by
use of labeled hybridization recognition means, each of which
uniquely binds to one target subsequence. It is preferable
35 that this b;n~ing be highly specific and reproducible. Each
sample or colony, or an array o~ samples or colonies, is
assayed for the contained sequence by detel i n i~g which of
- 151 -
CA 0223~860 1998-04-24
W O 97/lS690 PCTAUS96/17159
the set of probes recognizes and thus hybridizes to target
subsequences in the sample(s) or colony(ies)- Each sample i5
t~en characterized by a hash code, each bit of which
indicates which probes recognized subsequences, or hits, in a
5 particular sample. The sequence or gene in a sample is
determined from the hash code by computer implemented
methods.
The choice of the target subsequences is important.
For economical and rapid assay, the size of the set of
10 recognition means should be as small as poss-ble, preferably
le~s than 50 elements and more preferably from 15 to 25
elements. Further, it is most preferable that all possible
sequences or genes are recognized and uniquely determined.
It is preferable that 90 to 95% of all possible sequences be
1~ recGgnized, with each sequence being indistinguishable from,
Gr 2mbiguous with, at most one or two ot~er se~uences.
There~cre, each target subsequerl_e pre~Grably occurs
frequently enough to ; n i m i ze the number of diEferent
recogni~ion means needed. For ex~mple, it is not practical
-G for this invention, directed to rapid gene classification, if
-ach probe recognized only a fQ-~- genes and therefore
t~usan~s of probes were neede~ o~ever, each target
subsequen-e preferably does not occur so frequently that it-C
presence conveys little information. For example, a probe
?5 rccognizing every gene convsys no information.
The optimal choice is for each target subsequence
to have a probability of occurrence in all the genes or
~eq~l~nc~s that can appear in a sample or colony of
approximately 50%; a preferable choice is a probability of
30 occurr~nce between 10 and 50%. Typically for human cDNA
libraries, target subsequences of length 4 to 6 meet this
condition, as longer seguences occur too infrequently to make
useful hash codes. Additionally, the presence of one target
subsequence is preferably independent of the presence of any
35 other target subsequence in the same sequence or gene. These
two criteria ensure that a hash code for a sample, consisting
of indications of which target subsequences are present, is
- 152 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
~ lly likely to represent a unique gene or DNA sequence
c with i n; of wasted code words not specifying any gene.
Such a hash code is an efficient representation of sequences
or genes.
The ~xi~l number of genes or sequences that can
- be represented by a hash code is 2n, where n is the number of
target subsequences. A simple test to determine whether the
target subsequences occur frequently enough in the expected
ger~e library is made by comparing the actual probabilities of
10 the two hash codes that have all target subsequences either
present or absent to the ideal probabilities of these codes.
If p is the probability that any target subsequence occurs in
a given sequence in the library, then ~robability that none
of the target subsequences occur in a random gene is (l-p) n,
15 The closer the ratio (1-p)n/2-D is to l the more efficient is
the ccde. Similarly, the closer p~/2-n, the ratio of the
prc)babilities that all the target subsequences are present to
tne i;leal probability conveying maximum .nformation, is to 1
the more efficient is the code. We see ihe optimal p is
20 clos~ to 2-l.
T~.e preferred met~od of selecting target
subsequence3 meeting the proba~ili-~ of occurrence and
ir,clependence cri.eria is to use a database cont~in;ng
se~uences generally expected to be present in the samples to
25 he analyzed, for example human GenBank sequences for human
tissue derived samples. From a sequence dat~h~se, oligomer
frequency tables are compiled containing the frequencies of,
preferably, all 4 to 8-mers. From these tables, candidate
subseq~ence~ with the desired probability of occurrence are
30 sele_ted. Each candidate target subsequence is then checked
for independent occurrence, by, for example, checking that
the conditional probability for a hit by any selected pair of
candidates is approximately the product of the probabilities
of the individual candidate hit probabilities. Candidate
35 target subsequences meeting both occurrence and independence
- criteria are possible target subsequences. A sufficient
- 153 -
CA 0223~860 1998-04-24
W O 97~15690 PCTnUS96/171S9
number, typically 20, of any of these subsequences can be
selected as target subsequences for a hash code.
Preferably, but optionally, the initially set of
target subsequences can be optimized, using information on
5 the actual o~currences of the initially selected target
subsequences in the sequence database, resulting in a set ot'
target subsequences selected which recognizes a ~x;ml
number of genes with a minimum number of sequences and with a
~i~;~l amount of recognition ambiguity. Alternatively, this
~0 optimization can also be performed on a sub-set of the
da5abase comprised of sequences or genes of particular
biological or medical interest, for example, the set of all
oncogenes or grawth factors. In this manner, fewer target
subsequences c-an be chosen which distinguish more ef~iciently
15 among a set of sequences or genes of particular interest and
-iistinguish that set of genes from the sequences of the
remajnder of the sample.
This combinatorial optimiZatiQn prohlem is
comp-l.ariGnally intensive to solve exac-ly. ~ numbe~- of
20 approximate t~chn;ques can be used to obtain efficient nearly
optimal soll~tions. The preferred but not lim~ting technique
is to use simulated annealing (Press et zl., 1986, Numerical
~ecipQs - The Art of Scientific ComPUtinq, Sec. 10.9,
Cambridge University Press, Cambridge, U.X.). The
25 experimental design and optimization are described in detail
in the following section.
Example 6.6 illustrates the results of the
simulated ~nne~-ing optimization method. Simulated annealing
generally produces a choice of subsequences that achieve the
30 same resolution while using approximately 20~ fewer total
sequences than a selection guided only by the probability
principles previously described. This level of optimization
is likely to i v~e with larger and less re~l~n~nt databases
that represent longer genes.
An alternative to using single target subsequences
i5 to use sets of target subsequences, recognized by sets of
identically labeled hybridization probes, to generate one
- 154 -
,
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
presence or absence indication for the hash code. In t~is
alternative, sets of longer target subsequences would be
chosen such that the presence of any target subsequence in
the set is a presence indication. Absence means no element
5 of the set is present. If the sets are chosen so that their
probability of presence in a single sequence is near 50~,
preferably from 10 to 50%, and the presence or absence of one
set is independent of the presence or absence of any other
set, such sets can be used to construct codes equally well as
10 single subsequences. A resulting code will be efficient and
can be further optimized by simuLated annealing, as for
single target subsequence codes. Target sets of longer
subsequences are preferable where experimental recognition cf
shorter subsequences is less specific and reproducible, as
15 for example is true where short DNA oligomers are used as
hybridization p_obes for recognition. As a further
aLtern~tive, a code can concist of presence or absence
indic~tions of mixed target sets of subsequences and single
'arge~ subsequences.
Probes for a target subsequence are pre~erably PNA
oligomers, or less preferably DNA oligomers, which hybridize
to the subs~quence of interest. ~rse Oc set~ of degenerate
DNA o igomers to more specifically and reliably hybridize to
short DNA subse~nc~s has been described in relation to the
2~ PC~ implementation of QEA~ methods. The use of PNAs is
pre~erred in the colony calling embodiment since PNA
oligomers, due to their more favorable hybridization
energetics, more specifically and reliably hybridize to
shorter complementary DNA subse~lenc~ than do DNA oligomers.
30 ~eliable hybridization occurs for PNA 6 to 8-mers and longer.
Probing shorter subse~len~ preferably uses fully degenerate
sets of PNA oligomers, as is the case for DNA oligomers.
PNAs-are even more preferable when, in the
alternative, the hash code comprises presence or absence
35 indication of target sets o~ longer subsequences. In this
; case, many more DNA probes are generally required than PNA
probes. As PNA 6 to 8-mers reliably hybridize, target sets
- 155 -
CA 0223~860 1998-04-24
W O 97/15690 PCTnUS96/17159
can consist of subse~uences o~ length 6 to 8- Since DNA
oligomers of this length may not reliably hybridize, each
subsequence in the set must in turn be represented by a
further degenerate set of DNA oligomers, requiring thereby a
5 set of sets.
The experimental method Gf colony calling comprises
three principal steps: first, arraying cDNA libraries on
filters or other suitable substrates; second, PNA
hybriaization and detection, alternatively DNA hybridization
10 can be used; and third, interpreting the reculting hash code
to determ-ne the sequence in the sample.
The first step, which can be omitted if arrayed
cD~IA libraries are already available, is constructing and
arraying cDNA libraries. Any methods known in the art may be
15 used. For example, ~DNA libraries from normal or diseased
tissues can be constructed accordins tG Bxample 6.3.
Alternatively, the human cDNA libraries c~nstructed by M.B.
Soa-- e8 ancl colleagues are av;lilable a~ high density arrayC on
filters and can be used for the practice or this method. See
20 Scares et a ~, 1994, Proc. Natl. Ac~d. sci. us~ 91:9228-32.
'~he abil ~y to spot up to thousands o~ cDNA cLones or
c~lonies on filters suitable for hybridization is an
est--blished te-hnology. This service is now provided by
several ~o ~nies~ including_the preferred supplier Research
25 Genetics (Huntsville, AL). The protocol of Example 6.7 can
be used to generate these arrays from cDNA libraries.
The second step is probe (e.g., PNA) hybridization
and detection. Fluorescently labeled PNA oligomers are
available from PerSeptive Biosystemc (Bedford, MA) or can be
30 synthesized. PNAs are designed to be complementary to the
chosen target subse~l~c~c and to have a ~; number of
distinguishable labels for simultaneous hybridization with
multiple oligomers. PNA hy~ridization is performed according
to standard protocols developed by the manufacturer and
35 detailed in Example 6.7. Detection of the PNA signals uses
optical spectrographic means to distinguish fluorochrome
emissions s; ;1~ to those used in DNA analysis instruments,
- 156 -
CA 0223~860 1998-04-24
W O 97/1~690 PCT~US96/17159
but a~L~ iately modified to recognize spots on filters as
opposed to line~ly arrayed bands.
The third step, interpretation of the hash code, is
done by the computer implemented method described in the
5 fcllowing section.
- In an alternative embodiment, the intensity of the
detected hybridization signal indicates the ~., h~r- of times
the probe binds to the sample sequence. In this manner the
number of recognized target subsequences present in the
10 sample can be determined. This information can be used to
more precisely classify of identify a sample.
5.5. CC ANAhYRIS AND DESIGN h~ilnCh~8
The colony calling ~"CC") computer implemented
15 methods are similar to QEA~ computer methods. As for QEA~,
the experimental analysis method~ are described before the
experimental design methods.
5.6.1. CC ExpF~l~r~ AL ~NALY8IS ME~HODS
The analysis meth~ds make use of a mock experimenl
concept. First, a database is selected to represent possikle
s2qu~nc:es in the sample by the same methods as described for
~A~ analysis. These are illustrated and described with
reference to Fig. 6A. For CC, an experimental definition is
25 si~ply a list of Np target subsequences, where Np is-
pref~rably between 16 and 20. Next, a mock experiment
generates one hash code for each sequence in the selected
datahase, each hash code being a string of Np binary digits
wherein the n'th digit is a 1 (0) if the n'th target
30 subsequence does (does not) hybridize with the sequence. The
results of all the mock experiments determine the pattern of
hash codes expected. This pattern is output in a code table
of all possible hash codes in which, for each hash code,
there is a list of all accession numbers of se~enc~c with
35 this code.
- This method is illustrated in more detail in Fig.
15. The method starts at step 1901 and at step 1902 it
- 157 -
CA 0223~860 l99X-04-24
W O 97/15690 PCTAUS96/17159
inputs a selected database and on experimental definition
consisting of Np target subsequences. Step 1903 initializes a
table which for each of the 2NP hash codes can contain a list
of possible accession numbers which have this hash code.
5 Step 1904 is a D0 loop which iterates through all sequences
in the database. For a particular sequence, step 1905 ch~k~
for each target subsequence whether that subsequence
hybridizes to the sequence. This is implemented by string
matching in a manner si ;1~ to step 1303 of Fig. 9. A
lo binary hash co~e is constructed from this hybridization
information, and step 1906 adds the accession number of the
sequence to the list of accession numbers associated with
this hash code in the code table. Step 1907 outputs the code
table and the method ends at step 1908.
Having built a pattern of simulated hash code in a
co~e table, analysis of an experiment re~uires only simple
table look-up. A colony is hybridized with each of the Np
recc-grliticn means for the target subs2quences. The results
o~- ~he hybridization are used to construct a resulting hash
20 cod~. This code table for this hash code entry then contains
a lis. o~ sequence accession numbers that are possible
candidates for the sample sequence. If the l-st contains
onl~- cne element, then the sample has been uniquely
iclentified. If the list contains more than one element, the
25 identification is ambiguous. If the list is empty, the
s~mpl~ is not in the selected database and may possibly be a
previously unknown sequence.
Alternately, as ~or QEA~ experimental analysis, a
code table can be dispensed with if only a few hash codes
30 need to be looked up from only a few experiments. Then the
DNA database is scanned sequence by sequence ~or those
sequences generating the hash code of interest. If many hash
codes from many experiments need to be analyzed, a code table
is more efficient. The quantitative decision of when to
35 build a code table depends on the costs of the various
operations and the size of DNA database, and can be performed
- 158 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
as is well known in the computer arts- Without limitationt
- this description is built on the use of a code table.
For those f hg~ nts where the recognition means
can each recognize a subset of target subsequences, code
5 table construction must be modified accordingly. Such
embo~i ents, for example, can involve DNA oligomer probes
wAich due to their length can hybridize with an intended
target subsequences and those subsequences which differ by l
base pair from the intended target. In such embodiments,
lo step lgO5 checks -~hether each member of such a set of target
subsequences is found in the sample sequence. If any ~mh~-
is found in the sequence, then this information is used to
construct the hash code.
5.6.2. CC EXP~TM~NTA~ DESIGN h~ ~u~S
As for QEA~, the goal of CC experimental design is
to maximize the amour.t cf information from a CC hybri~izat~cn
exp riment. This i5 also performed by defining an
infor~ation measure and choosing an optimizatlon method which
20 m ximi~es this measure.
The preferred information measure is the number of
cccl~pied hash codes. Thi~ is equivalent Lo min;~i zing the
number of accession numbers which can result in a given hash
code. In fact for Np greater than about 17 to 18, that is for
25 2NP gr~ater than the number of expressed human genes (about
100,000), ~Y; ;zing the number of occupied hash codes can
result in each hash code representing a single sequence.
Such a unique code contains the ~Y; ~ _u"L o~
information. The invention is adaptable to other CC
30 information measures. For example, if only a subset of the
possible seql~n~c are of interest, an a~ o~-iate measure
would be the nll~h~ of such sequences which are uniquely
represented by a hash code. As for QEA~, these are sequences
of interest.
One optimization algorithm is exhaustive search.
In exhaustive search, all subsequences of length less than
approximately 10 are tried in all combinations in order to
- 159 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
find the optimum combination producing the best hash code
according to the chosen information measure. This method is
inefficient. The preferred algorithm for optimizing the
information from an experiment is simulated annealing. This
5 is per~ormed by the method illustrated and described with
respect to Fig. 13A. For CC, the following preferred choices
are made.
The energy is taken to be 1.0 divided by the
~nformation content; alternatively, any monotonically
o decreasing function of the information content can be used.
The energy is determined by performing the mock experiment of
Fig. 15 using a particular experimental definition and then
applying ~he measure to the resulting code table; For
example, if the number of occupied hash codes is the
15 in~ormation measure, this number can be computed by simply
scalming the code table and counting the number of table
entries w-th non-empty accession number lists. The ~oltzma~;
constant is again taken to be l so that the 'emperature
equals the energy. The initial temperature is pre~erably
20 1Ø The minimum energy and temperature, ~ and T~,
res~ectively, are determined ~y the information measure. ~or
ex mple, with the prior choices for enerqy Eunction and
in~ormation measure, ~, which equals To~ is 1.0 divided by
the number of sequences in the selected database.
The method of generating a new experimental
definition from an existing definition is to pick randomly
one target subseguence and to perform one of the following
moves: (1) r~n~: ly modifying one or more nucleotides; (2)
adding a random nucleotide; and (3) removing a random
30 nucleotide. A modification is discarded if it results in two
identical target subsequences. Further, it is desirable to
discard a modification if the resulting subsequence has an
extreme probability of bin~i ng to se~l~nc~c in the database.
For example, if the modified subsequence binds with a
35 probability less than approximately o.l or more than
approximately 0.5 to sequences in the selected database, it
should be discarded. To generate a new experiment, one o~
- 160 -
-
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
these moves is randomly selected and carried out on the
existing experimental definition. Alternatively, the various
moves can be unequally weighted. The invention is further
adaptable to other methods of generating new experiments.
5 Preferably, generation methods used will rAn~ ly generate
all possible experiments. An initial experimental defin tion
can be picked by taking Np randomly chosen subsequences or by
using subsequences from prior optimization.
Finally, the two execution parameters defining the
10 ~annealing schedule", that is the manner in which the
t-mperature is decreased d~ring the execution of the
si~ulated annealing method, are defined and chosen as in
QBAn'. The number of iterations in an epoch, denoted by N, is
preferably taken to be 100 and the temperature decay factor,
15 denoted by f, is preferably ta~en to be 0.95. Both N and f
may be systematically varied case-by-case to achieve a better
exper-_mental ~efinition ~ith lower energy ~nd a higher
inform2tion measure.
With these choices the simulated annealin~
23 op~imi~ation method of Fig. 13A ~an be performed to obtain an
op'~mized set of target subsequences. To determine an
o~imum Np, different initial Np can be .selected, the prior
design optimization performed, and the results compared. The
Np with the ~-~i information measure is optimum for the
25 selected dat~.
5.6.3. CC O~A~rrITATIVE EMBODIMB~T
To make use of quantitative detection information
the pattern of simulated hash codes stored in the code table
30 is augmented with additional information. For each hash code
in the table and each sequence giving rise to that hash code,
this additional information comprises recording the ", h~ of
times each target subsequence is found in such a sequence.
These numbers are simply determined by S~nn; ~g the entire
35 sequence and counting the n ~ of occurrences of each
target subsequence.
- 161 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
An exemplary method to perform hash code look up in
this augmented table is to first find the sequences giving
rise to a particular hash code as a binary number, and second
to pick from these the most likely sequence as that sequence
5 havir.g the most similar pattern of subsequence counts to the
detected quantitative hybridization signal. An exemplary
me~hod to determine such similarity is to linearly normalize
the detected signal so that the smallest hybridization si~nal
is 1.0 and t-hen to find the closest sequence by using a
lo Euclidean metric in an n-~; ~n~ional code space.
For CC exper-mental design, each pattern of
subsequence counts may alternatively be considered as a
disti~ct code entry for evaluation of an information measure.
This is instead of considering each hash code alone a
15 distinct entry.
PPa~L~T~8 FOR PERF~ ~ ING T~E METHGDS O~ THE lNV~LION
The apparatus of this invention includes means ~or
per~orming '~he recognition reactions of this invention in a
2~ pre~erably automated fashion, for example by the protocols of
~ 6.4.3, ~nd means for per~orming ~he compu~er implemented
experimental analysis and design methods of this invention.
Although .he subsequent discussion is dir~cted to embodiments
of apparatus for QEA~ - ho~i -nts of this invention, similar
25 apparatus is adaptable to the CC embodiments. Such adaption
includes using, in place of the corresponding components ~or
QEA~ ~ ho~i -nts, automatic laboratory insLL~ ~nts
appropriate for making and hybridizing arrays of clones and
~or reading the results of the hybridizations, and using
30 programs implementing the c~ _~Ler analysis and design
methods for the CC embo~i e~ts described in Sec. 5.6.
Fig. 12A illustrates an exemplary apparatus ~or
QEA~ '- ho~ nts of this invention, and with the described
adaption, also for the CC ~ hs~;ments of this invention.
3s Computer 1601 can be, alternatively, a UNIX based work
station type computer, an MS-DOS or Windows based personal
computer, a Macintosh personal computer, or another
- 162 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
eguivalent computer. In a preferred embodiment, computer
1601 is a PowerPC~ based Macinto5h computer with software
systems capable of running both Macintosh and MS-DOS/Windows
programs.
Fig. 12B illustrates the general software structure
in RAM memory 1650 of computer 1601 in a preferred
embodiment. At the lowest software level is Macintosh
operating system 1655. This system contains features 1656
and 1657 for permitting execution of UNIX programs and MS-DOS
10 or Windows programs alongside Macintosh programs in computer
1601. At the next higher software level are the preferred
languages in which the computer methods of this invention are
implemented. LabView 1658, from National Tnstruments
(Dallas, TX), is preferred for implementing control routines
15 1661 for the laboratory instruments, exemplified by 1651 and
1652, which perform the recognition reactions and fragment
separation and detection. C or C+~ languages 1659 are
pre~erred for implementing experi~ent~l routines 1662, which
are described in Sec. 5.4 and 5.6. Less preferred, but
20 useful for rapid prototyping, are various scripting languages
~nown in the art. PowerBuilder 1660, from Sybase (Denver,
Co), is preferred for implementing the user interfaces to the
computer implemented routines and methods. Finally, at the
highest so~tware level are the programs implementing the
25 descr-bed cl uLer methods. These programs are divided into
in~trument controL routines 1661 and experimental analysis
and design routines 1662. Control routines 1661 interact
with laboratory insLLI -nts, exemplified by 1651 and 1652,
which physically per~orm QEA~ and CC protocols. Experimental
30 routires 1662 interact with storage devices, exemplified by
devices 1654 and 1653, which store DNA sequence databases and
experimental results.
Returning to Fig. 12A, although only one processor
is illustrated, alternatively, the c uLer methods and
3S instrument control interface can be performed on a
- muItiprocessor or on several separate but linked processors,
such that insLL~ ~nt control methods 1661, computational
- 163 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
experimental methods 1661, and the graphical interface
methods can be on different processors in any combination or
sub-combination.
Input/output devices include color display device
5 1620 controlled by a keyboard and st~n~rd mouse 1603 for
output display of instrument control information and
experimental results and input of user requests and co~m~
Input and output data are preferably stored on disk devices
such as 1604, 1605, 1624, and 1625 connected to computer 1601
10 through links 1606. The data can be stored on any
combination of disk devices as is convenient. Thereby, links
1606 can be either local attachments, whereby all the disks
can be in the computer cabinet(s), LAN attachments, whereby
the data can be on other local server computers, or remote
15 links, whereby the data can be on distant servers.
Instruments 1630 and 1631 exemplify laboratory
devic:es for peri~orming, ih a part~y or ~Aholly automatic
mann~r, QEA~ recognition -eactions. These instruments can
be, ~or example, automatic thermal cyclers, laboratory
20 robGts, and controllable separation and detection apparatus,
such as is found in the applicants' copending U.S. Patent
Applica'ion 08/4i8,231 filed May 9, 1995. Link~ 1632
exemplify control and data links between ~omputer 1601 and
controlled devices 1631 and 1632. They can be sp~ci~l buses,
25 s-andard LANs, or any suitable link known in the art. These
links can alternatively be ,~ er readable medium or even
manual input ~Ych~nged between the instruments and computer
1601. Outline arrows 1634 and 1635 exempli~y the physical
flow of samples through the apparatus for performing
30 experiments 1607 and 1613. Sample flow can be either
automatic, manual, or any combination as appropriate. In
alternative embodiments there may be fewer or more laboratory
devices, as dictated by the current state of the laboratory
automation art.
On this complete apparatus, a QEA~ experiment is
designed, performed, and analyzed, preferably in a manner as
automatic as possible. First, a QEA~ experiment is designed,
- 164 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
according to the methods specified in Sec. 5.4.2 as
implemented by experimental routines 1662 on computer 1601.
Input to the design routines are databases of DNA se~n
which are typically representative selected database 1605
5 obtained by selection from input comprehensive sequence
database 1604, as described in Sec. 5.4.1. Alternatively,
comprehensive DNA databases 1604 can be used as input.
Database 1604 can be local to or remote from computer 1601.
Database selection performed by processor 1601 executing the
10 described methods generates one or more representative
selected databases 1605. Output from the experimental design
methods are tables, exemplified by 1609 and 1615, which, ~or
a QEAT~ RE embodiment, specify the recognition reaction and
the REs used for each recognition reaction.
Second, the apparatus performs the designed
experiment. Exemplary experiment L607 is de~n~d by tissue
sample 1608, which may be normal or dise~sed, experimental
defin~tion 1609, and physical recognition reactions ~610 as
ce~ined by 1609. Where instrument 1630 is a laboratory -obot
2C for ~utomating reaction, computer '601 c~ ~n~ and controls
ro~ot ~630 to perform reactions 1610 on cDNA samples prepared
rom ~issue ~608. Where n~L u-~cnt 163~ is a separation and
detection instrument, the results of these reacticns are then
transferred, automatically or manually, to 1631 ~or
25 separation and detection. Computer 1601 c ~n~c and
controls performance of the separation and receives detection
information. The detection information is input to computer
1601 over links 1632 and is stored on storage device 1624,
along with the experimental design tables and information on
30 the tissue sample source for processing. Since this
experiment uses, for example, fluorescent labels, detection
results are stored as fluorescent traces 1611.
Experiment 1613 is processed si il~ly along sample
pathway 1633, with robot 1630 performing recognition
35 reactions 1616 on cDNA ~rom tissue 1608 as defined by
definition 1615, and device 1631 performing fragment
separation and detection. Fragment detection data is input
- 165 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
by computer 1601 and stored on storage device 1625. In this
case, for example, silver st~in;ng is used, and detection
data is image 1617 of the stained bands.
During experimental performance, instrument control
5 routines 1661 provide the detailed control signals needed by
instruments 1630 and 1631. These routines also ailow
operator monitoring and control by displaying the progresS o~
the experiment in process, instrument status, instrument
exceptions or malfunctions, and -~uch other data that can be
o o~ use to a laboratory opera~or.
Third, interactive experimental analysis is
performed using the database of simulated signals generated
by analysis and design routines 1662 as described in Sec.
5.4.2 and 5.4.3. Simulated database 1612 for experiment 1607
15 is generated by the analysis methods executing on processor
;6~l using as input the apprcpriate selected database 1605
and experimental definition L6C9, an~ is output in t~ble
'~12. Similarly table 16i~ is the correspon~ing simulated
~~~abase of signals ~or e~periment 1613, and is genera_ed
20 from appropriate selected database 1605 and experi~ental
ae~in tion 1615. A signal is made unambiguous by
experimer,tal routines 1662 that implement the methods
d~..cribed in Sec. 5.4.3.
Display device 1602 presents an exemplary user
25 interface for the data generated by the methods of this
in~-ention. This user interface is programmed preferably by
uc~ng the Powerbuilder display front end. At 1620 are
selection buttons which can be used to select the particular
experiment and the particular reaction of the experiment
30 whose results are to be displayed. Once the experiment is
selected, histological images of the tissue source of the
sample are presented for selection and display in window
1621. I;hese images are typically observed, digitized, and
stored on c~uLer 1601 as part of sample preparation. The
35 results of the selected reaction of the selected experiment
are displayed in window 1622. Here, a fluorescent trace
output of a particular labeling is made available. Window
- 166 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
1622 is ;~e~ by marks 1626 representing the possible
- locations of DNA fragments of successive integer lengths.
Window 1623 displays contents from simulated
database 1612. Using, for example, mouse 1603, a particular
5 fragment length index 1626 is selected. The processor then
retrieves from the simulated database the list of accession
numbers that could generate a peak of that length with the
displayed end labeling. This window can also contain further
informztion about these sequences, such as gene name,
10 bibliographic data, etc. This further information may be
available in selected d~tabases 1605 or may require queries
to the complete sequence database 1604 based on the accession
numbers. In this manner, a user can interactively inquire
into the possible sequences causing particular results and-
15 can then scan to other reactio~s of the experiment by usingbuttons 1620 to seek other evidence Gf the presence cf these
=.equences .
It is apparent t~at this interactive -nterface has
furtller alternative embodiments specia~i~ed for classes of
20 users of di~fering interests and goals. For a user
interested in detel ; n i ng tissue gene expression, in one
a'ternative, a particular accession number is selected from
window 1623 with mouse 1603, and processor 1601 scans the
simulated database for all other fragment lengths and their
25 recognition reactions that could be produced by this
accession number. In a further window, these lengths and
reactions are displayed, and the user allowed to select
further reactions ~or display in order to confirm or refute
the presence of this accession number in the tissue sample.
30 If one of these other fragments are generated uniquely by
this sequence (a "good sequence", see supra), that fragment
can be highlighted as of particular interest. By displaying
tbe results of the generating reaction of that unique
fragment, a user can quickly and unambiguously determine
35 whether or not that particular accession number is actually
; present in the sample.
- 167 -
CA 0223~860 l99X-04-24
W O 97/1~690 PCT~US96/17159
In another interface alternative, the system
displays two experiments side by side, displaying two
histological images 1621 and two experimental results 1622.
This allows the user to determine by inspection signals
5 present in one sample and not present in the other. If the
two samples were diseased and normal specimens of the same
tissue, such signals would be of considerable interest as
perhaps reflecting differences due to the pathological
process. Having a signal of interest, preferably repeatable
lO and reproducible, a user can then determine the likely
accession numbers causing it by invoking the previously
described interface facilities. In a further elaboration of
this embodiment, system 16~1 can aid the determination of
signals of interest by automating the visual comparison by
15 performing statistical analysis of signals from samples of
the sam~ tissue in different states. First, signals
reproduc-bly present in tissue samples in the same state are
determine~, and second, di~erencas in these reproducible
signals across samples from the several states are compared.
20 Display 1602 then shows which reproducible signals vary
a_-oss the states, ther~by guiding the user in the selection
v~ signal~ of interest.
Tke apparatus of this ~nvention has been described
above in an ~ ho~i --ntadapted to a single site
25 implementation, where the various devices are substantially
local to computer 1601 of Fig. 2A, although the various
links shown could also represent remote attachments. An
alternative, explicitly distributed embo~i -nt o~ this
apparatus is illustrated in Fig. 12C. Shown here are
30 laboratory instruments 16~b, DNA sequence database systems
1684, and computer systems 1671 and 1673, all of which
cooperate to perform the methods of this inven~ion as
described above.
These systems are interconnected ~y communication
35 medium 1674 and its local attachments 1675, 1676, and 1677 to
the various systems. This medium may be any dedicated or
shared or local or remote c ication medium known in the
- 168 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
art. For example, it can be a "campus" LAN network extending
- perhaps a few kilometers, a dedicated wide area r ln; cation
system, or a shared network, such as the ~nternet. The
system local attachments are adapted to the nature of medium
5 1674.
Laboratory instruments 1670 are CG ~n~ed by
computer system 1671 to perform the automatable steps of the
recognition reactions, separation of the reaction results,
and detection and transmission of resulting signals through
10 Link 1672. Link 1672 can be any local or remote link known
- in tke art that is adapted to instru~ent control, and may
even be routed through co~n;cation medium 1674.
DNA sequence database systems 1684 ~ith various
sequence databases 1685 may be remote from the other systems,
15 ~or example, by being directly accessed at their sites o~
crigin, such as G~h~nk at Bethesda, MD. Alternatively,
Farts or ~]l of these databases may be periodically
dowrlc)aded for loca access by computer systems 1671 and 1672
ont, such sto~~age ~evices as d 5cs cr C~--ROMs.
Computer system 1671, including computer 1681,
storage 1682, and display 1633, can perform various methods
5f thiC i~vention. For example, it can perform solely the
control rcuti..e for control and monitoring of instrument
system 1670, whereby experimental design and analysis are
25 perfo~med elsewhere, as at computer system 1673. In this
case, system 1671 it would typically be operated by
laboratory technicians. Alternatively, system 1671 can also
perform experimental designs, which meet the requirements of
remote users of sample analysis information. In another
30 embodiment, system 1671 can carry out all the computer
implemented methods of this invention, including final data
display, in which case it would be operated by the final
users of the analysis information.
Computer system 1673, including computer 1678,
35 storage 1679, and display 1680, can perform a corresponding
range of functions. However, typically system 1673 is
remotely located and would be used by final users of the DNA
- 169 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
sample information. Such users can include clinicians
seeking information to make a diagnosis, grade or stage a
disease, or guide therapy. Other users can include
pharmacologists seeking information useful for the design or
5 improvement of drugs. Finally, other users can include
re;earchers seeking information useful to basic studies in
cell biology, developmental biology, etc. It is also
possible that a plurality of computer systems 1673 can be
linked to laboratory system 167G and control system 1671 in
10 order to provide for the analysis needs of a~plurality of
classe~ of users by desisning and causing the performance of
appropriate experiments.
It will be readily apparent to those o~ skill in
the computer arts that alternative distributed embodiments of
15 the apparatus of this invention, along with alternative
furction~l allocations of the computer implemented methods to
the various distributed systems, are eq~ally possible.
All the computer implemented methods of this
inverltion can be recorded fo- sto~age and transport on any
20 computer readable memory devices known in the art. For
example, these include, but are not llmited to, semiconductor
memor 2s -- such as ROMs, PROMs, EPROMs, EEPROMS, etc. of
wh3tever technology or configuration - magnetic memories -
such as tapes, cards, disks, etc of whatever density or size
23 - optical memories - such as optical read-only memories, CD-
ROM, or optical wirteable memories - and any;other computer
readable memory technologies.
Also, although this apparatus has been described
primarily ~~ith reference to QEA~ analysis of human tissue
30 samples, the laboratory instruments and associated control,
design, and analysis computer systems are not so limited.
They are also adaptable to performing the CC embodiment of
this invention and to the analysis of other samples, such as
from ~n; ~ or in vitro cultures.
The invention is further described in the ~ollowing
examples which are in no way intended to limit the scope of
the invention.
- 170 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/171~9
6. EX~PLB8
6.1. S~BSEOUENC~ HI~ AND LENGT~ INFORMATION
This example illustrates QEA~ signals generated by
a PCR embodiment. From the October 1994 GenBank database,
12,000 human first continuous coding domain sequences ("CDS")
were selected. This selection resulted in a selected
database of sequences with a bias toward shorter genes, the
average length of the selected CDSs being 1000 bp instead of
10 the typical coding sequence length of 1800-2000 bp, and with
no gtarantee that sequer.ces were not be repeated in the
selected database. From this database, tables containing the
probability of occurrence of all 4 to 6-mer sequences were
constructed.
Then Eqns. 1 and 2 were solved for N = 12,000 and L
= 1,00U resulting in p = 0.17 and M a 108. Five 6-mer target
sukseguences ~~ith th~s pro~ability of occurrence were chosen
'rom the 6-mer tables and grcuped into four pairs: CAGATA-
TCTCAC, CAGATA-GGTCTG, CAGATA-GCTCAA CAGA~A-~ACACC. Analyses
~o compricing mock digestions (see Sec. 5.4.1) of the selected
dztabase o~ CDSs were then performed for these four pairs of
taxget subseqll~nc~c.
The histogram of Fig. 1 presents the results of
these analyses. Along axis 102 is the length of fragments,
25 as would be observed in a gel separation of the amplified
fragments of a QEA~ reaction recognizing these target
subseqll~ncefi. Along axis 101 is the number of fragments at a
given length. For example, spike 103 at a length of
approx~mately 800 base pairs represents three fragments of
30 the same length. Multiple fragments at one length may occur
either because several CDSs have one target subsequence pair
spaced this length, because one CDS has several target
subsequence pairs spaced this length, because of redundancy
in the selected CDSs, or because signals of this length were
35 generated by more than one pair of target subsequences.
Spike 104 at a slightly longer length represents a single
fragments. This fragment is generated from a unique sequence
- 171 -
CA 02235860 1998-04-24
W O 97/15690 PCTAUS96/17159
~nd provides a unique indication of its presence in a cDNA
mixture, that is, this is a good sequence.
6.2. RESTRICTION ENDON~C~EASES
Tables 1-4 list all palindromic 4-mer and 6-mer
potential RE recognition sequences. RE enzymes recognizing
each site, where known, are also listed, along with an
exemplary commercial supplier. Over 85% of possible
sequences spanning a wide range of occurrence probabilities
10 have .~ known RE recognizing and cutting within the sequence.
The frequency o~ these sequences was determined, as
in example 6.1, in 12,000 human first continuous coding
domain sequences selected fro~ the ~ctober 1994 GenBank
database. The tables are sorted in order o~ increasing
15 recognition occurrence probability. The bar in the
re~ognition sequence indicates the site in the recognition
~equence whe;-e tha RE cuts.
The foLlowing ven~or abbreviations are used: New England
Biolabs (BeverLy, MA~ ("NEB"~, ~tratage~a (La Jolla, CA),
20 RoehrLnger Mannheim (Indianapolis, IN) ("BM"), and Gibco BRL
divislon of Life Technologies (Gaithersburg, MD) ("BRL").
- 172 -
CA 02235860 1998-04-24
W O 97/15690 PCTAUS96/17159
TABLE 1: THE 4-MBR RESTRICTION SITE8
Recognition CDS RE Overhang Vendor
Sequence Frequency
S ClGcG 0.36 SelI 2
C¦TAG 0.44 MaeI 2 NEB
T¦TAA 0.45 MseI 2 NEB
TATA 0.45
GCG¦C O.SO HhaI 2 NEB
10ATAT 0.50
A¦CGT 0.52 MaeII 2 BM
T ! CGA 0.53 TaqI 2 NEB
¦AATT 0.53 Tsp5C91 4 NEB
15C¦CGG 0.61 MspI 2 NEB
C-!TAC 0.64 Csp6I NEB
',GATC 0.67 Sau3AI 4 NEB
C~TC¦ 0.68 NlaIII 4 NEB
TG¦CA 0.78 C-~iRI O
_ AG ! CT 0.78 AluI O NEB
GG¦CC 0.79 HaeIII C NEB
- 173 -
CA 02235860 1998-04-24
WO 97/15690 PCTAUS96/17159
TABLE 2: ~E FIRST 20 6-MER RB8TRICTION 8ITE8
Se~uence CDS Frequency RE Overhang Vendor
TCG¦CGA 0.01 NruI O NEB
TAC¦GTA 0.02 SnaBI O NEB
C¦GTACG 0.02 BsiWI 4 NEB
CGAT¦CG O.02 PvuI 2 NEB
A¦CGCGT 0.03 MluI 4 NEB
10 A¦CTAGT 0.03 SpeI 4 NEB
G¦TCGAC 0.04 SalI 4 NEB
AA¦CGTT 0.04 Pspl406I 2 NEB
A7CCGGT 0.04 AgeI 4 NEB
GICTAGC 0-.04 NheI 4 NEB
15 TATATA 0.04
GTT¦AAC 0.05 HpaI O NEB
TAGCTA 0-05
T~T~A 0.05
20 G~A¦~AC 0.05 BstllO7I O NEB
CTATAG 0.05
CC-CGCG 0.05
C¦CTAGG 0.06 AvrII 4 NEB
TT¦CGAA 0.06 SfaI 2 BM
25 AT¦CGAT 0.06 ClaI 2 NEB
- 174 -
_
CA 02235860 1998-04-24
WO97/15690 PCT~S96/17159
TABLE 3: THE MIDDLE 20 6-MER RESTRICTION SITE8
Sequence CDS Frequency RE overhang Vendor
~ C¦TTAAG 0.06 AflII 4 NEB
5 T¦CTAGA 0.06 Xbal 4 NEB
ATATAT 0.07
AT¦TAAT 0.07 VspI 2 BRL
G¦CGCGC 0.08 BssHII 4 NEB
lO - ClAATTG 0.08 MunI 4 NEB
GACGT¦C 0.08 AatII 4 NEB
TTATAA o.09
TGC¦GCA 0.l0 FspI 0 NEB
c t TCGAG 0.0l XhoI 4 NEB
15 GAT¦ATC o.0l EcoRV o NEB
C.~¦TATG 0.l0 NdeI 2 NEB
ATGCA¦T 0.0l NsiI 4 NEB
AGC¦C-CT 0.llEcc47III 0 NEB
20 AAT ! ATT O.ll SspI 0 NEB
r ! CC~GA o . 11AccIII 4 Stratag
ene
TlT ! AAA 0.12 DraI 0 NEB
A¦CATGT 0.12 BspLVII 4
25 CAC ! GTG 0.12 Eco72I 0 Stratag
ene
CCGC¦GG 0.12 SacII 2 NEB
D.
- 175 -
CA 02235860 l998-04-24
WO 97/15690 PCT/US96/17159
TABLE 4: T~E: ~5T 2 4 6--MBRRE~TRICTION 8ITES
SequenceCDS Frequency RE overhang Vendor
GCATG¦C 0.13 SphI 4 NEB
5 TTGCAA 0.13
A¦AGCTT 0.13 HindIII 4 NEB
G¦TGCAC 0.13 ApaLI 4 NEB
AAATTT 0.14
lO AGT¦ACT 0.15 ScaI O NEB
G¦AATTC- O.15 EcoRI 4 NEB
~GTAC¦C 0.15 KpnI 4 NEB
T¦GTACA 0.15 Bspl407I 4 NEB
C¦GGCCG 0.15 EagI 4 NEB
15 G¦CCGGC 0.16 NgoMI 4 NEB
GGC: 5 GCC O.16 NarI O NEB
TjGATCA 0.16 BclI 4 NEB
TjCATGA O;;7 BspHI 4 NEB
20 C¦CCGGG O.19 SmaI 4 NEB
G~C-ATCC O.19 BamHI 4 NEB.
A¦GATCT 0.20 BglII 4 NEB
AGG~CCT 0.22 StuI O NEB
GGGCC¦C 0.24 ApaI 4 NEB
25 C¦CATGG O.Z4 NcoI 4 NEB
GAGCT¦C 0.25 SacI 4 NEB
TGG¦CCA 0.33 MscI O NEB
CAG¦CTG 0.42 PvuII O NEB
30 CTGCA¦G 0.43 PstI 4 NEB
-- 176 --
-
CA 0223~860 1998-04-24
W O 97/15690 PCTAJS96/17159
6.3. RNA EXTRACTION AND cDNA SY~ ~SIS
These protocols describe preferred methods for
extraction of RNA from tissue samples and for synthesis of
de-phosphorylated cDNA from the extracted RNA.
S
6.3.1. RNA EXTRACTION
In general, RNA extraction is done using Triazol
reagent from Life Technologies (Gaithersburg, MD) following
the protocol of Chomszynski et. al., 1987, Annal. Biochem .
10 162:156-59 ~nd Chomszynski et. al., 1993, Biotechnlques
15:532-~4,535-37. Total RNA is first extracted from tissues,
treated with Rnase-free Dnase I from Pharmacia Biotech
(UpFsala, Sweden) to remove contaminating genomic DNA,
followed by messenger RNA purification using oligo (dT)
15 magnetic beads from Dynal Corporation (Oslo, Norway), and
then used ror cDNA synthesis.
T~' desired, total cellular RNA can be separated
ir.to sllb~-pools prior to cDNA synthesis. For ex2mple, a sup-
pool o~ endoplasmic reticulum associatea RNA is enriched for
20 ~A producinq proteins having an extra-cQllular or receptor
f~nction.
~ :~ more detail, the following protocol is preferred
~or RN~ ex~raction from tissue szmples.
25 Tissue Homoqenization and Total RNA Extraction:
A voxel is used to describe the specific piece of
tissue to be analyzed. Most frequently it will refer to grid
punches corresponding ~o pathologically characterized tissue
sections.
30 1. It is important that tissue voxels be quick frozen in
liquid nitrogen i -~iately after dissection, and stored at
-70~C until processed.
2. The weight of the frozen tissue voxel is measured and
recorded.
35 3. Tissue voxels are pulverized and ground in liquid
nitrogen, either with a porcelain mortar and pestle, or by
stainless steel pulverizers, or alternative means. This
- 177 -
CA 02235860 1998-04-24
WO 97/15690 PCTAJS96/17159
tissue is ground to a fine powder and is kept on liquid
nitrogen.
4. The tissue powder is transferred to a tube cont~in;~g
Triazol reagent (Life Technologies, Gaithersburg, MD) with 1
5 ml of reagent per 100 mg of tissue and is dispersed in the
Triazol using a Polytron homogenizer from Brinkma~
Instruments (Westbury, NY). For small tissue voxels less
than 100 mg, a minimum of l ml of Triazol reagent should be
used for efficient homogenization.
10 5. Add 0.1 volumes BCP (1-bromo-3-chloropropane) (Molecular
Research, Cincinnati, OH) and mix by vortexing for 30
seconds. Let the mixture stand at room temperature for 15
minu'es.
6. Centrifuge for 15 minutes at 4OC at 12,000X G.
lS 7. Remove the a~{ueous phase to a fresh tube and add 0.5
~ol~mes iscpropanol per original amount of Triazol reagent
used and ~liX };ir vortexing for 30 s~oconds. Let t~2 mixt~lre
~tand at r~om temperature for lo ~inutes.
8. C~.ntriCuge at ~Gom ~emperature for 10 minutes at L-,000X
2~ ~
'~. Wash with 70% ethanol and centrifuge at room temperat~re
for 5 minutes at L2,000X G.
lt;. ~emove the supernatant and let 'he csntrifuge 'ube stand
to dry in an inverted position.
25 11. Resuspend the RNA pellet in water (1 ~l per mg of
original tissue weight) and heat to 55OC until completely
dissolved.
~Nase treatment:
30 1. Add 0.2 volume of 5X reverse transcriptase buffer (Life
Technologies, Gaithersburg, MD), 0.1 volumes of 0.1 M DTT,
and 5 units RNAguard per 100 mg starting tissue from
Pharmacia Biotech (Uppsala, Sweden~.
2. Add 1 unit RNase-free DNase I, Pharmacia Biotech, per
35 100 mg starting tissue. Incubate at 37~C for 20 minutes.
The following additional steps are optional,
- 178 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
Opt 1. Repeat RNA extraction by ~i n~ 10 volumes of
- Triazol reagent.
Opt 2. Repeat steps 5 through 11.
5 3. Quantify the total RNA (from the RNA concentration
obtained by measuring OD2~ of a 100 fold dilution); Store at
-20~C.
Isolation of PolY A+ Messenqer RNA:
Poly-adenylated ~RNA is isolated from total RNA
preparations using magnetic bead medi~ted oligo-dT detection.
Kits that can be used include Dynabeads mRNA Direct Kit from
D~nal ~Oslc, Norway) or MPG Direct mRNA Purification Kit from
CPG (Lincoln Park, NJ). Protocols are used as directed by
15 the manu~acturer.
Less prefe~-ably, the following procedure can be
used. TLle 3yna] oligo(dI~) m~gnetic beads have a capacity of
L ~lg poly (A') per lC0 ug of ~eads (1 mg/ml concentratiorl),
assuming 2% of-the total ~NA nas poly(A+) tail-~.
1. Add 5 volumes of Lysis/~inding buffer (Dynal) and
sufficient beads to bind the estimated poly(A+~ RN~.
2. Incubate at 650C for 2 minutes~ then at room te~perature
for 5 minutes.
25 3. Wash beads with 1 ml Washing buffer/LiDS (Dynal)
4. Wash beads with 1 ml Washing buffer (Dynal) 2 times.
5. Elute poly(A+) RNA with 1 ~l water/ug beads 2 times.
For both methods, the poly-adenylated RNA is
30 harvested in a small volume of water, quantified as above,
and stored at -20~C. Typical yields of poly-adenylated RNA
range from 1% to 4% of the input total RNA.
6.3. 2. CDNA &Y~ I8
This protocol for the synthesis of de-
phosphorylated cDNA from poly (A)+ RNA is preferred when the
- 179 -
CA 02235860 1998-04-24
W O 97/15690 PCTAUS96/17159
quantities of input RNA are approximately 1 ~g, or at least
200 ng or greater.
Reaqent~ Used:
S ~ Random HeYA~~~s (50 ng/~l)
~ 5X First strand buffer (BRL)
~ 10 mM dNTP mix
~ 100 mM DTT
~ SuperScript II reverse transcriptase (BRL) (200
1~ U/~l)
~ ~. coli DNA ligase (BRL) 10 U/~l
E. coli DNA polymerase (BRL) 10 U/~l
~ T4 DNA polymerase 2.5 U/~l
~ E. coli RNaseH (BRL) 3.5 U/~l
~ Arctic Shrimp Alkaline Phosphatase, ~SAP; USB), and
lOX SAP buffer (USB)
~ 5X Second strand b~f~er (BRL)
~ 3 M Na-Acetate
~ Phenol:Chloroform (phenol:chloroform:isoamyl
alcohol 25:24:1~
~ C~.loroform isoamyl alcohol (24:1) '
~ Absolute and 75% ethanol
~ 20 ugJ~l glycogen (BM)
25 cDNA Synthe~is Protocol:
1. Mix .25-1.0 ug o~ poly A+ RNA with 50 ng of random
hr~xA ~s in 10 ~1 of water. Heat the mixture to 700C
~or 10 min. and quick chill in ice-water slurry. Keep
on ice for 1-2 min. Spin in microfuge for 10 secs. to
collect co~ te.
2. Prepare first stand reaction mix with 4 ~1 5x First
strand buffer, 2 I~ 1 100 mM DTT, 1 ~1 10 mM dNTP mix, and
2 ~1 water,. Add this mix to the primer-Ann~led RNA
from step 1. Place mixture at 37~C for 2 mins. Add 1
~1 of Superscript II (BRL) (following manufacturer's
r~c ?~Ations). Tnc-lh~te at 37~C for 1 hr.
- 180 -
CA 02235860 1998-04-24
Wo 97/15690 PCT/US96/17159
3. Place tubes on ice, add 30 ~Ll of 5x Second strand
buffer, go ,ul of cold water, 3 ~l of lO mM dNTP, 1 ,uL
(10 units) of E. coli DNA ligase, 4 ,ul (40 units) of ~.
coll DNA polymerase, and 1 ul (3.5 units) of ~. coli
RnaseH. Incubate for 2h. at 16~C.
4. Add 2 ,ul of T4 DNA polymerase (5 units) and incubate at
16~C for 5 min.
5. Add 20 ,ul lOx SAP buffer, 25 ~Ll of water, and 5 l~l (5
units) of SAP. Incubate at 37~C for 30 min.
10 6. Extract cDNA with phenol-chloroform, chloroform-isoamyl
alcohol. To the aqueous layer add Na--acetate to 0.3 M~
20 ug glycogen, and 2 vol of ethanol. Incubate at -20OC
for lO mir.., spin at 14,000 g for lO min. Wash pellet
with 75% ethanol. Dissolve pellet in 50 ,ul TE.
15 7. Estimate the yield of cDNA using fluorometer.
. For subsequent QEA~ processing, transfer 7~ ng cDNA to a
separate tube, add mE L G make the concentration 600
ng/ml and put that tube in the fipecif~ied bGx at -20"C.
For storaye, add Na--acetate to 0.3M and 2 vol of~ ethanol
~o to the rest of cDNA and store at --80~C.
Alternative primers for first st~and synthesis
Xnown in the art can also be used for first strand synthesis.
Such primers include oligo(dT) primers, phasing primers, etc.
~5
6.3.3. cDNA 3YN~ I8 FOR SM~L ou~1~ Ss OF RNa
The cDNA synthesis protocol previously described is
based primarily on the method of Gubler and Hoffman (Gubler
et al , 1983, ~A simple and very ef~icient method i~or
30 generating cDNA libraries," Gene 25:263-9) and is robust and
well-proven for quantities of RNA in the l ,ug range (200 ng
and up). A more preferred protocol for RNA quantities below
200 ng takes advantage of the 5 ~ CAP structure of RNAs (Edery
et al., 1995, "An efi~icient strategy to iso}ate full--length
3S cDNAs based on an mRNA cap retention procedu~e (CAPture),"
~ol. Cell Biol. 15:3363_71; Kato et al., 1994, "Construction
- 181 -
CA 02235860 1998-04-24
WO 97/15690 PC~/US96/17159
of a human full-length cDNA bank," Gene 150:243-50). This
- protocol has a number of advantages including:
~ broad scalability of RNA input quantities, making
them ideal for biopsies and for other small and variable
5 sized samples;
capability of doing a pre-s2EA~ amplification of the
cDNA when very small amounts of cDNA are available;
cDNA synthesis biased toward full--length RNAs.
capability of introducing specific primer sites ~t
10 I:-oth e:nds of the full--length cDNAs;
~ option to eliminate the pol~ (A)~ RNA purification
step and use total RNA.
cDNA SvnthesiCl Protocol
15 1. The poly (A)+, or total, RNA (lO ,LLg) is dephosphorylated
with ~acterial alkaline phcsphatase (20 ul rxn; lOO mM
Tris-HCl pH 7.5, 2 m~ DTT; (,~.~ U ba~ terial ;~lkali:~e
phosphatase, 20 U Rnase lnhll~itor; 37~C ~or 30 min~.tes).
2. ~f~er phenol extraction and ethanol preclpitatio.rl, the
~NA is treated with tob;~cco acid pyrophosphatase. (20
L 1 rxn; ~O mM Na--OAc pH 6.O, 1 mM EDTA, 2 mM DTT; O.1 tT
tokacco acid pyrophosphat~se, 20 U Rnase inhibito--; 37~C
for 3~ minutes).
3. Phenol extract and ethanol precipitate the decapped RNA.
The following DNA--RNA primer named MA24R (3 nm) is
ligated to the 5--primeend using T4 RNA Ligas~ (20 ~l
rxn; Tris-HCl pH 7.5, 5 mM MgCl2, 0.5 mM ATP, 2 mM DTT,
259c ethylene glycol; lOO U T4 RNA Ligase, 20 U Rnase
inhibitor; 20~C for 12 hours):
MA2AR: dCdAdGdTdAdGdCdGdAdTdTdGdCdCdGdCdCdGdTdCdAdGdGdTGGA
(SEQ ID NO:??)
4. First strand synthesis is performed identically to steps
l and 2 of the protocol previously described in Sec.
6.3.2 except thst the following biotinylated primers are
used to prime the cDNA:
MBTA: CGGTGGGTTGCCGTAGTAGCGGAT(T)25A
(SEQ ID NO:??)
-- 182 --
CA 0223S860 1998-04-24
W O 97/15690 PCT~US96/17159
MBTC: CGGTGGGTTGCCGTAGTAGCGGAT(T)~C
~ (SEQ ID N0:??)
MBTG: CGGTGGGTTGCCGTAGTAGCGGAT(T)~G
(SEQ ID N0:??)
These reactions can occur in separate tubes or in one
tube. The phasing effect of doing the reaction in
separate tubes has the advantage of dividing the cDNA
into three separate pools. 0.2 ~g of each primer is
used in the reaction.
10 ~. Second strand synthesis is performed identically to
steps 3 and 4 of the protocol previously described in
Sec. 6.3.2 using a DNA-only version of the DNA-RNA
chimera is used to prime synthesis:
MA24: CAGTAGCGATTGCCGCCGTCAGGT
(SE~ ID N0:??)
Because the primers at bGth 5' ends lack phosphate
c~roups, dephofiphorylation of .he resulting cDNA, e.g.,
by shrimp alkaline phGsphata~e, is no longer necessary.
6. In cases where exce~ing;y small amounts of cDNA are
synthesized (l-lO ng yields), the sample can be
a~plified us:ing the following primer pair:
MA~4: CAGTAGCGATTGCCGCCGTCAGGT
(SEQ ID N0:??~
MB24: CGGTGGGTTGCCGTAGTAGCGGAT
(SEQ ID N0:??)
For 1 ng quantities, 500-fold amplification by 8 to 10
PCR cycles (96~C 30 seconds, 57~C l minute, 72~C 3
minuteC) provides adequate cDNA for comprehensive
analysis.
6.3.4. ALTE~NATIVE cDNA ~Y~ IS
cDNA sYnthesi~
Alternately, cDNA can be synthesized using the
Superscript~ Choice system ~rom Life Technologies, Inc.
35 (Gaithersburg, MD). If tissue voxels are the source for the
RNA, the polyadenylated RNA is not quantified, and the entire
- yield of polyadenylated RNA is concentrated by precipitation
~ - 183 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/171~9
- with ethanol. The polyadenylated RNA is resuspended in 10 ~l
of water, and 5 to 10 ~l are used for cDNA synthesis. The
manufacturer's protocols are followed for RNA amounts of less
than 1 ~g, using 100 ng of random hexamers are used as
5 primers. If.greater than 1 ~g of polyadenylated RNA is used,
the manufacturer's protocols are followed, using 50 ng of
random hexamer primers per microgram of polyadenylated RNA.
The resulting volume of the cDNA solution is 150 ~l. If the
amount is not quantified, QEA~ test reactions can be run
10 using 1 l~l or 0.1 ~l of cDNA solution in order to determine
the appropriate amount of cDNA to use for subsequence ~EA~
reactions.
cDNA ~e-pho~PhorYlation
Where cDNA is synthesized with terminal phosphates,
t~ey are preferably remcved before the RE/Ligase reactions.
Termir;al phosphate removal ~rom cDNA is illustratad with the
use of Barents sea shrimp alka~'ine phosphatase ~ AP") (U.S.
Bioc~,emica Corp.) and 2.5 ~g ot' cDNA. Substantially less
~<10 ng) or more (>20 ~g) of cDNA can be prepared at a time
~-ith proportiGnally adjusted amounts of enzymes. ~'olumes are
m1ir.t~ined to preserve ease of handling. The quan~ities
necess~ry are consistent wit.h using the method to analyze
small tissue samples from normal or diseased specimens.
l. Mix the following reagents
2.5 ~l 200 mM Tris-HCL
23 ~l cDNA
2 ~l 2 units/~l Shrimp alkaline phosphatase
The final resulting cDNA concentration is 100 ng/~l.
2. Incubate at 37~C for 1 hour
3. Incubate at 80~C 15 minutes to inactivate the SAP.
6.4. OEA~ PREFERRED RE ~ u~
Protocols for the RE embodiment are designed to
minimize the number of individual manipulations down, and
thereby to maximize the reproducibility of QEA~ procedures.
- 184 -
,
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
In preferred protocols, no buffer changes, precipitatiOnS, or
organic (phenol/chloroform) extractions are used, all of
which lower the overall efficiency of the process and reduce
its utility for general use and more specifically for its use
5 in automated or robotic procedures.
once the cDNA has been prepared, including termina
phosphate removal, it is separated into batches of from l ng
to at least 50 ng each and of a number equal to the desired
number of individual samples that need to be analyzed~ For
lo example, if six RE/ligase reactions and si:~ analyses are
needed to generate all necessary signals, six batches are
made. Advantageously, QEA~ reactions can be duplicated or
t.iplicated in order to increase precision of and confidence
i~ the results.
RE/ligase reactions are performed as digestions by,
prafera~ly, a pair of REs; alternatively, one or three or
more X2s can be used provided that the four ba-~e pair
overha~gs generated by each RE differ and can each be ligated
t~ a -~niauely adapter and that a sufficiently resolved length
~o distribution results. The preferably amount o RE enzyme
specified in the protocols is sufficient for complete
digestion while minim-zing any other exo- or endo-nuclease
activ~ty t~lat may be present in th~ enzyme. Preferred and
alternate RE combinations can be found in Tables 11 to 14.
2s Adapters are chosen that are unique to each RE in a
reaction. They are comprised of a link~r complementary to
each unique RE sticky overhang and a primer which uniquely
hybridizes with that linker. The hybridized primer/linker
combina~ion is called an adapter.
The primer/linker combination for a giver RE are
chosen according to the several embodiments of QEA~ reactions
selected. Generally, sample primer/linker combinations are
chosen according to the combinations illustrated in Table 10
for any particular RE. The primers can be labeled when the
35 detection means so require. Where one or more, or preferably
all, primers have label moieties, these moieties are
~ preferably disting~l;shA~le and can be advantageously chosen
- 185 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
from the fluorescent labels described in Sec. 6.11. In a
QEA~ embo~; -nt using post-PCR cleanup, one primer has a
capture moiety, e.g., biotin. The capture moiety is
preferably bound to one of the R-series primers, RA24 or
s RC24, and the other primer is preferably labeled. Pairs of
labeled and biotinylated primers are preferably chosen
according to table 11 for the RE pairs therein listed.
Finally, in the case of an SEQ-QEA~ embodiment, primers and
linkers are preferably chosen according to Sec. 6.10.1.
C.4.1. PREFERRED RE/LIGA8E & AMPLIFICATION REACTIONS
This section describes the preferred protocol for
performing the RE/ligase and PCR ~mpli~ication reactions with
a minimum of intervention.
Primer-ex~ss Adapter Set Annealinq
In the preferred protocol, a pri~er/linker
combin2'-ion in the form of an adapter set specific for each
R~ is cho-~n as abo;~e. The adapter set comprises sufficient
20 adapte-s, hybridized primer/~inker, for the RE/ligase
reaction and also sufficient excess primers for the
cukse~uent PCR amplification. Accordingly, primers do not
h~ve to be separately adde~ to the PCR reaction mix. Adapter
sets are constructed from linkers and primers according to
25 the following protocol:
1. Add to water linker and primer in a 1:20 concentration
ratio (12-mer : 24-mer) with the primer at a total
concentration of 50 pm per ~l.
30 2. Incubate at 50~C for 10 minutes.
3. Cool slowly to room t~ lerature and store at -20~C.
RE/liqase & Amplification Protocol
35 1. Combine the following components for the QPCR mix as
shown:
- 186 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
ReagentConcentration 1 rxn 96 rx~s
lOX TB 2.0500 mM Tris pH 9.15, 5 ,ul 525 ~l
160 ~IIM (NH4) 2S04, 20
mM MgCl2
S dNTP 10 mM 2 ,ul 210 ,LLl
(equimolar
mixture)
Klentaq:PFU 25 U/ml 0.25 ,ul 26.25 ,~Ll
(16:1)
water 32.75 ,ul 3438.75 ~Ll
wax 90:10
Paraffin:Chilloutn'
14
2. Pre-wax PCR tubes by melting the 90:10
lS ParaEfin:Chillout~ 14 wax and adding the melted wax to
the tubes in such a way that the wax solidifies on t:rLe
sides of the upper half of the tubes.
3. ~ix solutions by tapping and/or inverting the tubes (do
not vortex). Add 40~L1 S;~PCR mix to the pre-waxed PCR
23 tubes. Add the solution one tube at a time -arefully
avoiding the sides and wax in the tubes. Note that it
is important to keep the QPCR and the Qlig mixes
separate as any QPCR mix in the ligation and the
reaction will not work.
25 4' The tubes are placed in a thermal cycler without lids
and tne wax is melted onto the liquid layer by
;ncl~hAting at 7S~C for 2 min, followed by decreasing
in~ ~nts of 5~C for every 2 min until 25~C is reached.
5. Combine the following components for the Qlig mix as
5hown
~ -- 187 --
CA 02235860 1998-04-24
W O 97/15690 PCTnUS96/17159
Regent Concentration 1 rxn24 rxn
RE 1 depends on RE 0.2~1 5.2~1
RE 2 depends on RE 0.2~1 5.2~1
5Adapter 20 pmole/ml 1~1 26~1
set 1 for primer
Adapter 20 pmole~ml 1~1 26~1
set 2 for primer
ATP 10 mM O.8~120.8~1
10NEB 2 lOX 1~1 26~1
Betaine 5 M 2~1 52~1
Ligase 1 U/ml 0.2~1 5.2~1
H20 2.6~167.6~1
The amount for 24 rxns is advantageous for 8 cDNAs
reactions done in triplicate.
5. After the Qlig mixes are complete for each set of
enzymes the mix can be split u~ into tubes before adding
the cDNAs. 24 reactions can be split up into 8 tubes
each with 3 reaction volumes (approximately 27~1).
7. Add the cDNP. to the tubes and mix:
Reagont Concentration ~ rxn3 rxns
cDNA 1 ng/~ 1 3~1
sample
The cDNA is prediluted to the appropriate concentration
of 1 ng/~l.
30 8. Add 10~1 of the Qlig mix to the top of the wax being
careful not to disturb the wax. In the case where 24
Qlig reactions are triplicated, the products can be
split into 24 individual QPCR reactions.
9. Gently add the caps to the tubes. Excess pressure can
35disturb the wax.
10. Place the tubes in a thermal cycler and perform the
following thermal protocol.
- 188 -
-
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
Temp Time Reaction
(in ~C) (in min.)
37 30 Optimal RE digestion
temperature
Ramp down to 37~C at -1~C/min.
16 60 Optimal ligation
temperature
37 15 optimal RE digestion
temperature
72 20 Melt wax; mix solutions in
tube; blunt-end chains
Cycle the following steps for the number o~ PCR
cycles, preferably 20
~6 30 sec. Denaturing
57 1 Hybridizing
72 2 Chain elongation
_ End of the PCR cycles
72 10
~ hold
2Q
11 After program is ~inished heat the tubes to 75~C ~or ~
minuies. Pull out the tu~es and immedi~tely turn them
up3ide down till the wax hardens.
12. Place ~; ni ~he~ reactions in freezer or proceed directly
25 to further processing.
The following are the preferred vendors for the
various reagents used in this protocol.
- 189 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
R~age~ts Vendor Catalog #
Enzymes NEB
(Beverly, MA)
Adapters Amitof/NBI (see Table 10 for
(Allston, MA) sequences)
Fluorescent Primers Genosys (see Table 10 for
(The Woodlands, sequences)
TX)
ATP Pharmacia 27-1006-02
'Newark, NJ))
dNTP Pharmacia 27-2035-02
Klentaq Ab peptides 1001
(St. Louis, MO)
PFU Stratagene 600154
(Los Angeles, CA)
15 Betaine Sigma B-2754
(St. Louis, MO)
ParafEir. wax Fluka Chemical, 76243
Inc. (Ronkonkoma,
N.Y.)
~hillout~ 14 'iquid MJ Re-~earch
2~ wax
L~gaF~e B~L 15Z2~-025
'(BaltimQre, MD)
6.4.2. POST AMP~IFICATION CLEAN~P PROTOCOL AND OT~ER S~EP8
Different post-amplification steps are appropriate
for ~he vaxious ~ ho~ Ls of QEA~/RE embodiment. In one
case, QEA~ reaction are performed with labeled primers having
no conjugated capture moieties. In this case, QEA~ reaction
products are simply separated by length. When separation is
30 by elecL~ oresis, the reaction products are suspended in a
loading bu~fer and then loaded into an electrophoresis gel.
A preferable electrophoresis apparatus is an ABI 377 (Applied
Biosystems, Inc.) automated sequencer using the Gene Scan
software (ABI) for analysis. The electrophoresis can be done
35 under non-denaturing conditions, in which the dsDNA r~
together and carries the labels (if any) of both primers. It
can also be done under denaturing conditions, in which each
-- 190 --
CA 02235860 1998-04-24
W O 97/15690 PCTrUS96/17159
ssDNA is separately labeled but typically are expected to
migrate together.
In another case, one of the primers has a
conjugated capture moiety, e.g., biotin, either for post-
s amplification cleanup prior to separation or as part of the
SEQ-QEAT~ embo~ t. In this case, QEATU reaction products
are first subject to a cleanup protocol for removing excess
reagents and certain reaction products.
The following burfers are used in the post-PCR
lo cleanup protocol.
Binding Buffer (H20 solution)
I. 5 M Nacl
II. 10 mM Tris, pH 8.0
III. 1 mM EDTA
Wa~h Buffer (H20 solution)
I. 10 mM Tris, pH 8.
II. 10 mM EDTA
loading Buffer (denaturing~
I. 80% deionized ~ormamide
II. 20% 25 mM E~T~ (pH &.0), 50 mg/mL Blue
dextran
L~dder Loading Buffer
I. 100 I~L Gene Scan 500 R0~ with 900 'LL Lo~dina
Buffer
Post-PC~ Cleanup Protocol:
25 1. Prepare enough streptavidin magnetic beads for purifying
QEAT~ products (Catalog No. MSTR05~0 of CPG, Lincoln
Park, N.J.). Use 3 ~L of beads for every 5 ~LL o~ QEA
react~on product. Pre-wash beads in final suspension
volume with binding buffer.
- -- 191 --
CA 02235860 l99X-04-24
W O 97/15690 PCT~US96/17159
1 Reaction 96 Reactio~s
Sample Bead Suspension Bead Suspension
Volume Volume Volume Volume Volume
5 ~1 3 ~1 10 ~1 300 ~1 1 ml
10 ~1 6 ~1 10 ~1 600 ~1 1 ml
15 ~1 9 ~1 10 ~1 900 ~1 1 ml
20 ~1 12 ~1 10 ~1 1200 ~1 1 ml
10 2- Dispense 10 ~L of washed beads for every QEA~ sample to
be processed. Purifications are done in a 96 well
Falcon TC plate.
3. Add QEA~ product to beads. Mix well and incubate 30
minutes at 50~C.
15 ~' Bring volume of sample up to 100 ~1 with binding buffer.
Place plate on 96 well magnetic particle concentrator.
Allow beads to migrate for 5 minutes.
5. Remove liguid, add 200 uL of washing bu~fer (TE p~ 7.4).
~. Repeat the ~ashing step 5.
' In the cas6 of a SEQ-QEA~ emboAi ent, the washea beads
are now passed to the further steps of this ~h~ i ment
~s described in Sec. 6.5. In the other case of an
embodiment using post-amplification cleanup alone, the
washed beads are passed to the analysis step 9.
Optionally, the beads may be stored by passing to step
10 .
8. For analysis the beads are resuspended in loading buffer
(5 ~1 for 5 ~1 of beads). Gene Scan 500 ROX ladder can
be mixed in a one-tenth dilution. The supernatant is
then analyzed by electrophoresis under denaturing
conditions.
9. In case the beads are to be stored, remove liquid and
air dry the beads.
10. Store plate dry in at -20~C.
- 192 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
In the case where one of primers has a Conjugated
biotin moiety, QEA~ reaction products fall into the following
three categories:
a) A dsDNA molecule of which neither strand has a biotin
moiety;
~ b) A dsDNA molecule having only one having with a
conjugated biotin moiety;
c) A dsDNA molecule having biotin moieties conjugated to
both strands.
10 Category "a" products are not bound to the beads, and after
the washing steps 5 and 6 of the previous protocol, they are
washed from the beads, leaving only categories "b" and "c"
attached to the beads. After step 9 in which the beads are
resuspended in denaturing loading buffer, for category "b"
15 products, the strand not having the biotin moiety is released
while the other strand with the biotin moiety is retained by
rhe beads. For category "c" products, both strands are
etained. Conseguently, the c-lectrophoresis of step 9
separates single strands deriving from thosc reaction
~o products having only one conjugated biotin moiety.
6.4.3. THE 5'-OEA~ EMBODIMENT
This subsection describes an exemplary protocol for
QEA~ embodiments which generates cDNA fragments which on the
25 5' end are fixed with respect to the 5' cap of the source
~RNA and -~hich on the 3' end are singly cut by a chosen RE.
First, input cDNA is synthesized according to the protocol of
Sec. 6.3.3, or an equivalent protocol. Second, the protocols
in Sec. 6.4.1 and 6.4.2 previously described, except
30 di~fering only in the ~ _ocition of the Qlig mix, are
performed.
1. cDNA is synthesized according to the protocol of Sec.
6.3.3.
35 2. The QPCR mix is prepared according to steps 1 through 4
- of the protocol of Sec. 6.4.1.
- 193 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
3. Combine the following components for the Qlig mix as
shown:
Regent Concentration 1 rxn 24 rxn
RE 1 depends on RE 0.2~1 5.2~1
Adapter 20 pmole/ml 1~1 26~1
set 1 for primer
MA24 primer 20 pmole/ml 1~1 26~1
biotin- for primer
labeled
ATP 10 mM O.8~1 20.8~1
NEB 2 lOX 1~1 26~1
Betaine 5 M 2~1 52~1
Ligase 1 U/ml 0.2~1 5.2~1
H20 4.6~1 119.6~1
The amount for 24 reactions is advantageous for *
reactions performed in triplicate.
~~. Ihe RE/ligase and PCR ampli~ications are processed
according to steps 6 through 12 of the protocol of Sec.
6.4.1.
5. The reaction products are processed according tc steps
1-6 and 8-10 of the cleanup protocol of Sec. 6.4.2.
After the w~-ching step of the cleanup protocol,
step 6, attached to the streptavidin beads are only products
which are singly cut on the 3' end and are terminated at the
S' end by the biotin-la~eled primer, which is ligated in a
fixed relation to the 5' cap of the sGurce mRNA. Thus, upon
30 denaturing electrophoresis, step 9 of the cleanup protocol,
subsequent detection finds only signals from the desired
singly cut end fragments of de~inite length.
6.4.4. FIR8T ALTERNATIVE RE/LIGA8B & AMPLIFICATION REACTION8
3 The section describes less preferred protocols
suitable ~or either ~n~ or automated execution in two
- 194 -
-
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
tubes and suitable for labeled primers without a conjugated
capture moiety. Otherwise the REs and the primer/linker
components are chosen as previously described.
5 ~dapter Annealin~
- Adapters are former by annealing 12-mer linkers and 24-mer primers with some linker excess according to the
following protocol:
10 1. Add to water linker and primer in a 2:1 _oncentration
ratio (12-mer : 24-mer) with the primer at a total
concentration of 5 pM per ~1.
2. Incubate at 50~C for 10 minutes.
3. Co~l slowly to room temperature and store at -20~C.
Because there is no primer e~ce.s, prime-s must be
separ2tely added to the PCR reaction mix.
~anual_RE/Liqase & Am~lif~cation Reacti~ns
This protocol is advantageously applied to separate
mar.ually performed RE/~igase and amplification reactions.
First, t'.e RE/ligase reaction is prepared for use in a 96
well ~hermal cvcler. Add per reaction:
25 1. 1 U of chosen REs (New England Biolabs, Beverly, MA)
(preferred RE pair listing in Sec. 6.10)
2. 1 ~1 of pre-annealed adapters appropriate for the chosen
REs are prepared as above
3. 1 ~1 of Ligase/ATP (0.2 ~1 T4 DNA ligase [1 U/~1]/0.8 ~1
10 mM ATP from Li~e Technologies (Gaithersburg, MD))
4O 0.5 ~1 50 mM MgCl2
5. 10 ng of subject prepared cDNA
6. 1 ~1 10X NEB 2 buffer from New England Biolabs (Beverly,
MA)
35 7. Water to bring total volume to 10 ~1
- - 195 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
Then perform the RE/ligation reaction by following
the thermal profile in Fig. 16A using a PTC-100 Therma
Cycler from MJ Research (Watertown, MA).
Next for the PCR amplification reaction mix by
5 combining:
1. lo l~l 5X E-Mg (300 mM Tris-Hcl pH 9.0, 75 mM (NH~)2So~, no
Mg ions))
2. 100 pm of appropriate fluorescently labe~ed 24-mer
primers
3. 1 ~1 10 n~ dNTP mix (Life' Technologies, Gaithersburg,
MD)
4. 2.5 U o. 50:1 Taq polymerase (Life Technologies,
Gaithersburg, MD) : Pfu polymerase (Stratagene, La
Jolla, CA)
5. h'ater to briny volume to 40 ~1 per PCR reaction
Then perform the following steps:
2~ 1. Add 40 ~1 of the PCR reaction mix to e3ch RE~ligaticn
reaction
2. Perform the PCR temperature profile of Fig. 16B using a
~TC-100 ~hermal cycler (MJ Research, Watertown, MA)
Automated RE/~i~ase ~ Amplification Reaction-~
The pr~c~i~g protocol can be advantageously
automate~ according to the current protocol which requires
inte.~ t~ reagent additions. Reactions are preformed in a
~0 st~n~And 96 well thermal cycler format using a B~cl ~n Biomek
2000 robot (Be~l -n, Sunnyvale, CA). Typically 4 cDNA
samples are analyzed in duplicate with 12 different RE pairs,
for a total of 96 reactions. All steps are performed by the
robot, including solution mixing, from user provided stock
35 reagents, and temperature profile control.
Pre-annealed adapters are prepared as in the
prece~ing section. Mix per RE/ligase reaction:
- 196 -
CA 0223~860 1998-04-24
W O 97/lS690 PCT~US96/17159
1. 1 U of appropriate RE (New England Biolabs, Beverly, MA)
2. 1 ~l of appropriate annealed adapter prepared as above
tlO pm)
3. 0.1 l~l T4 DNA ligase tl U/~l] (Life Technologies
S (Gaithersburg, MD)
4. 1 ~l ATP (Life Technologies, Gaithersburg, MD)
5. 5 ng of subject prepared cDNA
6. 1.5 /~l lOX NEB 2 buffer from New England Biolabs
(Beverly, MA)
10 7. 0.5 ~l of 50 mM MgCl2
8. Water to bring total volume to 10 ~l and transfer to
thermal cycler
The robot requires 23 minutes total time to set up
lS the reactions. Then it performs the RE/ligation reaction by
follcwing the temperature profile of Fig. 16C u~ing a PTC-100
The~mal Cycler equipped w.ith a mechanized ;id from MJ
Research (Watertown, MA).
Next, prepare the PCR reaction mix by combining:
. ?0 ~1 Sx E-Mg (300 mM Tris-HCl pH 9.0, 75 ~ (NH4j2SC~)
2. 100 pm of appropriate fluorescently laheled 24-mer
primer
3. 1 ~l 10 mM dNTP mix ~Life Technologies, Gaithersburg,
MD)
. Z.5 U of 50:1 Taq polymerase (Life Technologies,
Gaithersburg, MD) : Pfu polymerase (Stratagene, La
Jolla, CA)
~. Water to being volume to 3S ~l per PCR reaction
Preheat the PCR mix to 72~C and transfer 35 ~l of
the PCR mix to each digestion/ligation reaction and mix. The
robot requires 6 minutes for the transfer and ;~i ng.
Perform the RE/ligase and PCR amplification reactions
35 according to the temperature profile of Fig. 16B using a PTC-
- 100 thermal cycler equipped with a ~ch~nized lid (MJ
Research, Watertown, MA).
- 197 ~
CA 0223~860 1998-04-24
W O97/15690 PCTAUS96/171~9
The total elapsed time for the digestion/ligation
and PCR amplification reactions is 179 minutes. No user
intervention is required after initial experimental design
and reagent positioning.
6 . 4 . 5 ~ 5~NV A~TBRNATIVB RB/~IG. & AMPLIFICATION ~ÉACTION8
The section describes a much less preferred fully
manual protocol in which the RE, the ligation, and the PCR
ampiification reactions are all separately performed in three
10 tubes. It is suitable for labeled primers without a
conjugated capture moiety, with the REs and the primer/linker
components otherwise chosen as previously described. It is a
less pre~erred protocol.
15 RE Di~estion Reaction
1. Mix the following reagents
C.5 ~1 prepared cDNA (lOo rg/~]) ~ixture (total 'O ng
of cDNA!
lo ~1 New England ~iolabs Buffer NG. 2
~o 3 Units RE enzyme
Z. lncubate for ~ hours at 37~C.
Larger size diges's with higher concentrations of
CDNA can be used and fractions of the digest saved for
2~ additional sets of experiments.
A~apter ~i~ation
Since it is important to remove unwanted ligation
products, such as concatamers of fragments from different
30 CDNAS resulting from hybridization of RE sticky ends, the
restriction enzyme is left active during ligation. This
leads to a con~i n~l i ng cutting of unwanted concatamers and end
ligation of the desired end adapters. The majority of
restriction enzymes are active at the 16~C ligation
35 temperature. Ligation profiles consisting of optimum
ligation conditions interspersed with optimum digestion
conditions can also be used to increase efficiency o~ th$s
- 198 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
process- An exemplary profile comprises periodically cycling
r between 37~C for 2 hours and 16~C for 2 hours at a ramp of
1~C/min.
one linker complementary to each 5 minutes overhang
5 generated by each RE is required- 100 picomoles ("pm") is a
sufficient molar excess for the protocol described- For each
linker a complementary uniquely labeled primer is added for
ligation to the cut ends of cDNAs- 100 pm is a sufficient
molar excess for the protocol described. If the amounts of
lo RE cDNA is changed the linker and primer amounts should be
proportiGnately changed.
Liqation Reaction (per 10 ~1 and 50 ng cDNA)
1. Mix the following reagents
Component Volume
digested cDNA mixture 10 /~1
loo pM/~l each primer 1 ~1
~03 p~!~l each linker 1 ~1
2. Thermally cycle from 50~C (a temperature al which tha
primers and linkers hybridize~ to 10~C (-1~C/minute)
then back to 16~C
3. Add 2 ~1 10 mM ATP with 0.2 ~1 T4 ~NA ligase (Premix
0.1 ~1 ligase 1 U/~l per 1 ~1 ATP) (E. Coli ligase is
less ~referred alternative ligase.)
4. Incubate 12 hours at 16~C. This step can be shortened
to less than 2 hours with proportionately higher ligase
conr~ntration. Alternately the thermal cycling protocol
described can be used here.
5. Incubate 2 hours 37~C
6. Incubate 20 minutes at 65~C to heat inactivate the
ligase (last step should be RE cutting).
7. Hold at 4~C
Amplification Of Fraqment~ With Liqated Ada~ters
-- 199 --
CA 02235860 1998-04-24
WO 97/15690 PCTAUS96/171~9
This step amplifies the fragments that have been
cut twice and ligated with adapters unique for each RE cut
end. It is designed for a high amplification specificity,
Multiple amplifications are performed, with an increasing
5 number of amplification cycles. Use of the minimum number of
cycles to get the desired signal is preferred.
Amplifications above 20 cycles are not generally reliably
quantitative.
Mix the following to form the ligation mix:
Com~onent Volume
REfLigase cDNA mixture 5 ~l
lOX PCR Bu~er 5 ~l
25 mM MgCl2 3 ~l
10 mM dNTPs 1 ~l
~00 pM~l each primer 1 ~1
Mix .he ~ollowing to form 150 ~l P~--Premix
Volume ComDonent
30 ~l Buffer E (ligation mix will contr7bute 0.3 mM MyCl)
1 ~l (300 pm/~l Rbuni24 Flour) 24 mer primer strand (50
pm/~l NBuni24 Tamra)
0.6 ~l Tag polymerase (per 150 ~l)
3 ~l dNTP (10 mM)
106 ~l H20
Amplification of fragments is more specific if the
small linker dissociates from the ligated primer-cDNA complex
prior to amplification. The following is an exemplary method
for amplification of the results of six RE/ligase reactions.
1. Place three strips of six PCR tubes, marked 10, 15, and
20 cycles, into three rows on ice as shown.
- 200 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
20 cycles 1 2 3 4 5 6 -Add 140 ~1 PCR-
premix
15 cycles 1 2 3 4 5 6
5 10 cycles 1 2 3 4 5 6 -Add 10 ~1
ligation mix
2. Place 10 ~1 ligation mix in each tube in 10 cycle row
3. Place 140 ~-1 PCR premix in each tube in 20 cycle row
10 4. Place into cycler and incubate for 5 minutes at 72~C.
This melts linker which was not covalently ligated to the
second strand of a cDNA fragment and allows the PCR
premix to come to temperature.
5. Move the 140 ~1 ~CR premix into the tubes in the 10 cycle
row contA;n;ng the 10 l~l ligation mix, then place 50 ~1
of result into corresponding tubes each in other rows.
6. Incubate for 5 minutes at 72~C. This finishes
incompletely doub?.e stranded cDNA ends into complete
d~DNA, the top primer being used as template for second
strand completion.
The amplification cycle is designed to raise
specificity and reproducibility of the reacticn. High
temperature and long melting times are used to reduce bias of
25 amplification due to high G+C content. Long extension times
ar~ used to reduce bias in favor of smaller fragments. Long
denaturing times reduce PCR bias due to melting rates of
fragments, and long extension time reduces PCR bias on
fragment sizes.
1. Thermally cycle 95~C for 1 minute followed by 68~C for 3
minutes.
2. Incubate at 72~C for 10 minutes at end of reaction.
.
6.4.6. OPTIONAL POST-AMP~IFICATION 8TEP8
Several optional steps can ; _L~ve the signal from
the detected bands. First, single strands produced as a
- 201 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/171~9
result of linear amplification from singly cut fragments can
be removed by the use o~ single strand specific exonuclease~
Exo I is the preferred single strand specific nuclease, and
is used by incubating 2 units of nuclease with the product of
5 each PCR reaction for 60 minutes at 37~C.
Second, the amplified products can be concentrated
prior to detection either by ethanol precipitation or column
separation with a hydroxyapatite column.
Several labeling methods are usable, including
10 fluorescent labeling as has be~n described, silver staining,
radiolabelled end primers, and intercalating dyes.
Fluorescent end labeling is preferred for high throughput
analysis with silver staining pre~erred if the indi~idual
bands are to be removed from the gel ~or further processing,
15 such as seguencing.
Finally, fourth, use cf two primers allows direct
se~uer;-in~ of separate~ strands ~y standard techni~ues. Also
separated strands can be directly cloned intG vectorfi ~or use
in RNA assays such as in situ analysis. In that case, it is
20 more preferrea to use primers containin~ T7 or other
poly~erase signals.
- 202 -
-
CA 0223~860 1998-04-24
W O 97/15690 PCTnUS96/17159
6.5. PREFERRED SEO-OEA~ ~ET~OD
The SEQ-QEA~ embodiment is practiced with the
special SEQ-QEA~ primers. one SEQ-QEA primer in a reaction
has a Type IIS restriction enzyme (e-g-, FokI) recognition
5 s~te and a fluorescent tag, (e-g-, FAr~ (carboxy-fluoroscein)
(ABI)) attached at the 5' end. The other primer used has a
biotin capture moiety ("Bio") and comprises either a uracil
residue or a site for a rare-cutting restriction enzyme like
AscI. Sec. 6.10.1 and Table 18 has a list of exemplary
10 primers and linkers for t~.e SEQ~QEA~ methods.
Using these primers with corresponding linkers and
appropriate REs, the preferred QEA~ protocol of Sec. 6.4.1 is
pe~for~ed and is followed by the post-PCR cleanup protocol of
Sec. 6.4.2 through the step 6 washing. As noted in step 7,
15 the products of step 6 are input to the further steps of the
SEQ-QEA~ embodiment.
The following ar2 pre~erable primers and linkers to
be used together with the an r~El .5f BglII and an RE2 of
F,spHI.
Type-II8 ~etho~ of
SEO-OEl~n~ method Primer pair~3 EnzYme Bead ReleaYe
1) K~5/KA24-FAM + RC9/UC24-Bio FokI UDG
2; BA5/BA24-FAM + RC9/UC24-Bio BbvI UDG
3) KA5/KA24-FAM + RC9/SC24-Bio FokI AscI
2s 4) BA5/BA24-FAM + RC9/SC24-Bio BbvI AscI
rJsing the above REs and primer pairs, QEA~ method
reaction products obtained fall into the following three
categories:
a~ A double-stranded DNA with a 5' FAM label with nearby
sequence containing a r~cognition site for FokI or BbvI
on one strand, and a 3' biotin label with nearby se~uence
cont~; n; ng a uracil residue or an AscI recognition site
on the other strand (in the case where different REs cut
at each end)
35 b) A double-stranded DNA with a 5' biotin label with nearby
sequence cont~; n; ng a uracil residue or an AscI
recognition site on one strand, and a 3' biotin label
- 203 -
CA 0223~860 1998-04-24
WO97/15690 PCT~S96/17159
with nearby sequence containing a uracil residue or an
AscI recognition site on the other strand (in the case
where same RE cuts at both ends)
c) A double-stranded DNA with a 5' FAM label with nearby
sequence containing a recognition site for FokI or Bb~I
on one strand, and a 3' FAM label with nearby sequence
cont~in;ng a recognition site for FokI or BbvI on the
other strand (in the case where same RE cuts at both
ends~
Typically, after QEA~ reactions according to the
protocol c, Sec. 6.4.1 is completed, 45 ~l out of 50 ~L is
processed (the rest is saved). During the post-PCR cleanup
accordir.g to the protocol of Sec. 6.4.2, these 45 ~l of the
reaction pr~ducts are bound to the magnetic streptavidin
15 beads and washed at step 6 of this protocol. After this
step, only c~tegory "a" and "b" productc Are retained by the
magnet.c streptavidin beads, 'he category "c" products having
no bi~tin moieties. Subsequently, the DNA bound to the beads
i.s ~ige~ted with the Type IIS restriction enzyme in d volume
20 o~ ~ on ~1 of a suitable lX RE bu~fer, e.g. NEB 4 for Fokl,
wi~h about lO units of the enzyme for 3 hours at 37~C. After
~ype IS RE digestion and washing only category "a" products
are retained by the beads, the category "b" products having
been cut at both ends and released from the beads. The
25 supernatant is then removed and the beads are washed three
times with the wash buffer. Type IIS restriction enzymes
cleave DNA at a location outside their recognition sites,
thus producing overhangs of unknown sequences (Szybalski et
al., ~99l, Gene 100:13-26). The Type IIS digestion thus
30 releases the FAM label of the category "a" products and
creates a fragment-specific overhang that acts as a template
for sequencing. Complete Type IIS digestion can be checked
for by the absence of the FAM label.
The end-sequencing reaction is essentially a chain
35 fill-in reaction using the overhang generated by the Type-IIS
restriction enzyme as a template. Dideoxy chain terminators
labeled with different ABI fluorescent dyes are mixed at high
- 204 -
-
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
rati~S with dNTPs to ensure high frequency of incorporation~
- and the DNA polymerase enzyme used (e-g-, Sequenase (T7 DNA
polymerase), Taquenase (Taq polymerase)~ has high affinity
for the labeled dideoxynucleotides. A sequencing mix
5 totalling 20 ~l containing the appropriate lX buffer, 1 ~l
dNTPs diluted 1/200 from stock (3 mM dATP, 1.2 mM dCTP,
4.5 mM dGTP, 1.2 mM dTTP), C.5 ~l each ABI dye-labeled
terminator solution (Cont~ining ddATP, ddCTP, ddGTP and
ddTTP, respectively), (and 1 ~l 0.1 M DTT for Sequenase) is
10 made. The beads are resuspended in the se~uencing mix and
0.1 ~' Taquenase is added and the reaction is incubated at
65~C for 15 minutes. If Sequenase is to be used, 0.1 ~l
Seq~ena~e s added instead of Taquenase and the reaction is
incubated at 37~C for 15 minutes. After this, the reaction
15 m_x is transferred to a magnet and the supernatant is
removed. The beads are washed twice with wash ~ufEer ~.
The above-described end-sequencing reaction
ncorporates dye labeled nuc~eotides into the strara that
contains biotin. Since biotin-streptaVidin binding i3 ne2rly
20 irre~ersible, the labeled strands must be cleaved for
ana~ysis by electrophoresis. This is achieved by tre~.ing
UM~-ccn aining fragments with Urac_l DNA Glycosylase (UDG),
or cleavillg AscI-site-cont~; n ing fragments with AscI. UDG
re~oyes the Uracil residue from dsDNA; the phosphate backbone
25 is subsequently hydrolyzed at t~ _~~atures above room
temper2ture and at pH>8.3.
For UDG treatment, the beads are resuspended in 20
~1 UDG bu~fer (30 mM Tris-HCl pH 7.5, 50 mM KCl, 5 mM MgCl2),
0.2 units ~f UDG are added and the reaction is incubated at
30 room temperature for 30 minutes. The reaction is then
transferred to a magnet and the supernatant removed. The
biotinylated strand, which is the strand that is being filled
in during end-sequencing, is still attached to the beads as
~DG does not destroy the backbone, but makes it very
35 susceptible to hydrolysis.
The beads are resusp~n~ in 5 ~l formamide loading
buffer. These are then split into 2 tubes of 2.5 ~l each.
- 205 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
Another 2.5 ~1 fo~ - ide loading buffer is added to one and
2.5 ~1 fo~ ~ ;de loading buffer with 20% GS500 ROX ladder
(ABI) is added to the other. These are heated at 95~C for 5
minutes to effect hydrolysis and denaturation. Then they are -
5 electrophoretically separated.
In case of the biotinylated primer having an AscI
site, the following is performed. The beads are resuspended
in 20 ~1 of AscI buffer (15 mM KOAc, 20 mM Tris, 10 mM MgOAc,
1 mM ~T, pH 7.9) and 5 units of AscI is added and incubated
10 at 37~C for 1 hour. The beads are separated on a magnet and
the supern~tant that contains the digestion products is
precipitated with three volumes of ethanol after the addition
of 5 ~g of glycogen. The pellet is resuspended in 5 ~1
formamide loading buffer and split into 2 tubes of 2.5 ~1
15 each. Another 2.5 ~1 formamide loading ~uffer is added to
one and 2.5 ~1 formami~e loading buffer with 2~% GS500 ROX
ladder is added to the other. These are heated a' 95~C for 5
minu~es and analyzed by electrophoretic separation.
Sequencing is completed by gel ele~LLJ~horetic
20 separation of released and sequenced strands. The overhang
sequence is given by the order of partially filled in
ragmants observed.
6.6. OEA~ BY T~E PCR EMBODrMENT
This is an alternative QEA~ embs~i -nt based on PCR
amplification of _ragments between target subsequences
recognized by PCR primers or sets of PCR primers. It is
designed for the preferred primers described with reference
to Fig. 5. If other primers are used, such as simple sets of
30 degenerate oligonucleotides, step S, the ~irst low stringency
PCR cycle, is omitted.
First strand cDNA synthesis is carried out
according to Sec. 6.3. PCR amplification with de~ined sets
of primers is performed according to the following protocol.
- 206 -
CA 02235860 1998-04-24
W O 97/15690 PCTAUS96/17159
1. Rnase treat the 1st strand mix with 1 ~l of RNase
r Cocktail from Ambion, Inc. (Austin, TX) at 37~C for 30
minutes.
2. Phenol/CHCl3 extract the mixture 2 times, and purify it on
a Centricon 100, Milipore Corporation (Bedford, MA) using
- water as tne filtrate.
3. Bring the end volume of the cDNA to 50 ~l (starting with
10 ng RNA/~l).
4. Set up the ~ollowing PCR Reaction:
Com~onent VGlume
cDNA ~-10 ng/~l) 1 ~l
~.0X PCR Buffer . 2.5 ~l
25 mM MgCl2 1.5 ~l
10 mM dNTPs 0.5 ~l
20 pM/~l primer 1 2.5 ~1
20 pM;~l primer 2 2.5 ~l
Taq Poly. (5 U''~l) 0.2 ~1
water 14.3 ~l
5. One low stringency cycle with the pr~file
40~C for 3 minutes (almealing)
72~C for 1 minute ~extension)
6. Cycle using the following profile:
95~C for.l minute
15-3C times:
95~C fPr 3C seconds
50~C for 1 minute
72~C for 1 minute
72~C for 5 minutes
7. 4~C hold.
~ 8. Samples are precipitated, resuspended in denaturing
ls~ing buffer, and analyzed by electrophoresis (either
under denaturing or non-denaturing conditions).
- 207 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
6.7. EXAMP~E OF 8~M~LATBD ~NN~r.T~
From the October 1994 G~R~nk database containing
human coding sequences, 12,000 of the first continuous coding
domain sequences ("CDS") were selected as in Sec. 6.1. This
5 selection resulted in a selected database of sequences biased
towards short sequences. Frequency tables were t~en created
that listed the occurrence frequency of each nucleotide
sub-equence of lengths 4, 5, 6, 7, and 8 in this selected
database. Test target subsequences were initially selected
10 whose ~robabiLity of occurrence was near to 50%. This was
fea~ible for the 4-mers, as they bind relatively frequently,
but a~ the occurrence probability decreases with length, for
longer sequences, the occurrence probability wzs often
substantially less than 50%. These initially selected target
15 subsequences were then optimized, using the simulated
annealing CC experimental design methcds, to pick the best '6
~ubsequences.
Tables 5, 6 and 7 pre.sent the results for target
subs2~uences of lengths 4, 5 and 6, respe_tively. Table 8
~0 prese.lts the results for optim-~ ng target subsequences of
iengt~. 4 througb 6 together. Simulated annealing gen~rally
prodllc:ed an approximately 20~ improvement over a target
subsequence selection guided only by the occurrence and
indep~ndence probability criteria. This level of
25 optimi~at-on is likely to improve with larger and less
redundant databases that represent longer genes. Longer
se~uences bind too infrequently in this database to make
useful hash codes.
TAB~E 5: ~iN OPTTMT~n 8ET OF 4-MER S~B8EQu~N~
CGTC GTTA ACTA CTAG
TTTT TGTA AATC GTTG
TACC TTGT TTCG GATA
CGGT CTCG AACG GGTA
- 208 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
The target subsequences in Table 5 were chosen from
all possible 256 4-mers. There are 2-41 CDSs per hash code
on average. There was 692 CDSs (out of 12000) which are not
complementary to any of these subsequences.
TABLE 6: AN OPTIMIZED 8ET OF 5-MER S~BSEQ~ENCES
AGGCA ACTGT GTCTC TGTGC
CAACT GCCCC ACTAC GTGAC
GCACC GTCTG GCCTC CAGGT
AGGGG GGAAC GCTCC GCTCT
The target subsequences in Table 6 were chosen from
the 300 most frequently occurring 5-mers. There are 2.33
15 CDSs per hash code on average.- There was 829 CDSs (out of
;2G00) which are not comp~ementary lo any of these
subsequences
TABLE 7: AN OPTIMTZED 8E~ OF 6~MER SUBSEQUENCE8
2~
TCCTCA CCAGGC AGCAGC CTCCTG
AGCTGG CTCTGG CCAGGG CAGAGA
GCCTGG ACTGGA CACCAT GCTGTG
ACTGT~ TCTGTG CCAAGG CCTGGA
The target subsequences in Table 7 were chosen from
the Z00 most frequently occurring 6-mers. There was 2.63
CDSs per hash code on average. There are 1530 CDSs (out of
12000~ which are not complementary to any of these
3~ subsequences.
- 209 -
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
TAB~E 8: ~N OPTIMIZED 8ET OF 4-, 5-, Al~D 6-MER S~BSBQu~:S
CTCG TTCG GATA TTTT
CTAG GGTA ACTGT ACTAC
S CAACT GTCTG AGGCA GCACC
TGTGC GGAAC AGGGG CTCCTG
The target subsequences in Table 8 were chosen from
sets in Tables 1-3. There was 2.22 CDSs per hash code on
lQ average. There are 715 CDSs (out of 12GOO) which are not
complementary to any of these subsequences.
The bias of the selected CDSs toward short
sequences, on the Average less than the length of a typical
gene, partially explains the 5-10% of CDSs that were not
15 complementary to any selected target subsequence. Longer
sequer,ces would he expected to have mGre hits as -hey have
more variability. Also more target suosequences can be
c~o~erl to improve coverage. The 2.2 to 2.6 CDSs per
ndiv-dual hash code is partially explained by replication in
20 th~ selected database. No attempt was make tc insure each
C~ is unique in the selected database.
6.8. OEA~ RESUBT8
This subsection present results from QEA~
25 experiments directed primarily to the query and tissue modes.
6 . 8 .1. O~ERY MODE OEA~ RES~TS
The pattern of gene expression differs from tissue
to tissue, and is modulated both during normal development
30 and during the progression of many diseases, including
c~nc~. Query mode QEA~ experiments were used to investigate
differences in gene expression between normal, hyperplastic,
and adenocarcinomatous glandular tissues. We had at our
disposal voxels containing all three types of tissue,
35 preserved in such a way that the adjacent tissue sections
were available for later in situ hybridization. The
following experiments were carried out with normal,
- 210 -
CA 0223~860 1998-04-24
Wo 97/15690 PCT/US96/17159
hyperplastic, and adenocarcinomatous tissue, respectively~ of
a particular gland.
-
RNA Extraction and cDNA S~nthesis
Isolation of total RNA and poly(A)' RNA from
homogenized glandular tissue voxels was performed
substantially as described in Sec. 6.3.l. cDNA was prepared
substantially as described in Sec. 6.3.4.
lO Ouantitative Expression Analysis
QEArY reactions were performed by the preferred REembodiment substantially as described in Sec. 6.4.4. This
inclu~ed the following steps.
15 AdaPter Anne2linq
Pairs of 12-~ase and ;~4-b2~sa primers were pre-
annealted at a ratio of 2~ 2 mer : 24 mer) at a
concentration of 5 picomoles 24 mer per microliter in lX NEB
2 ~-~fer. For linker/Frimer hybridiza~ion, the
20 oligonucleotide mixture was heated to 50 C for lO minutes,
an~l ailowed to cool slowly to room temperature. For this
experiment, lO picomoles o~ JC3 and 5 picomoles of JC24, and
lO picomoles of RC6 and 5 picomoles of RC24 were separately
pre-;-nn~led. The sequences of JC3, JC24, RC6, and RC24 are
25 listed in Table lO of Sec. 6.10, infra.
Restriction--Diqestion/Liqation Reaction
Reactions were prepared in for use in a 8--well
thermal cycler format. Gl;~n~lnlar cDNA isolated from lO
30 separate voxels of tissue was cut with ~in~TTI and NgoMI, and
pre-~nne~led linkers were ligated onto the 4 base 5'
overhangs that these enzymes generated. Added per each QEAn'
reaction were:
., .
1 Unit of HindIII (New England Biolabs, Beverly MA)
l Unit of NgoMI (New England Biolabs, Beverly MA)
1 ~Ll of pre-Ann~led JC3~JC24
~ -- 211 --
CA 02235860 1998-04-24
W O 97/15690 PCTAUS96/17159
1 ~l of pre-annealed RC6/RC24
1 ~l Ligase/ATP (0.2 ~l T4 DNA Ligase (1 Unit/~l) and
0.8~1 10 mM ATP - Life Technologies, Gaithersburg
MD)
0.5/~l 50 mM MgCL2
10 ng of glandular cDNA
1~1 lOX NEB 2 Buffer (New England Biolabs, Beverly MA)
Total volume of 10/~1 with H20
The temperature profile o~ Fig. 16A was per~ormed
using a PTC-LOO Thermal Cycler (MJ Research, Watertown MA).
AmPli~ication Reaction
The products of the RE/ligation reaction were then
15 a~pli~ied using RC24 and JC24 primers. The PCR reaction mix
in-~luded:
~0~1 5X E-Mg (300 mM Tris-HC~ p~I 9.0, 75 mM (NH~ SO
0~ pm RC24
l30 pm JC24
L ;~-1 lO mrr. dNTP mix (Life Technologies, Gaithers}:arg MD~
2.~ Units 50:1 Taq polyDIerase (Life Technologies,
Gaithersburg MD): P~u poly~erase (Stratagene, La
Jolla CA) mix
Total volume of 40~1 with H20.
40~1 preheated PCR reaction mix was added to each
restriction-digestion/ligation reaction. The temperature
profile of Fig. 16B was performed using a PTC-100 Thermal
30 Cycler (MJ Research, Watertown MA).
OEA~ Ana 1YS is
The reaction products were separated on a SS
acrylamide sequencing gel, and detected by silver staining.
35 Lane-to-lane comparisons were made both by visual inspection
of the gel, and by comparing computer enhanced images
obtained ~rom So~nni ng the gel using st~n~Ard computer
- 212 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/171~9
scanner equipment. One particular band of length X bp was
differentially expressed, being prominent in some samples but
absent in others. This band was picked from the gel, PCR re-
amplified, and sequenced.
QEA~ analysis was performed substantially as
~ described in Sec. 5.4.1 using the CDS database constructed as
described in Sec. 6.1. Four possible sequences in that
database were found to be possible contributors to a fragment
of Y bp (note that Y bp = X - 46 bp, where PCR primers add 46
10 bp ~o the fragment length), sequences A, B, C, and D.
Analysis of the sequencing of the picked band confirmed that
this DNA fragment was produced by sequence C, which is
presently entered in GenBank. This result confirms the
correct functioning of the integrated experimental and
15 analysis methods.
Further, analysis of seguence C predicted that a
secorld double-digest, using RF.s BspHI and BstYI, would yield
a se-ond, non-overlappinq restriction fragment at ~ bp in
~eng'h (plus the 46 bp of ligated primers). A second QEA
20 reaction was performed using these glandular cDNAs. ~he
pre~iou~ly described experimental conditions were used, with
the exception of substituting BspHI, BstYI, RA5/RA24 and
Jcs/Jc~4 for HindIII, NgoMI, ~C~/JC24 and RC6/RC24 during the
RE/ligation reaction and of substituting RA24 and JC24 during
25 amplificltion reaction. Analysis of the results of this
second QEA~ experiment on silver-stained acrylamide gels, as
above, revealed the pr~ce of a band of the predicted size,
Z + 46 bp, that was also differentially expressed in the same
tissue samples as the X bp fragment. This results confirms
30 the correct functioning of the mock digest prediction methods
coupled with subsequence actual experimental digest.
Additional hybrid primers were designed to
facilitate direct seguencing of QEA~ products and the direct
generation of RNA probes for the in situ hybridization to the
35 original tissue sample. The M13-21 primer or the M13 reverse
primer (in italics) were fused to the first 23 nucleotides of
- - 213 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
JC24 and RC24 (in bold), respectively, to allow direct
sequencing of the double-digested QEA~ products.
M13-21J + JA24: 5' GGC GCG CCT GTA AAA CGA CGG CCA GTA
5 CCG ACG TCG-ACT ATC CAT GAA G 3' (SEQ ID NO:56)
M13revR + RA24: 5' AAA ACT GCA GGA AAC AGC TA~ GAC CaG
CAC TCT CCA GCC TCT CAC CGA 3' (SEQ ID NO:57)
In order to enable direct generation of anti-5ense RNA probes
10 for in situ hybridization, the phage T7 promotor (in
italics) was fused to the first 23 nucleotides of ~A24/JC24
and RA24/RC24 (in bold).
T7 + JA24: 5' ACT TCG AAA TTA ATA CGA CTC ACT ATA GGG ACC
15 GAC GTC GAC TAT CCA TGA AG 3' (SEQ ID NO:58)
T7 ~ RA24: 5' ACT TCG AAA TTA ATA CGA ~TC ACT ATA GGG AGC
~-T CTC CAG CCT CTC ACC GA 3' (SEQ ID NO:53)
6.8.2. TIS8UE MOD~ O~A~ RES~T~
20 LsoLat-on of Human Placental Lactoqen usin~ OEA~
Lactogen is one of the m~st highly expressed genes
in .he hunan placenta and has a known sequence. The sequence
cf lactogen was retrieved from GenBank and mcck digestion
reactions were performed, substantially as described in
25 5.4.1, with a wide selection of possible RE pairs. These
m~ck digestions showed that digesting placental cDNA with the
restriction enzymes BssHIII and XbaI yields a lactogen
fragment of 166 bp in length.
30 RNA Extraction and cDNA Synthesis
Isolation of total RNA and poly(A)+ RNA ~rom
homogenized human placenta tissue was performed substantially
as described i~ Sec. 6.3.1. cDNA was prepared substantially
as described in Sec. 6.3.4.
- 214 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
Ouantitative ExPression Analysis
Y QEA~ reactionS were performed by the preferred RE
embodiment substantially as described in Sec. 6.4.3. This
included the following steps.
AdaPter Annealin~
Pairs of 12-base and Z4-base primers were pre-
annealed at a ratio of 2:1 (12 mer : 24 mer) at a
concentration of 5 picomoles 24 mer per microliter in lX NEB
10 2 buffer. The oligonucleotide mixture was heated to 50~C for
10 minutes, and allowed to cool slowly to room temperature.
For this experiment, 10 picomoles of RC8 and 5 picomoles of
RC24, and 10 picomoles of JC7 and 5 picomoles of JC24 were
separately pre-annealed. The sequences of RC8, RCZ4, JC7,
5 and JC24 are set forth in Table 10 of Sec. 6.10, infra.
~estrictior-~i~estion/Liqation Rea tion
Reactions were prepared ~or U5~ in a 8--well thermal
cyc~er format. Placental cDNA wa~ cut with BssHII and XbaI,
20 and prQ-annealed adapters ligat~d onto the 4 base 5'
Gve;-nangs ~hat ~hese enzymes genera~ed. Added per reaction
~ere:
1 Unit o~ BssHII (New England Biolabs, Beverly MA)
1 Unit of XbaI (New England Biolabs, Beverly MA)
1 ~1 of pre-annealed RC8/RC24
1 ~1 of pre-annealed JC7/JC24
1 ~1 Ligase/ATP (0.2~1 T4 DNA Ligase (1 Unit/~l) and
O.8~1 10 mM ATP - Life Technologies, Gaithersburg
MD)
O.5~1 50 mM MgCl2
10 ng of placental cDNA
1~1 lOX NEB 2 Buffer (New England Biolabs, Beverly MA)
Total volume of 10~1 with H20.
The temperature profile of Fig. 16A was performed
using a PTC-100 Thermal Cycler (MJ Reséarch, Watertown MA).
~ - Z15 -
CA 02235860 1998-04-24
W O 97/15690 PCTnJS96/17159
Am~lifiCation Reaction
The products of the RE/ligation reaction were then
amplified using RC24 and JC24 primers (see Table 10, infra).
The PCR reaction mix included:
10~1 SX E-Mg (300 mM Tris-HCl pH 9.0, 75 mM (NH~)2SO4)
100 pm RC24
100 pm JC24
1~1 10 mM dNTP mix (Life Technologies, Gaithersburg MD)
lo 2.5 Units 50:1 Taq polymerase (Life Technologies,
Gaithersburg MD): Pfu polymerase (Stratagene, La
Jolla CA) mix.
TotaL voLume of 40~1 with H~O.
40~1 preheated PCR reaction mix was added to each
restriction--digestion/ligation reaction. The temperature
p~ofile of Fig. 16B was per~ormad usiIIg a PTC-100 Ther~al
Cycler ~MJ Research, Watertown MA).
20 OEA~ AnalYsis
The reaction products were separated on a 5~
acr-ylamide sequencing gel and detected by silver staining. A
prominent band of size 212 bp was seen. This was predicted
to correspond to the 166 bp lactogen BssHI~-XbaI fragment,
25 with JC24 ligated to the BssHII site, and RC24 ligated to the
Xb~I site. To prove that this band did indeed correspond to
lactogen, the 212 bp band was excised from the gel, re-
amplified using JC24 and RC24, and the fragment was
se~-~c~. Analysis of these sequencing results proved that
30 the fr~gment was from lactogen. Moreover, the lactogen
sequence ended at the expected 4 base remnant of the
restriction site, i ?~i~tely followed by either JC24 (at the
BssHII end) or RC24 (at the XbaI end).
This result confirmed the experimental design
35 methods of Sec. 5.4.2 applied to selection of a QEA~
experiment to identify certain se~l~nc~ of interest, in this
case the hu~an placental lactogen sequence, in a tissue cDNA
- 216 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
sample. These design methods resulted in the selection of an
J experiment which successfully identified the gene intended.
Further QEA~ experiments were done according to the
protocols of this section on human placental derived cDNA
5 with differing enzyme combinations. One unit of each enzyme
of the enzyme combinations listed in the first coiumn of
Table g were used in the restriction-digestion/ligation
reaction protocol. Primers and linkers for each RE were
chosen accGrding to Table 10, with one appropriate "J" series
~0 linker and primer and one appropriate "R" series linker and
primer used in each reaction. The reaction products were
separated by electrophoresis on a 5~ acrylamide gel and the
bands detected by silver staining. Fragments from bands with
the lengths listed in the second column of Table 9 were
15 removed from the gel and sequenced. Sequencing identified
~he subsequences on the ends of the fragments and the- precise
le.:g~hs of each fragment. Each subsequen~e ~as
characteristic of one of the REs used, confirming correct
action of the ligation and amplification protocols. The
20 third column of Table 9 lists end subsequences, with a "1"
indicating the recognition subsequence of RE "Enzl" and a "2"
ir.dicating .he recognition subsequence of RE "Enz2".
Multiple fragments with the same length but differ-ng
recognition subsequence are placed in separate sub-rows in
25 Table 9.
Mock digest reactions, as described in Sec. 5.4.1,
were performed using the CDS database selected according to
6.1. These mock digestion reactions searched this CDS
database for se~lenc~C having recognition subsequences for
30 the REs and such that the recognition subsequences are spaced
apar~ in order to produce the fragments with the lengths
listed. This search identified the database accession
numbers listed in the fourth column of Table 9. The gene
- responsible for each accession number was determined from a
35 G~R~nk lookup and is listed in the fifth column of Table 9.
Ta~le 9. Each such gene and its a~c_ ,onying accession
numbers is listed in a further sub-row. Multiple accession
- 217 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
numbers associated with one gene reflect the redundancy
present in current CDS DNA sequence databases.
For all but one of the fragments recovered ~rom the
gel, the sequence for the fragment corresponded to one o~ the
5 genes identified by the mock digestion reaction as causing
that fragment. This particular gene is indicated by
displaying the gene name in underscore and bold in the fi~th
column of Table 9. That the gene determined by sequencing
the separated ~ragment matched the prediction o~ the database
10 search con~irms the ef~icacy o~ the experimental protocols
and the computer implemented experimental analysis and
ambiguity resolution methods o~ Sec. 5.4.1 and Sec. 5.4.3 for
tissue mode QEA~. In fact, the mock digestion reactions
pravide a simple way o~ identifying possible ambiguities in
15 ~NA sequence databases.
TABLE 9: PhACENTA GENE CA~LS
RE Fragmant End Database Gene Causing
Combinations Lengt~. Sub- Acc. F.-agment
20 (Enzl & seq. Numbers
Enz2)
BglII & 97 1,1 X07767 cAMP-Dependent
BspE1 Protein Kinase
1,2 J03278, PDGF Receptor
M21616
D23660, Ribosomal
L20868, Protein L4
X73974
2,2 M74096 Long Chain Acyl-
Dehydrogenase
30 BamHl & 112 1,2 L26914, Nitric Oxide
BspEl M93718, Synthase
M95296
L22453, Ribosomal
M90054, Prot~n L3a
X73460
35 BglII & 115 1,2 M20496, CathePsi~ L
BspE1 X05256
- 218 -
CA 02235860 1998-04-24
W O 97/15690 PCTAJS96/17159
RE Fragment End Database Gene Causing
. Combinations Length Sub- Acc. Fragment
(Enzl & seq. Numbers
Enz2)
BglII & 137 2,2 X55740 5'-Nucleotidase
NgoMl 137 1,2 L18967 TRP2 Dopachrome
Tautomerase
L10386 Tranglut~m;n~e
E3
S69231 Tyrosinease-
~0 Related Protein
X56998, ~biquitin
X5699s
EcoRl & Bcll 139 1,2 U14967 Ribosomal
Protein L21
Bcll & NgoMl 144 1,2 J02984 Ribosomal
Protein 815
U04683, Ol~actory
X30391 Receptor OR17-40
144 2,2 L12700 Engrailed-2
BamHl & 144 1,2 X97234 Ribosomal
BspEl Protein Ll~
X14362 C3B~C4B Receptor
EcoRl & 146 1,2 M13932 Ribosomal
HindIII ~ Protcin 817
BssHII & 166 1,2 J00118, L~cto~en
25 Xbal . VOQ573
Bcll & NgoM1 168 1,2 S56985, Ribosomal
X63527 Protein ~19
BamHl & 173 1,1 S59493, Nuclear Factor
BspEl U10323 NF45
1,2 M20882, Pre~nancy 8p.
M23575, Glycoprotein
M31125, beta 1
M33666,
M34420,
M37399,
~ M69245,
M93061
- 219 -
CA 02235860 1998-04-24
W O 97/15690 PCTAUS96/17159
RE Fragment End Database Gene Causing
Combinations Length Sub- Acc. Fragment
(Enzl & seq. Numbers
Enz2)
BglII & 192 1,1 D26350 Inositol
NgoM1 Triphosphatase
Receptor
L27711, Protein
L25876 Phosphatase
CIP2/KAPl
1,2 D29992, Tis~ue Facto~
L27624 PathwaY
Inhibitor 2
BglII & Agel 215 1,2 M11353, Histone H3.3
M11354
6.9. COLONY CALLING
The colony calling embodiment comprises the
principal steps of cDNA library filter construction, PNA
hybridization, and detectio~ of hybridization. Determination
of the s~quence in a sample is done ~y the previousIy
described computer implemented CC experimental analysis
methods.
cDNA librar~ filter construction
This protocol comprises-three steps: first, robotic
picking of colonies into microtiter plates, second, PCR
amplification of inserts, and third, spotting of amplified
cDNA inserts onto filters.
1. Colony picking -
a) Libraries are plated out at a density of 1,000-
10,000 colonies per 100 mm Petri dish and are picked
using a robot into 384 well microtiter plates con~;n;n~
SG ~1 of TB medium with the appropriate antibiotic. There
are several commercially available robots to do this
task. The preferable robot is from the WA~h;n~ton
University Human Genome Sequencing Center (St. Louis,
MO).
- 220 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/171~9
b) The picked colonies are grown for 8 hours at 37~C,
and are frozen for archiving.
2. PCR amplification -
PCR primer pairs designed for insert amplification are
dispensed with a stAn~Ard 25 ~1 PCR mix into 96 well
~ microtiter plates. A 96 prong transfer tool picks and
transfers samples to provide amplification templates from
the 384 well colony into the 96 well PCR mixes. A
standard 25 cycle amplification protocol qenerates 100-
500 ng sf insert DNA.
3. Spotting on filers -
The PCR products are pooled back into a 384 well format
microtiter plates identical to the colony plates above.
Spotting onto filters is a service performed by Research
Genetics (Huntsville, AL).
~ lternatively, c~NA libr2ry ~il~ers may be ob~ained
fron.;_~ ?_cial sources in certain cases.
~0 PNA hybridization and detection
PNAs ar~ ~ -~cially available from Pe-septive
osystems (Bedford, MA). The prstoccl below uses 8 dyes on
lfi different deqenerate sets of PNA 8-mers cont~ining as
common subsequences the optimized 6-mer subsequences from
25 Table 7. Thereby, complete classification and determination
Gf expressed genes in a human tissue can be done with only 4
hybridizations generating a code of length 32. Actual
conditions for stringency may vary dep~n~;ng on the PNA set
used.
1. Hybridization -
A pool of 8 PNAs are used, labeled with 8 different
fluorochromes made up at a concentration of 0.1 ~g/ml in
lO mM Phosphate buffer, pH 7.0, lX Denhardt's solution
(20 mg/ml Ficoll 400, polyvinylpyrollidone, and BSA~. The
arrayed filters are hybridized for 16 hrs at 25~ C, and
- 221 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/1715
washed 3 times in the above buffer without PNAs at a
temperature which ~xi~izes signal/noise.
2. visualization -
A ~luorescent detection system, such as used ~or DNA
analysis, can be used to distinguish the dyes, and thus
the PNAs, present at each filter hybridization position.
PNA presence or absence de~ines a code for each
hybridization position on the ~ilter.
6.10. PREFERRBD OEA~ ADAPTERS AND REs PAIR8
Table 10 lists preferred primer-linker pairs that
may be used as adapters for the preferred RE embodiment of
QE~. The primers listed co~er possible double-digest RE
comhinations involving the approximately 56 available REs
15 generating a 5' 4 bp overhang. There are 40 such REs
availakle from New England BiGlabs. For each QEA~ double
dlgest reaction, one primer ~nc~ one linker ~rom the "R"
series corresponding to one of the pair o~ REs and one primer
and or.e Linker from the "J" seriQs cGrresponding to the other
~o of the pair of REs are use~ together. This choice satisfies
the adapter characteristics previously described. T~o pairs
~rom the same series are not compatible during amplification.
TAB~E 10: SANP~ ADAP~ER8
Adapter: Primer (longer strand) RE
Series Linker (shorter strand)
Notes: 'm' signi~ies an optional Iabel
or capture moiety.
RA24 5' m-AGC ACT CTC CAG CCT CTC ACC GAA 3'
(SEQ ID NO:1)
RA1 3' AG TG& CTT TTAA Tsp509
tSEQ ID NO:2) I Mfel
EcoRI
RA5 3' AG TGG CTT GTAC NcoI
(SEQ ID NO:3) BspHI
RA6 3' AG TGG CTT GGCC XmaI
(SEQ ID NO:4) NgoMI
BspEI
- 222 -
CA 02235860 1998-04-24
WO 97/15690 PCTAUS96/17159
RA7 3' AG TGG CTT GCGC BssHII
(SEQ ID N0:5) AscI
RA8 3' AG TGG CTT GATC AvrII
(SEQ ID N0:6) NheI
XbaI
RA9 3' AG TGG CTT CTAG DpnlI
(SEQ ID N0:7) BamHI
BclI
RA10 3' AG TGG CTT CGCG KasI
(SEQ ID N0:8)
10 RAll 3' AG TGG CTT CCGG EagI
(SEQ ID N0:9) Bspl20
I NotI
EaeI
RA12 3' AG TGG CTT CATG BsiWI
(SEQ ID N0:10) Acc65I
BsrGI
RA14 3' AG TGG CTT AGCT XhoI
(SEQ ID N0:11) SalI
RA~5 3' AG TGG CTT ACGT ApaLI
(SEQ ID N0:12)
R~16 3' AG TGG CTT AATT A~lII
(SEQ ID N0:13)
RA'7 ' AG TGG CTT AGCA BssSI
(SEQ ID N0:14)
RC24 5' m-AGC ACT CTC CAG CCT CTC ACC &AC 3'
(SEQ ID N0:15)
RC1 3' AG TCG CTG TTAA Tsp509
(SEQ ID N0:15)
EcoRI
ApoI
RC3 3' AG TCG CTG TCGA HindII
(S ~ ID N0:17)
RC5 3' AG TCG CTG GTAC BspHI
(SEQ ID N0:18)
RC6 3' AG TCG CTG GGCC AgeI
(SEQ ID N0:19) Ngo~I
BspEI
SgrAI
BsaWI
- 223 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
RC7 3' AG TCG CTG GCGC MluI
(SEQ ID N0:20) BssHII
AscI
RC8 3' AG TCG CTG GATC SpeI
(SEQ ID N0:21) NheI
~ XbaI
RC9 3' AG TCG CTG CTAG DpnII
(SEQ ID NO:22) BglII
BamHI
BclI
BstYI
10 RC10 3' AG TCG CTG CGCG XasI
(SEQ ID N0:23)
RCll 3' AG TCG CTG CCGG Bspl20
(SEQ ID N0:24) . I NotI
RC12 3' AG TCG CTG CATG Acc56I
(SEQ ID N0:25) BsrGI .
RC14 3' AG TCG CTG AGCT SalI
(SEQ ID N0:26)
RC~5 3' AG ICG CTG ACGT PpulOI
(SEQ ID N0:27) ApaLI
JA24 5' m-ACC GAC GTC GAC TAT CCA TGA AGA 3'
(SEQ ID N0:28)
JAl 3' GT ACm TCT TTAA Tsp5os
~EQ I~ N0:29) I Mfel
: EcoRI
JA5 3' GT ACT TCT GTAC NcoI
25(SEQ ID NO:30) BspHI
JA6 3' GT ACT TCT GGCC XmaI
(SEQ ID NO:31) NgoMI
BspEI
JA7 3' GT ACT TCT GCGC BssHII
(SEQ ID N0:32) AscI
JA8 3' GT ACT TCT GATC AvrII
(SEQ ID N0:33) NheI
XbaI
JA9 3' GT ACT TCT CTAG DpnII
(SEQ ID N0:34) BamHI
Bc I
JA10 3' GT ACT TCT CGCG KasI
(SEQ ID N0:35)
- 224 -
,
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/17159
JAll 3' GT ACT TCT CCGG EagI
~ (SEQ ID N0:36) Bspl20
I NotI
EaeI
JA12 3' GT ACT TCT CATG BsiWI
(SEQ ID ~0:37) Acc65I
BsrGI
JAl4 3' GT ACT TCT AGCT XhoI
(SEQ ID N0:38) SalI
JA15 3' GT ACT TCT ACGT ApaLI
(SEQ ID N0:39)
JA16 3' GT ACT TCT AATT A~lII
(SEQ ID N0:40)
JA17 3' GT ACT TCT AGCA BssSI
(SEQ ID NO:41)
JC24 5' m-ACC GAC GTC GAC TAT CCA TGA AGC 3'
(SEQ ID N0:42)
JCl 3' GT ACT TC& TTAA Tsp509
(SEQ ID N0:43)
EcoRI
ApoI
20 JC3 3' GT ACT TCG TCGA HindII
~SEQ ID N0:44)
JC5 3' GT ACT TCG GTAC BspHI
(SEQ ID N0:45)
JC6 3' GT ACT TCG GGCC AgeI
(SEQ ID N0:~6) NgoMI
2S BspEI
SgrAI
BsrFI
BsaWI
JC7 3' GT ACT TCG GCGC MluI
(SEQ ID N0:47) BssHII
AscI
JC8 3' GT ACT TCG GTAC SpeI
(SEQ ID N0:48) NheI
XbaI
JC9 3' GT ACT TCG CTAG DpnII
(SEQ ID N0:49) . BglII
BamHI
BclI
BstYI
..
- 2~5 -
CA 02235860 1998-04-24
.
WO97/15690 PCT~S96/17159
JClO 3' GT ACT TCG CGCG KasI
(SEQ ID NO:50)
JCll 3' GT ACT TCG CCGG Bspl20
(SEQ ID N0:51) I NotI
5 JCl2 3' GT ACT TCG CATG Acc56I
(SEQ ID NO:52) BsrGI
JCl4 3' GT ACT TCG AGCT SalI
(SEQ ID NO:S3)
JCl5 3' GT ACT TCG ACGT PpulOI
(SEQ ID N0:54) ApaLI
In the case where one of the primers is conjugated
to a capture moiety, Table ll RE pairs and the corresponding
primer/Iinker combinations that have been tested. This table
supplements Table lO. Biotin can be conjugated to primers by
l~ using st~n~d phosphoramidate chemistry.
T~LE ll: ~ESTED RE PAIRS AND BIG-~NY~ATED ADAPTERS
RE l RE 2 Adapter l Ad~pter 2
Chose labeled Chose
primer JA24 or biotinylated
JC2~ to m~tch the primer RA24 or
linker accord~'ng RC24 to match t~e
to Table lO linker according
to Table lO
BamHI BspHI JC9 RA5
BgIII BspHI JA5 RC9
BgIII EcoRI JCl RC9
BgIIIHindIII JC3 RC9
BgIII BspEI JC6 RC9
BgIII Ncol JC9 RA5
BspEI BspYI JC6 RC9
BspEI~;n~TTI JC6 RC3
BspHI EcoRI JA5 RAl
BspHI~i n~TTI JC3 RA5
BstYI EcoRI JCl RC9
EcoRIHindIII JC3 RAl
- 226 -
CA 02235860 1998-04-24
W O 97/15690 PCTGUS96/17159
BAMHI HindIII JC9 RC3
BspEI BspHI JC6 RA5
~ BspEI EcoRI JC6 RA1
5 BspHI BstYI JA5 RC9
BspHI NgoMI JA5 RC6
BstYI HindIII JC3 RC9
Hin~lTTT Ncol JC3 RA5
HindIII NgoMI JC3 RC6
Tables 12 and 13 list the RE combinations that have
been tested in QEA~ experiments on human placental and
glandular cDNAs samples. The preferred double digests are
those that give more than approximately 50 bands in the range
15 O~ 100 to 700 bp. Table 12 lists the pre~erred RE
c-~binations for human cDNA analyses.
TABLE 12: PREFERRED RE COMBTNATIONS FOR
~IJ2AN cDNA ANALY~:CS
Acc56I & HindIII Acc65I & NgoMI BamHI & EcoRI
BglII & ~i~T~I BglII & NgoMI BsiWI & BspHI
BspPI & BstYI BspHI & NgoMI BYrGI & EcoRI
EagI & EcoRI EagI & ~i n~TTT EagI & NcoI
25 ~ndIII & NgoMI NgoMI & NheI NgoMI & SpeI-
BglII & BspHI Bspl20I & NcoI BssHII & NgoMI
Ec~RI & ~;n~TTI NgoMI & XbaI
Table 13 lists other RE combinations tested and
30 that can be used for human cDNA analyses.
- 227 -
CA 02235X60 1998-04-24
W O97/15690 PCTnUS96/17159
T ~ LE 13.- OT~ER RE CONBINATIONS FOR ~Po~N CDrna 7~NA~Y8I8
AvrII & NgoMI BamHI & Bspl20I BamHI & BspHI
BamHI & NcoI BclI & BspHI BclI & NcoI
5 BqlII & BspEI BqlII & EcoRI BglII & NcoI
BssHII & BsrGI BstYI & Ncol BamHI & ~in~TTI
BglII & Bspl20I BspHI & HindIII
Tables 14 and 15 list the RE combinations that have
10 been tested in QEA~ experiments on mouse cDNA samples. The
preferred double digests are those that give more than
app~oximately 50 bands in the range of 100 to ?00 bp. Table
14 lists the Freferred RE combinations for mouse cDNA
an~lyses.
TABLE 14: PREFERRED RE COMBINATION8 FOR
MO~SE cDNA ANALYSIS
AccS~I & Hi.~dIII Acc65I & NgoMI AscI & Hi.ndIII
AvrII & NgoMI BamHI & BspHI BamHI & Hind-lI
2G
BamHI & NcoI BclI & NcoI BglII & BspHI
BalI. & HindIII BglII & NcoI BglII & NgoMI
Bspl~0I & NcoI Acc65I & BspHI BspHI & Bspl20I
Bsp~I & BsrGI BspHI & EagI BspHI & NgoMI
25 BspHI & NotI BssHII & Hin~rTI BstYI & HindIII
HindIII & NcoI HindIII & NgoMI NcoI & NotI
NgoMI & NheI NgoMI & SpeI NgoMI & XbaI
BclI & ~i n~TTT
Table 15 lists other RE combinations tested and
that can be used for mouse cDNA analyses.
- 228 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/171~9
TAB~ 15: OTHER RE COMRTN~TIONS FOR MOU8E cDNA ANALY8TS
Acc65I & NcoI BclI & BspHI BsiWI & BspHI
~ BsiWI & NcoI BspHI & HindIII BsrGI & NcoI
~ 5 BssHII & NgoMI BstYI & BspHII EagI & NcoI
HindIII & MluI
Table 16 lists the data obtained from various RE
combinations using mouse cDNA samples. The number o~ bands
10 was observed ~rom silver stained acrylamide separation gels.
TABLE 16: MO~SE cDNA RE DIGESTION RES~LTS
RE Combination Number of
Bands
15 Acc65I & HindIII 200
Acc6SI & NgoMI 150
AscI & HindIII 100
~vrlI & NgoMI 50
BamHI & BspHI 200
BamHI & HindIII 150
BamHI & NcoI 150
BclI & BspHI 5
BclI & ~irt~TTT l5O
25 BclI & NcoI SO
BglII & BspHI 50
BglII & ~in~TTI 150
BglII & NcoI SO
BglII & NgoMI SO
Bspl20I & NcoI 50
BspHI & Acc6SI 150
BspHI & Bspl20I SO
~ BspHI & BsrGI 200
BspHI & EagI lSO
BspHI & HindIII O
- 229 -
CA 0223~860 1998-04-24
W O 97/15690 PCT~US96/171~9
RE Combination Number of
Bands
BspHI & NgoMI 150
BspHI & NotI 150
BsrGI & NcoI 10
BssHII & HindIII 100
BssHII & NgoMI 20
BstY~ & BspHI 20
BstYI & ~i n~TTT 200
EagI & NcoI 10
HindIII & MluI 25
HindIII & NcoI 50
S HindIII & NgoMT 150
NcoI & NotI 200
NgoMI & NheI 50
NgoMI & SpeI 200
NgoMI & XbaI 50
TOTAL # BANDS 3490
31 available REs that recognize a 6 bp recognition
sequence and generate a 4 bp 5' overhang are: Acc65I, AflII,
25 Agel, ApaLI, ApoI, AscI, AvrI, BamHI, BclI, BglII, BsiWI,
Bspl20I, BspEI, BspHI, BsrGI, BssHII, BstYI, EagI, EcoRI,
~; n~TTT, MfeI, MluI, NcoI, NgoMI, NheI, NotI, PpulOI, SalI,
SpeI, XbaI, and XhoI.
All of these enzymes have been tested in QEA~
3C protocols according to Sec. 6.4.4 with the exception of
AflII. All were useable except for MfeI, Ppu10I, SalI, and
XhoI. All the other 26 enzymes have been tested and are
usable in the RE implementation of QEA~.
However certain pairs of these enzymes are less
35 informative due to the fact that they produce identical
overhangs, and thus their recognition sequences cannot be
disting~lishe~ by QEA~ adapters. ~hese pairs are Acc65I and
- 230 -
-
CA 0223~860 1998-04-24
W O 97/15690 PCTrUS96/17159
(BsiWI or BsrGI); AgeI and (BspEI or NcoMI); ApoI and EcoRI;
AscI and (BssHII or MluI); AvrI and (NheI, SpeI, or XbaI);
BamHI and (BclI, BglII, or BstYI); BclI and (BgLII or BstYI);
BglII and BstYI; BsiWI and BsrGI; Bspl20I and EagI; BspEI and
5 NcoMI; BspHI and NcoI; BssHII and MluI; NheI and (SpeI or
XbaI); and SpeI and XbaI.
Thus 301 RE pairs have been tested and are useable
in the RE embodiments o~ QEA~.
10 6.10.1. PREFERRED 8EO-OEA~ ENZYMES AND ADAPTERS
Table 17 lists exemplary Type IIS REs adaptable to
SEQ-QEA~ embodiment and their important characteristics. For
each RE, the table lists the recognition sequenae on each
strand o~ a dsDNA molecule and the distance in bp ~rom that
15 recognition sequence to the location of strand cutting. Also
l.st~d is the net overhang generated.
TA8LE 17: SAMPLE TYPE ~IS REs
20 ~ype I-S Recog. Dist. to Over- Comment
~E Seqs. cutting hang
site (bp) (bp)
EokI5'-GGATG 9 4
CCTAC 13
HgaI5'-GACGC 5 5
25 CTGCG 10
BbvT5'-GCAGC 8 4
CGTCG 12
BsmFI5'-GGGAC 10 4 Lower recognition
CCCTG 14 site speci~icity
BspMI5'-ACCTGC 4 4
30 TGGACG 8
SfaNI5'-GCATC 5 4
CGTAG 9
Table 18 lists exemplary primer and linker
combinations adaptable to a SEQ-QEA~ method. They satis~y
the previously described requirements on primers and linkers.
- 231 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
Except ~or the indicated differences, they are the same as
the primers and linkers of similar names in Table 10. RA24-U
and RC24-U have a 5' biotin capture moiety and a uracil
release means as indicated, and are adaptable to the same
5 linkers and REs as are RA24 and RC24 of Table 10. RA24-S and
RC24-S also have a 5' biotin capture moiety with a AscI
recognition site release means as indicated in bold and
underlining, and are adaptable to the same linkers and REs as
are RA24 and RC24 of Table 10. JA24-K has an internal FokI
10 recognition site as indicated and a 5' FAM label moiety (see
Table 19). The FokI recognition site is optimally placed to
be used with a RE producing a 4 bp overhang. Linkers KA5,
KA6, and KA9 corresponding to the indicated REs function with
this primer. JC24-B has an internal BbvI recognition site, a
15 5' FAM label, and functions with linkers BA5 and BA9. The
~bvI recogn__ion site is also optimally placed to be used
wit~ ~ RE pr~d~cing a ~ bp overhang.
TABLE 18: fiAMPL~ A~APrrERS
- ~
Adapter: Primer (lon~er strand) RE
Series Linker (shorter strand)
Notes: 'b' signi~ies a biotin moiet-~
'f' signifies a FAM labeL moiety
RA24-U 5' b-AGC ACT CTC CAG CC~ CTC ACC GAA 3'
(SEQ ID N0:??)
RA24-S 5' b-AGC ACT CTG GCG CGC CTC ACC GAA 3'
(SEQ ID N0:??)
RC24-U 5' b-AGC ACT CTC CAG CC~ CTC ACC GAC 3'
(SEQ ID N0:??)
RC24-S 5' b-AGC ACT CTG GCG CGC CTC ACC GAC 3'
(SEQ ID N0:??)
JA24-K 5' f-ACC GAC GTC GAC TAT GGA ~GA AGA 3' FokI
(SEQ ID NO:??) (9)
- 232 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
KA9 3' CT ACT TCT CTAG DpnII
(SEQ ID NO:??)BglII
BamHI
BclI
BstYI
5 KA5 3' CT ACT TCT GTAC NcoI
(SEQ ID NO:??)BspHI
KA6 3' CT ACT TCT GGCC AgeI
(SEQ ID NO:??)NgoMI
BspEI
SgrAI
BsrFI
. BsaWI
JC24-B 5' f-ACC GAC GTC GAC TAT CGC AGC AGA 3' BbvI
tSEQ ID NO:??) (8)
BA9 3' CG TCG TCT CTAG DpnII
(SEQ ID NO:??) BglII
BamHI
BclI
BstYI
BA5 . 3' CG TCG TCT GTAC NcoI
(SEQ ID NO:??) .BspHI
6.11. FLUORESCENT LABEL8
~ Fluorochromes labels that can be used in the
methods of the present inven~ion include the classic
fluorochromes as well as more specialized fluorochromes. The
25 classic ~luorochromes include bimane, ethidium, europium
(I~I) citrate, fluorescein, La Jolla blue, methylcoumarin,
nitrobenzofuran, pyrene buLy~ate, rho~ in~, terbium chelate,
~nd tetramethylrho~ ine. More specialized fluorochromes are
listed in Table 19 along with their suppliers.
TABLE 19: FLORESCENT T-~R~T-R
Fluorochrome Vendor Absorption Emission
MAX; M~xi
~ Bodipy Mol~~ r Probes 493 503
493/503
Cy2 BDS 489 505
Bodipy FL Mol~nl~ Probes 508 516
- 233 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
Fluorochrome Vendor Absorption Emission
~xi M~ m
FTC Molecular Probes 494 518
FluorX 8DS 494 520
FAM Perkin-Elmer 495 535
Carboxy- Mol~c~ r Probes 519 543
rho~ine
EITC Molecular Probes 522 543
10 Bodipy Molecular Probes 530 550
530/550
JOE Perkin-Elmer 525 557
HEX Perkin-Elmer 529 560
Bodipy Molecular Probes 542 563
542/563
Cy3 BDS 552 565
TRITC Molecular Probes 547 572.
I,RB Molecular Probes 556 576
Bodipy LMR Molecular Probes 545 577
~o Tam-a Perkin-Elmer 552 580
Bodlpy Molecular Probes 576 5~9
576/589
Bodipy Molecular Probes 581 591
581/59i
25 Cy3.5 BDS 581 596
XRITC Molecular Probes 570 596
ROX Perkin-Elmer 550 610
Texas Red Molecular Probes 589 615
Bodipy TR Molecular Probes 596 625
(618?)
Cy5 BDS 650 667
Cy5.5 BDS 678 703
DdCy5 B~cl ~n 680 710
Cy7 BDS 443 767
5 DbCy7 Beckman 790 820
- 234 -
CA 0223S860 1998-04-24
W O 97/1~690 PCT~US96/17159
The suppliers ~isted in Table l9 are Molecular Probes
(Eugene, OR), Biological Detection Systems (I'BDS")
(pittsburghr PA) and Perkin-Elmer (Norwalk, CT).
Means of utilizing these fluorochromes by attaching
5 them to particular nucleotide groups are described in Kricka
et al., l99S, Molecular Probing, Blotting, and Sequencing,
chap. l, Academic Press, New York. Preferred methods of
attachment are by an amino linker or phosophoramidite
chemistry.
7. Sr~l~lC EMBODIMENTS, CITATION OF REFERENCES
The present invention is not to be limited in scope
by the specific embodiments described herein. Indeed,
various modifications of the invention in addition to those
15 described herein will become apparent to those skilled in the
art from the foregoing description and ~ccompanying figures.
Such modifications are intended to fall within the scope of
the zppended claims.
Various publications are cited herein, the
20 disclosures of which are incorporated by reference in their
ent-reties.
3Q
- 235 -
CA 02235860 1998-04-24
WO 97/15690 PCTnJS96/171~9
~U~:N'~' LISTING
( 1 ) ~N~RAT- INFORMATION:
(i) APPLICANT: Rothberg, Jonathan
Deem, Michael
Simpson, John
(ii) TITLE OF 1NV~NL1ON: Method and Apparatus for Identifying,
Classifying, or Quantifying DNA Se~uences in a Sample
Without Sequencing
(iii) NUMBER OF ~yU~N~S: 70
(iv) CORRESPONDENCE ADDRESS:
'A', ADDRESSEE: Pennie and Edmonds
rB, STREET: 1155 Avenue of the Americas
,Cj CITY: New York
DI STATE: New York
~EI CUUN LK~: USA
IFI ZIP: 10036-2711
(v~ COMPUTER READABLE FORM:
~A) MEDIUM TYPE: Floppy disk
,BI COMPUTER: IBM PC compatible
,C, OPERATING SYSTEM: PC-DOS/MS-DOS
,D) SOFTWARE: PatentIn Release #1.0, Ver3icn #1.30
(vi) CURRENT AFPLICATION DATA:
(A) APPLICATION NUMBER: To be a~signed
(B) FILING DATE: 14-June-1995
(C) CT,ASSIFICATION:
(viii~ A.L~RN~:Y/AGENT INFORMAT~ON:
;Aj NAME: Misrock, S. Leslie
(B~ REGISTRATION NU~BER: 18,B72
~C) R~-~n~N~DOCKET NUMBER- 7934-033-999
(ix) TELECOMMUNICATION INFORMATION:
(A) TELEPHONE: (212) 790-9090
(B) TELEFAX: (212) 869-8864
(C) TELEX: 66441 PENNIE
(2) INFORMATION FOR SEQ ID NO:l.
(i) ~:QU~:N~' CHARACTERISTICS:
~A' LENGTH: 24 base pairs
B' TYPE: nucleic acid
,C', STR~Nn~nNECS: single
~DJ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) i~:yu~N~- DESCRIPTION: SEQ ID NO:l:
AGCACTCTCC AGC~ AC CGAA 24
(2) INFORMATION FOR SEQ ID NO:2:
(i) ~yu~N~ CHARACTERISTICS:
(A~ LENGTH: 12 ba~e pairs
- 236 -
CA 02235860 1998-04-24
WO 97/15690 PCT~US96/17159
(B) TYPE: nucleic acid
(C) STRANDBDNESS: ~ingle
(D) TOPOLOGY: linear
(ii) MOLECrJLE TYPE: DNA
.
~ (xi) SEQUENCE DESCRIPTION: SEQ ID NO:2:
AGTGGCTTTT AA 12
(2) INFORMATION FOR SEQ ID NO:3:
(i) SEQUENCE CHARACTERISTICS:
(A': LENGTH: 12 base pair~
,B.I TYPE: nucleic acid
~GI STRANDEDNESS: ~ingle
,~j TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) .SEQUENCE DESCRIPTION: SEQ ID NO:3:
AGTGG~TTGT AC 12
(2) I~FORMATION FOR SEQ ID No:4:
(i) SEQUENCE CHARACTERISTICS:
(A) LENGTH: 12 ba~e pairY
~B) TYPE: nucleic acid
,C~ STRANDEDNESS: single
~D) TOPOLOGY: linear
(iij MCLECUI.E TYPE: DNA
(xi) ~Uu~N~ DESCRIPTION: SEQ ID NO:4:
A~~GG~.~G CC 12
(2) lN~. ~ TION FOR SEQ ID NO:5:
(i) ~yu~-._~ CHARACTERISTICS:
Al LENGTH: 12 ba~e pairs
,BI TYPE: nucleic acid
Cl STRANDEDNESS: ~ingle
~DJ TOPOLOGY: linear
(ii) MOr.~Cr~.~ TYPE: DNA
(xi) ~yu~ DESCRIPTION: SEQ ID NO:5:
A~G~llGC GC 12
- (2) INFORMATION FOR SEQ ID No:6:
- 237 -
CA 02235860 1998-04-24
W O 97/15690 PCTnJS96/17159
( i ) ~QU~N~ CHARACTERISTICS:
~AI LENGTH: 12 base pairY
,B~ TYP3: nucleic acid
,C STRANDEDNESS: ~ingle
~DJ TOPOLOGY: linear
(ii) r~oLEcULE TYPE: DNA
(Xi) ~'~U~N~ DESCRIPTIOW: SEQ ID NO:6:
AGTGGCTTGA TC 12
(2) INFORMATION FOR SEQ ID NO:7:
(i) SEQUENCE CHARACTE~ISTICS:
~A'I LENGTH: 12 base pair~
~BI TYPE: nucleic acid
~C! STRANDEDNESS: ~ingle
,D TOPOEOGY: linear
(ii) MOLECULE TYPE: DNA
xi) SEQUENCE DESCRIPTION: SEQ ID NO:7:
.GTC~,~ L AG 12
~2) INFORMATION FOR SEQ ID NO: e:
(i) ~yu~ CHARACTERISTICS:
lA) LENGTH: 12 base pairs
iB' TYPE: nucleic acid
~C) STRANDEDNESS: ~ingle
~D) TOPOI.OGY: linear
~ii) MOLECULE TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:8:
AGTGGcTTca ca 12
(2) l~FO~ ~TION FOR SEQ ID NO:9:
(i) ~yu~ ~ CHARACTERISTICS:
iAI LENGTH: 12 ba~e pair3
~B) TYPE: nucleic acid
~CJ STRANDEDNESS: ~ingle
~D~ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(Xi) ~:QU~:N~: DESCRIPTION: SEQ ID NO:9:
AGTGGCTTCC GG 12
- 238 -
CA 02235860 l998-04-24
W 0 97/15690 PCTrUS96/171~9
(2) lN~OR~ATIoN FOR SEQ ID NO:lO:
( i ) ~r;QuriN~ CHARACTERISTICS:
(Aj LENGTH: 12 base pair~
'B, TYPE: nucleic acid
,,C, STRANDEDNESS: single
~ ~DJ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) ~yur;N~ DESCRIPTION: SEQ ID NO:lO:
AC-IGGCTT Q TG 12
(2 t INFORMATION FOR SEQ ID NO:ll:
yur;N~:ri CHARACTERISTICS:
~A'I LENGTH: 12 base pair~
~Bl TYPE: nucleic acid
,'C, STRANDEDNESS: ~ingle
~,D,, TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) ~r;yu~N~ri DESCRIPTION: SEQ ID NO: 13:
AGTG~CTT~G CT 12
20 ;2) lNr~lATION FOR SEQ ID NO:12:
( i ) ~yU~N~ CH~RACTERISTTCS:
A'l LENGTH: 12 base pair3
I'B'I TYPE: nucleic acid
,,C, STRANDEDNESS: single
~D TOPOLOGY: linear
(ii) M~r-r'Crrr-~ TYPE: DNA
(Xi) ~r;yUL.._~ D~Sr~TPT30N: SEQ ID NO:12:
A~LGG~AC GT 12
(2) lNr~ ~TION FOR SEQ ID NO:13:
L~Ur;N-;ri ~Ri~cTlzRT~sTIl~s:
~Aj LENGTH: 12 base pairs
~B) TYPEs nucleic acid
,C) STRANu~riSS: single
~D,l TOPOLOGY: linear
~ (ii) MOr~CUr~ TYPE: DNA
o
(Xi) ~r;yUL.._~ D~C~TpTIoN: SEQ ID NO:13:
- 239 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
A~1GGCL1AA TT 12
(2) IN~OR~ATION FOR SEQ ID NO:14:
( i ) ~yu~N~b: CHARACTERISTICS:
(A, LENGTH: 12 base pairs
(Bl TYPE: nucleic acid
S (C, STRANDEDNESS: single
(D~ TOPOLOGY: 1 inear
(ii) MOLECULE TYPE: DNA
(Xi) 5~YU~N~ DESCRIPTION: SEQ ID NO:14:
AGTGGCTTAG CA 12
(2) INFORMATION FOR SEQ ID NO: 15:
( i ) S~:QD~N~- CHARACTERISTICS:
~A'l LENGTH: 24 base pair~
~Bl TYPE: nucleic acid
~C! STRANDEDNESS: single
,DI TOPOLOGY: linear
(i..) MOLECULE TYPE: DNA
Xi) SEQUENCE DESCRIPTION: SEÇ ID NO:1;:
~G AGCAC-CTCC AGC~1~CAC C~AC 2
(2) INFORMATION FOR SEQ ID NO:16:
ri) SEQUENCE CHARA~TERIS~ICS:
~AI LENGTH: 12 base pair~
(BJ TYPE: nucleic acid
(C, STRANDEDNESS: single
(D, TOPOLOGY: linear
( ii ) ~t~L~CUT-~ TYPE: DNA
(xL) ~Eyu~ ~ DESCRIPTION: SEQ ID NO: 16:
AGTCGCTGTT AA 12
(2) INFORMATION FOR SEQ ID NO:17:
( i ) ~Q~N~ CHARACTERISTICS:
~A~ LENGTH: 12 base pairs
IBI TYPE: nucleic acid
,C~ STT~ANn~n~SS: single
lD~ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
- Z40 -
CA 0223~860 1998-04-24
W O 97/15690 PCTAUS96/17159
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:17:
~ AGTCGCTGTC GA 12
(2) INFORMATION FOR SEQ ID NO:18:
(i) SEQUENCE CHARACTERISTICS:
I'A'I LENGTH: 12 ba~e pairs
IBI TYPE: nucleic acid
- ~C! STRANDEDNESS: single
l,DJ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) SEQUENC,'E DESCRIPTION: SEQ ID NC:18:
AGTCGCTGGT AC 12
(c) INFORMATION FOR SEQ ID NO:l9:
(r) SEQUENCE CHARACTERISTICS:
I'A~I LENGTH: 12 base pairs
IB~ TYPE: nucleic acid
,C~, STRANDEDNESS: single
,'DJ ~OPOLOGY: linear
;ii) }~tCL~CULE TYPE: DNA
(xi) ~QD~; DEsCRIPTION: SEQ ID NC:;9:
AGFCGCTGC;G ~C 12
(2) INFORMATION FOR SEQ ID NO:20:
(i) SEQUENCE CHARACTERISTICS:
'A~ LENGTH: 12 base pair~
IB'I TYPE: nucleic acid
,C~ STRANDEDNESS: ~inqle
~D~ TOPOI.OGY: linear
(ii) MOLECULE TYPE: DNA
(xi) S~UU~N~ DESCRIPTION: SEQ ID NO:20:
AGTCGCTGGC GC 12
(2) INFORMATION FOR SEQ ID NO:21:
(i) SEQUENCE CHARACTERISTICS:
(A~ LENGTH: 12 ba~e palr~
(Bl TYPE: nucleic acid
(cl STRANDEDNESS: single
(D,l TOPOLOGY: linear
. (ii) MOLECULE TYPE: DNA
- 241 -
.
CA 02235860 1998-04-24
WO 97/15690 PCTAUS96/17159
(xL) SEQUENCE DESCRIPTION: SEQ ID NO:21:
AGTC&CTGGA TC 12
(2) INFORMATION FOR SEQ ID No:22:
(i) SEQUENCE CHARACTERISTICS:
IAI LENGTH: 12 ba~e pairs
,B! TYPE: nucleic acid
C, STRANDEDNESS: ~ingle
tDI TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi; SEQUEWCE DESCRIPTION: SEQ ID NO:22:
AGTCGCTGCT AG 12
;2) IWFORMATION FOR SEC ID No:23:
(i~ srQ~N~- CHARACTERISTICS:
,~AI LENGTH: 12 baqe pairs
,8, TYPE: nucleic acid
,C, STRANDEDNESS: single
~D TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) .~QU~NC~ DESCRI~TION: SEQ ID No:23:
AGTCGCTGCG CG 12
(2) INFO~MATION FOR SEQ ID NO:24:
(i) SEQUENCE CHARACTERISTICS:
I-AI LENGTH: 12 ba~e pair~
,BI TYPE: nucleic acid
,C sT~ANn~nNEss: ~ingle
~D/ TOPOLOGY: linear
( ii ~ MOT T~'CUT r' TYPE: DNA
(xi) ~Pyu~ DESCRIPTION: SEQ ID NO:24:
AGTCGCTGCC GG 12
(2) INFORMATION FOR SEQ ID NO:25:
(i) s~yu~-~c~ CHARACTERISTICS:
~'A~ LENGTH: 12 ba~e cair~
IBI TYPE: nucleic acid
~.C! STRANDEDNESS: ~ingle
~DJ TOPOLOGY: linear
(ii) M~rr'CUT-r~' TYPE: DNA
- 242 -
CA 02235860 1998-04-24
W 097/15690 PCT~US96/17159
(xi) ~Qu~.~ DESCRIPTION: SEQ ID NO:25:
AGTCGCTGCA TG 12
(2) INFORMATION FOR SEQ ID NO:26:
(i) SEQUENCE CHARACTERISTICS:
'Aj LENGTH: 12 base pairs
,BI TYPE: nucleic acid
,C, STRANDEDNESS: single
,,DJ TOPOLOGY: linear
'ii) MOLECULE TYPE: DNA
(xi) ~Qu~c~ DESCRIPTION: SEQ ID NO:26:
AGTCGCTGAG CT 12
~2) INFORMATION FOR SEQ ID NO:27:
(i) SEQUENCE CHARACTERISTICS:
~A~l LENG~H: 12 base pairY
BI TYPE: nucleic acid
,C STRANDEDNESS: single
~DJ TOPOLOGY: linear
~.i) MOLECULE TYPE: DNA
~xi~ S~u~r~ DESCRIPTION: SEQ ID NO:27:
A5TCGCTGAC GT 12
(') IrJ~ORMATlON FOR SEQ ID NO:28:
(i) ~u~ C~ARACTERISTICS:
I'A', LENGTH: 24 base pairs
B, TYPE: nucleic acid
~'C, STRANDEDNESS: single
~D) TOPOLOGY: linear
( ii ) M~T.~CrrT.~ TYPE: DNA
(xi) x~yu~ DESCRIPTION: SEQ ID NO:28:
Accr~AcGTcG ACTATCCATG AAGA 24
(2) INFORMATION FOR SEQ ID NO:29:
(i) ~yu~ CHAP~ACTERISTICS:
'Aj LENGTH: 12 base pairs
,B? TYPE: nucleic acid
C'I STRANDEDNESS: ~ingle
~DJ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
- 243 -
CA 02235860 1998-04-24
W 0 97/15690 PCTtUS96tl7159
(xi) ~yu~N~ DESCRIPTION: SEQ ID No:29:
GTA~l~L~L AA 12
5 (2) INFORMATION FOR SEQ ID NO:30:
(i) ~:yULN~: CHARACTERISTICS:
(A'l LENGTH: 12 ba~e pairs
B TYPE: nucleic acid
,C, ST~ANn~nNESS: ~ingle
~D,l TOPOLOGY: linear
( ii ) M~r.~cur.~ TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID No:30:
GTA~L~ AC 12
(2) INFORMATION FOR SEQ ID NO:31:
(i) ~:yU~N~: CHARACTERISTICS:
~A'I LENGTH: 12 ba~e pairs
~B I TYPE: nucleic acid
,C, STRANDEDNESS: ~ingle
.D,, TOPOLOGY: linea~
Sii~ ~OLECULE TYPE: DNA
(xi) SLQuLN~ DESCRIPTION: SEQ ID NO:31:
GTACTTCTGa CC 12
(2~ INFORMATION FOR SEQ ID No:32:
(i) -~LyuLL._~ CHARACTERISTICS:
~AJ LENGTH: 12 base pair~
,'B) TYPE: nucleic acid
(Cj STR~N~ N~-SS: 8ingle
~D~ TOPOLOGY: linear
( ii ) Mnr T~CTTr ~ TYPE: DNA
(xi) X~yUL.._~ DESCRIPTION: SEQ ID NO:32:
GTA~L~GC GC 12
(2) lN~K~ATION FOR SEQ ID NO:33:
~u
( i ) ~yUL.._~ CHARACTERISTICS:
~A~I LENGTH: 12 base pair~
B I TYPE: nucleic acid
~C, STRANDEDNESS: ~ingle
,D) TOPOLOGY: linear
- 244 -
CA 02235860 1998-04-24
WO 97/15690 PCTAUS96/17159
(ii) MOLECUL3 TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:33:
S GTACTTCTGA TC . 12
(2) INFORMATION FOR SEQ ID NO:34:
(i) ~QU~N~: CH~2ACTERISTICS:
(Al LENGTH: 12 base pairs
,B TYPE: nucleic acid
C, STRANDEDNESS: single
~DJ TOPOLOGY: lLnear
(ii) MOLECULE TYPE: DNA
(xi~ SEQUENCE DESCRIPTION: SEQ ID No:34:
GTA~L-L ~. aG 12
(2) INFORMATION FOR SEQ ID NO:35:
(' ! ';I:;QU~iN~;~; CE~ARACTFRISTICS:
~Aj LENGT~: 12 base pairs
,BI TYPE: nucieic acid
(C, S~R~NDEDNESS: ~ingle
~D,I TOPOLOGY: linear
(ii) ~OLECULE TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:35:
GTA~Ll~LCG CG 12
25 (2) INFORMATION FOR SEQ ID NO:36:
(i) SEQUENCE CH~2ACTERISTICS:
(Al LENGTH: 12 ba~e pairs
~BJ TYPE: nucleic acid
rc, STT2ANnT2nN~.cs ~Lngle
~DJ TOPOLOGY: linear
( ii ) M~r-T~CT~-T~ TYPE: DNA
(xi) ~yu~.._~ DESCRIPTION: SEQ ID NO:36:
GTA~~ C GG 12
(2) lN~O~ATION FOR SEQ ID NO:37:
(i) ~I~:yu~ CHARACTERISTICS:
(A) LENGTH: 12 base paLr~
(B) TYPE: nucleLc acLd
(C) STPANT~T~'nNT~SS: ~ingle
- 245 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
(D) TOPOLOGY: linear
( ii ) ~r ~CUT~T~' TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:37:
GTA~lL~A TG 12
(2) INFORMATION FOR SEQ ID No:38:
(i) SEQUENCE CHARACTERISTICS:
~AI LENGTH: 12 ba~e pairs
~Bl TYPE: nucleic acid
,C~ STRANDEDNESS: single
,D, TOPOLOGY: linear
(ii' MOLECULE TYPE: DNA
~xi) SEQ~N~k DESCRIPTION: SEQ ID NO:38:
GTACTTCTAG CT 12
(2) INFORMATION FOR SEQ ID NO:3g:
( i ) ~QU~N~ CHARACTERISTICS:
~A~ EENGTH: 12 base pairQ
.~? TYPE: nucleic acL~
,C, STR~NDEDWESS: single
~D~ TOPOLOGY: linear
~(iiJ MOLECULE TYPE: DNA
(xi) ~u~ DESCRIPTION:.SEQ ID NO:39:
25 GTACTTCTAC GT 12
(2) INFORMATION FOR SEQ ID NO:40:
(i) ~Qu~ ~ ~R~RAcTT~R~sTIcs:
~A'I LENGTH: 12 base pair~
~B) TYPE: nucleic acid
~Cj STPANDEDNESS: single
~D) TOPOLOGY: iinear
( ii ) ~nT ~CUT T~ TYPE: DNA
(xi) ~QU~N~ DESCRIPTION: SEQ ID NO:40:
GTACTTCTAA TT 12
(2) INFORMATION FOR SEQ ID NO:41:
(i) -~u~ r~ARACT~RT-CTICS:
(A) IENGTH: 12 ba~e pair~
- 246 - _
CA 02235860 1998-04-24
WO 97/15690 PCTnJS96/17159
(B) TYPE~ cleic acid
~C) STR~NnEDN'~cs: single
(D) TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:41:
GTACTTCTAG CA 12
(2) INFORMATION FOR SEQ ID NO:42:
( i) '7~yU~:N~ CHARACTERISTICS:
~Aj LENGTH: 24 ba~e pairis
~Bl TYPE: nucleic acid
,C, sT7~Nn7~nNEss single
~D1 TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:42i-
ACCGACGTCG ACTATCCATG AAGC 24
(2) lN~K~ATION FOR SEQ ID NO:43:
(i) SEQUENCE CHARACTERISTICS:
~Al LENGTH: 12 bai~e pair~
~B~ TYPE: nucleic acLd
iC~ STRANDEDNESS: sLngle
! D,l TOPO~OGY: ,Lnear
(iL) MO'~ECULE TYPE: DUA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:43:
I TA~L~ ~LL AA 12
(2) INFO~MATION FOR SEQ ID NO:44:
(L~ SEyu~n~-i~ CHARACTERISTICS:
~A~ LENGTH: 12 ba~e pairs
,BJ TYPE: nucleLc acLd
~C, 5T~ Nn~nN~CS: sLngl~
~D, TOPOLOGY: lLnear
(LL~ MOLEC'JLE TYPE: DNA
(xL) ~yu~N~ D~C~RTPTION: SEQ ID NO:44:
GTA~L,~ GA 12
(2) INFORMATION POR SEQ ID NO:45:
- 247 -
CA 02235860 1998-04-24
WO 97/15690 PCT~US96/17l59
(l) ~yu~ CHARACTERISTICS:
~Al LENGTH: 12 ba~e pair3
,BI TYPE: nucleic acid
,CI STRANnT~n~s: single
~D~ TOPOLOGY: linear
(Li) ~OLECULE TYPE: DNA
(Xi) ~QU~N~ DESCRIPTION: SEQ ID No:45:
GTACTTCGGT AC 12
(2) INFOR~TION FOR SEQ ID NO:46:
(i1 SEQUENCE CHARACTERISTICS:
(Al LENGTH: 12 ba~e pair~
,BI TYPE: nucleic acid
IC~ STRANDEDNESS: singLe
D, TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
~xi) SEQUENCE DESCRIPTION: SEQ ID NO:46:
GT~CTTCGGr CC - 12
(Z) INFO~ ~TION FOR SEQ ID NO:47:
2 0 ~ yU~N~ CHARACTERISTICS:
AI LENGTH: 12 bas~.pair~
B TYPE: nucleiG acid
~C, STRANDEDNESS: single
~D ! TOPOLOGY: linear
~ ii ) Mnn~cTTr ~ TYPE: DNA
(xi) ~;Qu~.~ DT~'S~R~TION: SEQ ID NO:47:
GTACTTCGGC GC 12
(2~ lN~ ~TION FOR SEQ ID NO:48:
~i) S~:yu~ CHARACTERISTICS:
(Al ~ENGTH: 12 basQ pair~
BI TYPE: n~cleic acid
~C STRAN~ N~:CS singlQ
~D~ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) ~Qu~_~ DESCRIPTION: SEQ ID NO:48:
GTA~... ~. AC 12
- 248 -
CA 02235860 1998-04-24
W O 97/15690 PCTAUS96/17159
t2) INFORMATION FOR SEQ ID NO:49:
(i) ~yU~N~- CHARACTERISTICS:
IA'I LENGTH: 12 base pairs
,BI TYPE: nucleic acid
,C, STR~NDEDNESS: ~ingle
,DJ TOPOLOGY: linear
(ii) MOr.~CUr.~ TYPE: DNA
~Xi) SEyU~N~ DESCRIPTION: SEQ ID NO:49:
GTACTTCGCT AG 12
(2) INFCRMATION FOR SEQ ID NO:50:
(i) SEQUENCE CHARACTERISTICS:
(A~ LENGTH: 12 base pairs
(8, TYPE: nucleic acid
(C~ STRANDEDNESS: single
(Dl TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) SEQUENC~ DESCRIPTION: SEQ ID NO:50:
C-TA''T'rCG"G CC- 12
20 (2) INFv~lATION FOR SEQ ID NC:51:
(i) SEQUENCE CHARAC'TERISTICS:
,~A~I LENGTH: 12 ba3e pairs
IB'I TYPE: nucleic acid
,C, STRANDEDNESS: single
,D) T~POLOGY: linear
(ii) MOr-~CUr-~ TYPE: DNA
(Xi) ~:yU~N~: DESCRIPTION: SEQ ID NO:51:
GTACTTCGCC GG 12
(2) INFORMATION FOR SEQ ID NO:52:
(i) ~yu~N~: r~R~T~RrsTIcs:
,A' LENGTH: 12 base pairs
tB! TYPE: nucleic acid
,C~ STR~ n~1~CS: single
~DJ TOPOLOGY: linear
~ (ii) MOLECULE TYPE: DNA
_ (Xi) ~yU~:N~ DESCRIPTION: SEQ ID NO:52:
- 249 -
CA 02235860 1998-04-24
WO 97/15690 PCTAUS96/17159
GTACTTCGCA TG 12
(2) INFORMATION FOR SEQ ID NO:53:
( i ) ~yU~N~ CHARACTERISTICS:
'Al LENGTH: 12 base pairs
Bl TYPE: nucleic acid
,CI STRANDEDNESS: ~ingle
,D,I TOPOLOGY: linear
( iL) MOLECULE TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:53:
O
GTACTTCGAG CT 12
(~) INFORMATION FOR SEQ ID No:54:
(i) SEQUENCE CHARACTERISTICS:
~A) LENGTH: 12 base pair3
BI TYPE: nucleic acid
~C, STRANDEDNESS: ~ingle
lS D~ TOPOLOGY: linear
(ii~ MOLEC~I.E TYPE: DNA
(X~! ~EQUENCE DESCRIPTION: SEQ lD NO:5~:
20 GTA~T~GAC &T 12
FORi~AT_ON FOR SEQ ID NO:55:
(i) SEQUENCE CHARACTERISTICS:
iAI LENGT~: 28 base paLr~
~B\ TYPE: nucleic acid
(C, STRANDEDNESS: ~Lngle
(D;l TOPOLOGY: linear
(iij MOLECULE TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:55:
AGCACTCTCC AGC~L~'-AC CGAGCATG 28
(2) INFORMATION FOR SEQ ID NO:56:
(i) ~Q~ CHARACTERISTICS:
(Al LEWGTH: 49 ba~e pair~
,BI TYPE: nucleic acid
,C STRANDEDNESS: single
~D,~ TOPOLOGY: linear
(li) MOLECULE TYPE: DNA
- 250 -
CA 02235860 1998-04-24
W O 97/15690 PCT~US96/17159
(xi) ~yu~ DESCRIPTION: SEQ ID No:56:
GGCGCGCCTG TAA~A~-~G GCCAGTACCG ACGTCGACTA TCCATGAAG 49
(2) INFORMATION FOR SEQ ID NO:57:
(i) SEQUENCE CHARACTERISTICS:
,'A'I LENGTH: 48 base pairs
IBI TYPE: nucleic acid
~ C I STRANn~nNESS: single
,DJ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) SEQUENCE DESCRIPTION: SEQ ID NO:57:
AA~ACTGCAG GAA~r~GCTA TGACCAGCAC TCI~CCAGCCT CTCACCGA 48
(2) INFORMATION FOR SEQ ID No:58:
(i) SEQUENCE CHARACTERISTICS:
~A'I LENGTH: 53 base pairs
~B'l TYPE: nucleic acid
C, STRANDEDNESS: single
,DI TOPOLOGY: linear
(i.i3 ~OLECULE TYPE: DNA
(xi3 SEQUENCE DESCRIPTION: SEQ ID NO:5~:
ACTTCGAAAT TAATA~rA~T CACTATA~-GG A~C~-ACGTCG ACTATCCATG AAG S~
(2) ~'NFCPMA~ION FOR SEQ ID NO:59:
(ij ~u~ CHARACTERISTICS:
tAj LENGTH: 53 base pairs
tBI TYPE: nucleic acid
.C, STT~ANnEnN~-SS: single
~Dl TOPOLOGY: lLnear
(ii) MOLECULE TYPE: DNA
(xi) ~Q~..C~ DESCRIPTION: SEQ ID NO:59:
ACTTCGAAAT T~ATA~t~-A~T CACTATAGGG AGCACTCTCC AGC~- ~AC CGA 53
(2) INFORMATION FOR SEQ ID No:60:
(i) S~yu~c~ CHARACTERISTICS:
~A'I LENGTH: 24 base pairs
~Bl TYPE: nucleic acid
,C! 5TRANn~nNT~cs: single
~Dl TOPOLOGY: linear
( ii ) Mnr T~CUT ~ TYPE: DNA
- 251 -
CA 02235860 1998-04-24
W O 97/15690 PCTnJS96/17159
(xi) ~y~N~ DESCRIPTION: SEQ ID No:60:
AGCACTCTCC AGC~u~-~AC CGAA 24
(2) INFORMATION FOR SEQ ID NO:61:
S
(i~ ~yu~N~ CHARACTERISTICS:
'A LENGTH: 24 base pairs
~Bl TYPE: nucleic acid
,C, STRANDEDNESS: single
~DJ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(Xi) ~b:~'U~N~- DESCRIPTION: SEQ ID NO:61:
AGCACTCTGG CGCGCCTCAC CGAA 24
(2) INFOR.~ATION FOR SEQ ID NO:62:
(i) ~:QB~N~ CHARACTERISTICS:
~'A', LENGTH: 24 base pairs
~B, TYPE: nucleic acid
,C, STRANDEDNESS: single
~D,, TOPOLOGY: linear
(ii' MOLEC~LE TYPE: DNA
~0
;xi) ~yu~N~ DESCRIPTION: SEQ ID NO:62:
AGCACTCTCC AGCCUCTCAC CGAC 24
r2) INFORMATION FOR SEQ ID NO:63:
( i) ~yU~N~ CHARACTERISTICS:
~A' LENGTH: 24 base pairs
~B~ TYPE: nucleic acid
C I STRANn~n~.CS: ~ingle
~D) TOPOLOGY: linear
( i~ ) M~T-~CUT T~! TYPE: DNA
(xi) x~yl _ D~.SCRTPTION: SEQ ID NO:63:
AGCACTCTGG ~CGC~AC CGAC 24
(2) INFORMATION FOR SEQ ID NO:64:
(i) ~yu~-._~ CHARACTERISTICS:
~A'~ LENGTH: 24 base pairs
~BI TYPE: n~lcl~i~ acid
C, sTR~Nn~nN~-cs single
iDJ TOPOLOGy: linear
( ii ) MOr~CUT-T~ TYPE: DNA
- 252 -
= -- ~
CA 02235860 1998-04-24
WO 97/15690 PCT~US96/17159
(xi) ~yuh~._h DESCRIPTION: SEQ ID NO:64:
ACC~A~GTCG ACTATGGATG AAGA 24
(2) INFORMATION FOR SEQ ID No:65: ~,
(i) ~Q~hN~h CHARACTERISTICS:
~A' LENGTH: 12 base pairs
~8I TYPE: nucleic acid
~C, STRANDEDNESS: ~ingle
~D~ TOPOLO~Y: linear
(ii) MOLECULE TYPE: DNA
(xi) ~hQ~N~h DESCRIPTION: SEQ ID NO:65:
GAT~L~Ll~A TC 12
S2) INFORMATION FOR SEQ ID NO:66:
(i) ~Q~hN~h CHARACTERISTICS:
tAI LENGTH: 12 ba~e pair~
IB, TYPE: nucleic acid
,C~ STR~NDEDNESS: sin~le
~DI TOPOLOGY: line~r
(i ) MOLECULE TYPE: DNA
(xi) SEQUENCE DESCRlPTION: SEQ ID NO:66:
CALG.L~;1 LCA TC 1
(2) INFORMATION FQR SEQ ID NO:67:
(i) ~Q~hN~h- CHARACTERISTICS:
,AI LENGTH: 12 base pairs
B TYPE: nucleic acid
~CI ST~AN~Fn~F.c5 single
~DJ TOPOLOGY: lLnear
(il) M~r.~CUT.r' TYPE: DNA
(xi) ~hQu~ Dr~C~TpTIoN: SEQ ID NO:67:
C~l~l~A TC 12
(2) INFORMATION FOR SEQ ID NO:68:
(i) ~Quh~ CHARACTERISTICS:
~AI LENGTH: 24 base pair3
~8, TYPE: nucleic acid
~C, STRANDEDNESS: single
~DJ TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
- 253 -
CA 02235860 1998-04-24
WO 97/15690 PCT~US96/171~9
(xi) S~Q~ ~:N~ DESCRIPTION: SEQ ID NO:68:
ACCGA~GTCG ACTATCGCAG CAGA 24
5 t2~ INFORMATION FOR SEQ ID NO:69:
EQU~-N~: CHARACTERISTICS:
~AI LENGTH: 12 ba3e pairs
tB~ TYPE: nucleic acid
C~ STRANDEDNESS: ~ingle
DI TOPOLOGY: linear
(ii) MOLECULE TYPE: DNA
(xi) SEQ~N~: DESCRIPTION: SEQ ID No:69:
GATCTCTGCT 5C l2
(2) INEORMATION FOR SEQ ID NO:70:
(i) SEQUENCE CHARACTERISTICS:
~AI LENGTH: 12 base pairs
,BI TYPE: nucleic acid
~C, STRANDEDNESS: single
DI TOPOLOGY: linear
/ii.; .~.OLECULE rYPE: DNA
(x.i) SEQUENCL DESCRIPTION: SEQ ID NO:70:
CATGTCTGCT GC 12
- 254 -