Patent 2416611 Summary

(12) Patent Application:	(11) CA 2416611
(54) English Title:	METHODS AND SYSTEMS FOR EVALUATING A MATCHING BETWEEN MEASURED DATA AND STANDARDS
(54) French Title:	PROCEDES ET SYSTEMES POUR EVALUER UNE CORRESPONDANCE ENTRE DONNEES MESUREES ET ETALONS
Status:	Deemed Abandoned and Beyond the Period of Reinstatement - Pending Response to Notice of Disregarded Communication

Bibliographic Data

(51) International Patent Classification (IPC):	G06F 17/10 (2006.01) C12N 15/09 (2006.01) G01N 27/447 (2006.01) G01N 33/48 (2006.01) G01N 33/483 (2006.01) G01N 33/50 (2006.01) G06F 17/15 (2006.01) G06F 17/18 (2006.01)
(72) Inventors :	BREU, HEINZ (United States of America) PASIKA, HUGH J. (United States of America)
(73) Owners :	APPLERA CORPORATION
(71) Applicants :	APPLERA CORPORATION (United States of America)
(74) Agent:	MARKS & CLERK
(74) Associate agent:
(45) Issued:
(86) PCT Filing Date:	2001-07-23
(87) Open to Public Inspection:	2002-01-31
Examination requested:	2003-01-17
Availability of licence:	N/A
Dedicated to the Public:	N/A
(25) Language of filing:	English

Patent Cooperation Treaty (PCT):	Yes
(86) PCT Filing Number:	PCT/US2001/023630
(87) International Publication Number:	WO 2002008949
(85) National Entry:	2003-01-17

(30) Application Priority Data:

Application No.	Country/Territory	Date
60/219,697	(United States of America)	2000-07-21

Abstracts

English Abstract

The present invention proposes an approach to matching measured data points
representative of a physical property (for instance the measured peak location
or positions of an in-lane size standard which represent DNA fragment length)
to a set of standard values of that property (for instance a size standard
definition). The matching is evaluated by comparing ratios of intervals
between selected pairs of the measured data points with ratios of the
intervals between adjacent pairs of the standard values, to find the measured
data points that best match the standard values.

French Abstract

L'invention concerne des procédés permettant d'établir une correspondance entre des points de données mesurées représentant une propriété physique (p. ex. le pic mesuré ou les positions d'un étalon de masse moléculaire dans le tracé qui représente une longueur de fragment d'ADN) et un ensemble de valeurs étalons de cette propriété (p. ex. définition d'étalon de masse moléculaire). On évalue la correspondance en comparant des rapports d'intervalles entre des paires sélectionnées de points de données mesurées avec des rapports d'intervalles entre des paires adjacentes de valeurs étalons, afin de trouver les points de données mesurées qui correspondent le mieux aux valeurs étalons.

Claims

Note: Claims are shown in the official language in which they were submitted.

Claims
1. A computer implemented method for evaluating a matching between a
plurality of measured data points representative of a physical
property and a standard comprising a sequence of pre-defined
values of said property, the method comprising:
(a) selecting a sequence of the measured data points
corresponding in number to the number of pre-defined values of
said property comprised in said standard and matching them in
sequence to the values of the standard;
(b) defining a sequence of measured edges as the set of
intervals between sequential ones of the measured data points
selected in step (a);
(c) calculating the ratio of each pair of adjacent measured
edges defined in step (b);
(d) defining a sequence of standard edges as the sequence of
intervals between sequential ones of the pre-defined values in the
standard, each standard edge corresponding to a respective
different measured edge;
(e) calculating the ratio of each pair of adjacent standard
edges defined in step (d); and
(f) evaluating the matching by comparing the ratios of the
measured edges with the corresponding ratios of the standard
edges.
2. A method according to claim 1, wherein the step of comparing
ratios comprises calculating the ratios of corresponding ones of
the measured edge ratios and the standard edge ratios to generate
a ratio cost for each corresponding pair of ratios.
3. A method according to claim 2, comprising summing the ratio costs
to generate a total cost for the matching.
4. A method according to claim 2, wherein each of said measured data
points has an associated magnitude, the method further comprising
calculating a magnitude cost for each of the selected points as a
ratio of the magnitude of the point to the maximum magnitude
across all of the measured data points.
19

5. A method according to claim 4, comprising generating a total cost
for the matching as a weighted sum of the ratio costs and
magnitude costs for the selected points.
6. A method according to claim 3 or claim 5, comprising determining a
quality for the matching by comparing the total cost of the
matching with a maximum possible total cost.
7. A method according to claim 6, wherein a matching is rejected if
the quality is not sufficiently high.
8. A method according to any one of the preceding claims, wherein
there are more measured data points than there are values in the
standard, and the evaluation is repeated for one of more different
sub-sequences of measured data points.
9. A method according to claim 8, further comprising calculating
total costs for matching each of said sub-sequences of measured
data points to said standard, and identifying the matching having
the highest cost as the best matching.
10. A method according to any one of the preceding claims, wherein
the plurality of measured data points from which the sequence of
matched points are selected is itself a sub-set of a larger set of
measured data points.
11. A method according to claim 10, wherein the number of data
points selected from said larger set equals the number of values
in said standard plus a pre-defined number of additional data
points.
12. A method according to claim 11 or claim 12, wherein the
measured data points of said larger set each have a magnitude, and
the plurality of measured points selected from said larger set of
data points is selected on the basis of said magnitude.
13. A method according to any one of the preceding claims, wherein
the evaluation of the matching comprises employing an algorithm
for which the execution time increases no more than substantially
linearly with the number of pre-defined values in the standard.
20

14. A method according to any one of the preceding claims, wherein
said plurality of measured data points exceeds the number of pre-
determined values in the standard and the evaluation of the
matching comprises employing an algorithm for which the execution
time increase no more than as the cube of the number of measured
data points in excess of the number of values in the standard.
15. A method according to any one of the preceding claims, wherein
the matching is evaluated using a dynamic programming algorithm.
16. A computer implemented method for evaluating a matching between
a plurality of measured peak locations, the locations representing
DNA fragment size, and a size standard definition comprising a
sequence of pre-defined DNA fragment sizes, the method comprising:
(a) selecting a sequence of the measured peak locations
corresponding in number to the number of pre-defined sizes in the
size standard definition;
(b) defining a sequence of measured edges as the set of
intervals between sequential ones of the peak locations selected
in step (a);
(c) calculating the ratio of each pair of adjacent measured
edges defined in step (b);
(d) defining a sequence of standard edges as the sequence of
intervals between sequential ones of the sizes of the size
standard definition, each standard edge corresponding to a
respective different measured edge;
(e) calculating the ratio of each pair of adjacent standard
edges defined in step (d); and
(f) evaluating the matching by comparing the ratios of the
measured edges with the corresponding ratios of the standard
edges.
17. A method according to claim 16, wherein the step of comparing
ratios comprises calculating the ratios of corresponding ones of
the measured edge ratios and the standard edge ratios to generate
a ratio cost for each corresponding pair of ratios.
18. A method according to claim 17, comprising summing the ratio
costs to generate a total cost for the matching.
21

19. A method according to claim 17, wherein each of said peak
locations has an associated peak height, the method further
comprising calculating a height cost for each of the selected
peaks as a ratio of the height of the peak to the maximum height
across all of the peaks.
20. A method according to claim 19, comprising generating a total
cost for the matching as a weighted sum of the ratio costs and
height costs for the selected points.
21. A method according to claim 18 or claim 20, comprising
determining a quality for the matching by comparing the total cost
of the matching with a maximum possible total cost.
22. A method according to claim 21, wherein a matching is rejected
if the quality is not sufficiently high.
23. A method according to any one of claims 16 to 22, wherein there
axe more measured peak locations than there are sizes in the size
standard definition, and the evaluation is repeated for one of
more different sub-sequences of measured peak locations.
24. A method according to claim 23, further comprising calculating
total costs for matching each of said sub-sequences of measured
peak locations to said size standard definition, and identifying
the matching having the highest cost as the best matching.
25. A method according to any one of claims 16 to 24, wherein the
plurality of peak locations from which the sequence of matched
peaks are selected is itself a sub-set of a larger set of peak
locations.
26. A method according to claim 25, wherein the number of peak
locations selected from said larger set equals the number of sizes
in said size standard definition plus a pre-defined number of
additional peak locations.
27. A method according to claim 25 or claim 26, wherein the peak
locations of said larger set each have an associated height, and
the plurality of peak locations selected from said larger set of
data points is selected on the basis of said height.
22

28. A method according to any one of claims 16 to 27, wherein the
evaluation of the matching comprises employing an algorithm for
which the execution time increases no more than substantially
linearly with the number of sizes in the size standard definition.
29. A method according to any one of claims 16 to 27, wherein said
plurality of peak locations exceeds the number of sizes in the
size standard definition and the evaluation of the matching
comprises employing an algorithm for which the execution time
increase no more than as the cube of the number of peak locations
in excess of the number of sizes in the size standard definition.
30. A method according to any one of claims 16 to 29, wherein the
matching is evaluated using a dynamic programming algorithm.
31. A computer system for evaluating a matching between a plurality
of measured data points representative of a physical property and
a standard comprising a sequence of pre-defined values of said
property, the system comprising:
(i) memory for storing values of said measured data points and
said pre-defined values of said standard; and
(ii) processing means for:
(a) selecting a sequence of the measured data points
corresponding in number to the number of pre-defined values of
said property comprised in said standard and matching them in
sequence to the values of the standard;
(b) defining a sequence of measured edges as the set of
intervals between sequential ones of the measured data points
selected in step (a);
(c) calculating the ratio of each pair of adjacent
measured edges defined in step (b);
(d) defining a sequence of standard edges as the sequence
of intervals between sequential ones of the pre-defined values in
the standard, each standard edge corresponding to a respective
different measured edge;
(e) calculating the ratio of each pair of adjacent
standard edges defined in step (d); and
(f) evaluating the matching by comparing the ratios of
the measured edges with the corresponding ratios of the standard
edges; and
23

(iii) output means for outputting the result of said
evaluation.
32. A system according to claim 31, wherein said processing means
is operable in accordance with any one of the methods of claims 1
to 15.
33. A.computer system for evaluating a matching between a plurality
of measured peak locations, the locations representing DNA
fragment size, and a size standard definition comprising a
sequence of pre-defined DNA fragment sizes, the system comprising:
(i) memory for storing values of said peak locations and said
sizes of said size standard definition; and
(ii) processing means for:
(a) selecting a sequence of the measured peak locations
corresponding in number to the number of pre-defined sizes in the
size standard definition;
(b) defining a sequence of measured edges as the set of
intervals between sequential ones of the peak locations selected
in step (a);
(c) calculating the ratio of each pair of adjacent
measured edges defined in step (b);
(d) defining a sequence of standard edges as the sequence
of intervals between sequential ones of the sizes of the size
standard definition, each standard edge corresponding to a
respective different measured edge;
(e) calculating the ratio of each pair of adjacent
standard edges defined in step (d); and
(f) evaluating the matching by comparing the ratios of
the measured edges with the corresponding ratios of the standard
edges; and
(iii) output means for outputting the result of said
evaluation.
34. A system according to claim 33, wherein said processing means
is operable in accordance with any one of the methods of claims 16
to 30.
35. A computer program which is executable on a computer or a
distributed network of computers to cause the computer or network
24

of computers to operate in accordance with the method of any one
of claims 1 to 30.
36. A computer program according to claim 35 stored on a computer-
readable medium.
25

Description

Note: Descriptions are shown in the official language in which they were submitted.

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
Methods and Systems for Evaluating a Matching Between Measured Data
and Standards
Field of the Invention
The present invention relates to methods and systems for
evaluating a matching between data, in particular a sequence of
measured data points, and a pre-defined standard. It has specific,
although not necessarily exclusive application in the field of
evaluating biological data.
Background
When studying a substance or device for example, it is often
the case that one or more physical properties of the sample or device
are not directly measurable or at least cannot be measured easily.
In order to evaluate such a property an alternative, indirect measure
representative of the property must be found.
This problem can be illustrated by taking the biological
example of evaluating the size (i.e. length) of DNA fragments
(measured in base pairs for instance). This is not a property that
can be measured directly. The alternative, indirect measure of
fragment size most commonly used is a measure of the rate at which
the fragments are transported through a medium in which their
mobility varies with size. Typically this is achieved by
electrophoresis through Lanes in a polyacrylmide gel, the details of
which will be well known to the skilled person and need not be
repeated here.
With this approach, the location of the fragments along the
lane, or the point in time at which a particular fragment passes a
set point along the lane (typically identified by peaks on an
electropherogram), provides an indirect measure of the fragment size.
However, since in any particular run the migration of the fragments
through the gel will be affected by factors other than just the size
of the fragment, on its own this provides only a measure of the
relative sizes of the fragments to one another. What it does not
provide is any absolute values for fragment size.

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
To enable the determination of absolute fragment size, it is
possible to run an in-lane size standard along with the fragments
being evaluated. This in-lane size standard is a kind of "ruler"
made up of a set of fragments of known size (i.e. length). If the
location of these size standard fragments can be accurately
determined, it is then possible to evaluate the size of an unknown
_ fragment by comparing its location with those of the size standard
and interpolating from the known sizes of the size standard
fragments.
For this approach to work, it is important that the location of
the size standard fragments can be accurately determined. That is to
say, one has to match peak locations in the measured results to each
of the known fragment sizes of the size standard definition. The
presence of artefacts in the results, giving false peaks, often means
that this is not a straightforward task. Embodiments of the present
invention are concerned with providing efficient approaches to
carrying out and evaluating such matching processes.
Although the background to the invention has been described
above with reference to evaluating the size of DNA fragments, it will
be appreciated that similar problems can be encountered when
evaluating the properties of other biological or non-biological
samples or devices. The invention has applicability more generally,
therefore, in any case where there is a need to match specific points
within a measured data set representing a set of standard values of a
property to the standard values themselves. The skilled person will
be well aware of many exemplary cases where the approach set out
below can be usefully applied, including but not limited to
applications in the filed of telecommunications, image processing,
pattern recognition and the like.
Summary of the Invention
The present invention proposes an approach to matching measured
data points representative of a physical property (for instance the
measured peak location or positions of an in-lane size standard which
represent DNA fragment length) to a set of standard values of that
property (for instance a size standard definition). The matching is
evaluated by comparing ratios of intervals between selected.pairs of
2

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
the measured data points with ratios of the intervals between
adjacent pairs of the standard values, to find the measured data
points that best match the standard values.
Other preferred features and aspects of the invention are set
out in the appended claims.
Brief Description of the Drawing
The invention is described below, by way of example, with
reference to the accompanying drawings in which:
Fig. 1 is a flow diagram illustrating the basis matching
process according to an embodiment of the present invention;
Fig. 2 Illustrates Example peaks, sizes and a matching;
Fig. 3 shows execution time for the matching process against
number of values in the size standard and additional peaks;
Fig. 4 shows execution time versus the number of values in the
size standard definition for different numbers of additional peaks;
Fig. 5 shows execution time versus the number of extra peaks
for a series of size standard definitions having differing numbers of
size values;
Fig. 6, 7, 8 and 9 show exemplary results from matchings;
Fig. 10 is a table of execution times versus the number of size
value in the size standard definition and the number of additional
peaks; and
Fig. 11 is a block diagram illustrating a suitable system
architecture for operating embodiments of the invention.
Description of Embodiments
Embodiments of the invention are described below in the context
of the exemplary application of the described approach to the problem
of matching an in-line size standard to a size standard definition
for the evaluation of nucleic acid information obtained from an
3

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
electrophoresis procedure. It will be appreciated that the approach
has wider applicability
In overview, the method operates by matching data generated
with a standard sample to actual sizes that should exist in the
standard sample. For example, one may use a standard sample with
nucleotide lengths 110, 114, 117, 120, and 125. One runs the
standard sample and obtains several data peaks. The size standard
matching component predicts the peaks that correspond to the five
known nucleotide lengths. Thus, one can subsequently compare data in
a sample to those peaks to determine the nucleotide lengths of
fragments in a sample.
Referring to Fig. 1, an embodiment of the process operates by
first selecting a sequence of peaks from those obtained by running
the standard sample, the number of peaks selected being equal to the
number of values in the size standard,definition. One peak is
matched to each of the size standard sizes in sequence. The matching
this produces is then evaluated to determine the total cost for the
matching. This process is repeated for a number of different
sequences of peaks to establish which sequence of the possible
variations from the peaks obtained from running the standard sample
provides the best matching.
In certain embodiments, a matching component uses an algorithm
that has three parameters: ratio factor (the importance of peak
height vs. the importance of the local linearity), min acceptable
quality (used for ending dynamic programming iteration), and number
of extra peaks (the number of peaks participated in size matching is
the number of size standard definition fragments plus the number of
extra peaks). In certain embodiments, the component fixes the ratio
factor to 0.6 and min acceptable quality to 0.75. In certain
embodiments, the component fixes the number of extra peaks to 10 for
Applied Biosystems 310/377 instrument data and 25 for Applied
Biosystems 3700 instrument data.
In certain embodiments, a statistically based quality value is
generated for the matching result.
In certain embodiments, one skilled in the art will be able to
adjust the number of extra peaks that may be used with a given
instrument.
4

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
In certain embodiments, the component ignores the peaks located
within the offseale regions in the sample.
Tn certain embodiments, the component fails the size matching
process if the size standard definitions are not fully matched in the
matching process.
In certain embodiments, the component implements two primer
peak detection methods. The first is the primer-peak-height-
suppression method. This method replaces the peak heights of the
highest peaks with the peak height of the middle peak, assuming that
the primer peaks are among the highest. The second is to find the
primer peak location. The method assumes that the primer peak
locates within the first half of the signal and the size standard
fragments locate in the second half of the signal. For example, one
takes the mean peak height of all the peaks in the second half and
multiples that mean by five to get the potential primer peak height.
The method works backwards in the first half of the signal to find
the last primer peak.
In certain embodiments, a size-standard matcher takes as input
a list of peaks (e.g., from an electropherogram) and a list of
fragment sizes (e.g., in nucleotides). It produces as output a
matching, that is, a list of pairs of the form <peak,size>, where
each peak and each fragment size appears at most once. In certain
embodiments, a size-standard mateher evaluates a matching, and uses
an algorithm for finding good matchings.
Certain embodiments employ an algorithm that evaluates a
matching by treating its two constituent sequences as sequences of
edges between points. A matching is also a correspondence between
edges. Two edges, e1 and e2, that share an endpoint define a ratio
of lengths r = ~e2~/~el~. Again, a matching is also a correspondence
between ratios. Under the assumption that the relation between peak
position and fragment size is "more or less" linear, corresponding
ratios typically should be equal. In certain embodiments, the
component derives a ratio cost to measure this property. In certain
embodiments, the component also concentrates on big peaks by deriving
a height cost. The total cost of a matching is a weighted sum of
these constituent costs,
In certain embodiments, the component formulates the size
standard matching problem as finding a matching with maximum cost.
In such embodiments, the cost is separable. That is, with some
5

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
additional mathematics, the component can maximise subsequences
independently. In certain embodiments, the cost also enjoys the
advantage of being local, thereby compensating for global deviations
from linearity. This cost also leads to a quality value between 0
and 1.
A size standard is a set of DNA fragments, each of known size.
The definition of a size standard is simply a list of these sizes.
Note that a size-standard definition typically does not depend on the
instrument on which the size standard is run, and therefore not on
any particular set of run conditions either.
An in-lane size standard is a set of peaks resulting from
running a size standard on an instrument. One determines the
positions and the heights of the peaks.
A size-standard matcher takes as input an in-lane size standard
and a size standard definition. It produces as output a matching,
that is, a list of pairs of the form (peak,size), where each peak and
each fragment size appears at most once. A peak has a position
(e. g., in scan numbers) and a height (e. g., in fluorescent units).
Fragment sizes are given in nucleotides.
Assume that there are at least as many peaks as sizes.
Furthermore, assume that every size has a corresponding peak, except
possibly for some small number from the end of this list. This
exception is meant to model the situation where a user may have
stopped electrophoresis early, before the larger fragments have had a
chance to elute.
In certain embodiments, one employs the following components.
Let P=lpo,p"...,j7~r_~) be a list of nP peak locations, given by
increasing scan number, for example. Let H=~j2o,12~...,ltn -~ J be a list
v
of the corresponding nP peak heights, given in fluorescent units, for
example. The size standard definition S=LSO,s~,...,s,~-~, is a list of
s
ns fragment sizes in increasing nucleotides. By assumption, nP Z ns.
A size-standard matchin is a set of airs f
g p M =~~io~0)~~ii~l)~..., (i»,~n,.)
where the i~ are in increasing order, that is, where subscript j < k
implies i~ < ik.
3S
6

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
Example 1 Peaks, Sizes, and a Matching
Consider the peaks, sizes, and matching displayed in Figure 2.
The list P contains nP = 11 peak positions:
P = [968, 1029, 1203, 1259, 1412, 1535, 1714, 1751, 1785, 1837,
1928] .
These np peaks have heights H:
H = [2722, 6219, 1060, 5380, 7726, 1082, 7424, 1263, 7335, 7937,
1562] .
The size standard definition has ns = 5 sizes S:
S = [75, 100, 139, 150, 160] .
Finally, M is the matching shown in the figure:
M = { (3, 0) , (4, 1) , (6, 2) , (8, 3) , (9, 4) } .
In certain embodiments, big-Oh notation is used to express
algorithm complexity. This notation is ubiquitous in worst-case and
average-case resource analysis. Briefly, a function f is said to be
on the order of another function g (written f(x) = O(g(x)) if there
exists positive constants c and N such that ~f(x)~ _< c~g(x)~ for all
x >- N.
Evaluating a Matching
Assume there is a matching. In certain embodiments,, the
matcher component evaluates a matching by examining its two
constituent sequences. It treats the sequence of peaks as a sequence
of edges between peaks, and similarly for the sizes. For example, M
- ~(3, 0), (4, 1), (6, 2), (8, 3), (9, 4)} is the matching from
Example 1. Its peak (index) sequence is [3, 4, 6, 8, 9], which has
four edges, (3, 4), (4, 6), (6, 8), and (8, 9). Similarly, its
fragment size definition (index) sequence is [0, 1, 2, 3, 4], which
also has four edges: (0, 1), (1, 2), (2, 3), and (3, 4).
A matching is also a correspondence between edges. In this
example, peak edge (6, 8) corresponds to definition edge (2, 3).
Assume that two edges are adjacent if they share an endpoint. In
this example, (4, 6) and (6, 8) are adjacent since they share peak 6.
Two adjacent edges (i,j) and (j, k) define a ratio risk of lengths:
1'J~ _ ~k p~ . (1)
h; - hr
7

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
In certain embodiments, one can employ a more economical
notation for size ratios for matching all sizes:
r = S,f+z -Sf+~ .
I
s f+~ - s.f ( ~ )
Again, a matching is also a correspondence between ratios. In this
example, peak ratio r689 corresponds to size ratio rZ.
Under the assumption that the relation between peak position
and fragment size is "more or less" lineax, corresponding ratios
typically should be equal. More formally, assume that the fragment
of size si occurs at position pi. ~If there are coefficients a and b
such that pi = asi + b for all i, then
p~ -pj - ~as~. +b)-~asj +b~ a~s~ -sj~- s~ -sj
(3)
pj-p ~ -~ r ~ ~- ,
as . + b as. + b a s . s. s . s.
To measure the similarity of a corresponding pair of ratios ri~x
and rf, one may define their ratio cost cr(i, j, k, f) to be
min rt.~ , r
~,.~i~J~k~.f~= ~ ~ jv_ (4)
max ~;~~,rf
Note that 0 <- cr(i,j,k,f) <_ 1 for all 0 <- i < j < k < np and 0 _<< f <
n5 - 2. Note also that cr(i,j,k,f) = 1 indicates the ideal of equal
ratios. The ratio cost of a matching is the sum of its individual
costs .
In certain embodiments, one has the matchings concentrate on
the big peaks. To this end, one may define the height cost ch(i) of
a matched peak i to be its height divided by the maximum peak height
1i. More formally,
3o h= max h. (5)
0_<j<rrr,
and
8

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
cn~i~= h . (6>
Again, 0 < ch(i) <- 1 for all peaks 0 _<< i < nP, and ch(i) = 1, in
certain embodiments, corresponds to the ideal of a maximally tall
peak.
In order to combine these two types of costs, one may weight
and sum them. Since there are only two costs, a single weighting
parameter a, where 0 <- a <_ l, will suffice. The total cost c(M) of a
matching M is the weighted sum:
c~M~=a ~ c,.~i, j, k, f~+~1-a> ~ c,,~i~ . (7)
r>.f )EM O,.f )EM
j,f+1 eM
~h,.f+Z~M
One may now formulate the size standard matching problem as
finding a matching with maximum cost. Note that the cost is local in
the sense that each element of the summation depends on at most three
adjacent points. This property allows the matcher component in
certain embodiments. to compensate for global deviations from
linearity.
Quality Measures
If one divides the cost of a matching by the maximum possible
cost over all matchings, one would have a number between 0 and 1 that
indicates its quality. What is this maximum possible cost? Every
pair of ratios in such a matching would contribute its maximum value,
namely axl. There are a total of n - 2 ratio pairs. Similarly,
every matched peak would be at the maximum height, so that all n
matched peaks (one for each definition size) contribute (1 - a) x 1.
The maximum possible cost c, therefore, is:
c=~n-2~a,+n~l-a~=n-2a (
The quality of a matching M is therefore given by c(M)/c.
Other possible quality measures include the sum of just the
ratio costs, and the worst ratio cost in the matching.
An Efficient Algorithm
In certain embodiments, an advantage of the above formulation
9

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
is that the cost is separable. That is, with some additional
mathematics, one can maximise subsequences independently. This
property leads to an efficient dynamic programming algorithm. In
certain embodiments, the algorithm is efficient (runs in low-order
polynomial time and space) and guarantees an optimal solution.
Let c : ~~ -. ~. denote the maximum cost of a matching
subproblem. In particular, let c(j, k, f) denote the maximum cost of
matching the peaks from 0 to k with the definition fragments from 0
to f + 1 in such a way that peak j matches size f and peak k matches
size f + 1. The cost of matching all sizes is therefore
c~M*~= max c~j,k,ns. -2~ (9)
i<h~n,~
where M* is the optimal matching. Note that every definition
fragment matches some peak, but only n5 of the peaks need match in
this embodiment.
One can now express a maximum cost recursively. For f = 0
there are no ratios to compute, so one need only be concerned with
the height cost:
c~j,k,0~= ~1-cz~c,,~j~+ch~k)) . (lo)
For f > 0, one can compute the cost recursively by adding the height
cost for the newly matched peak k to a new ratio cost and a previous
subproblem cost:
c~j,k, ~=(1-a~~c,,~k~ (11)
+max~(i, j, f -1~+a~c,.(i, j,k, f -l~~
i<~
Converting these equations to algorithms is straightforward.
In certain embodiments, the matcher component computes the individual
elements in a consistent order. Furthermore, one may exploit the
fact that one can match every size in the definition by limiting the
computation. In certain embodiments, the matcher component only
needs to compute c(j, k, f) for k > j Z f since j peaks cannot be
made to fit all f sizes if j < f. Similarly, in certain embodiments,
the matcher component needs to examine only subproblems c(i,j,f-1)

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
where i ? f - 1 since i peaks could not fit all f-1 sizes if i < f -
1.
To this end, Algorithm 2 solves Equation 10 and Algorithm 3
solves Equation 11.
Algorithm 2 Basis of the Recursion (when f = 0)
for j .- 0, 1,... , nP - 2 do
for k .- j + 1, j + 2, ... , np - 1 do
c(j~k~ 0) ~- (l - a) (ch(j) + ch(k) )
Algorithm 3 Compute Cost of Matching (when f > 0)
1. for f ,.- 1,2,... ,ns - 2 do
2. for k .- f + 1, f + 2,... , f + np - ns + l do
3. for j .- f, f + 1,... , k - 1 do
4 . c * .-
5. for i .-- f - 1, f,... , j - 1 do
6. , ci .- e(i, j, f - 1) + a ~ cr(i, j,k, f - 1)
7. if ci > c* then
8 , c * .- ci
9. c(j,k,f) ~- (1 - a) ~ ch(k) + c*
As stated, these algorithms compute only the cost of an optimal
matching. One will still retrieve a matching from this calculation.
This is often a standard part of dynamic programming algorithms.
When memory requirements are very high, it is often the practice to
recompute the path to the optimal cost from the cost matrix. Since
certain embodiments have relatively small sequences, one can trade
time for memory by keeping an array of backpointers, or predecessors
p. It is easy to maintain this array by adding the line p(j,k,f) .-- i
after line 8 in Algorithm 3. This assignment indicates that the
predecessor to cost c(j,k,f) is c(i,j,f-1). Then, the matcher
component may reconstruct an optimal matching from Equation 9 by
tracing backwards.
Computational Resources
Theoretical Run-time Analysis
In certain embodiments, the algorithm's run time complexity is

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
dominated by the number of times it executes Lines 6 and 7. The
lines themselves execute in constant time. The inner (the i) loop
executes them ~i'r-~ 1 times. Therefore the j loop executes them
H_~ i-~
~.-.r-1 1 times. The k loop terminates at k = f + nP - ns + 1 =
f + m + 1, where m = nP - ns is the number of extra peaks.
Continuing in this way, one sees that the lines are executed a total
of T(m, ns) times, where
..,.-z n~+.r+i k-~ j
T~nZ,hs.~= ~ ~ ~ 1 . (12)
.r=~ x=.f+~ i=I r=I-1
to
This expression is not as formidable as it looks since the inner
three summations are independent of the value f. By a judicious
substitution of variables, one sees that:
n,+_r+1 k-1 ;-1 m k 1
IS ~ ~ ~ 1=~ ~ ~ 1. (13)
=.r+~ j=.r 1=.r-1 k=~ J=~ i=0
A straightforward calculation shows that:
~~ ~ ~ 1=6~nZ3+6naz+llm+6~=6~m+3~m+2~m+1~.
=o j=o r=o
It follows that
n -2 m+r+1 k-1 j-1
T ~m~ n~s~ ~ _
' .r=1 ~=.r+1 j=f i=f-1
= 6(m+3~m+2~m+l~n,s -2) (14)
= O~msns
That is, the execution time increases only linearly with the number
of definition fragments, but it increases as the cube of the number
of extra peaks. Note that, when the number of peaks equals the
number of definition fragments (that is, when m = 0), Lines 6 and 7
are executed only ns -2 times, which is exactly the number of ratios
that need to be compared to evaluate any matching.
12

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
Empirical Measurements
The theoretical analysis in the previous subsection allows one
to understand the asymptotic behaviour of the algorithm. That is, it
allows one to predict the trend in the run-time when the inputs are
large. For smaller inputs, in certain embodiments, various overhead
factors influence the run time.
One can construct several sets of synthetic data and time a C++
implementation of the algorithm. The data includes size standard
definitions with ns = 5 to ns = 40 fragment sizes. In every case the
ith fragment has size 20a., where i >- 1. The in-lane peaks have
positions equal to the definition sizes, but they also have m = 0 to
m = 20 additional peaks, where the ith additional peak has position
10+20i, for i >- 0. For each combination of ns and m, a test program
executes the matcher component 20 times and divides the elapsed time
by 20 also, to give the time for each execution in milliseconds.
Figures 3 to 5 show the results. The execution times
themselves, rounded to the nearest millisecond, are provided in Fig.
10.
Memory
Full arrays for holding the costs and predecessors may
use ~n2 -1- t2s.~2 t2~. = m2ns. -I- 2mYts -1-Yts real values . Initialising
these
arrays therefore takes asymptotically more time, when YI23 =O(ns ),
than the optimisation algorithm. If this is a problem, the arrays
can be implemented as sparse arrays, so that they occupy O(m3ns)
space as well as time. Another solution is to use full arrays, but
to index them not with the peak indices, but rather with the
substituted variables in Equation 13. A third possibility is to use
and allocate full matrices, but to not initialise them.
Practical Considerations
In certain embodiments, one may want to determine a set of
candidates peaks that the matcher component should size. One may
choose to allow a parameter m specifying the number of extra peaks to
consider. The size standard matcher then extracts the np = ns + m
tallest peaks from all peaks detected, for example by a previous
sizecalling step. One may use m = 4.
One may use a weighting factor a between 1/2 and 3/4.
13

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
An analyst typically should choose a size standard definition
that corresponds accurately to the in-lane size standard. However,
it may be that an analyst terminated a run early, before the longer
fragments have had a chance to elute. In this case, the definition
is not accurate, strictly speaking. To provide some robustness in
this situation, one may test if the optimal matching satisfies a
minimum acceptable quality parameter. If not, one may remove the
last definition size and try again, repeating this process until the
quality is acceptable. Alternatively, if the quality is
unacceptable, one may simply report this without returning a
matching.
Results
Figure 6 shows an example of matching a run of the Applied
Biosystems GeneScan 500 size standard on a Manhattan breadboard. The
top panel shows the matched electropherogram. Again, peak positions
are in green, and peak bases are in red. The middle panel shows the
"errors" obtained by fitting this matching to a straight line and
subtracting the assigned size from the size predicted from the linear
fit. Finally, the bottom panel shows the definition points(in green
circles) and the predicted peak sizes (in red circles). The
correspondence between the two is shown in blue line segments.
Prior to analysis, the primer peak is removed from the peaks
used for the matching, although it is still indicated in the diagram.
The electropherogram contains an extra peak at its end (around scan
4400). In this example, the described approach correctly did not
match this peak. However that peak was often assigned 500 nucleotides
by prior art matchers.
The electropherogram also shows a spurious "spike" around scan
2200. A peak detector rejected this as a peak since it did not meet
the minimum peak width criterion.
The 250-nucleotide fragment of the size standard is known to
migrate anomalously on the 310, and seems to do so on Manhattan,
also. The 340-nucleotide fragment is also known to migrate
anomalously; these two fragments typically are not used fox sizing on
the 310.
14

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
Figure 7 shows the run from Figure 6 matched with GeneScan 500
less sizes 250 and 340. Note that the matcher correctly ignored the
peaks corresponding to these two fragments.
Figure 8 shows the result of matching GeneScan 500, less sizes
35 and 50 with a run that was terminated early. Tl~e matcher correctly
ignores the corresponding peaks. The matcher does initially match all
sizes, including the unrepresented large ones, to the available
peaks, but the matching quality was not high enough. Consequently,
The matcher then strips one size at a time off the end of the
definition until the resulting match is of sufficient quality.
Figure 9 shows a matching with the GeneScan 400HD size
standard. This electropherogram also contains a spike (between size
90 and 100). A peak detector ignored this peak due to a minimum peak
width parameter. The primer peak is not visible since the analyst set
an analysis range that excludes it. Note from the middle panel that
the extremes(sizes 50, 60, 380, and 400) of the standard did not run
"as linearly" as the rest. These extremes would cause previous
matchers some problems, though their errors are quite small here.
Conclusion
It can be seen that embodiments of the present invention
provide a simple formulation of the size-standard matching problem
which leads to a simple algorithm for its solution. In some
embodiments the formulation emphasises corresponding ratios of edge
lengths combined with the heights of the matched peaks. In preferred
embodiments, the corresponding algorithm employs dynamic programming
directly on the input sequences. It runs in O(m3 ns) time, where ns is
the number of fragment sizes in the size-standard definition, and m =
nP - ns is the number of extra peaks in the in-lane size standard.
Preferred embodiments of the invention employ an algorithm that
requires just a very few (three) parameters:
1. the weight for the ratio cost,
2. the minimum acceptable quality, and
3. the number m of extra peaks to consider.
l5

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
System Architecture
Figure 11 is a block diagram that illustrates a computer system
500, according to certain embodiments, upon which embodiments of the
invention may be implemented. Computer system 500 includes a bus 502
or other communication mechanism for communicating information, and a
processor 504 coupled with bus 502 for processing information.
Computer system 500 also includes a memory 506, which can be a random
access memory (RAM) or other dynamic storage device, coupled to bus
502 for determining size matches, and instructions to be executed by
processor 504. Memory 506 also may be used for storing temporary
variables or other intermediate information during execution of
instructions to be executed by processor 504. Computer system 500
further includes a read only memory (ROM) 508 or other static storage
device coupled to bus 502 for storing static information and
instructions for processor 504. A storage device 510, such as a
magnetic disk or optical disk, is provided and coupled to bus 502 for
storing information and instructions.
Computer system 500 may be coupled via bus 502 to a display
512, such as a cathode ray tube (CRT) or liquid crystal display
(LCD), for displaying information to a computer user. An input
device 514, including alphanumeric and other keys, is coupled to bus
502 for communicating information and command selections to processor
504. Another type of user input device is cursor control 516, such
as a mouse, a trackball or cursor direction keys for communicating
direction information and command selections to processor 504 and for
controlling cursor movement on display 512. This input device
typically has two degrees of freedom in two axes, a first axis (e. g.,
x) and a second axis (e. g., y), that allows the device to specify
positions in a plane.
A computer system 500 evaluates size matchings and provides a
quality level fox each matching. Consistent with certain
implementations of the invention, a matching and associated quality
is provided by computer system 500 in response to processor 504
l6

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
executing one or more sequences of one or more instructions contained
in memory 506. Such instructions may be read into memory 506 from
another computer-readable medium, such as storage device 510.
Execution of the sequences of instructions contained in memory 506
causes processor 504 to perform the process states described herein.
Alternatively hard-wired circuitry may be used in place of or in
combination with software instructions to implement the invention.
Thus implementations of the invention are not limited to any specific
combination of hardware circuitry and software.
The term "computer-readable medium" as used herein refers to
any media that participates in providing instructions to processor
504 for execution. Such a medium may take many forms, including but
not limited to, non-volatile media, volatile media, and transmission
media. Non-volatile media includes, for example, optical or magnetic
disks, such as storage device 510_ Volatile media includes dynamic
memory, such as memory 506. Transmission media includes coaxial
cables, copper wire, and fiber optics, including the wires that
comprise bus 502. Transmission media can also take the form of
acoustic or light waves, such as those generated during radio-wave
and infra-red data communications.
Common forms of computer-readable media include, for example, a
floppy disk, a flexible disk, hard disk, magnetic tape, or any other
magnetic medium, a CD-ROM, any other optical medium, punch cards,
papertape, any other physical medium with patterns of holes, a RAM,
PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a
carrier wave as described hereinafter, or any other medium from which
a computer can read.
Various forms of computer readable media may be involved in
carrying one or more sequences of one or more instructions to
processor 504 for execution. For example, the instructions may
initially be carried on magnetic disk of a remote computer. The
remote computer can load the instructions into its dynamic memory and
send the instructions over a telephone line using a modem. A modem
local to computer system 500 can receive the data on the telephone
line and use an infra-red transmitter to convert the data to an
infra-red signal. An infra-red detector coupled to bus 502 can
receive the data carried in the infra-red signal and place the data
17

CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
on bus 502. Bus 502 carries the data to memory 506, from which
processor 504 retrieves and executes the instructions. The
instructions received by memory 506 may optionally be stored on
storage device 510 either before or after execution by processor 504.
It will be appreciated that the above embodiments and results
are exemplary only and various modifications to that which have been
specifically described are possible within the scope of the
invention. In particular it is to be noted that the applicability of
the invention is not limited to the described biological data, but is
also applicable to~other biological and non-biological data.
18

Representative Drawing

Sorry, the representative drawing for patent document number 2416611 was not found.

Administrative Status

2024-08-01:As part of the Next Generation Patents (NGP) transition, the Canadian Patents Database (CPD) now contains a more detailed Event History, which replicates the Event Log of our new back-office solution.

Please note that "Inactive:" events refers to events no longer in use in our new back-office solution.

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Event History , Maintenance Fee and Payment History should be consulted.

Event History

Description	Date
Inactive: IPC expired	2011-01-01
Application Not Reinstated by Deadline	2007-07-23
Time Limit for Reversal Expired	2007-07-23
Inactive: Correspondence - Formalities	2006-08-18
Deemed Abandoned - Failure to Respond to Maintenance Fee Notice	2006-07-24
Inactive: IPC from MCD	2006-03-12
Inactive: IPC from MCD	2006-03-12
Inactive: IPC from MCD	2006-03-12
Inactive: IPC from MCD	2006-03-12
Inactive: IPC from MCD	2006-03-12
Inactive: IPC from MCD	2006-03-12
Inactive: IPC from MCD	2006-03-12
Amendment Received - Voluntary Amendment	2005-08-02
Inactive: IPRP received	2003-07-28
Letter Sent	2003-05-13
Inactive: Single transfer	2003-03-24
Amendment Received - Voluntary Amendment	2003-03-24
Inactive: First IPC assigned	2003-03-18
Inactive: Courtesy letter - Evidence	2003-03-18
Inactive: Cover page published	2003-03-13
Inactive: First IPC assigned	2003-03-11
Letter Sent	2003-03-11
Inactive: Acknowledgment of national entry - RFE	2003-03-11
Application Received - PCT	2003-02-20
National Entry Requirements Determined Compliant	2003-01-17
Request for Examination Requirements Determined Compliant	2003-01-17
All Requirements for Examination Determined Compliant	2003-01-17
Application Published (Open to Public Inspection)	2002-01-31

Abandonment History

Abandonment Date	Reason	Reinstatement Date
2006-07-24

Maintenance Fee

The last payment was received on 2005-07-07

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

the reinstatement fee;
the late payment fee; or
additional fee to reverse deemed expiry.

Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Fee History

Fee Type	Anniversary Year	Due Date	Paid Date
Request for examination - standard			2003-01-17
MF (application, 2nd anniv.) - standard	02	2003-07-23	2003-01-17
Registration of a document			2003-01-17
Basic national fee - standard			2003-01-17
MF (application, 3rd anniv.) - standard	03	2004-07-23	2004-07-07
MF (application, 4th anniv.) - standard	04	2005-07-25	2005-07-07

Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
APPLERA CORPORATION

Past Owners on Record
HEINZ BREU
HUGH J. PASIKA

Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.

Documents

To view selected files, please enter reCAPTCHA code :

To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.

Filter

Download Selected in PDF format (Zip Archive)

Download Selected as Single PDF

Document Description	Date (yyyy-mm-dd)	Number of pages	Size of Image (KB)
Description	2003-01-17	18	723
Claims	2003-01-17	7	259
Drawings	2003-01-17	9	259
Abstract	2003-01-17	1	51
Cover Page	2003-03-13	1	32
Acknowledgement of Request for Examination	2003-03-11	1	185
Notice of National Entry	2003-03-11	1	225
Courtesy - Certificate of registration (related document(s))	2003-05-13	1	107
Courtesy - Abandonment Letter (Maintenance Fee)	2006-09-18	1	175
PCT	2003-01-17	5	240
Correspondence	2003-03-11	1	25
PCT	2003-01-18	2	67
Correspondence	2006-08-18	1	36
Prosecution correspondence	2005-08-02	1	26

Language selection

Menus

English Abstract

French Abstract

Event History

Abandonment History

Maintenance Fee

Fee History

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.

Patent 2416611 Summary

English Abstract

French Abstract

Event History

Abandonment History

Maintenance Fee

Fee History

Your request is in progress.Requested information will be availablein a moment.Thank you for waiting.

Your request is in progress.

Requested information will be available
in a moment.

Thank you for waiting.