Note : Les descriptions sont présentées dans la langue officielle dans laquelle elles ont été soumises.
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
Methods and Systems for Evaluating a Matching Between Measured Data
and Standards
Field of the Invention
The present invention relates to methods and systems for
evaluating a matching between data, in particular a sequence of
measured data points, and a pre-defined standard. It has specific,
although not necessarily exclusive application in the field of
evaluating biological data.
Background
When studying a substance or device for example, it is often
the case that one or more physical properties of the sample or device
are not directly measurable or at least cannot be measured easily.
In order to evaluate such a property an alternative, indirect measure
representative of the property must be found.
This problem can be illustrated by taking the biological
example of evaluating the size (i.e. length) of DNA fragments
(measured in base pairs for instance). This is not a property that
can be measured directly. The alternative, indirect measure of
fragment size most commonly used is a measure of the rate at which
the fragments are transported through a medium in which their
mobility varies with size. Typically this is achieved by
electrophoresis through Lanes in a polyacrylmide gel, the details of
which will be well known to the skilled person and need not be
repeated here.
With this approach, the location of the fragments along the
lane, or the point in time at which a particular fragment passes a
set point along the lane (typically identified by peaks on an
electropherogram), provides an indirect measure of the fragment size.
However, since in any particular run the migration of the fragments
through the gel will be affected by factors other than just the size
of the fragment, on its own this provides only a measure of the
relative sizes of the fragments to one another. What it does not
provide is any absolute values for fragment size.
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
To enable the determination of absolute fragment size, it is
possible to run an in-lane size standard along with the fragments
being evaluated. This in-lane size standard is a kind of "ruler"
made up of a set of fragments of known size (i.e. length). If the
location of these size standard fragments can be accurately
determined, it is then possible to evaluate the size of an unknown
_ fragment by comparing its location with those of the size standard
and interpolating from the known sizes of the size standard
fragments.
For this approach to work, it is important that the location of
the size standard fragments can be accurately determined. That is to
say, one has to match peak locations in the measured results to each
of the known fragment sizes of the size standard definition. The
presence of artefacts in the results, giving false peaks, often means
that this is not a straightforward task. Embodiments of the present
invention are concerned with providing efficient approaches to
carrying out and evaluating such matching processes.
Although the background to the invention has been described
above with reference to evaluating the size of DNA fragments, it will
be appreciated that similar problems can be encountered when
evaluating the properties of other biological or non-biological
samples or devices. The invention has applicability more generally,
therefore, in any case where there is a need to match specific points
within a measured data set representing a set of standard values of a
property to the standard values themselves. The skilled person will
be well aware of many exemplary cases where the approach set out
below can be usefully applied, including but not limited to
applications in the filed of telecommunications, image processing,
pattern recognition and the like.
Summary of the Invention
The present invention proposes an approach to matching measured
data points representative of a physical property (for instance the
measured peak location or positions of an in-lane size standard which
represent DNA fragment length) to a set of standard values of that
property (for instance a size standard definition). The matching is
evaluated by comparing ratios of intervals between selected.pairs of
2
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
the measured data points with ratios of the intervals between
adjacent pairs of the standard values, to find the measured data
points that best match the standard values.
Other preferred features and aspects of the invention are set
out in the appended claims.
Brief Description of the Drawing
The invention is described below, by way of example, with
reference to the accompanying drawings in which:
Fig. 1 is a flow diagram illustrating the basis matching
process according to an embodiment of the present invention;
Fig. 2 Illustrates Example peaks, sizes and a matching;
Fig. 3 shows execution time for the matching process against
number of values in the size standard and additional peaks;
Fig. 4 shows execution time versus the number of values in the
size standard definition for different numbers of additional peaks;
Fig. 5 shows execution time versus the number of extra peaks
for a series of size standard definitions having differing numbers of
size values;
Fig. 6, 7, 8 and 9 show exemplary results from matchings;
Fig. 10 is a table of execution times versus the number of size
value in the size standard definition and the number of additional
peaks; and
Fig. 11 is a block diagram illustrating a suitable system
architecture for operating embodiments of the invention.
Description of Embodiments
Embodiments of the invention are described below in the context
of the exemplary application of the described approach to the problem
of matching an in-line size standard to a size standard definition
for the evaluation of nucleic acid information obtained from an
3
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
electrophoresis procedure. It will be appreciated that the approach
has wider applicability
In overview, the method operates by matching data generated
with a standard sample to actual sizes that should exist in the
standard sample. For example, one may use a standard sample with
nucleotide lengths 110, 114, 117, 120, and 125. One runs the
standard sample and obtains several data peaks. The size standard
matching component predicts the peaks that correspond to the five
known nucleotide lengths. Thus, one can subsequently compare data in
a sample to those peaks to determine the nucleotide lengths of
fragments in a sample.
Referring to Fig. 1, an embodiment of the process operates by
first selecting a sequence of peaks from those obtained by running
the standard sample, the number of peaks selected being equal to the
number of values in the size standard,definition. One peak is
matched to each of the size standard sizes in sequence. The matching
this produces is then evaluated to determine the total cost for the
matching. This process is repeated for a number of different
sequences of peaks to establish which sequence of the possible
variations from the peaks obtained from running the standard sample
provides the best matching.
In certain embodiments, a matching component uses an algorithm
that has three parameters: ratio factor (the importance of peak
height vs. the importance of the local linearity), min acceptable
quality (used for ending dynamic programming iteration), and number
of extra peaks (the number of peaks participated in size matching is
the number of size standard definition fragments plus the number of
extra peaks). In certain embodiments, the component fixes the ratio
factor to 0.6 and min acceptable quality to 0.75. In certain
embodiments, the component fixes the number of extra peaks to 10 for
Applied Biosystems 310/377 instrument data and 25 for Applied
Biosystems 3700 instrument data.
In certain embodiments, a statistically based quality value is
generated for the matching result.
In certain embodiments, one skilled in the art will be able to
adjust the number of extra peaks that may be used with a given
instrument.
4
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
In certain embodiments, the component ignores the peaks located
within the offseale regions in the sample.
Tn certain embodiments, the component fails the size matching
process if the size standard definitions are not fully matched in the
matching process.
In certain embodiments, the component implements two primer
peak detection methods. The first is the primer-peak-height-
suppression method. This method replaces the peak heights of the
highest peaks with the peak height of the middle peak, assuming that
the primer peaks are among the highest. The second is to find the
primer peak location. The method assumes that the primer peak
locates within the first half of the signal and the size standard
fragments locate in the second half of the signal. For example, one
takes the mean peak height of all the peaks in the second half and
multiples that mean by five to get the potential primer peak height.
The method works backwards in the first half of the signal to find
the last primer peak.
In certain embodiments, a size-standard matcher takes as input
a list of peaks (e.g., from an electropherogram) and a list of
fragment sizes (e.g., in nucleotides). It produces as output a
matching, that is, a list of pairs of the form <peak,size>, where
each peak and each fragment size appears at most once. In certain
embodiments, a size-standard mateher evaluates a matching, and uses
an algorithm for finding good matchings.
Certain embodiments employ an algorithm that evaluates a
matching by treating its two constituent sequences as sequences of
edges between points. A matching is also a correspondence between
edges. Two edges, e1 and e2, that share an endpoint define a ratio
of lengths r = ~e2~/~el~. Again, a matching is also a correspondence
between ratios. Under the assumption that the relation between peak
position and fragment size is "more or less" linear, corresponding
ratios typically should be equal. In certain embodiments, the
component derives a ratio cost to measure this property. In certain
embodiments, the component also concentrates on big peaks by deriving
a height cost. The total cost of a matching is a weighted sum of
these constituent costs,
In certain embodiments, the component formulates the size
standard matching problem as finding a matching with maximum cost.
In such embodiments, the cost is separable. That is, with some
5
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
additional mathematics, the component can maximise subsequences
independently. In certain embodiments, the cost also enjoys the
advantage of being local, thereby compensating for global deviations
from linearity. This cost also leads to a quality value between 0
and 1.
A size standard is a set of DNA fragments, each of known size.
The definition of a size standard is simply a list of these sizes.
Note that a size-standard definition typically does not depend on the
instrument on which the size standard is run, and therefore not on
any particular set of run conditions either.
An in-lane size standard is a set of peaks resulting from
running a size standard on an instrument. One determines the
positions and the heights of the peaks.
A size-standard matcher takes as input an in-lane size standard
and a size standard definition. It produces as output a matching,
that is, a list of pairs of the form (peak,size), where each peak and
each fragment size appears at most once. A peak has a position
(e. g., in scan numbers) and a height (e. g., in fluorescent units).
Fragment sizes are given in nucleotides.
Assume that there are at least as many peaks as sizes.
Furthermore, assume that every size has a corresponding peak, except
possibly for some small number from the end of this list. This
exception is meant to model the situation where a user may have
stopped electrophoresis early, before the larger fragments have had a
chance to elute.
In certain embodiments, one employs the following components.
Let P=lpo,p"...,j7~r_~) be a list of nP peak locations, given by
increasing scan number, for example. Let H=~j2o,12~...,ltn -~ J be a list
v
of the corresponding nP peak heights, given in fluorescent units, for
example. The size standard definition S=LSO,s~,...,s,~-~, is a list of
s
ns fragment sizes in increasing nucleotides. By assumption, nP Z ns.
A size-standard matchin is a set of airs f
g p M =~~io~0)~~ii~l)~..., (i»,~n,.)
where the i~ are in increasing order, that is, where subscript j < k
implies i~ < ik.
3S
6
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
Example 1 Peaks, Sizes, and a Matching
Consider the peaks, sizes, and matching displayed in Figure 2.
The list P contains nP = 11 peak positions:
P = [968, 1029, 1203, 1259, 1412, 1535, 1714, 1751, 1785, 1837,
1928] .
These np peaks have heights H:
H = [2722, 6219, 1060, 5380, 7726, 1082, 7424, 1263, 7335, 7937,
1562] .
The size standard definition has ns = 5 sizes S:
S = [75, 100, 139, 150, 160] .
Finally, M is the matching shown in the figure:
M = { (3, 0) , (4, 1) , (6, 2) , (8, 3) , (9, 4) } .
In certain embodiments, big-Oh notation is used to express
algorithm complexity. This notation is ubiquitous in worst-case and
average-case resource analysis. Briefly, a function f is said to be
on the order of another function g (written f(x) = O(g(x)) if there
exists positive constants c and N such that ~f(x)~ _< c~g(x)~ for all
x >- N.
Evaluating a Matching
Assume there is a matching. In certain embodiments,, the
matcher component evaluates a matching by examining its two
constituent sequences. It treats the sequence of peaks as a sequence
of edges between peaks, and similarly for the sizes. For example, M
- ~(3, 0), (4, 1), (6, 2), (8, 3), (9, 4)} is the matching from
Example 1. Its peak (index) sequence is [3, 4, 6, 8, 9], which has
four edges, (3, 4), (4, 6), (6, 8), and (8, 9). Similarly, its
fragment size definition (index) sequence is [0, 1, 2, 3, 4], which
also has four edges: (0, 1), (1, 2), (2, 3), and (3, 4).
A matching is also a correspondence between edges. In this
example, peak edge (6, 8) corresponds to definition edge (2, 3).
Assume that two edges are adjacent if they share an endpoint. In
this example, (4, 6) and (6, 8) are adjacent since they share peak 6.
Two adjacent edges (i,j) and (j, k) define a ratio risk of lengths:
1'J~ _ ~k p~ . (1)
h; - hr
7
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
In certain embodiments, one can employ a more economical
notation for size ratios for matching all sizes:
r = S,f+z -Sf+~ .
I
s f+~ - s.f ( ~ )
Again, a matching is also a correspondence between ratios. In this
example, peak ratio r689 corresponds to size ratio rZ.
Under the assumption that the relation between peak position
and fragment size is "more or less" lineax, corresponding ratios
typically should be equal. More formally, assume that the fragment
of size si occurs at position pi. ~If there are coefficients a and b
such that pi = asi + b for all i, then
p~ -pj - ~as~. +b)-~asj +b~ a~s~ -sj~- s~ -sj
(3)
pj-p ~ -~ r ~ ~- ,
as . + b as. + b a s . s. s . s.
To measure the similarity of a corresponding pair of ratios ri~x
and rf, one may define their ratio cost cr(i, j, k, f) to be
min rt.~ , r
~,.~i~J~k~.f~= ~ ~ jv_ (4)
max ~;~~,rf
Note that 0 <- cr(i,j,k,f) <_ 1 for all 0 <- i < j < k < np and 0 _<< f <
n5 - 2. Note also that cr(i,j,k,f) = 1 indicates the ideal of equal
ratios. The ratio cost of a matching is the sum of its individual
costs .
In certain embodiments, one has the matchings concentrate on
the big peaks. To this end, one may define the height cost ch(i) of
a matched peak i to be its height divided by the maximum peak height
1i. More formally,
3o h= max h. (5)
0_<j<rrr,
and
8
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
cn~i~= h . (6>
Again, 0 < ch(i) <- 1 for all peaks 0 _<< i < nP, and ch(i) = 1, in
certain embodiments, corresponds to the ideal of a maximally tall
peak.
In order to combine these two types of costs, one may weight
and sum them. Since there are only two costs, a single weighting
parameter a, where 0 <- a <_ l, will suffice. The total cost c(M) of a
matching M is the weighted sum:
c~M~=a ~ c,.~i, j, k, f~+~1-a> ~ c,,~i~ . (7)
r>.f )EM O,.f )EM
j,f+1 eM
~h,.f+Z~M
One may now formulate the size standard matching problem as
finding a matching with maximum cost. Note that the cost is local in
the sense that each element of the summation depends on at most three
adjacent points. This property allows the matcher component in
certain embodiments. to compensate for global deviations from
linearity.
Quality Measures
If one divides the cost of a matching by the maximum possible
cost over all matchings, one would have a number between 0 and 1 that
indicates its quality. What is this maximum possible cost? Every
pair of ratios in such a matching would contribute its maximum value,
namely axl. There are a total of n - 2 ratio pairs. Similarly,
every matched peak would be at the maximum height, so that all n
matched peaks (one for each definition size) contribute (1 - a) x 1.
The maximum possible cost c, therefore, is:
c=~n-2~a,+n~l-a~=n-2a (
The quality of a matching M is therefore given by c(M)/c.
Other possible quality measures include the sum of just the
ratio costs, and the worst ratio cost in the matching.
An Efficient Algorithm
In certain embodiments, an advantage of the above formulation
9
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
is that the cost is separable. That is, with some additional
mathematics, one can maximise subsequences independently. This
property leads to an efficient dynamic programming algorithm. In
certain embodiments, the algorithm is efficient (runs in low-order
polynomial time and space) and guarantees an optimal solution.
Let c : ~~ -. ~. denote the maximum cost of a matching
subproblem. In particular, let c(j, k, f) denote the maximum cost of
matching the peaks from 0 to k with the definition fragments from 0
to f + 1 in such a way that peak j matches size f and peak k matches
size f + 1. The cost of matching all sizes is therefore
c~M*~= max c~j,k,ns. -2~ (9)
i<h~n,~
where M* is the optimal matching. Note that every definition
fragment matches some peak, but only n5 of the peaks need match in
this embodiment.
One can now express a maximum cost recursively. For f = 0
there are no ratios to compute, so one need only be concerned with
the height cost:
c~j,k,0~= ~1-cz~c,,~j~+ch~k)) . (lo)
For f > 0, one can compute the cost recursively by adding the height
cost for the newly matched peak k to a new ratio cost and a previous
subproblem cost:
c~j,k, ~=(1-a~~c,,~k~ (11)
+max~(i, j, f -1~+a~c,.(i, j,k, f -l~~
i<~
Converting these equations to algorithms is straightforward.
In certain embodiments, the matcher component computes the individual
elements in a consistent order. Furthermore, one may exploit the
fact that one can match every size in the definition by limiting the
computation. In certain embodiments, the matcher component only
needs to compute c(j, k, f) for k > j Z f since j peaks cannot be
made to fit all f sizes if j < f. Similarly, in certain embodiments,
the matcher component needs to examine only subproblems c(i,j,f-1)
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
where i ? f - 1 since i peaks could not fit all f-1 sizes if i < f -
1.
To this end, Algorithm 2 solves Equation 10 and Algorithm 3
solves Equation 11.
Algorithm 2 Basis of the Recursion (when f = 0)
for j .- 0, 1,... , nP - 2 do
for k .- j + 1, j + 2, ... , np - 1 do
c(j~k~ 0) ~- (l - a) (ch(j) + ch(k) )
Algorithm 3 Compute Cost of Matching (when f > 0)
1. for f ,.- 1,2,... ,ns - 2 do
2. for k .- f + 1, f + 2,... , f + np - ns + l do
3. for j .- f, f + 1,... , k - 1 do
4 . c * .-
5. for i .-- f - 1, f,... , j - 1 do
6. , ci .- e(i, j, f - 1) + a ~ cr(i, j,k, f - 1)
7. if ci > c* then
8 , c * .- ci
9. c(j,k,f) ~- (1 - a) ~ ch(k) + c*
As stated, these algorithms compute only the cost of an optimal
matching. One will still retrieve a matching from this calculation.
This is often a standard part of dynamic programming algorithms.
When memory requirements are very high, it is often the practice to
recompute the path to the optimal cost from the cost matrix. Since
certain embodiments have relatively small sequences, one can trade
time for memory by keeping an array of backpointers, or predecessors
p. It is easy to maintain this array by adding the line p(j,k,f) .-- i
after line 8 in Algorithm 3. This assignment indicates that the
predecessor to cost c(j,k,f) is c(i,j,f-1). Then, the matcher
component may reconstruct an optimal matching from Equation 9 by
tracing backwards.
Computational Resources
Theoretical Run-time Analysis
In certain embodiments, the algorithm's run time complexity is
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
dominated by the number of times it executes Lines 6 and 7. The
lines themselves execute in constant time. The inner (the i) loop
executes them ~i'r-~ 1 times. Therefore the j loop executes them
H_~ i-~
~.-.r-1 1 times. The k loop terminates at k = f + nP - ns + 1 =
f + m + 1, where m = nP - ns is the number of extra peaks.
Continuing in this way, one sees that the lines are executed a total
of T(m, ns) times, where
..,.-z n~+.r+i k-~ j
T~nZ,hs.~= ~ ~ ~ 1 . (12)
.r=~ x=.f+~ i=I r=I-1
to
This expression is not as formidable as it looks since the inner
three summations are independent of the value f. By a judicious
substitution of variables, one sees that:
n,+_r+1 k-1 ;-1 m k 1
IS ~ ~ ~ 1=~ ~ ~ 1. (13)
=.r+~ j=.r 1=.r-1 k=~ J=~ i=0
A straightforward calculation shows that:
~~ ~ ~ 1=6~nZ3+6naz+llm+6~=6~m+3~m+2~m+1~.
=o j=o r=o
It follows that
n -2 m+r+1 k-1 j-1
T ~m~ n~s~ ~ _
' .r=1 ~=.r+1 j=f i=f-1
= 6(m+3~m+2~m+l~n,s -2) (14)
= O~msns
That is, the execution time increases only linearly with the number
of definition fragments, but it increases as the cube of the number
of extra peaks. Note that, when the number of peaks equals the
number of definition fragments (that is, when m = 0), Lines 6 and 7
are executed only ns -2 times, which is exactly the number of ratios
that need to be compared to evaluate any matching.
12
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
Empirical Measurements
The theoretical analysis in the previous subsection allows one
to understand the asymptotic behaviour of the algorithm. That is, it
allows one to predict the trend in the run-time when the inputs are
large. For smaller inputs, in certain embodiments, various overhead
factors influence the run time.
One can construct several sets of synthetic data and time a C++
implementation of the algorithm. The data includes size standard
definitions with ns = 5 to ns = 40 fragment sizes. In every case the
ith fragment has size 20a., where i >- 1. The in-lane peaks have
positions equal to the definition sizes, but they also have m = 0 to
m = 20 additional peaks, where the ith additional peak has position
10+20i, for i >- 0. For each combination of ns and m, a test program
executes the matcher component 20 times and divides the elapsed time
by 20 also, to give the time for each execution in milliseconds.
Figures 3 to 5 show the results. The execution times
themselves, rounded to the nearest millisecond, are provided in Fig.
10.
Memory
Full arrays for holding the costs and predecessors may
use ~n2 -1- t2s.~2 t2~. = m2ns. -I- 2mYts -1-Yts real values . Initialising
these
arrays therefore takes asymptotically more time, when YI23 =O(ns ),
than the optimisation algorithm. If this is a problem, the arrays
can be implemented as sparse arrays, so that they occupy O(m3ns)
space as well as time. Another solution is to use full arrays, but
to index them not with the peak indices, but rather with the
substituted variables in Equation 13. A third possibility is to use
and allocate full matrices, but to not initialise them.
Practical Considerations
In certain embodiments, one may want to determine a set of
candidates peaks that the matcher component should size. One may
choose to allow a parameter m specifying the number of extra peaks to
consider. The size standard matcher then extracts the np = ns + m
tallest peaks from all peaks detected, for example by a previous
sizecalling step. One may use m = 4.
One may use a weighting factor a between 1/2 and 3/4.
13
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
An analyst typically should choose a size standard definition
that corresponds accurately to the in-lane size standard. However,
it may be that an analyst terminated a run early, before the longer
fragments have had a chance to elute. In this case, the definition
is not accurate, strictly speaking. To provide some robustness in
this situation, one may test if the optimal matching satisfies a
minimum acceptable quality parameter. If not, one may remove the
last definition size and try again, repeating this process until the
quality is acceptable. Alternatively, if the quality is
unacceptable, one may simply report this without returning a
matching.
Results
Figure 6 shows an example of matching a run of the Applied
Biosystems GeneScan 500 size standard on a Manhattan breadboard. The
top panel shows the matched electropherogram. Again, peak positions
are in green, and peak bases are in red. The middle panel shows the
"errors" obtained by fitting this matching to a straight line and
subtracting the assigned size from the size predicted from the linear
fit. Finally, the bottom panel shows the definition points(in green
circles) and the predicted peak sizes (in red circles). The
correspondence between the two is shown in blue line segments.
Prior to analysis, the primer peak is removed from the peaks
used for the matching, although it is still indicated in the diagram.
The electropherogram contains an extra peak at its end (around scan
4400). In this example, the described approach correctly did not
match this peak. However that peak was often assigned 500 nucleotides
by prior art matchers.
The electropherogram also shows a spurious "spike" around scan
2200. A peak detector rejected this as a peak since it did not meet
the minimum peak width criterion.
The 250-nucleotide fragment of the size standard is known to
migrate anomalously on the 310, and seems to do so on Manhattan,
also. The 340-nucleotide fragment is also known to migrate
anomalously; these two fragments typically are not used fox sizing on
the 310.
14
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
Figure 7 shows the run from Figure 6 matched with GeneScan 500
less sizes 250 and 340. Note that the matcher correctly ignored the
peaks corresponding to these two fragments.
Figure 8 shows the result of matching GeneScan 500, less sizes
35 and 50 with a run that was terminated early. Tl~e matcher correctly
ignores the corresponding peaks. The matcher does initially match all
sizes, including the unrepresented large ones, to the available
peaks, but the matching quality was not high enough. Consequently,
The matcher then strips one size at a time off the end of the
definition until the resulting match is of sufficient quality.
Figure 9 shows a matching with the GeneScan 400HD size
standard. This electropherogram also contains a spike (between size
90 and 100). A peak detector ignored this peak due to a minimum peak
width parameter. The primer peak is not visible since the analyst set
an analysis range that excludes it. Note from the middle panel that
the extremes(sizes 50, 60, 380, and 400) of the standard did not run
"as linearly" as the rest. These extremes would cause previous
matchers some problems, though their errors are quite small here.
Conclusion
It can be seen that embodiments of the present invention
provide a simple formulation of the size-standard matching problem
which leads to a simple algorithm for its solution. In some
embodiments the formulation emphasises corresponding ratios of edge
lengths combined with the heights of the matched peaks. In preferred
embodiments, the corresponding algorithm employs dynamic programming
directly on the input sequences. It runs in O(m3 ns) time, where ns is
the number of fragment sizes in the size-standard definition, and m =
nP - ns is the number of extra peaks in the in-lane size standard.
Preferred embodiments of the invention employ an algorithm that
requires just a very few (three) parameters:
1. the weight for the ratio cost,
2. the minimum acceptable quality, and
3. the number m of extra peaks to consider.
l5
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
System Architecture
Figure 11 is a block diagram that illustrates a computer system
500, according to certain embodiments, upon which embodiments of the
invention may be implemented. Computer system 500 includes a bus 502
or other communication mechanism for communicating information, and a
processor 504 coupled with bus 502 for processing information.
Computer system 500 also includes a memory 506, which can be a random
access memory (RAM) or other dynamic storage device, coupled to bus
502 for determining size matches, and instructions to be executed by
processor 504. Memory 506 also may be used for storing temporary
variables or other intermediate information during execution of
instructions to be executed by processor 504. Computer system 500
further includes a read only memory (ROM) 508 or other static storage
device coupled to bus 502 for storing static information and
instructions for processor 504. A storage device 510, such as a
magnetic disk or optical disk, is provided and coupled to bus 502 for
storing information and instructions.
Computer system 500 may be coupled via bus 502 to a display
512, such as a cathode ray tube (CRT) or liquid crystal display
(LCD), for displaying information to a computer user. An input
device 514, including alphanumeric and other keys, is coupled to bus
502 for communicating information and command selections to processor
504. Another type of user input device is cursor control 516, such
as a mouse, a trackball or cursor direction keys for communicating
direction information and command selections to processor 504 and for
controlling cursor movement on display 512. This input device
typically has two degrees of freedom in two axes, a first axis (e. g.,
x) and a second axis (e. g., y), that allows the device to specify
positions in a plane.
A computer system 500 evaluates size matchings and provides a
quality level fox each matching. Consistent with certain
implementations of the invention, a matching and associated quality
is provided by computer system 500 in response to processor 504
l6
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
executing one or more sequences of one or more instructions contained
in memory 506. Such instructions may be read into memory 506 from
another computer-readable medium, such as storage device 510.
Execution of the sequences of instructions contained in memory 506
causes processor 504 to perform the process states described herein.
Alternatively hard-wired circuitry may be used in place of or in
combination with software instructions to implement the invention.
Thus implementations of the invention are not limited to any specific
combination of hardware circuitry and software.
The term "computer-readable medium" as used herein refers to
any media that participates in providing instructions to processor
504 for execution. Such a medium may take many forms, including but
not limited to, non-volatile media, volatile media, and transmission
media. Non-volatile media includes, for example, optical or magnetic
disks, such as storage device 510_ Volatile media includes dynamic
memory, such as memory 506. Transmission media includes coaxial
cables, copper wire, and fiber optics, including the wires that
comprise bus 502. Transmission media can also take the form of
acoustic or light waves, such as those generated during radio-wave
and infra-red data communications.
Common forms of computer-readable media include, for example, a
floppy disk, a flexible disk, hard disk, magnetic tape, or any other
magnetic medium, a CD-ROM, any other optical medium, punch cards,
papertape, any other physical medium with patterns of holes, a RAM,
PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a
carrier wave as described hereinafter, or any other medium from which
a computer can read.
Various forms of computer readable media may be involved in
carrying one or more sequences of one or more instructions to
processor 504 for execution. For example, the instructions may
initially be carried on magnetic disk of a remote computer. The
remote computer can load the instructions into its dynamic memory and
send the instructions over a telephone line using a modem. A modem
local to computer system 500 can receive the data on the telephone
line and use an infra-red transmitter to convert the data to an
infra-red signal. An infra-red detector coupled to bus 502 can
receive the data carried in the infra-red signal and place the data
17
CA 02416611 2003-O1-17
WO 02/08949 PCT/USO1/23630
on bus 502. Bus 502 carries the data to memory 506, from which
processor 504 retrieves and executes the instructions. The
instructions received by memory 506 may optionally be stored on
storage device 510 either before or after execution by processor 504.
It will be appreciated that the above embodiments and results
are exemplary only and various modifications to that which have been
specifically described are possible within the scope of the
invention. In particular it is to be noted that the applicability of
the invention is not limited to the described biological data, but is
also applicable to~other biological and non-biological data.
18