Language selection

Search

Patent 3167253 Summary

Third-party information liability

Some of the information on this Web page has been provided by external sources. The Government of Canada is not responsible for the accuracy, reliability or currency of the information supplied by external sources. Users wishing to rely upon this information should consult directly with the source of the information. Content provided by external sources is not subject to official languages, privacy and accessibility requirements.

Claims and Abstract availability

Any discrepancies in the text and image of the Claims and Abstract are due to differing posting times. Text of the Claims and Abstract are posted:

  • At the time the application is open to public inspection;
  • At the time of issue of the patent (grant).
(12) Patent Application: (11) CA 3167253
(54) English Title: METHODS AND SYSTEMS FOR A LIQUID BIOPSY ASSAY
(54) French Title: PROCEDES ET SYSTEMES DE DOSAGE DE BIOPSIE DE LIQUIDE
Status: Compliant
Bibliographic Data
(51) International Patent Classification (IPC):
  • G16B 20/10 (2019.01)
(72) Inventors :
  • TELL, ROBERT (United States of America)
  • ZHU, WEI (United States of America)
  • FINKLE, JUSTIN DAVID (United States of America)
  • LO, CHRISTINE (United States of America)
  • DRIESSEN, TERRI M. (United States of America)
(73) Owners :
  • TEMPUS AI, INC. (United States of America)
(71) Applicants :
  • TEMPUS LABS, INC. (United States of America)
(74) Agent: FASKEN MARTINEAU DUMOULIN LLP
(74) Associate agent:
(45) Issued:
(86) PCT Filing Date: 2021-02-18
(87) Open to Public Inspection: 2021-08-26
Availability of licence: N/A
(25) Language of filing: English

Patent Cooperation Treaty (PCT): Yes
(86) PCT Filing Number: PCT/US2021/018622
(87) International Publication Number: WO2021/168146
(85) National Entry: 2022-08-05

(30) Application Priority Data:
Application No. Country/Territory Date
62/978,130 United States of America 2020-02-18
63/041,424 United States of America 2020-06-19
63/041,293 United States of America 2020-06-19
63/041,272 United States of America 2020-06-19

Abstracts

English Abstract

Methods, systems, and software are provided for validating a copy number variation, validating a somatic sequence variant, and/or determining circulating tumor fraction estimates using on-target and off-target sequence reads in a test subject. A copy number status annotation for a genomic segment is validated by applying a first dataset to a plurality of filters comprising a measure of central tendency bin-level sequence ratio filter, a confidence filter, and a measure of central tendency-plus-deviation bin-level sequence ratio filter. A somatic sequence variant is validated by comparing a variant allele fragment count for a candidate somatic sequence variant for a respective locus, against a dynamic variant count threshold for the locus in a respective reference sequence. A circulating tumor fraction is estimated based on a measure of fit between genomic segment-level coverage ratios and integer copy states across a plurality of simulated circulated tumor fractions.


French Abstract

L'invention concerne des procédés, des systèmes et un logiciel permettant de valider une variation d'un nombre de copies, de valider une variante de séquence somatique, et/ou de déterminer des estimations de fractions tumorales circulantes à l'aide de lectures de séquences sur cible et hors cible chez un sujet de test. Une annotation d'état de nombre de copies d'un segment génomique est validée par application d'un premier ensemble de données à une pluralité de filtres comprenant un filtre de mesure de rapport de séquence de niveau binaire de tendance centrale, un filtre de confiance, et un filtre de mesure de rapport de séquence de niveau binaire de tendance centrale-plus-écart. Une variante de séquence somatique est validée par comparaison d'un compte de fragments d'allèle variant d'une variante de séquence somatique candidate d'un locus respectif, par rapport à un seuil de compte de variantes dynamiques du locus dans une séquence de référence respective. Une fraction tumorale circulante est estimée sur la base d'une mesure d'ajustement entre des rapports de segment génomique-couverture de niveau et des états de copie entière sur une pluralité de fractions tumorales en circulation simulées.

Claims

Note: Claims are shown in the official language in which they were submitted.


WO 2021/168146
PCT/US2021/018622
WHAT IS CLAIMED IS:
1. A method of validating a copy number variation in a test
subject, the method
comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors.
(A) obtaining, from a first sequencing reaction, a corresponding sequence of
each
cell-free DNA fragment in a first plurality of cell-free DNA fragments in a
liquid biopsy
sample of the test subject, thereby obtaining a first plurality of sequence
reads, wherein the
first plurality of sequence reads comprises at least 100,000 sequence reads;
(B) aligning each respective sequence read in the first plurality of sequence
reads to a
reference sequence for the species of the subject;
(C) determining:
(1) a plurality of bin-level sequence ratios, each respective bin-level
sequence
ratio in the plurality of bin-level sequence ratios corresponding to a
respective bin in a
plurality of bins, wherein:
each respective bin in the plurality of bins represents a corresponding
region of a reference genome for the species of the subject, and
each respective bin-level sequence ratio in the plurality of bin-level
sequence ratios is determined from a comparison of the first plurality of
sequence reads to
sequence reads from one or more reference samples;
(2) a plurality of segment-level sequence ratios, each respective segment-
level
sequence ratio in the plurality of segment-level sequence ratios corresponding
to a segment in
a plurality of segments, wherein:
each respective segment in the plurality of segments represents a
corresponding region of the reference genome for the species of the subject
encompassing a
subset of adjacent bins in the plurality of bins, and
each respective segment-level sequence ratio in the plurality of
segment-level sequence ratios is determined from a measure of central tendency
of the
plurality of bin-level sequence ratios corresponding to the subset of adjacent
bins
encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each respective
segment-level measure of dispersion in the plurality of segment-level measures
of dispersion
(i) corresponding to a respective segment in the plurality of segments and
(ii) determined
347
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
using the plurality of bin-level sequence ratios corresponding to the subset
of adjacent bins
encompassed by the respective segment; and
(D) validating a copy number status annotation of a respective segment in the
plurality of segments that is annotated with a copy number variation by
applying the first
dataset to an algorithm having a plurality of filters, the plurality of
filters comprising:
(1) a measure of central tendency bin-level sequence ratio filter that is
fired
when a measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more bin-level sequence ratio thresholds;
(2) a confidence filter that is fired when the segment-level measure of
dispersion corresponding to the respective segment fails to satisfy a
confidence threshold; and
(3) a measure of central tendency-plus-deviation bin-level sequence ratio
filter
that is fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds,
wherein the one or more measure of central tendency-plus-deviation bin-level
copy ratio
thresholds are derived from (i) a measure of central tendency of the bin-level
sequence ratios
corresponding to the plurality of bins that map to the same chromosome of the
reference
genome for the species of the subject as the respective segment, and (ii) a
measure of
dispersion across the bin-level sequence ratios corresponding to the plurality
of bins that map
to the respective chromosome;
wherein rejecting or validating the copy number status annotation of the
respective
segment is based on a predetermined pattern of firing or lack of firing of
each of the filters in
the plurality of filters.
2. A method of validating a copy number variation in a test
subject, the method
comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors:
(A) obtaining, from a first sequencing reaction, a corresponding sequence of
each
cell-free DNA fragment in a first plurality of cell-free DNA fragments in a
liquid biopsy
sample of the test subject, thereby obtaining a first plurality of sequence
reads;
(B) aligning each respective sequence read in the first plurality of sequence
reads to a
reference sequence for the species of the subject, wherein the reference
sequence for the
348
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
species represents at least 1 Mb of the genome for the species;
(C) determining:
(1) a plurality of bin-level sequence ratios, each respective bin-level
sequence
ratio in the plurality of bin-level sequence ratios corresponding to a
respective bin in a
plurality of bins, wherein:
each respective bin in the plurality of bins represents a corresponding
region of a reference genome for the species of the subject, and
each respective bin-level sequence ratio in the plurality of bin-level
sequence ratios is determined from a comparison of the first plurality of
sequence reads to
sequence reads from one or more reference samples;
(2) a plurality of segment-level sequence ratios, each respective segment-
level
sequence ratio in the plurality of segment-level sequence ratios corresponding
to a segment in
a plurality of segments, wherein:
each respective segment in the plurality of segments represents a
corresponding region of the reference genome for the species of the subject
encompassing a
subset of adjacent bins in the plurality of bins, and
each respective segment-level sequence ratio in the plurality of
segment-level sequence ratios is determined from a measure of central tendency
of the
plurality of bin-level sequence ratios corresponding to the subset of adjacent
bins
encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each respective
segment-level measure of dispersion in the plurality of segment-level measures
of dispersion
(i) corresponding to a respective segment in the plurality of segments and
(ii) determined
using the plurality of bin-level sequence ratios corresponding to the subset
of adjacent bins
encompassed by the respective segment; and
(D) validating a copy number status annotation of a respective segment in the
plurality of segments that is annotated with a copy number variation by
applying the first
dataset to an algorithm having a plurality of filters, the plurality of
filters comprising:
(1) a measure of central tendency bin-level sequence ratio filter that is
fired
when a measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more bin-level sequence ratio thresholds;
(2) a confidence filter that is fired when the segment-level measure of
dispersion corresponding to the respective segment fails to satisfy a
confidence threshold; and
349
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
(3) a measure of central tendency-plus-deviation bin-level sequence ratio
filter
that is fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds,
wherein the one or more measure of central tendency-plus-deviation bin-level
copy ratio
thresholds are derived from (i) a measure of central tendency of the bin-level
sequence ratios
corresponding to the plurality of bins that map to the same chromosome of the
reference
genome for the species of the subject as the respective segment, and (ii) a
measure of
dispersion across the bin-level sequence ratios corresponding to the plurality
of bins that map
to the respective chromosome;
wherein rejecting or validating the copy number status annotation of the
respective
segment is based on a predetermined pattern of firing or lack of firing of
each of the filters in
the plurality of filters.
3. The method of claim 1 or 2, wherein the liquid biopsy sample is a blood
sample.
4. The method of any one of claims 1-3, wherein the liquid biopsy sample
comprises
blood, whole blood, peripheral blood, plasma, serum, or lymph of the test
subject.
5. The method of any one of claims 1 -4, further comprising obtaining the
liquid biopsy
sample from a sample repository or database.
6. The method of any one of claims 1-5, wherein the test subject is a
patient in a clinical
trial.
7. The method of any one of claims 1-6, wherein the test subject is a
patient with a
cancer.
8. The method of claim 7, wherein the cancer is a solid tumor cancer.
9. The method of claim 7 or 8, wherein the cancer is Ovarian Cancer,
Cervical Cancer,
Uveal Melanoma, Colorectal Cancer, Chromophobe Renal Cell Carcinoma, Liver
Cancer,
Endocrine Tumor, Oropharyngeal Cancer, Retinoblastoma, Biliary Cancer, Adrenal
cancer,
Neural, Neuroblastoma, Basal Cell Carcinoma, Brain Cancer, Breast Cancer,
Melanoma,
350
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
Non-Clear Cell Renal Cell Carcinoma, Glioblastoma, Glioma, Tumor of Unknown
Origin,
Kidney Cancer, Gastrointestinal Stromal Tumor, Medulloblastoma, Bladder
Cancer, Gastric
Cancer, Bone Cancer, Non-Small Cell Lung Cancer, Thymoma, Low Grade Glioma,
Prostate
Cancer, Clear Cell Renal Cell Carcinoma, Skin Cancer, Thyroid Cancer, Sarcoma,
Testicular
cancer, Head and Neck Cancer, Head and Neck Squamous Cell Carcinoma,
Meningioma,
Peritoneal cancer, Endometrial Cancer, Pancreatic Cancer, Mesothelioma,
Esophageal
Cancer, Small Cell Lung Cancer, Her2 Negative Breast Cancer, Solid Tumor,
Ovarian Serous
Carcinoma, HR+ Breast Cancer, Uterine Serous Carcinoma, Endometrial Cancer,
Uterine
Corpus Endometrial Carcinoma, Gastroesophageal Junction Adenocarcinoma,
Gallbladder
Cancer, Chordoma, or Papillary Renal Cell Carcinoma.
1 0. The method of any one of claims 1-9, wherein the plurality of
cell-free nucleic acids
comprise circulating tumor DNA (ctDNA).
1 1. The method of any one of claims 1-10, wherein the liquid
biopsy sample corresponds
to a matched tumor sample.
12. The method of any one of claims 1-1 1, wherein the one or
more reference samples are
non-cancerous samples.
1 3. The method of any one of claims 1-1 1, wherein the one or
more reference samples is
a sample of a non-cancerous tissue from the subject.
14. The method of claim 1 3, wherein the liquid biopsy sample is
a cell-free fraction of a
whole blood sample from the subject and the one or more reference samples is a
cell-
containing fraction of the whole blood sample from the subject.
1 5. The method of any one of claims 1-12, further comprising
obtaining the one or more
reference samples from a sample repository or a database.
1 6. The method of any one of claims 1 -1 5, further comprising
performing the first
sequencing reaction.
1 7. The method of any one of claims 1 -1 6, wherein the
sequencing is multiplexed
sequencing.
351
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
18. The method of claim 16, further comprising isolating the plurality of
cell-free nucleic
acids from the liquid biopsy sample of the test subject prior to the
sequencing.
19. The method of any one of claims 1-18, wherein the sequencing is short-
read
sequencing or long-read sequencing.
20. The method of any one of claims 1-19, wherein the sequencing is a panel-
enriched
sequencing reaction.
21. The method of claim 20, wherein the sequencing reaction is performed at
a read depth
of at least 1,00.
22. The method of claims 20 or 21, wherein the panel-enriched sequencing
reaction uses a
sequencing panel that enriches for at least 50 genes.
23. The method of any one of claims 20-22, wherein the panel-enriched
sequencing
reaction uses a sequencing panel that enriches for at least 10 genes listed in
Table 1.
24. The method of any one of claims 20-23, wherein the panel-enriched
sequencing
reaction uses a sequencing panel that enriches for at least 10 genes listed in
List 1.
25. The method of any one of claims 20-24, wherein the panel-enriched
sequencing
reaction uses a sequencing panel that enriches for at least 10 genes listed in
List 2.
26. The method of any one of claims 20-25, wherein the panel-enriched
sequencing
reaction uses a sequencing panel that enriches for one or more genes selected
from the group
consisting of MET, EGFR, ERBB2, CD274, CCNE1, MYC, BRCA1 and BRCA2.
27. The method of any one of claims 1-26, wherein the aligning (B)
comprises aligning
each respective sequence read in the first plurality of sequence reads to a
reference human
genome.
28. The method of any one of claims 1-27, wherein the sequencing of the
plurality of cell-
free nucleic acids in the first liquid biopsy sample of the test subject is
performed at a central
laboratory or sequencing facility.
352
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
29. The method of claim 28, wherein the obtaining (A) comprises accessing
the first
dataset, in electronic form, through a cloud-based interface.
30. The method of any one of claims 1-29, wherein each respective bin-level
sequence
ratio in the plurality of bin-level sequence ratios is derived from a
comparison of (a) a test
read depth of sequence reads, in the plurality of sequence reads, that map to
the
corresponding bin in the plurality of bins, to (b) a measure of central
tendency of reference
read depths of sequence reads, for each of a plurality of reference samples,
that map to the
corresponding bin.
31. The method of claim 30, wherein the test read depth and the reference
read depths are
centered and corrected.
32. The method of claim 31, wherein the test read depth and the reference
read depths are
median-centered.
33. The method of any one of claims 30-32, wherein the measure of central
tendency of
read depths for the corresponding bin, across one or more reference samples is
an arithmetic
mean, a weighted mean, a midrange, a midhinge, a trimean, a Winsorized mean, a
mean, a
median, or a mode of read depths for the corresponding bin, across one or more
reference
samples.
34. The method of any one of claims 30-33, wherein the read depths for each
respective
bin, in the plurality of bins, are determined by binning sequence reads
obtained for the
plurality of cell-free nucleic acids in a panel-enriched sequencing reaction.
35. The method of claim 34, wherein the plurality of bin-level sequence
ratios comprises:
a first sub-plurality of bin-level sequence ratios corresponding to bins that
map to the
same position of the human reference genome as an enriched locus in the panel-
enriched
sequencing reaction; and
a second sub-plurality of bin-level sequence ratios corresponding to bins that
do not
map to the same position of the human reference genome as any enriched locus
in the panel-
enriched sequencing reaction.
353
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
36. The method of any one of claims 1-35, wherein each bin-level sequence
ratio in the
plurality of bin-level sequence ratios is a copy ratio.
37. The method of any one of claims 1-36, wherein each respective segment,
in the
plurality of segments that represents a corresponding region of the human
reference genome,
consists of a subset of adjacent bins that are grouped together based on a
similarity between
the respective sequence ratios of the subset of adjacent bins.
38. The method of claim 37, wherein the grouping is performed using
circular binary
segmentation (CBS).
39. The method of any one of claims 1-38, wherein the measure of central
tendency of the
plurality of bin-level sequence ratios corresponding to the subset of bins
encompassed by the
respective segment in the (A) obtaining is an arithmetic mean, a weighted
mean, a midrange,
a midhinge, a trimean, a Winsorized mean, a mean, a median, or a mode of the
bin-level
sequence ratios for all the respective bins encompassed by the respective
segment.
40. The method of any one of claims 1-39, wherein the measure of central
tendency of the
plurality of bin-level sequence ratios in the (A) obtaining is a weighted
mean.
41. The method of any one of claims 1-40, wherein each respective segment-
level
measure of dispersion in the plurality of segment-level measures of dispersion
is a confidence
interval, a standard deviation, a standard error, a variance, or a range.
42. The method of any one of claims 1-41, wherein
each respective segment-level measure of dispersion in the plurality of
segment-level
measures of dispersion is a confidence interval, and
determining each respective segment-level measure of dispersion in the
plurality of
segment-level measures of dispersion comprises bootstrapping the plurality of
bin-level
sequence ratios corresponding to the subset of bins encompassed by the
respective segment.
43. The method of any one of claims 1-42, wherein a respective segment in
the plurality
of segments is annotated with a copy number status annotation when the
corresponding
segment-level sequence ratio satisfies one or more segment-level sequence
ratio thresholds.
354
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
44. The method of any one of claims 1-43, wherein a copy number status
annotation is
selected from the group consisting of amplified, deleted, or neutral.
45. The method of any one of claims 1-44, wherein the plurality of filters
further
comprises:
(4) a segment-level sequence ratio filter that is fired when the segment-level

sequence ratio corresponding to the respective segment fails to satisfy one or
more segment-
level sequence ratio thresholds.
46. The method of claim 45, wherein the segment-level sequence ratio
corresponding to
the respective segment fails to satisfy one or more segment-level sequence
ratio thresholds
when the segment-level sequence ratio is lower than a segment-level sequence
ratio
amplification threshold.
47. The method of claim 45, wherein the segment-level sequence ratio
corresponding to
the respective segment fails to satisfy one or more segment-level sequence
ratio thresholds
when the segment-level sequence ratio is higher than a segment-level sequence
ratio deletion
threshold.
48. The method of claim 46, wherein a segment-level sequence ratio
amplification
threshold is between 0 and 0.5.
49. The method of claim 47, wherein a segment-level sequence ratio deletion
threshold is
between -0.75 and -0.25.
50. The method of any one of claims 1-49, wherein the measure of central
tendency of the
plurality of bin-level sequence ratios corresponding to the subset of bins
encompassed by the
respective segment fails to satisfy one or more bin-level sequence ratio
thresholds when the
measure of central tendency is lower than a bin-level sequence ratio
amplification threshold.
51. The method of any one of claims 1-49, wherein the measure of central
tendency of the
plurality of bin-level sequence ratios corresponding to the subset of bins
encompassed by the
respective segment fails to satisfy one or more bin-level sequence ratio
thresholds when the
measure of central tendency is higher than a bin-level sequence ratio deletion
threshold.
355
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
52. The method of any one of claims 1-51, wherein the measure of central
tendency of the
plurality of bin-level sequence ratios corresponding to the subset of bins
encompassed by the
respective segment in the (B) validating is an arithmetic mean, a weighted
mean, a midrange,
a midhinge, a trimean, a Winsorized mean, a mean, a median or a mode of the
bin-level
sequence ratios for all the respective bins encompassed by the respective
segment.
53. The method of any one of claims 1-52, wherein the measure of central
tendency of the
plurality of bin-level sequence ratios in the (B) validating is a median.
54. The method of claim 50, wherein a bin-level sequence ratio
amplification threshold is
between 0 and 0.5.
55. The method of claim 51, wherein a bin-level sequence ratio deletion
threshold is
between -0.75 and -0.25.
56. The method of any one of claims 1-55, wherein the segment-level measure
of
dispersion corresponding to the respective segment fails to satisfy a
confidence threshold
when the lower bound of the measure of dispersion is lower than the confidence
threshold.
57. The method of any one of claims 1-55, wherein the segment-level measure
of
dispersion corresponding to the respective segment fails to satisfy a
confidence threshold
when the upper bound of the measure of dispersion is higher than the
confidence threshold.
58. The method of any one of claims 1-57, wherein the confidence threshold
is a measure
of central tendency of the segment-level sequence ratios corresponding to all
other segments
that map to the same chromosome of the human reference genome as the
respective segment.
59. The method of claims 58, wherein the measure of central tendency of the
segment-
level sequence ratios is an arithmetic mean, a weighted mean, a midrange, a
midhinge, a
trimean, a Winsorized mean, a mean, a median or a mode.
60. The method of any one of claims 1-59, wherein:
the one or more measure of central tendency-plus-deviation bin-level sequence
ratio
thresholds is a sum of:
(i) a measure of central tendency value of the bin-level sequence ratios
356
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
corresponding to the plurality of bins that map to the same chromosome, and
(ii) the measure of central tendency value of a plurality of absolute
dispersions, wherein:
each absolute dispersion is determined using a comparison between
each respective bin-level sequence ratio corresponding to each respective bin
in the plurality
of bins that map to the same chromosome as the respective segment, and the
measure of
central tendency value of the bin-level sequence ratios measured in (i); and
the measure of central tendency of the plurality of bin-level sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy the
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds
when the measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment is
lower than the
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds.
61. The method of any one of claims 1-59, wherein:
the one or more measure of central tendency-plus-deviation bin-level sequence
ratio
thresholds comprises:
(i) a measure of central tendency value of the bin-level sequence ratios
corresponding to the plurality of bins that map to the same chromosome, minus
(ii) the measure of central tendency value of a plurality of absolute
dispersions, wherein:
each absolute dispersion is determined using a comparison between
each respective bin-level sequence ratio corresponding to each respective bin
in the plurality
of bins that map to the same chromosome as the respective segment, and the
measure of
central tendency value of the bin-level sequence ratios measured in (i); and
the measure of central tendency of the plurality of bin-level sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy the
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds
when the measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment is
higher than the
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds.
62. The method of claim 61, wherein the one or more measure of
central tendency-plus-
deviation bin-level sequence ratio thresholds is:
357
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
(i) the measure of central tendency value of the bin-level sequence ratios
corresponding to the plurality of bins that map to the same chromosome, minus
(ii) the measure of central tendency value of the plurality of absolute
dispersions multiplied by a factor k.
63. The method of claim 62, wherein k is between 0.1 and 0.95,
between 0.3 and 0.9,
between 0.5 and 0.85, between 0.65 and 0.8, or between 0.73 and 0.77.
64. The method of any one of claims 1-63, wherein the measure of
dispersion across the
bin-level sequence ratios in the (B) validating is a variance, standard
deviation, or
interquartile range across the bin-level copy ratios.
65. The method of any one of claims 1-64, comprising:
validating an amplification status of a respective segment in the plurality of
segments,
by applying the first dataset to an algorithm having a plurality of filters,
the plurality of filters
comprising:
(1) a measure of central tendency bin-level sequence ratio filter that is
fired
when a measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment is
lower than a
bin-level sequence ratio amplification threshold;
(2) a confidence filter that is fired when the lower bound of the segment-
level
measure of dispersion corresponding to the respective segment is lower than
the confidence
threshold; and
(3) a measure of central tendency-plus-deviation bin-level sequence ratio
filter
that is fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment is
lower than the
measure of central tendency-plus-deviation bin-level sequence ratio threshold;
wherein:
when a filter in the plurality of filters is fired, the amplification status
of the
respective segment is rejected; and
when no filter in the plurality of filters is fired, the amplification status
of the
respective segment is validated.
66. The method of any one of claims 1-65, comprising:
358
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
validating a deletion status of a respective segment in the plurality of
segments, by
applying the first dataset to an algorithm having a plurality of filters, the
plurality of filters
comprising:
(1) a measure of central tendency bin-level sequence ratio filter that is
fired
when a measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment is
higher than a
bin-level sequence ratio deletion threshold;
(2) a confidence filter that is fired when the upper bound of the segment-
level
measure of dispersion corresponding to the respective segment is higher than
the confidence
threshold; and
(3) a measure of central tendency-plus-deviation bin-level sequence ratio
filter
that is fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment is
higher than the
measure of central tendency-plus-deviation bin-level sequence ratio threshold:
wherein:
when a filter in the plurality of filters is fired, the deletion status of the

respective segment is rejected; and
when no filter in the plurality of filters is fired, the deletion status of
the
respective segment is validated.
67. The method of any one of claims 1-66, further comprising, after the
validating (B),
applying the validated copy number variation of the respective segment to a
diagnostic assay.
68. The method of any one of claims 1-67, wherein one or more respective
segments in
the plurality of segments that represents a corresponding region of the human
reference
genome encodes a target gene.
69. The method of claim 68, wherein the target gene is MET, EGFR, ERBB2,
CD274,
CCNE1, MYC, BRCA1 or BRCA2.
70. The method of claim 68, wherein the target gene is any of the genes
listed in Table 1.
71. The method of any one of claims 1-70, further comprising treating a
patient with a
cancer containing a copy number variation of a target gene by:
359
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
determining whether the copy number variation of the target gene is a focal
copy
number variation by validating the copy number variation in the patient,
thereby determining
whether the patient has an aggressive form of the cancer associated with a
focal copy number
variation of the target gene;
when the patient has the aggressive form of cancer associated with focal copy
number
variation of the target gene, administering a first therapy for the aggressive
form of the cancer
to the patient; and
when the patient does not have the aggressive form of cancer associated with
focal
copy number variation of the target gene, administering a second therapy for a
less aggressive
form of the cancer to the patient.
72. The method of claim 71, wherein the first therapy is selected from
Table 2.
73. The method of claims 71 or 72, wherein the first therapy is
trastuzumab, lapatinib, or
crizotinib.
74. The method of any one of claims 1-73, further comprising generating a
report
comprising the validated copy number status of the respective segment for the
biological
sample of the respective test subject.
75. The method of claim 74, wherein the generated report further comprises
matched
therapies based on the copy number status of the respective segment.
76. The method of any one of claims 1-75, further comprising:
(E) obtaining a second dataset that comprises:
(1) a plurality of bin-level sequence ratios, each respective bin-level
sequence
ratio in the plurality of bin-level sequence ratios corresponding to a
respective bin in a
plurality of bins, wherein:
each respective bin in the plurality of bins represents a corresponding
region of a human reference genome, and
each respective bin-level sequence ratio in the plurality of bin-level
sequence ratios is determined from a sequencing of a plurality of cell-free
nucleic acids in a
second liquid biopsy sample of the test subject and one or more reference
samples;
(2) a plurality of segment-level sequence ratios, each respective segment-
level
360
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
sequence ratio in the plurality of segment-level sequence ratios corresponding
to a segment in
a plurality of segments, wherein:
each respective segment in the plurality of segments represents a
corresponding region of the human reference genome encompassing a subset of
adjacent bins
in the plurality of bins, and
each respective segment-level sequence ratio in the plurality of
segment-level sequence ratios is determined from a measure of central tendency
of the
plurality of bin-level sequence ratios corresponding to the subset of adjacent
bins
encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each respective
segment-level measure of dispersion in the plurality of segment-level measures
of dispersion
(i) corresponding to a respective segment in the plurality of segments and
(ii) determined
using the plurality of bin-level sequence ratios corresponding to the subset
of adjacent bins
encompassed by the respective segment; and
(F) validating a copy number status annotation of a respective segment in the
plurality of segments that is annotated with a copy number variation by
applying the second
dataset to an algorithm having a plurality of filters.
77. The method of claim 76, wherein the first liquid biopsy sample is
obtained at a first
time point and the second liquid biopsy sample of the test subject is obtained
al a second time
point.
78. The method of claim 77, wherein the second time point is at least 1
day, at least 1
week, at least 1 month, at least 2 months, at least 3 months, at least 6
months, or at least 1
year after the first time point.
79. A computer system comprising:
one or more processors; and
a non-transitory computer-readable medium including computer-executable
instructions that, when executed by the one or more processors, cause the
processors to
perform a method according to any one of claims 1-78.
361
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
80. A non-transitory computer-readable storage medium having stored thereon
program
code instructions that, when executed by a processor, cause the processor to
perform the
method according to any one of claims 1-78.
81. A method of validating a copy number variation in a test subject, the
method
comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors:
(A) obtaining, from a first sequencing reaction, a corresponding sequence of
each
cell-free DNA fragment in a first plurality of cell-free DNA fragments in a
liquid biopsy
sample of the test subject, thereby obtaining a first plurality of sequence
reads, wherein the
first plurality of sequence reads comprises at least 100,000 sequence reads;
(B) aligning each respective sequence read in the first plurality of sequence
reads to a
reference sequence for the species of the subject;
(C) determining:
(1) a plurality of bin-level sequence ratios, each respective bin-level
sequence
ratio in the plurality of bin-level sequence ratios corresponding to a
respective bin in a
plurality of bins, wherein:
each respective bin in the plurality of bins represents a corresponding
region of the reference genome for the species of the subject, and
each respective bin-level sequence ratio in the plurality of bin-level
sequence ratios is determined from a comparison of the first plurality of
sequence reads to
sequence reads from one or more reference samples;
(2) a plurality of segment-level sequence ratios, each respective segment-
level
sequence ratio in the plurality of segment-level sequence ratios corresponding
to a segment in
a plurality of segments, wherein:
each respective segment in the plurality of segments represents a
corresponding region of the reference genome for the species of the subject
encompassing a
subset of adjacent bins in the plurality of bins, and
each respective segment-level sequence ratio in the plurality of
segment-level sequence ratios is determined from a measure of central tendency
of the
plurality of bin-level sequence ratios corresponding to the subset of adjacent
bins
encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each respective
362
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
segment-level measure of dispersion in the plurality of segment-level measures
of dispersion
(i) corresponding to a respective segment in the plurality of segments and
(ii) determined
using the plurality of bin-level sequence ratios corresponding to the subset
of adjacent bins
encompassed by the respective segment; and
(D) validating a copy number status annotation of a respective segment in the
plurality of segments that is annotated with a copy number variation by
applying the first
dataset to an algorithm having a plurality of filters,
wherein rejecting or validating the copy number status annotation of the
respective
segment based on a predetermined pattern of firing or lack of firing of each
of the filters in
the plurality of filters.
82. A method of validating a copy number variation in a test
subject, the method
comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors:
(A) obtaining, from a first sequencing reaction, a corresponding sequence of
each
cell-free DNA fragment in a first plurality of cell-free DNA fragments in a
liquid biopsy
sample of the test subject, thereby obtaining a first plurality of sequence
reads;
(B) aligning each respective sequence read in the first plurality of sequence
reads to a
reference sequence for the species of the subject, wherein the reference
sequence for the
species represents at least 1 Mb of the genome for the species;
(C) determining:
(1) a plurality of bin-level sequence ratios, each respective bin-level
sequence
ratio in the plurality of bin-level sequence ratios corresponding to a
respective bin in a
plurality of bins, wherein:
each respective bin in the plurality of bins represents a corresponding
region of the reference genome for the species of the subject, and
each respective bin-level sequence ratio in the plurality of bin-level
sequence ratios is determined from a comparison of the first plurality of
sequence reads to
sequence reads from one or more reference samples;
(2) a plurality of segment-level sequence ratios, each respective segment-
level
sequence ratio in the plurality of segment-level sequence ratios corresponding
to a segment in
a plurality of segments, wherein:
363
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
each respective segment in the plurality of segments represents a
corresponding region of the reference genome for the species of the subject
encompassing a
subset of adjacent bins in the plurality of bins, and
each respective segment-level sequence ratio in the plurality of
segment-level sequence ratios is determined from a measure of central tendency
of the
plurality of bin-level sequence ratios corresponding to the subset of adjacent
bins
encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each respective
segment-level measure of dispersion in the plurality of segment-level measures
of dispersion
(i) corresponding to a respective segment in the plurality of segments and
(ii) determined
using the plurality of bin-level sequence ratios corresponding to the subset
of adjacent bins
encompassed by the respective segment; and
(D) validating a copy number status annotation of a respective segment in the
plurality of segments that is annotated with a copy number variation by
applying the first
dataset to an algorithm having a plurality of filters,
wherein rejecting or validating the copy number status annotation of the
respective
segment based on a predetermined pattern of firing or lack of firing of each
of the filters in
the plurality of filters.
83. The method of claim 81, wherein the plurality of filters
comprises:
(1) a measure of central tendency bin-level sequence ratio filter that is
fired
when a measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more bin-level sequence ratio thresholds.
84. The method of any one of claims 81-83, wherein the plurality
of filters further
comprises:
(2) a confidence filter that is fired when the segment-level measure of
dispersion corresponding to the respective segment fails to satisfy a
confidence threshold.
85. The method of any one of claims 81-84, wherein the plurality
of filters further
comprises:
(3) a measure of central tendency-plus-deviation bin-level sequence ratio
filter
that is fired when a measure of central tendency of the plurality of bin-level
sequence ratios
364
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds,
wherein the one or more measure of central tendency-plus-deviation bin-level
copy ratio
thresholds are derived from (i) a measure of central tendency of the bin-level
sequence ratios
corresponding to the plurality of bins that map to the same chromosome of the
human
reference genome as the respective segment, and (ii) a measure of dispersion
across the bin-
level sequence ratios corresponding to the plurality of bins that map to the
respective
chromosome.
86. The method of any one of claims 81-85, wherein the plurality of filters
further
comprises:
(4) a segment-level sequence ratio filter that is fired when the segment-level

sequence ratio corresponding to the respective segment fails to satisfy one or
more segment-
level sequence ratio thresholds.
87. A computer system comprising:
one or more processors; and
a non-transitory computer-readable medium including computer-executable
instructions that, when executed by the one or more processors, cause the
processors to
perform a method according to any one of claims 81-86.
88. A non-transitory computer-readable storage medium having stored thereon
program
code instructions that, when executed by a processor, cause the processor to
perform the
method according to any one of claims 81-86.
89. A method for treating a patient with a cancer containing a copy number
variation of a
target gene, the method comprising:
determining whether the patient has an aggressive form of cancer associated
with a
focal copy number variation of the target gene by:
obtaining a first liquid biopsy sample from the patient,
performing copy number variation analysis on the first biological sample to
identify the copy number status of the target gene in the cancer, the copy
number variation
analysis comprising:
(A) obtaining, from a first sequencing reaction, a corresponding
365
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
sequence of each cell-free DNA fragment in a first plurality of cell-free DNA
fragments in
the liquid biopsy sample of the patient, thereby obtaining a first plurality
of sequence reads;
(B) aligning each respective sequence read in the first plurality of
sequence reads to a reference sequence for the species of the patient, wherein
the reference
sequence for the species represents at least 1 Mb of the genome for the
species; and
(C) determining:
(1) a plurality of bin-level sequence ratios, each respective bin-
level sequence ratio in the plurality of bin-level sequence ratios
corresponding to a respective
bin in a plurality of bins, wherein:
each respective bin in the plurality of bins represents a
corresponding region of a reference genome for the species of the patient, and
each respective bin-level sequence ratio in the plurality
of bin-level sequence ratios is determined from a comparison of the first
plurality of sequence
reads to sequence reads from one or more reference samples;
(2) a plurality of segment-level sequence ratios, each respective
segment-level sequence ratio in the plurality of segment-level sequence ratios
corresponding
to a segment in a plurality of segments, wherein:
each respective segment in the plurality of segments
represents a corresponding region of the reference genome for the species of
the patient
encompassing a subset of adjacent bins in the plurality of bins, and
each respective segment-level sequence ratio in the
plurality of segment-level sequence ratios is determined from a measure of
central tendency
of the plurality of bin-level sequence ratios corresponding to the subset of
adjacent bins
encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each
respective segment-level measure of dispersion in the plurality of segment-
level measures of
dispersion (i) corresponding to a respective segment in the plurality of
segments and (ii)
determined using the plurality of bin-level sequence ratios corresponding to
the subset of
adjacent bins encompassed by the respective segment; and
(D) determining whether the copy number variation of the target gene
is a focal copy number variation by applying the first dataset to an algorithm
having a
plurality of copy number variation filters; and
when the patient has the aggressive form of cancer associated with focal copy
number variation of the target gene, administering a first therapy for the
aggressive form of
366
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
the cancer to the patient; and
when the patient does not have the aggressive form of cancer associated with
focal copy number variation of the target gene, administering a second therapy
for a less
aggressive form of the cancer to the patient.
90. A method for treating a patient with a cancer containing a
copy number variation of a
target gene, the method comprising:
determining whether the patient has an aggressive form of cancer associated
with a
focal copy number variation of the target gene by:
obtaining a first liquid biopsy sample from the patient,
performing copy number variation analysis on the first biological sample to
identify the copy number status of the target gene in the cancer, the copy
number variation
analysis comprising:
(A) obtaining, from a first sequencing reaction, a corresponding
sequence of each cell-free DNA fragment in a first plurality of cell-free DNA
fragments in
the liquid biopsy sample of the patient, thereby obtaining a first plurality
of sequence reads,
wherein the first plurality of sequence reads comprises at least 100,000
sequence reads;
(B) aligning each respective sequence read in the first plurality of
sequence reads to a reference sequence for the species of the patient; and
(C) determining:
(1) a plurality of bin-level sequence ratios, each respective bin-
level sequence ratio in the plurality of bin-level sequence ratios
corresponding to a respective
bin in a plurality of bins, wherein:
each respective bin in the plurality of bins represents a
corresponding region of a reference genome for the species of the patient, and
each respective bin-level sequence ratio in the plurality
of bin-level sequence ratios is determined from a comparison of the first
plurality of sequence
reads to sequence reads from one or more reference samples;
(2) a plurality of segment-level sequence ratios, each respective
segment-level sequence ratio in the plurality of segment-level sequence ratios
corresponding
to a segment in a plurality of segments, wherein:
each respective segment in the plurality of segments
represents a corresponding region of the reference genome for the species of
the patient
encompassing a subset of adjacent bins in the plurality of bins, and
367
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
each respective segment-level sequence ratio in the
plurality of segment-level sequence ratios is determined from a measure of
central tendency
of the plurality of bin-level sequence ratios corresponding to the subset of
adjacent bins
encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each
respective segment-level measure of dispersion in the plurality of segment-
level measures of
dispersion (i) corresponding to a respective segment in the plurality of
segments and (ii)
determined using the plurality of bin-level sequence ratios corresponding to
the subset of
adjacent bins encompassed by the respective segment; and
(D) determining whether the copy number variation of the target gene
is a focal copy number variation by applying the first dataset to an algorithm
having a
plurality of copy number variation filters; and
when the patient has the aggressive form of cancer associated with focal copy
number variation of the target gene, administefing a first therapy for the
aggressive form of
the cancer to the patient; and
when the patient does not have the aggressive form of cancer associated with
focal copy number variation of the target gene, administering a second therapy
for a less
aggressive form of the cancer to the patient.
91. The method of claim 89 or 90, wherein the plurality of copy
number variation filters
comprises:
(1) a measure of central tendency bin-level sequence ratio filter that is
fired
when a measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more bin-level sequence ratio thresholds, thereby determining that the
copy number
variation of the target gene is not a focal copy number variation when fired.
92. The method of any one of claims 89-91, wherein the plurality
of copy number
variation filters further comprises:
(2) a confidence filter that is fired when the segment-level measure of
dispersion corresponding to the respective segment fails to satisfy a
confidence threshold,
thereby determining that the copy number variation of the target gene is not a
focal copy
number variation when fired.
368
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
93. The method of any one of claims 89-92, wherein the plurality
of copy number
variation filters further comprises:
(3) a measure of central tendency-plus-deviation bin-level sequence ratio
filter
that is fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds,
wherein the one or more measure of central tendency-plus-deviation bin-level
copy ratio
thresholds are derived from (i) a measure of the bin-level sequence ratios
corresponding to
the plurality of bins that map to the same chromosome of the human reference
genome as the
respective segment, and (ii) a measure of dispersion across the bin-level
sequence ratios
corresponding to the plurality of bins that map to the respective chromosome,
thereby
determining that the copy number variation of the target gene is not a focal
copy number
variation when fired.
94. The method of any one of claims 89-93, wherein the plurality
of copy number
variation filters further comprises:
(4) a segment-level sequence ratio filter that is fired when the segment-level

sequence ratio corresponding to the respective segment fails to satisfy one or
more segment-
level sequence ratio thresholds, thereby determining that the copy number
variation of the
target gene is not a focal copy number variation when fired.
95. The method of any one of claims 89-94, wherein the target
gene is any of the genes
listed in Table 1.
96. The method of any one of claims 89-95, wherein the target
gene is MET, EGFR,
ERBB2, CD274, CCNE1, MYC, BRCA1 or BRCA2.
97. The method of any one of claims 89-96, wherein the first
therapy is selected from
Table 2.
98. The method of any one of claims 89-97, wherein the first
therapy is trastuzumab,
lapatinib, or crizotinib.
369
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
99. The method of any one of claims 89-98, further comprising generating a
report
comprising the copy number status of the target gene.
100. The method of claim 99, wherein the generated report further comprises
matched
therapies based on the copy number status of the respective segment, wherein:
when the patient has the aggressive form of cancer associated with focal copy
number
variation of the target gene, matching a first therapy for the aggressive form
of the cancer to
the patient; and
when the patient does not have the aggressive form of cancer associated with
focal
copy number variation of the target gene, matching a second therapy for a less
aggressive
form of the cancer to the patient.
101. A computer system comprising:
one or more processors; and
a non-transitory computer-readable medium including computer-executable
instructions that, when executed by the one or more processors, cause the
processors to
perform a method according to any one of claims 89-100.
102. A non-transitory computer-readable storage medium having stored thereon
program
code instructions that, when executed by a processor, cause the processor to
perform the
method according to any one of claims 89-100.
103. A method of validating a copy number variation in a test subject, the
method
comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors:
(A) obtaining, from a first sequencing reaction, a corresponding sequence of
each
cell-free DNA fragment in a first plurality of cell-free DNA fragments in a
liquid biopsy
sample of the test subject, thereby obtaining a first plurality of sequence
reads, wherein the
first plurality of sequence reads comprises at least 100,000 sequence reads;
(B) aligning each respective sequence read in the first plurality of sequence
reads to a
reference sequence for the species of the subject;
(C) forming a first dataset, comprising:
(1) a plurality of bin-level sequence ratios, each respective bin-level
sequence
370
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
ratio in the plurality of bin-level sequence ratios corresponding to a
respective bin in a
plurality of bins, wherein:
each respective bin in the plurality of bins represents a corresponding
region of a reference sequence for the species of the test subject, and
each respective bin-level sequence ratio in the plurality of bin-level
sequence ratios is determined from a sequencing of a plurality of cell-free
nucleic acids in a
first liquid biopsy sample of the test subject and one or more reference
samples;
(2) a plurality of segment-level sequence ratios, each respective segment-
level
sequence ratio in the plurality of segment-level sequence ratios corresponding
to a segment in
a plurality of segments, wherein:
each respective segment in the plurality of segments represents a
corresponding region of the reference sequence for the species of the test
subject
encompassing a subset of adjacent bins in the plurality of bins, and
each respective segment-level sequence ratio in the plurality of
segment-level sequence ratios is determined from a measure of central tendency
of the
plurality of bin-level sequence ratios corresponding to the subset of adjacent
bins
encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each respective
segment-level measure of dispersion in the plurality of segment-level measures
of dispersion
(i) corresponding to a respective segment in the plurality' of segments and
(ii) determined
using the plurality' of bin-level sequence ratios corresponding to the subset
of adjacent bins
encompassed by the respective segment; and
(D) validating a copy number status annotation of a respective segment in the
plurality of segments that is annotated with a copy number variation by
applying the first
dataset to an algorithm having a plurality of filters, the plurality of
filters comprising:
(1) a measure of central tendency bin-level sequence ratio filter that is
fired
when a measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more bin-level sequence ratio thresholds;
wherein rejecting or validating the copy number status annotation of the
respective
segment based on a predetermined pattern of firing or lack of firing of each
of the filters in
the plurality of filters.
371
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
104. A method of validating a copy number variation in a test subject, the
method
comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors:
(A) obtaining, from a first sequencing reaction, a corresponding sequence of
each
cell-free DNA fragment in a first plurality of cell-free DNA fragments in a
liquid biopsy
sample of the test subject, thereby obtaining a first plurality of sequence
reads;
(B) aligning each respective sequence read in the first plurality of sequence
reads to a
reference sequence for the species of the subject, wherein the reference
sequence for the
species represents at least 1 Mb of the genome for the species;
(C) forming a first dataset, comprising:
(1) a plurality of bin-level sequence ratios, each respective bin-level
sequence
ratio in the plurality of bin-level sequence ratios corresponding to a
respective bin in a
plurality of bins, wherein:
each respective bin in the plurality of bins represents a corresponding
region of a reference sequence for the species of the test subject, and
each respective bin-level sequence ratio in the plurality of bin-level
sequence ratios is determined from a sequencing of a plurality of cell-free
nucleic acids in a
first liquid biopsy sample of the test subject and one or more reference
samples;
(2) a plurality of segment-level sequence ratios, each respective segment-
level
sequence ratio in the plurality of segment-level sequence ratios corresponding
to a segment in
a plurality of segments, wherein:
each respective segment in the plurality of segments represents a
corresponding region of the reference sequence for the species of the test
subject
encompassing a subset of adjacent bins in the plurality of bins, and
each respective segment-level sequence ratio in the plurality of
segment-level sequence ratios is determined from a measure of central tendency
of the
plurality of bin-level sequence ratios corresponding to the subset of adjacent
bins
encompassed by the respective segment; and
(3) a plurality of segment-level measures of dispersion, each respective
segment-level measure of dispersion in the plurality of segment-level measures
of dispersion
(i) corresponding to a respective segment in the plurality of segments and
(ii) determined
using the plurality of bin-level sequence ratios corresponding to the subset
of adjacent bins
encompassed by the respective segment; and
372
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
(D) validating a copy number status annotation of a respective segment in the
plurality of segments that is annotated with a copy number variation by
applying the first
dataset to an algorithm having a plurality of filters, the plurality of
filters comprising:
(1) a measure of central tendency bin-level sequence ratio filter that is
fired
when a measure of central tendency of the plurality of bin-level sequence
ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more bin-level sequence ratio thresholds;
wherein rejecting or validating the copy number status annotation of the
respective
segment based on a predetermined pattern of firing or lack of firing of each
of the filters in
the plurality of filters.
105. The method of claim 103, wherein the plurality of filters further
comprises:
(2) a confidence filter that is fired when the segment-level measure of
dispersion corresponding to the respective segment fails to satisfy a
confidence threshold.
106. The method of claim 103 or 105, wherein the plurality of filters further
comprises:
(3) a measure of central tendency-plus-deviation bin-level sequence ratio
filter
that is fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds,
wherein the one or more measure of central tendency-plus-deviation bin-level
copy ratio
thresholds are derived from (i) a measure of the bin-level sequence ratios
corresponding to
the plurality of bins that map to the same chromosome of the human reference
genome as the
respective segment_ and (ii) a measure of dispersion across the bin-level
sequence ratios
corresponding to the plurality of bins that map to the respective chromosome.
107. The method of any one of claims 103-106, wherein the plurality of
amplification
filters further comprises:
(4) a segment-level sequence ratio filter that is fired when the segment-level

sequence ratio corresponding to the respective segment fails to satisfy one or
more segment-
level sequence ratio thresholds.
108. A computer system comprising:
one or more processors; and
373
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
a non-transitory computer-readable medium including computer-executable
instructions that, when executed by the one or more processors, cause the
processors to
perform a method according to any one of claims 103-107.
109. A non-transitory computer-readable storage medium having stored thereon
program
code instructions that, when executed by a processor, cause the processor to
perform the
method according to any one of claims 103-107.
110. A method of validating a somatic sequence variant in a cancerous tissue
of a test
subject having a cancer condition, the method comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors:
(A) obtaining, from a first sequencing reaction, a corresponding sequence of
each
cell-free DNA fragment in a first plurality of cell-free DNA fragments in a
liquid biopsy
sample of the test subject, thereby obtaining a first plurality of sequence
reads, wherein the
first plurality of sequence reads comprises at least 100,000 sequence reads;
(B) aligning each respective sequence read in the first plurality of sequence
reads to a
reference sequence for the species of the subject thereby identifying a
candidate somatic
sequence variant mapping to a respective locus in the reference sequence;
(C) determining for the candidate somatic sequence variant, (i) a respective
variant
allele fragment count for the first sequencing reaction, and (ii) a respective
locus fragment
count for the first sequencing reaction; and
(D) comparing the respective variant allele fragment count for the candidate
somatic
sequence variant against a dynamic variant count threshold for the respective
locus in the
reference sequence that the candidate variant maps to, wherein the dynamic
variant count
threshold is based upon at least a pre-test odds of a positive variant call
for the respective
locus based upon a prevalence of variants in a genomic region that includes
the respective
locus in a cohort of training subjects having the cancer condition, and:
when the variant allele fragment count for the candidate somatic sequence
variant satisfies the dynamic variant count threshold for the respective
locus, not rejecting the
presence of the candidate somatic sequence variant in the test subject, or
when the variant allele fragment count for the candidate somatic sequence
variant does not satisfy the dynamic variant count threshold for the locus,
rejecting the
presence of the candidate somatic sequence variant in the test subject.
374
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
111. A method of validating a somatic sequence variant in a test subject
having a cancer
condition, the method comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors:
(A) obtaining, from a first sequencing reaction, a corresponding sequence of
each
cell-free DNA fragment in a first plurality of cell-free DNA fragments in a
liquid biopsy
sample of the test subject, thereby obtaining a first plurality of sequence
reads;
(B) aligning each respective sequence read in the first plurality of sequence
reads to a
reference sequence for the species of the subject thereby identifying a
candidate somatic
sequence variant mapping to a respective locus in the reference sequence,
wherein the
reference sequence for the species represents at least 1 Mb of the genome for
the species;
(C) deterrnining for the candidate somatic sequence variant, (i) a respective
variant
allele fragment count for the first sequencing reaction, and (ii) a respective
locus fragment
count for the first sequencing reaction; and
(D) comparing the respective variant allele fragment count for the candidate
somatic
sequence variant against a dynamic variant count threshold for the respective
locus in the
reference sequence that the candidate variant maps to, wherein the dynamic
variant count
threshold is based upon at least a pre-test odds of a positive variant call
for the respective
locus based upon a prevalence of variants in a genomic region that includes
the respective
locus in a cohort of training subjects having the cancer condition, and:
when the variant allele fragment count for the candidate somatic sequence
variant satisfies the dynamic variant count threshold for the respective
locus, not rejecting the
presence of the candidate somatic sequence variant in the test subject, or
when the variant allele fragment count for the candidate somatic sequence
variant does not satisfy the dynamic variant count threshold for the locus,
rejecting the
presence of the candidate somatic sequence variant in the test subject.
112. The method of claim 110 or 111, wherein the dynamic variant count
threshold is also
based upon a sequencing error rate for the sequencing reaction.
113. The method of claim 112, wherein the sequencing error rate for the
sequencing
reaction is a trinucleotide sequencing error rate.
375
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
114. The method of any one of claims 110-113, wherein the dynamic variant
count
threshold is also based upon a background sequencing error rate determined for
the locus.
115. The method of any one of claims 110-114, method further comprising:
determining the dynamic variant count threshold based upon a variant detection

specificity determined according to the relationship:
( ( odds (pre- test) )
specificity = 1 ¨ (sensitivity) x ______________________________
odds(post- test))
wherein,
sensitivity is a variant detection sensitivity selected from a distribution of

variant detection sensitivities based on an estimated circulating variant
fraction for the
candidate variant,
odds(post- test) is a post-test odds of a positive variant call for the locus,
and
odds(pre- test) is the pre-test odds of the positive variant call for the
locus.
116. The method of claim 115, wherein the distribution of variant detection
sensitivities is
based on a correlation between (i) the detection rate of a reference variant
allele, in one or
more sequencing reactions that are process-matched with the first sequencing
reaction, for a
plurality of cancer samples, and (ii) the variant allele fractions for the
reference variant allele
in the cancer samples.
117. The method of claim 116, wherein the correlation is established by
determining, for
each respective bin in a plurality of bins collectively representing a span of
variant allele
fractions represented in the cancer samples, wherein each respective bin
corresponds to a
contiguous span of variant allele fractions that does not overlap with any
other respective bin
in the plurality of bins, a corresponding sensitivity for detection of the
reference variant
alleles for the corresponding subset of cancer samples.
118. The method of any one of claims 115-117, wherein the estimated
circulating variant
fraction for the candidate variant is a variant allele fraction determined
from a comparison of
(i) the respective variant allele fragment count for the first sequencing
reaction, to (ii) the
respective locus fragment count for the first sequencing reaction
376
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
119. The method of any one of claims 115-118, wherein the specificity is used
to select a
quantile of a beta-binomial distribution of the minimal variant allele
fragment count required
to support a positive variant call for the locus, thereby defining the dynamic
threshold for the
locus, wherein the beta-binomial distribution is defined by a sequencing error
rate for the
sequencing reaction and a background sequencing error rate determined for the
locus.
120. The method of any one of claims 115-119, wherein the pre-test odds of a
positive
variant call for the locus is based on the prevalence of variants in a genomic
region that
includes the locus from the first set of nucleic acids obtained from the
cohort of subjects
having the cancer condition.
121. The method of claim 120, wherein, when the genomic region that includes
the locus is
associated with a mutation known to confer resistance against a therapy used
to treat the
cancer condition, the pre-test odds are boosted based on a pre-test-odds
multiplier specific for
the genomic region.
122. The method of claim 120 or 121, wherein the pre-test odds of a positive
variant call
for the locus is further based on a known or inferred effect of the variants,
wherein:
when the known or inferred effect of a variant is loss-of-function of a gene
that
includes the locus, the genomic region used to compute the pre-test
probability i s the entire
acne, and
when the known or inferred effect of a variant is gain-of-function of the gene
that
includes the locus, the genomic region used to compute the pre-test
probability is the exon, of
the gene, that includes the locus.
123. The method of claim 122, wherein the effect of the variants is inferred
by:
binning each respective variant of the variants in the genomic region that
includes the
locus from the first set of nucleic acids obtained from the cohort of subjects
having the cancer
condition into a respective bin, in a plurality of bins for the gene that
include the locus,
corresponding to the exon encompassing the respective variant in the gene,
wherein each bin
in the plurality of bins corresponds to a different exon of the respective
gene; and
determining whether any bin in the plurality of bins contains significantly
more
variants than the other bins in the plurality of bins, wherein:
when a bin contains significantly more variants than the other bins in the
377
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
plurality of bins, the effect of the sequence variant is inferred to be a gain-
of-function of the
gene, and
when no bin in the plurality of bins contains significantly more sequence
variants than the other bins in the plurality of bins, the effect of the
sequence variant is
inferred to be a loss-of-function of the gene.
124. The method of claim 123, wherein determining whether any bin in the
plurality of
bins contains significantly more variants than the other bins in the plurality
of bins comprises
applying a rolling Poisson test of difference between bin counts corresponding
to adjacent
exons in the gene.
125. The method of any one of claims 110-124, wherein the liquid biopsy sample
is blood.
126. The method of any one of claims 110-124, wherein the liquid biopsy sample

comprises blood, whole blood, peripheral blood, plasma, serum, or lymph of the
test subject.
127. The method of any one of claims 110-126, wherein:
the first sequencing reaction is a panel-enriched sequencing reaction of a
first plurality
of enriched loci, and
each respective locus in the plurality of enriched loci are sequenced at an
average
unique sequence depth of at least 250x.
128. The method of claim 127, wherein each respective locus in the plurality
of enriched
loci are sequenced at an average unique sequence depth of at least 1000x.
129. The method of any one of claims 110-126, wherein:
the first sequencing reaction is a whole genome sequencing reaction, and
the average sequencing depth of the reaction across the genome is at least
25x.
130. The method of any one of claims 110-129, wherein the first plurality of
sequence
reads comprises at least 50,000 sequence reads.
131. The method of any one of claims 110-129, wherein the first plurality of
sequence
reads comprises at least 250,000 sequence reads.
378
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
132. The method of any one of claims 110-131, wherein the cancer condition is
a particular
type and stage of cancer.
133. The method of any one of claims 110-132, wherein the cohort of subjects
having the
cancer condition are matched to at least one personal characteristic of the
subject.
134. The method of any one of claims 110-133, wherein, when the variant allele
fragment
count for the candidate variant satisfies the dynamic variant count threshold
for the locus and
all other variant calling
135. The method of any one of claims 110-134, further comprising generating a
report for
the test subject comprising the identity of variant alleles having variant
allele counts, in the
first sequencing reaction, that satisfy the dynamic variant count threshold.
136. The method of claim 135, wherein the generated report further comprises
therapeutic
recommendations for the test subject based on the identity of one or more of
the reported
variant alleles.
137. A computer system comprising:
one or more processors; and
a non-transitory computer-readable medium including computer-executable
instructions that, when executed by the one or more processors, cause the
processors to
perform a method according to any one of claims 1-136.
138. A non-transitory computer-readable storage medium having stored thereon
program
code instructions that, when executed by a processor, cause the processor to
perform the
method according to any one of claims 1-136.
139. A method of estimating a circulating tumor fraction for a test subject
from panel-
enriched sequencing data for a plurality of sequences, the method comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors:
A) obtaining, from a first panel-enriched sequencing reaction, a first
plurality of
sequence reads, wherein the first plurality of sequence reads comprises at
least 100,000
379
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
sequence reads and wherein the plurality of sequence reads comprises:
(i) a corresponding sequence for each cell-free DNA fragment in a first
plurality of cell-free DNA fragments obtained from a liquid biopsy sample from
the test
subject, wherein each respective cell-free DNA fragment in the first plurality
of cell-free
DNA fragments corresponds to a respective probe sequence in a plurality of
probe sequences
used to enrich cell-free DNA fragments in the liquid biopsy sample in the
first panel-enriched
sequencing reaction; and
(ii) a corresponding sequence for each cell-free DNA fragment in a second
plurality of cell-free DNA fragments obtained from the liquid biopsy sample,
wherein each
respective cell-free DNA fragment in the second plurality of DNA fragments
does not
correspond to any probe sequence in the plurality of probe sequences;
B) determining a plurality of bin-level coverage ratios from the plurality of
sequence
reads, each respective bin-level coverage ratio in the plurality of bin-level
coverage ratios
corresponding to a respective bin in a plurality of bins, wherein:
each respective bin in the plurality of bins represents a corresponding region

of a reference sequence for the species of the test subject, and
each respective bin-level coverage ratio in the plurality of bin-level
coverage
ratios is determined from a comparison of (i) a number of sequence reads in
the plurality of
sequence reads that map to the corresponding bin and (ii) a number of sequence
reads from
one or more reference samples that map to the corresponding bin;
C) determining a plurality of segment-level coverage ratios by:
forming a plurality of segments by grouping respective subsets of adjacent
bins in the plurality of bins based on a similarity between the respective
coverage ratios of the
subset of adjacent bins, and
determining, for each respective segment in the plurality of segments, a
segment-level coverage ratio based on the corresponding bin-level coverage
ratios for each
bin in the respective segment;
D) fitting, for each respective simulated circulating tumor fraction in a
plurality of
simulated circulating tumor fractions, each respective segment in the
plurality of segments to
a respective integer copy state in a plurality of integer copy states, by
identifying the
respective integer copy state in the plurality of integer copy states that
best matches the
segment-level coverage ratio, thereby generating, for each respective
simulated circulating
tumor fraction in the plurality of simulated tumor fractions, a respective set
of integer copy
states for the plurality of segments; and
380
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
E) estimating the circulating tumor fraction for the test subject based on a
measure of
fit between corresponding segment-level coverage ratios and integer copy
states across the
plurality of simulated circulated tumor fractions.
140. A method of estimating a circulating tumor fraction for a test subject
from panel-
enriched sequencing data for a plurality of sequences, the method comprising:
at a computer system having one or more processors, and memory storing one or
more programs for execution by the one or more processors:
A) obtaining, from a first panel-enriched sequencing reaction, a first
plurality of
sequence reads, wherein the plurality of sequence reads comprises:
(i) a corresponding sequence for each cell-free DNA fragment in a first
plurality of cell-free DNA fragments obtained from a liquid biopsy sample from
the test
subject, wherein each respective cell-free DNA fragment in the first plurality
of cell-free
DNA fragments corresponds to a respective probe sequence in a plurality of
probe sequences
used to enrich cell-free DNA fragments in the liquid biopsy sample in the
first panel-enriched
sequencing reaction; and
(ii) a corresponding sequence for each cell-free DNA fragment in a second
plurality of cell-free DNA fragments obtained from the liquid biopsy sample,
wherein each
respective cell-free DNA fragment in the second plurality of DNA fragments
does not
correspond to any probe sequence in the plurality of probe sequences;
B) determining a plurality of bin-level coverage ratios from the plurality of
sequence
reads, each respective bin-level coverage ratio in the plurality of bin-level
coverage ratios
corresponding to a respective bin in a plurality of bins, wherein:
each respective bin in the plurality of bins represents a corresponding region

of a reference sequence for the species of the test subject, and
each respective bin-level coverage ratio in the plurality of bin-level
coverage
ratios is determined from a comparison of (i) a number of sequence reads in
the plurality of
sequence reads that map to the corresponding bin and (ii) a number of sequence
reads from
one or more reference samples that map to the corresponding bin;
C) determining a plurality of segment-level coverage ratios by:
forming a plurality of segments by grouping respective subsets of adjacent
bins in the plurality of bins based on a similarity between the respective
coverage ratios of the
subset of adjacent bins, and
determining, for each respective segment in the plurality of segments, a
381
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
segment-level coverage ratio based on the corresponding bin-level coverage
ratios for each
bin in the respective segment;
D) fitting, for each respective simulated circulating tumor fraction in a
plurality of
simulated circulating tumor fractions, each respective segment in the
plurality of segments to
a respective integer copy state in a plurality of integer copy states, by
identifying the
respective integer copy state in the plurality of integer copy states that
best matches the
segment-level coverage ratio, thereby generating, for each respective
simulated circulating
tumor fraction in the plurality of simulated tumor fractions, a respective set
of integer copy
states for the plurality of segments; and
E) estimating the circulating tumor fraction for the test subject based on a
measure of
fit between corresponding segment-level coverage ratios and integer copy
states across the
plurality of simulated circulated tumor fractions.
141. The method of claim 139 or 140, wherein estimating the circulating tumor
fraction
comprises minimization of an error between corresponding segment-level
coverage ratios and
integer copy states across the plurality of simulated circulated tumor
fractions.
142. The method of claim 139 or 140, wherein estimating the circulating tumor
fraction
comprises:
identifying a plurality of local minima for the error between corresponding
segment-
level coverage ratios and integer copy states across the plurality of
simulated circulated tumor
fractions, and
selecting the local minima that is closest to a second estimate of circulating
tumor
fraction determined by a different methodology.
143. The method of claim 142, wherein the second estimate of circulating tumor
fraction is
generated by:
(i) detecting a plurality of germline variants in the liquid biopsy sample
based on the
first plurality of sequence reads;
(ii) determining, for each respective germline variant in the plurality of
germline
variants, a corresponding germline variant allele frequency for the liquid
biopsy sample,
thereby determining a plurality of germline variant allele frequencies for the
liquid biopsy
sample;
(iii) determining, for each respective germline variant in the plurality of
germline
382
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
variants, an absolute value of the difference between the corresponding
germline variant
allele frequency for the liquid biopsy sample and a germline variant allele
frequency for the
respective germline variant allele in a non-cancerous tissue of the subject,
thereby
determining a plurality of germline variant allele deltas for the liquid
biopsy sample; and
(iv) estimating the circulating tumor fraction for the liquid biopsy sample as
twice the
value of the maximum germline variant allele delta in the plurality of
germline variant allele
deltas.
144. The method of claim 143, wherein, for each respective germline variant in
the
plurality of germline variants, the corresponding germline variant allele
frequency for the
respective germline variant allele in a non-cancerous tissue of the subject is
defined as 0.5.
145. The method of claim 143, wherein, for each respective germline variant in
the
plurality of germline variants, the corresponding germline variant allele
frequency for the
respective germline variant allele in a non-cancerous tissue of the subject is
determined based
on a second sequencing reaction of nucleic acids from a non-cancerous sample
of the subject.
146. The method of claim 142, wherein the second estimate of circulating tumor
fraction is
generated by:
(i) detecting a plurality of somatic variants in the liquid biopsy sample
based on the
first plurality of sequence reads;
(ii) determining, for each respective somatic variant in the plurality of
somatic
variants, a corresponding somatic variant allele frequency for the liquid
biopsy sample,
thereby determining a plurality of somatic variant allele frequencies for the
liquid biopsy
sample; and
(iii) estimating the circulating tumor fraction for the liquid biopsy sample
as twice the
value of the largest somatic variant allele frequency in the plurality of
somatic variant allele
frequencies.
147. The method of claim 142, wherein the second estimate of circulating tumor
fraction is
generated by:
(i) detecting a plurality of somatic variants in the liquid biopsy sample
based on the
first plurality of sequence reads;
(ii) determining, for each respective somatic variant in the plurality of
somatic
383
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
variants, a corresponding somatic variant allele frequency for the liquid
biopsy sample,
thereby determining a plurality of somatic variant allele frequencies for the
liquid biopsy
sample; and
(iii) estimating the circulating tumor fraction for the liquid biopsy sample
as the value
of the largest somatic variant allele frequency in the plurality of somatic
variant allele
frequencies.
148. The method of any one of claims 139-147, wherein the plurality of probe
sequences
used to enrich cell-free DNA fragments in the liquid biopsy sample in the
first panel-enriched
sequencing reaction collectively map to at least 25 different genes in human
reference
genome.
149. The method of any one of claims 139-148, wherein plurality of integer
copy states
comprise a 1-copy state, a 2-copy state, a 3-copy state, and a 4-copy state.
150. The method of any one of claims 139-149, wherein the fitting D) includes
using a
maximum likelihood estimation method to fit each respective segment in the
plurality of
segments to the respective integer copy state.
151. The method of claim 150, wherein the maximum likelihood estimation method
is an
expectation maximization algorithm that considers the error between each of
the plurality of
copy states and the segment-level coverage ratio at each of the plurality of
simulated
circulating tumor fractions.
152. The method of any one of claims 139-151, wherein the fitting D) includes
for each
respective simulated tumor fraction in the plurality of simulated tumor
fractions:
determining, for each respective integer copy state in the plurality of
integer copy
states, a corresponding expected coverage ratio;
comparing, for each respective segment in the plurality of segments, the
corresponding segment-level coverage ratio to the each of the expected
coverage ratio for
each respective integer copy state in the plurality of integer copy states;
and
assigning, for each respective segment in the plurality of segments, a
corresponding
integer copy state based on the comparison.
384
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
153. The method of claim 152, wherein, for each respective integer copy state
in the
plurality of integer copy states, the corresponding expected coverage ratio is
determined
according to the relationship:
(2(1-TP)+(CA/1)(TP)),
1og2(CR) = 1og2
2
wherein:
CR is the expected coverage ratio;
T Pi is the respective simulated circulating tumor fraction, and
CNj is the respective integer copy state.
154. The method of any one of claims 139-153, wherein the plurality of
simulated
circulating tumor fractions comprises at least 10 simulated circulating tumor
fractions.
155. The method of any one of claims 139-154, wherein the plurality of
simulated
circulating tumor fractions spans a range of at least from 5% to 25%.
156. The method of any one of claims 139-154, wherein the plurality of
simulated
circulating tumor fractions spans a range of at least from 1% to 50%.
157. The method of any one of claims 139-156, wherein the span between each
consecutive pair of simulated tumor fractions is no more than 5%.
158. The method of any one of claims 139-157, wherein the determining E)
comprises:
determining a measure of fit, for each respective simulated tumor fraction in
the
plurality of simulated tumor fractions, based on the aggregate of a
difference, for each
respective segment in the plurality of segments, between the respective
segment-level
coverage ratio and the expected coverage ratio for the corresponding copy
state fit to the
respective segment; and
selecting the simulated tumor fraction, in the plurality of tumor fractions,
with the
best measure of fit.
159. The method of claim 158, wherein the measure of fit for each respective
segment, in
the plurality of segments, is defined by the relationship:
wi = Ek Sklk,
385
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
wherein:
wi is the measure of fit for simulated tumor fraction
Ek is the square of the difference between the respective segment-level
coverage ratio and expected coverage ratio for the copy state k at tumor
fraction i,
lk is the number of probe sequences, in the plurality of probe sequences, that
fall within the respective segment.
160. The method of any one of claims 139-159, further comprising generating a
report for
the test subject comprising the circulating tumor fraction for the test
subject.
161. The method of claim 160, wherein the generated report further comprises
therapeutic
recommendations for the test subject based on the reported circulating tumor
fraction for the
test subject.
162. A computer system comprising:
one or more processors; and
a non-transitory computer-readable medium including computer-executable
instructions that, when executed by the one or more processors, cause the
processors to
perform a method according to any one of claims 139-161.
163. A non-transitory computer-readable storage medium having stored thereon
program
code instructions that, when executed by a processor, cause the processor to
perform the
method according to any one of claims 139-161.
386
CA 03167253 2022- 8- 5

Description

Note: Descriptions are shown in the official language in which they were submitted.


WO 2021/168146
PCT/US2021/018622
METI-IODS AlND SYSTEMS FOR A LIQUID BIOPSY ASSAY
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U.S. Provisional
Patent Application No.
63/041,272, filed June 19, 2020, U.S. Provisional Patent Application No.
63/041,293, filed
June 19, 2020, U.S. Provisional Patent Application No. 63/041,424, filed June
19, 2020, and
U.S. Provisional Patent Application No. 62/978,130, filed February 18, 2020,
the contents of
which are hereby incorporated by reference, in their entireties, for all
purposes.
FIELD OF THE INVENTION
[0002] The present disclosure relates generally to the use of
cell-free DNA sequencing
data to provide clinical support for personalized treatment of cancer.
BACKGROUND
[0003] Precision oncology is the practice of tailoring cancer
therapy to the unique
genomic, epigenetic, and/or transcriptomic profile of an individual's cancer.
Personalized
cancer treatment builds upon conventional therapeutic regimens used to treat
cancer based
only on the gross classification of the cancer, e.g, treating all breast
cancer patients with a
first therapy and all lung cancer patients with a second therapy. This field
was borne out of
many observations that different patients diagnosed with the same type of
cancer, e.g., breast
cancer, responded very differently to common treatment regimens. Over time,
researchers
have identified genomic, epigenetic, and transcriptomic markers that improve
predictions as
to how an individual cancer will respond to a particular treatment modality.
[0004] There is growing evidence that cancer patients who receive
therapy guided by
their genetics have better outcomes. For example, studies have shown that
targeted therapies
result in significantly improved progression-free cancer survival. See, e.g.,
Radovich M. et
Oncotarget, 7(35):56491-500 (2016). Similarly, reports from the IMPACT trial¨a
large
(n = 1307) retrospective analysis of consecutive, prospectively molecularly
profiled patients
with advanced cancer who participated in a large, personalized medicine
trial¨indicate that
patients receiving targeted therapies matched to their tumor biology had a
response rate of
16.2%, as opposed to a response rate of 5.2% for patients receiving non-
matched therapy.
Tsimberidou AM et al., ASCO 2018, Abstract LBA2553 (2018).
1
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0005] In fact, therapy targeted to specific genomic alterations
is already the standard of
care in several tumor types, e.g., as suggested in the National Comprehensive
Cancer
Network (NCCN) guidelines for melanoma, colorectal cancer, and non-small cell
lung
cancer. In practice, implementation of these targeted therapies requires
determining the
status of the diagnostic marker in each eligible cancer patient. While this
can be
accomplished for the few, well known mutations associated with treatment
recommendations
in the NCCN guidelines using individual assays or small next generation
sequencing (NGS)
panels, the growing number of actionable genomic alterations and increasing
complexity of
diagnostic classifiers necessitates a more comprehensive evaluation of each
patient's cancer
genome, epigenome, and/or transcriptome.
[0006] For instance, some evidence suggests that use of
combination therapies where
each component is matched to an actionable genomic alteration holds the
greatest potential
for treating individual cancers. To this point, a retroactive study of cancer
patients treated
with one or more therapeutic regimens revealed that patients who received
therapies matched
to a higher percentage of their genomic alterations experienced a greater
frequency of stable
disease (e.g., a longer time to recurrence), longer time to treatment failure,
and greater overall
survival. Wheeler JJ etal., Cancer Res., 76:3690-701 (2016). Thus,
comprehensive
evaluation of each cancer patient's genome, epigenome, and/or transcriptome
should
maximize the benefits provided by precision oncology, by facilitating more
fine-tuned
combination therapies, use of novel off-label drug indications, and/or tissue
agnostic
immunotherapy. See, for example, Schwaederle M. etal., J Clin Oncol.,
33(32):3817-25
(2015); Schwaederle M. etal., JAMA Oncol., 2(11):1452-59 (2016); and Wheler JJ
etal.,
Cancer Res., 76(13):3690-701 (2016). Further, the use of comprehensive next
generation
sequencing analysis of cancer genomes facilitates better access and a larger
patient pool for
clinical trial enrollment. Coyne GO et al., Curr. Probl. Cancer, 41(3):182-93
(2017); and
Markman M., Oncology, 31(3):158, 168.
[0007] The use of large NGS genomic analysis is growing in order
to address the need for
more comprehensive characterization of an individual's cancer genome. See, for
example,
Fernandes GS etal., Clinics, 72(10):588-94. Recent studies indicate that of
the patients for
which large NGS genomic analysis is performed, 30-40% then receive clinical
care based on
the assay results, which is limited by at least the identification of
actionable genomic
alterations, the availability of medication for treatment of identified
actionable genomic
alterations, and the clinical condition of the subject. See, Ross JS etal.,
JAMA Oncol.,
2
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
1(1):40-49 (2015); Ross JS et al., Arch. Pathol. Lab Med., 139:642-49 (2015);
Hirshfield KM
et al., Oncologist, 21(11):1315-25 (2016); and Groisberg R. et al.,
Oncotarget, 8:39254-67
(2017).
[0008] However, these large NGS genomic analyses are
conventionally performed on
solid tumor samples. For instance, each of the studies referenced in the
paragraph above
performed NGS analysis of FFPE tumor blocks from patients. Solid tissue
biopsies remain
the gold standard for diagnosis and identification of predictive biomarkers
because they
represent well-known and validated methodologies that provide a high degree of
accuracy.
Nevertheless, there are significant limitations to the use of solid tissue
material for large NGS
genomic analyses of cancers. For example, tumor biopsies are subject to
sampling bias
caused by spatial and/or temporal genetic heterogeneity, e.g., between two
regions of a single
tumor and/or between different cancerous tissues (such as between primary and
metastatic
tumor sites or between two different primary tumor sites). Such intertumor or
intratumor
heterogeneity can cause sub-clonal or emerging mutations to be overlooked when
using
localized tissue biopsies, with the potential for sampling bias to be
exacerbated over time as
sub-clonal populations further evolve and/or shift in predominance.
[0009] Additionally, the acquisition of solid tissue biopsies
often requires invasive
surgical procedures, e.g., when the primary tumor site is located at an
internal organ. These
procedures can be expensive, time consuming, and carry a significant risk to
the patient, e.g.,
when the patient's health is poor and may not be able to tolerate invasive
medical procedures
and/or the tumor is located in a particularly sensitive or inoperable
location, such as in the
brain or heart. Further, the amount of tissue, if any, that can be procured
depends on multiple
factors, including the location of the tumor, the size of the tumor, the
fragility of the patient,
and the risk of comorbidities related to biopsies, such as bleeding and
infections. For
instance, recent studies report that tissue samples in a majority of advanced
non-small cell
lung cancer patients are limited to small biopsies and cannot be obtained at
all in up to 31%
of patients. The and Hofinan, Transl. Lung Cancer Res., 5(4):420-23 (2016).
Even when a
tissue biopsy is obtained, the sample may be too scant for comprehensive
testing.
100101 Further, the method of tissue collection, preservation
(e.g., formalin fixation),
and/or storage of tissue biopsies can result in sample degradation and
variable quality DNA.
This, in turn, leads to inaccuracies in downstream assays and analysis,
including next-
generation sequencing (NGS) for the identification of biomarkers. The and
Hofman, Transl
Lung Cancer Res., 5(4):420-23 (2016).
3
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0011] In addition, the invasive nature of the biopsy procedure,
the time and cost
associated with obtaining the sample, and the compromised state of cancer
patients receiving
therapy render repeat testing of cancerous tissues impracticable, if not
impossible. As a
result, solid tissue biopsy analysis is not amenable to many monitoring
schemes that would
benefit cancer patients, such as disease progression analysis, treatment
efficacy evaluation,
disease recurrence monitoring, and other techniques that require data from
several time
points.
[0012] Cell-free DNA (cfDNA) has been identified in various
bodily fluids, e.g., blood
serum, plasma, urine, etc. Chan et at., Ann. Clin. Biochem., 40(Pt 2):122-30
(2003). This
cfDNA originates from necrotic or apoptotic cells of all types, including
germline cells,
hematopoietic cells, and diseased (e.g., cancerous) cells. Advantageously,
genomic
alterations in cancerous tissues can be identified from cfDNA isolated from
cancer patients.
See, e.g., Stroun et al., Oncology, 46(5):318-22 (1989); Goessl et at.. Cancer
Res.,
60(20:5941-45 (2000); and Frenel et al., Clin. Cancer Res. 21(20):4586-96
(2015). Thus,
one approach to overcoming the problems presented by the use of solid tissue
biopsies
described above is to analyze cell-free nucleic acids (e.g., cfDNA) and/or
nucleic acids in
circulating tumor cells present in biological fluids, e.g., via a liquid
biopsy.
[0013] Specifically, liquid biopsies offer several advantages
over conventional solid
tissue biopsy analysis. For instance, because bodily fluids can be collected
in a minimally
invasive or non-invasive fashion, sample collection is simpler, faster, safer,
and less
expensive than solid tumor biopsies. Such methods require only small amounts
of sample
(e.g., 10 mL or less of whole blood per biopsy) and reduce the discomfort and
risk of
complications experienced by patients during conventional tissue biopsies. In
fact, liquid
biopsy samples can be collected with limited or no assistance from medical
professionals and
can be performed at almost any location. Further, liquid biopsy samples can be
collected
from any patient, regardless of the location of their cancer, their overall
health, and any
previous biopsy collection. This allows for analysis of the cancer genome of
patients from
which a solid tumor sample cannot be easily and/or safely obtained. In
addition, because
cell-free DNA in the bodily fluids arise from many different types of tissues
in the patient,
the genomic alterations present in the pool of cell-free DNA are
representative of various
different clonal sub-populations of the cancerous tissue of the subject,
facilitating a more
comprehensive analysis of the cancerous genome of the subject than is possible
from one or
more sections of a single solid tumor sample.
4
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0014] Liquid biopsies also enable serial genetic testing prior
to cancer detection, during
the early stages of cancer progression, throughout the course of treatment,
and during
remission, e.g., to monitor for disease recurrence. The ability to conduct
serial testing via
non-invasive liquid biopsies throughout the course of disease could prove
beneficial for many
patients, e.g., through monitoring patient response to therapies, the
emergence of new
actionable genomic alterations, and/or drug-resistance alterations. These
types of information
allow medical professionals to more quickly tailor and update therapeutic
regimens, e.g.,
facilitating more timely intervention in the case of disease progression. See,
e.g., The and
Hofman, Transl. Lung Cancer Res., 5(4):420-23 (2016).
[0015] Nevertheless, while liquid biopsies are promising tools
for improving outcomes
using precision oncology, there are significant challenges specific to the use
of cell-free DNA
for evaluation of a subject's cancer genome. For instance, there is a highly
variable signal-to-
noise ratio from one liquid biopsy sample to the next. This occurs because
cfDNA originates
from a variety of different cells in a subject, both healthy and diseased.
Depending on the
stage and type of cancer in any particular subject, the fraction of cfDNA
fragments
originating from cancerous cells (the -tumor fraction" or -ctIDNA fraction" of
the
sample/subject) can range from almost 0% to well over 50%. Other factors,
including tumor
type and mutation profile, can also impact the amount of DNA released from
cancerous
tissues. For instance, cfDNA clearance through the liver and kidneys is
affected by a variety
of factors, including renal dysfunction or other tissue damaging factors
(e.g., chemotherapy,
surgery, and/or radiotherapy).
[0016] This, in turn, leads to problems detecting and/or
validating cancer-specific
genomic alterations in a liquid sample. This is particularly true during early
stages of the
disease __________ when cancer therapies have much higher success rates
because the tumor fraction
in the patient is lowest at this point. Thus, early stage cancer patients can
have ctDNA
fractions below the limit of detection (LOD) for one or more informative
genomic alterations,
limiting clinical utility because of the risk of false negatives and/or
providing an incomplete
picture of the cancer genome of the patient. Further, because cancers, and
even individual
tumors, can be clonally diverse, actionable genomic alterations that arise in
only a subset of
clonal populations are diluted below the overall tumor fraction of the sample,
further
frustrating attempts to tailor combination therapies to the various actionable
mutations in the
patient's cancer genome. Consequently, most studies using liquid biopsy
samples to date
have focused on late stage patients for assay validation and research.
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0017] Another challenge associated with liquid biopsies is the
accurate determination of
tumor fraction in a sample. This difficulty arises from at least the
heterogeneity of cancers
and the increased frequency of large chromosomal duplications and deletions
found in
cancers. As a result, the frequency of genomic alterations from cancerous
tissues varies from
locus to locus based on at least (i) their prevalence in different sub-clonal
populations of the
subject's cancer, and (ii) their location within the genome, relative to large
chromosomal
copy number variations. The difficulty in accurately determining the tumor
fraction of liquid
biopsy samples affects accurate measurement of various cancer features shown
to have
diagnostic value for the analysis of solid tumor biopsies. These include
allelic ratios, copy
number variations, overall mutational burden, frequency of abnormal
methylation patterns,
etc., all of which are correlated with the percentage of DNA fragments that
arise from
cancerous tissue, as opposed to healthy tissue.
[0018] Altogether, these factors result in highly variable
concentrations of ctDNA¨from
patient to patient and possibly from locus to locus __ that confound accurate
measurement of
disease indicators and actionable genomic alterations. Further, the quantity
and quality of
cfDNA obtained from liquid biopsy samples are highly dependent on the
particular
methodology for collecting the samples, storing the samples, sequencing the
samples, and
standardizing the sequencing data.
[0019] While validation studies of existing liquid biopsy assays
have shown high
sensitivity and specificity, few studies have corroborated results with
orthogonal methods, or
between particular testing platforms, e.g., different NGS technologies and/or
targeted panel
sequencing versus whole genome/exome sequence. Reports of liquid biopsy-based
studies
are limited by comparison to non-comprehensive tissue testing algorithms
including Sanger
sequencing, small NGS hotspot panels, polymerase chain reaction (PCR), and
fluorescent in
situ hybridization (FISH), which may not contain all NCCN guideline genes in
their
reportable range, thus suffering in comparison to a more comprehensive liquid
biopsy assay.
[0020] As an example, conventional liquid biopsy assays do not
provide accurate
classifications of copy number variations (CNVs) for genomic targets (e.g.,
biomarkers),
where CNVs are a form of genomic alteration with known relevance to cancer.
Conventional
methodologies typically assign a genomic target to an integer copy number
and/or one of
three copy number states (e.g., amplified, neutral, or deleted) using a copy
ratio cutoff above
or below which an amplified or deleted status is called, respectively, or in
which a neutral
status is otherwise called. Such methodologies make these assignments based on
the fact that
6
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
at a given tumor fraction and a known ploidy, the copy number in a segment is
positively
correlated with its copy ratio and thus the copy ratio can be mathematically
converted to an
integer copy number. For example, one conventional method ichorCNA utilizes
software
that estimates tumor fraction in circulating cfDNA from ultra-low-pass whole
genome
sequencing, which is then used to determine genomic alterations such as copy
number
alterations. See, Adalsteinsson etal., Nat Commun., 8:1324 (2017).
[0021] However, this approach can be problematic due to the
current challenges in
accurately determining tumor fraction in liquid biopsy samples. For example,
estimating the
ctDNA fraction of total cell-free DNA in plasma can be difficult due to highly
variable tumor
fractions that can range from 0 to approximately 90%, and in many cases can be
below 1%
and/or below the limit of detection. See, Shigematsu and Koyama, Nihon Jinzo
Gakkai Shi.,
30(9):1115-22 (1988). Methods based on mean, median, maximum or other point
estimates
of somatic variant allele fractions (VAFs) require the difficult task of
accurate quantification
and classification of somatic and germline variants in liquid biopsy samples,
which can be
further complicated by the absence of a matched normal sample or the presence
of artifactual
variants and/or clonal heterogeneity. In addition to the reliance on
potentially inaccurate
tumor fraction estimations, methods that utilize ultra-low-pass whole genome
sequencing
assays may be inappropriate for analyzing copy number variations from capture-
based deep
sequencing assays.
[0022] Additional challenges arise in cases where non-focal copy
number variations are
identified (e.g., where an entire chromosome or a large portion of a
chromosome is amplified
or deleted). Non-focal copy number variations are often difficult to
interpret, as these large-
scale copy number changes may represent real copy number variations or may be
artifacts
resulting from incorrect normalization due to low sample quality, capture
failures, or other
unknown issues during library preparation or sequencing. Because such large-
scale copy
number changes are unlikely to be associated with therapeutically actionable
genomic
alterations, the ability to differentiate between real and artifactual copy
number variations is
an important and unmet need in precision oncology applications. For example,
two
conventional methods that are insufficient to distinguish focal copy number
variations from
non-focal copy number variations include CNVkit and AVENIO. See, for example,
Talevich
et al., PI.oS Commit Biol, 12.1004873 (2016), and Roche, "AVENTO ctDNA
Expanded Kit,"
(2018), the contents of which are incorporated herein by reference, in their
entireties, for all
purposes.
7
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0023] As another example, conventional liquid biopsy assays do
not provide a method
for accurately detecting variants (e.g., variant alleles) in ctDNA NGS assays.
As described
above, many patients may not have abundant ctDNA in early stage disease and
may shed
variants below the limit of detection (LOD) for ctDNA assays, resulting in
false negatives.
Detecting these variants at low circulating fractions is also technically
challenging due to
constraints of sequencing by synthesis. Additionally, differentiating between
germline and
somatic variants in ctDNA is difficult, as is differentiating between
mutations derived from
clonal hematopoiesis (CH) and the solid tumor being assayed. In such cases,
mutations in
hematopoietic lineage cells may be mistaken for tumor-derived mutations.
Indeed,
researchers have identified several genes frequently mutated in CH with
potential importance
in cancer, such as JAK2, TP53, GNAS, IDH2, and KRAS. Mayrhofer etal., 2018,
"Cell-free
DNA profiling of metastatic prostate cancer reveals microsatellite
instability, structural
rearrangements and clonal hematopoiesis,- Genome Med, (10), pg. 85; Hu etal.,
2018,
-False-Positive Plasma Genotyping Due to Clonal Hematopoiesis," Clin Cancer
Res, (24),
pg. 4437.
[0024] Additionally, conventional conventional liquid biopsy
assays do not provide
accurate circulating tumor fraction estimates (ctFEs). Accurate ctFEs provide
several
benefits to liquid biopsy applications, including classification of variants
as somatic or
germline, detection of clinically relevant copy number variations, and/or use
of ctFEs as
biomarkers.
[0025] For example, because up to 30% of breast cancer patients
and up to 55% of lung
cancer patients relapse after initial treatment, as well as a significant
portion of patients in
other cancer cohorts, the ability to detect metastasis and disease recurrence
earlier in these
patients could significantly improve patient outcomes. See, Colleoni etal.,
2016, -Annual
Hazard Rates of Recurrence for Breast Cancer During 24 Years of Follow-Up:
Results From
the International Breast Cancer Study Group Trials Ito V," J Clin Oncol, (34),
pg. 927; Yates
etal., 2017, "Genomic Evolution of Breast Cancer Metastasis and Relapse,"
Cancer Cell,
(32), pg. 169; Uramoto etal., 2014, -Recurrence after surgery in patients with
NSCLC,"
Transl Lung Cancer Res, (3), pg. 242; Taunk et al., 2017, "Immunotherapy and
radiation
therapy for operable early stage and locally advanced non-small cell lung
cancer," Transl
Lung Cancer Res, (6), pg 178 Indeed, recent retrospective and prospective
studies have
shown ctDNA after completion of treatment or surgery can act as a biomarker
for disease
recurrence in many cancer types, including breast cancer, lung cancer,
melanoma, bladder
8
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
cancer, and colon cancer. See. Coombes etal., 2019, "Personalized Detection of
Circulating
Tumor DNA Antedates Breast Cancer Metastatic Recurrence," Clin Cancer Res,
(25), pg.
4255; Tie etal., 2019, "Circulating Tumor DNA Analyses as Markers of
Recurrence Risk
and Benefit of Adjuvant Therapy for Stage III Colon Cancer," JAMA Oncol,
print; McEvoy
et al., 2019, "Monitoring melanoma recurrence with circulating tumor DNA: a
proof of
concept from three case studies," Oncotarget, (10), pg. 113; Christensen
etal., 2019, "Early
Detection of Metastatic Relapse and Monitoring of Therapeutic Efficacy by
Ultra-Deep
Sequencing of Plasma Cell-Free DNA in Patients With Urothelial Bladder
Carcinoma," J
Clin Oncol, (37), pg. 1547; Isaksson etal., 2019, "Pre-operative plasma cell-
free circulating
tumor DNA and serum protein tumor markers as predictors of lung adenocarcinoma

recurrence," Acta Oncol, (58), pg. 1079. Higher ctFEs are associated with
disease
progression at radiographic evaluation and an increased metastatic lesion
count.
[0026] Furthermore, ctFEs correlate with important clinical
outcomes, and provide a
minimally invasive method to monitor patients for response to therapy, disease
relapse, and
disease progression. However, conventional methodologies used for determining
ctFEs in
liquid biopsy samples rely on low-pass, whole-genome sequencing, which cannot
also be
used for variant detection (see, for example, Adalsteinsson et al., "Scalable
whole-exome
sequencing of cell-free DNA reveals high concordance with metastatic tumors,"
(2017)
Nature Communications Nov 6;8(1):1324, doi:10.1038/s41467-017-00965-y; and
ichorCNA,
the Broad Institute, available on the intemet at
github.com/broadinstitute/ichorCNA). Other
traditional approaches use variant allele fractions (VAFs) to estimate tumor
fraction, but such
approaches are confounded by variant tissue source and capture bias resulting
in high levels
of noise. Additionally, conventional methodologies for determining tumor
purity estimates in
solid tumor biopsy samples rely solely on on-target probe regions, which
cannot be used in
conjunction with targeted gene panels containing small numbers of genes.
[0027] The information disclosed in this Background section is
only for enhancement of
understanding of the general background of the invention and should not be
taken as an
acknowledgement or any form of suggestion that this information forms the
prior art already
known to a person skilled in the art.
SUMMARY
[0028] Given the above background, there is a need in the art for
improved methods and
systems for supporting clinical decisions in precision oncology using liquid
biopsy assays. In
9
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
particular, there is a need in the art for improved methods and systems for
identifying focal
copy number variations in liquid biopsy assays. The present disclosure solves
this and other
needs in the art by providing improvements in validating copy number variation
annotations,
thus identifying focal copy number variations in genomic segments obtained
from liquid
biopsy assays. For example, by applying a plurality of amplification and/or
deletion filters to
a dataset comprising bin-level copy ratios, segment-level copy ratios, and
segment-level
confidence intervals for a plurality of bins and segments, respectively, the
systems and
methods described herein reject or validate a focal copy number status
annotation for a at a
locus that is potentially actionable using precision oncology.
[0029] For example, in one aspect, the present disclosure
provides a method of validating
a copy number variation in a test subject, at a computer system having one or
more
processors, and memory storing one or more programs for execution by the one
or more
processors. The method comprises obtaining a first dataset that comprises a
plurality of bin-
level sequence ratios, each respective bin-level sequence ratio in the
plurality of bin-level
sequence ratios corresponding to a respective bin in a plurality of bins. Each
respective bin
in the plurality of bins represents a corresponding region of a human
reference genome, and
each respective bin-level sequence ratio in the plurality of bin-level
sequence ratios is
determined from a sequencing of a plurality of cell-free nucleic acids in a
first liquid biopsy
sample of the test subject and one or more reference samples.
[0030] The first dataset also comprises a plurality of segment-
level sequence ratios, each
respective segment-level sequence ratio in the plurality of segment-level
sequence ratios
corresponding to a segment in a plurality of segments. Each respective segment
in the
plurality of segments represents a corresponding region of the human reference
genome
encompassing a subset of adjacent bins in the plurality of bins, and each
respective segment-
level sequence ratio in the plurality of segment-level sequence ratios is
determined from a
measure of central tendency of the plurality of bin-level sequence ratios
corresponding to the
subset of adjacent bins encompassed by the respective segment.
[0031] The first dataset further comprises a plurality of segment-
level measures of
dispersion, where each respective segment-level measure of dispersion in the
plurality of
segment-level measures of dispersion (i) corresponds to a respective segment
in the plurality
of segments and (ii) is determined using the plurality of bin-level sequence
ratios
corresponding to the subset of adjacent bins encompassed by the respective
segment.
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0032] In this aspect, the method comprises validating a copy
number status annotation of
a respective segment in the plurality of segments that is annotated with a
copy number
variation by applying the first dataset to an algorithm having a plurality of
filters. A first
filter in the plurality of filters is a measure of central tendency bin-level
sequence ratio filter
that is fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more bin-level sequence ratio thresholds. A second filter in the
plurality of filters is a
confidence filter that is fired when the segment-level measure of dispersion
corresponding to
the respective segment fails to satisfy a confidence threshold. A third filter
in the plurality of
filters is a measure of central tendency-plus-deviation bin-level sequence
ratio filter that is
fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more measure of central tendency-plus-deviation bin-level sequence
ratio thresholds.
In the third filter, the one or more measure of central tendency-plus-
deviation bin-level copy
ratio thresholds are derived from (i) a measure of the bin-level sequence
ratios corresponding
to the plurality of bins that map to the same chromosome of the human
reference genome as
the respective segment, and (ii) a measure of dispersion across the bin-level
sequence ratios
corresponding to the plurality of bins that map to the respective chromosome.
[0033] When a filter in the plurality of filters is fired, the
copy number status annotation
of the respective segment is rejected; and when no filter in the plurality of
filters is fired, the
copy number status annotation of the respective segment is validated.
[0034] In another aspect, the present disclosure provides a
method for treating a patient
with a cancer containing a copy number variation of a target gene. The method
comprises
determining whether the patient has an aggressive form of cancer associated
with a focal
copy number variation of the target gene by obtaining a first biological
sample of the cancer
from the patient and performing copy number variation analysis on the first
biological sample
to identify the copy number status of the target gene in the cancer.
[0035] The copy number variation analysis generates a first
dataset comprising a plurality
of bin-level sequence ratios, each respective bin-level sequence ratio in the
plurality of bin-
level sequence ratios corresponding to a respective bin in a plurality of
bins. Each respective
bin in the plurality of bins represents a corresponding region of a human
reference genome,
and each respective bin-level sequence ratio in the plurality of bin-level
sequence ratios is
11
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
determined from a sequencing of a plurality of nucleic acids in the first
biological sample of
the cancer from the patient and one or more reference samples.
[0036] The first dataset also comprises a plurality of segment-
level sequence ratios, each
respective segment-level sequence ratio in the plurality of segment-level
sequence ratios
corresponding to a segment in a plurality of segments. Each respective segment
in the
plurality of segments represents a corresponding region of the human reference
genome
encompassing a subset of adjacent bins in the plurality of bins, and the
plurality of segment-
level sequence ratios is determined from a measure of central tendency of the
plurality of bin-
level sequence ratios corresponding to the subset of adjacent bins encompassed
by the
respective segment.
[0037] The first dataset further comprises a plurality of segment-
level measures of
dispersion, where each respective segment-level measure of dispersion in the
plurality of
segment-level measures of dispersion (i) corresponds to a respective segment
in the plurality
of segments and (ii) is determined using the plurality of bin-level sequence
ratios
corresponding to the subset of adjacent bins encompassed by the respective
segment.
[0038] The method further comprises determining whether the copy
number variation of
the target gene is a focal copy number variation by applying the first dataset
to an algorithm
having a plurality of copy number variation filters. When the patient has the
aggressive form
of cancer associated with focal copy number variation of the target gene, a
first therapy for
the aggressive form of the cancer to the patient is administered, and when the
patient does not
have the aggressive form of cancer associated with focal copy number variation
of the target
gene, a second therapy for a less aggressive form of the cancer to the patient
is administered.
[0039] Additionally, there is a need in the art for improved
methods and systems for
identifying somatic tumor mutations in cell-free DNA, particularly where the
sample has low
tumor fractions. Advantageously, the present disclosure solves this and other
needs in the art
by providing improved somatic variant identification methodology that better
accounts for
locus-specific and/or sample specific considerations to more accurately
identify true somatic
mutations in a liquid biopsy sample. For example, by using an application of
Bayes theorem
to account for one or more of (i) the prevalence of variants at a specific
locus in a specific
cancer type, (ii) the variant allele fraction for the variant being evaluated,
(iii) the prevalence
of sequencing errors at a particular locus, and (iv) the actual sequencing
error rate of a
particular reaction, the variant filter methodologies described herein tune
the specificity and
12
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
sensitivity of variant count thresholds in a locus-specific fashion to achieve
higher accuracy
of true somatic variant calling in a liquid biopsy assay.
[0040] For example, in one aspect, the present disclosure
provides a method of validating
a somatic sequence variant in a test subject having a cancer condition. The
method is
performed at a computer system having one or more processors, and memory
storing one or
more programs for execution by the one or more processors. The method includes
obtaining,
from a first sequencing reaction, a corresponding sequence of each cell-free
DNA fragment in
a first plurality of cell-free DNA fragments in a liquid biopsy sample of the
test subject, thus
obtaining a first plurality of sequence reads. Each respective sequence read
in the first
plurality of sequence reads is aligned to a reference sequence for the species
of the subject,
thus identifying a variant allele fragment count for a candidate variant that
maps to a locus in
the reference sequence, and a locus fragment count for the locus encompassing
the candidate
variant.
[0041] The method further includes comparing the variant allele
fragment count for the
candidate variant against a dynamic variant count threshold for the locus in
the reference
sequence that the candidate variant maps to. The dynamic variant count
threshold is based
upon a pre-test odds of a positive variant call for the locus based on the
prevalence of variants
in a genomic region that includes the locus from a first set of nucleic acids
obtained from a
cohort of subjects having the cancer condition.
[0042] The method then includes rejecting or validating the
variant as a true somatic
variant based upon the dynamic variant count threshold. For instance, when the
variant allele
fragment count for the candidate variant satisfies the dynamic variant count
threshold for the
locus, the presence of the somatic sequence variant in the test subject is
validated. And when
the variant allele fragment count for the candidate variant does not satisfy
the dynamic
variant count threshold for the locus, the presence of the somatic sequence
variant in the test
subject is rejected.
[0043] Additionally, there is a need in the art for improved
methods and systems for
determining accurate circulating tumor fraction estimates (ctFEs) in liquid
biopsy assays.
The present disclosure solves this and other needs in the art by providing
methods and
systems for estimating the circulating tumor fraction of a liquid biopsy
sample from a
targeted-panel sequencing reaction. For example, by fitting segment-level
coverage ratios for
on-target and off-target sequence reads distributed relatively uniformly along
the genome to
13
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
integer copy states across a range of simulated tumor fractions (e.g., using
maximum
likelihood estimation, for example, with an expectation-maximization
algorithm), the systems
and methods described herein can generate an accurate estimate of the
circulating tumor
fraction of a liquid biopsy sample. This is achieved, in some embodiments, by
identifying the
expected coverage ratios, given the fitted integer copy states, that best
match the
experimental coverage ratios. Such an accurate estimate of the circulating
tumor fraction can
be used in conjunction with on-target sequencing results to improve variant
detection
identification, as well as serve as an informative biomarker itself
[0044] For example, in one aspect, the present disclosure
provides a method of estimating
a circulating tumor fraction for a test subject from panel-enriched sequencing
data for a
plurality of sequences, at a computer system having one or more processors,
and memory
storing one or more programs for execution by the one or more processors.
[0045] The method includes obtaining, from a first panel-enriched
sequencing reaction, a
first plurality of sequences. The plurality of sequences includes a
corresponding sequence for
each cell-free DNA fragment in a first plurality of cell-free DNA fragments
obtained from a
liquid biopsy sample from the test subject, wherein each respective cell-free
DNA fragment
in the first plurality of cell-free DNA fragments corresponds to a respective
probe sequence
in a plurality of probe sequences used to enrich cell-free DNA fragments in
the liquid biopsy
sample in the first panel-enriched sequencing reaction.
[0046] The first plurality of sequences also includes a
corresponding sequence for each
cell-free DNA fragment in a second plurality of cell-free DNA fragments
obtained from the
liquid biopsy sample, wherein each respective cell-free DNA fragment in the
second plurality
of DNA fragments does not correspond to any probe sequence in the plurality of
probe
sequences.
[0047] The method includes determining a plurality of bin-level
coverage ratios from the
plurality of sequences, each respective bin-level coverage ratio in the
plurality of bin-level
coverage ratios corresponding to a respective bin in a plurality of bins. Each
respective bin in
the plurality of bins represents a corresponding region of a human reference
genome. Each
respective bin-level sequence ratio in the plurality of bin-level sequence
ratios is determined
from a comparison of (i) a number of sequence reads in the plurality of
sequences that map to
the corresponding bin and (ii) a number of sequence reads from one or more
reference
samples that map to the con-esponding bin.
14
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0048] The method further includes determining a plurality of
segment-level coverage
ratios by forming a plurality of segments by grouping respective subsets of
adjacent bins in
the plurality of bins based on a similarity between the respective coverage
ratios of the subset
of adjacent bins, and determining, for each respective segment in the
plurality of segments, a
segment-level coverage ratio based on the corresponding bin-level coverage
ratios for each
bin in the respective segment.
[0049] For each respective simulated circulating tumor fraction
in a plurality of simulated
circulating tumor fractions, the method includes fitting each respective
segment in the
plurality of segments to a respective integer copy state in a plurality of
integer copy states, by
identifying the respective integer copy state in the plurality of integer copy
states that best
matches the segment-level coverage ratio, thus generating, for each respective
simulated
circulating tumor fraction in the plurality of simulated tumor fractions, a
respective set of
integer copy states for the plurality of segments.
[0050] The method further includes determining the circulating
tumor fraction for the test
subject based on a comparison between the corresponding segment-level coverage
ratios and
integer copy states across the plurality of simulated circulated tumor
fractions. In some
embodiments, the comparison includes optimization of an error between
corresponding
segment-level coverage ratios and integer copy states across the plurality of
simulated
circulated tumor fractions. In some embodiments, the comparison includes
finding two or
more local optima for fit (e.g., local minima for an error between
corresponding segment-
level coverage ratios and integer copy states across the plurality of
simulated circulated tumor
fractions) and choosing the local optima (e.g., minima) that is most
consistent with one or
more alternative estimations of the tumor fraction.
[0051] Additional aspects and advantages of the present
disclosure will become readily
apparent to those skilled in this art from the following detailed description,
wherein only
illustrative embodiments of the present disclosure are shown and described. As
will be
realized, the present disclosure is capable of other and different
embodiments, and its several
details are capable of modifications in various obvious respects, all without
departing from
the disclosure. Accordingly, the drawings and description are to be regarded
as illustrative in
nature, and not as restrictive.
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
BRIEF DESCRIPTION OF THE DRAWINGS
[0052] Figures 1A, 1B, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and 1D3
collectively
illustrate a block diagram of an example computing device for supporting
clinical decisions
in precision oncology using liquid biopsy assays (e.g., by validating a copy
number variation,
validating a somatic sequence variant in a test subject having a cancer
condition, estimating
the circulating tumor fraction of a liquid biopsy sample based on on-target
and off-target
sequence reads from targeted-panel sequencing data etc.), in which dashed
boxes represent
optional portions of the method, in accordance with some embodiments of the
present
disclosure.
[0053] Figure 2A illustrates an example workflow for generating a
clinical report based
on information generated from analysis of one or more patient specimens, in
accordance with
some embodiments of the present disclosure.
[0054] Figure 2B illustrates an example of a distributed
diagnostic environment for
collecting and evaluating patient data for the purpose of precision oncology,
in accordance
with some embodiments of the present disclosure.
[0055] Figure 3 provides an example flow chart of processes and
features for liquid
biopsy sample collection and analysis for use in precision oncology, in which
dashed boxes
represent optional portions of the method, in accordance with some embodiments
of the
present disclosure.
[0056] Figures 4A, 4B, 4C, 4D, 4E, 4F1, 4F2, 4G1, 4G2, 4G3, and
4F3 collectively
illustrate an example bioinformatics pipeline for precision oncology. Figure
4A provides an
overview flow chart of processes and features in a bioinformatics pipeline, in
accordance
with some embodiments of the present disclosure. Figure 4B provides an
overview of a
bioinformatics pipeline executed with either a liquid biopsy sample alone or a
liquid biopsy
sample and a matched normal sample. Figure 4C illustrates that paired end
reads from tumor
and normal isolates are zipped and stored separately under the same order
identifier, in
accordance with some embodiments of the present disclosure. Figure 4D
illustrates quality
correction for FASTQ files, in accordance with some embodiments of the present
disclosure.
Figure 4E illustrates processes for obtaining tumor and normal BAM alignment
files, in
accordance with some embodiments of the present disclosure. Figure 4F1
provides a flow
chart of a method for validating a copy number variation, in which dashed
boxes represent
optional portions of the method, in accordance with some embodiments of the
present
16
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
disclosure. Figure 4F2 provides a flow chart of a method for validating a
somatic sequence
variant in a test subject having a cancer condition, in which dashed boxes
represent optional
portions of the method, in accordance with some embodiments of the present
disclosure.
Figures 4G1, 4G2, and 4G3 illustrate a method of variant detection, in which
dashed boxes
represent optional portions of the method, in accordance with some embodiments
of the
present disclosure. Figure 4F3 provides an overview of a method for estimating
the
circulating tumor fraction for a liquid biopsy sample, based on targeted panel
sequencing
data, in which dashed boxes represent optional portions of the method, in
accordance with
some embodiments of the present disclosure.
[0057] Figures 5A1, 5B1, 5C1, 5D1, and 5E1 collectively provide a
flow chart of
processes and features for validating a copy number variation in a test
subject, in which
dashed boxes represent optional portions of the method, in accordance with
some
embodiments of the present disclosure.
[0058] Figures 5A2 and 5B2 collectively provide a flow chart of
processes and features
for validating a somatic sequence variant in a test subject, in which dashed
boxes represent
optional portions of the method, in accordance with some embodiments of the
present
disclosure.
[0059] Figures 5A3 and 5B3 collectively provide a flow chart of
processes and features
for estimating the circulating tumor fraction of a liquid biopsy sample based
on on-target and
off-target sequence reads from a targeted-panel sequencing data, in which
dashed boxes
represent optional portions of the method, in accordance with some embodiments
of the
present disclosure.
[0060] Figures 6A1, 6B1, and 6C1 collectively provide a flow
chart of processes and
features for treating a patient with a cancer containing a copy number
variation of a target
gene, in which dashed boxes represent optional portions of the method, in
accordance with
some embodiments of the present disclosure.
[0061] Figure 6A2 illustrates a flow chart of a method for
obtaining a distribution of
variant detection sensitivities as a function of circulating variant allele
fraction from a cohort
of subjects, in accordance with some embodiments of the present disclosure.
[0062] Figures 6A3, 6B3, and 6C3 collectively illustrate a
process for fitting segment-
level coverage ratios to an integer copy number (6A3 and 6B3) and subsequently
determining
17
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
the error associated with the fit (6C3) at a particular simulated circulating
tumor fraction, in
accordance with some embodiments of the present disclosure.
[0063] Figures 7A1 and 7B1 illustrate a non-focal amplified
segment and a focal
amplified segment comprising the MYC gene, in accordance with some embodiments
of the
present disclosure.
[0064] Figure 7C1 illustrates a focal deleted segment comprising
the BRCA2 gene, in
accordance with some embodiments of the present disclosure.
[0065] Figures 7A2 and 7B2 collectively illustrate a method of
inferring an effect of a
sequence variant as a gain-of-function or a loss-of-function of a gene, in
accordance with
some embodiments of the present disclosure.
100661 Figure 7A3 illustrates an overview of an experimental and
analytical workflow
used for validation of the performance of a method for estimating the
circulating tumor
fraction of a liquid biopsy sample based on on-target and off-target sequence
reads from a
targeted-panel sequencing data, in accordance with some embodiments of the
present
disclosure.
[0067] Figures 8A, 8B, 8C, and 8D collectively illustrate results
of an inter-assay
comparison between a liquid biopsy assay, a digital droplet polymerase chain
reaction
(ddPCR), and a solid-tumor biopsy assay, in accordance with various
embodiments of the
present disclosure.
[0068] Figures 9A, 9B, 9C, 9D, 9E, 9F, 9G, and 9H collectively
illustrate results of a
comparison between circulating tumor fraction estimate (ctFE) and variant
allele fraction
(VAF) using an Off-Target Tumor Estimation Routine (OTTER) method, in
accordance with
various embodiments of the present disclosure.
[0069] Figures 10A and 10B collectively illustrate results of
evaluating ctFE and
mutational landscape according to cancer type, in accordance with various
embodiments of
the present disclosure.
[0070] Figures 11A, 11B, and 11C collectively illustrate results
of evaluating associations
between ctFE and advanced disease states, in accordance with various
embodiments of the
present disclosure.
18
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0071] Figures 12A, 12B, and 12C collectively illustrate results
of comparing ctFE with
recent clinical response outcomes, in accordance with various embodiments of
the present
disclosure.
[0072] Figure 13 illustrates a first table describing sensitivity
for all SNVs, indels, CNVs,
and rearrangements targeted in reference samples, in accordance with various
embodiments
of the present disclosure.
[0073] Figure 14 illustrates a second table describing
sensitivity for all SNVs, indels,
CNVs, and rearrangements targeted in reference samples, in accordance with
various
embodiments of the present disclosure.
[0074] Figure 15 illustrates a third table describing comparisons
between the presently
disclosed liquid biopsy assay and a commercial liquid biopsy kit, in
accordance with various
embodiments of the present disclosure.
[0075] Figures 16A, 16B, and 16C collectively illustrate a fourth
table describing variants
detected by a liquid biopsy assay, in accordance with various embodiments of
the present
disclosure.
[0076] Figure 17 illustrates a fifth table describing dynamic
filtering methodology to
further reduced discordance, in accordance with various embodiments of the
present
disclosure.
[0077] Figure 18 illustrates a sixth table describing cancer
groups included in clinical
profiling analysis, in accordance with various embodiments of the present
disclosure.
[0078] Figure 19 illustrates an example plot of the errors
between corresponding
segment-level coverage ratios and integer copy states determined across a
plurality of
simulated circulated tumor fractions ranging from about 0 to about 1, in
accordance with
some embodiments of the disclosure.
100791 Like reference numerals refer to corresponding parts
throughout the several views
of the drawings.
DETAILED DESCRIPTION
Introduction
[0080] As described above, conventional liquid biopsy assays do
not provide accurate
determination of copy number variations (CNVs) for actionable genomic targets,
particularly
19
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
focal amplifications. For example, some conventional methodologies determine
copy
number variations by mathematically converting copy ratios (e.g., of
experimental samples
compared to reference samples) to integer copy numbers based on tumor fraction
estimates
and known ploidy. These approaches have disadvantages due to the presence of
artifactual
variants and/or clonal heterogeneity in liquid biopsy samples, leading to
unreliable tumor
fraction estimates and, subsequently, unreliable copy number annotations.
Furthermore, the
identification of therapeutically actionable copy number variations is limited
when using
conventional methods because many large-scale (e.g., non-focal) copy number
variations
contain artifactual variants due to errors in normalization, poor sample
quality, and/or other
technical issues.
[0081] Thus, there is a need in the art for improved methods of
validating CNV calls in
order to distinguish between real and artifactual copy number variations.
Specifically, there
is a need in the art for a method of detecting focal copy number variations,
e.g., in order to
identify therapeutically actionable genomic alterations.
[0082] Advantageously, disclosed herein are methods and systems
that do provide
accurate determination of copy number variations by detecting actionable,
focal copy number
variations in circulating tumor DNA (ctDNA) with high confidence without the
need for
tumor fraction estimation. For example, in some embodiments, the methods and
systems
described herein utilize annotation and filtering that applies a statistical
method to bin-level
copy ratios, segment-level copy ratios and corresponding segment-level
confidence intervals
of binned and segmented sequence reads aligned to a reference genome. The
statistical
method filters out segments with non-focal copy number variations, which are
either non-
actionable, e.g., in the case of a copy number variation spanning a
significant portion of a
chromosome, or artifactual, e.g., due to incorrect data normalization.
[0083] As an example, Figure 4F1 illustrates a workflow of a
method 400-1 for validating
copy number variation, e.g., to identify therapeutically actionable genomic
alterations, in
accordance with some embodiments of the present disclosure.
100841 In some embodiments, the methods described herein utilize
conventional
methodologies to putatively identify copy number variations, which are then
validated using
the methodologies described herein. For instance, in some embodiments, copy
number
variations (CNVs) are analyzed using a combination of an open-source tool,
e.g., CNVkit, to
putatively identify copy number variations, and a script, e.g., a Python
script, to validate or
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
reject the putative copy number variations, using the validation methodologies
described
herein. In other embodiments, the validation methodologies described herein
are used to
identify focal copy number variations independently of conventional
bioinformatics tools,
e.g., CNVkit.
[0085] As described herein, in some embodiments, the methods
described herein include
one or more data collection steps, in addition to data analysis and downstream
steps. For
example, as described below, e.g., with reference to Figures 2 and 3, in some
embodiments,
the methods include collection of a liquid biopsy sample and, optionally, one
or more
matching biological samples from the subject (e.g., a matched cancerous and/or
matched non-
cancerous sample from the subject). Likewise, as described below, e.g., with
reference to
Figures 2 and 3, in some embodiments, the methods include extraction of DNA
from the
liquid biopsy sample and, optionally, one or more matching biological samples
from the
subject (e.g., a matched cancerous and/or matched non-cancerous sample from
the subject).
Similarly, as described below, e.g., with reference to Figures 2 and 3, in
some embodiments,
the methods include nucleic acid sequencing of DNA from the liquid biopsy
sample and,
optionally, one or more matching biological samples from the subject (e.g., a
matched
cancerous and/or matched non-cancerous sample from the subject).
[0086] However, in other embodiments, the methods described
herein begin with
obtaining nucleic acid sequencing results, e.g., raw or collapsed sequence
reads of DNA from
a liquid biopsy sample and, optionally, one or more matching biological
samples from the
subject (e.g., a matched cancerous and/or matched non-cancerous sample from
the subject),
from which the statistics needed for focal CNV validation (e.g., bin-level
sequence ratios,
segment-level sequence ratios, and segment-level measures of dispersion) can
be determined.
For example, in some embodiments, sequencing data 122 for a patient 121 is
accessed and/or
downloaded over network 105 by system 100.
[0087] Likewise, in some embodiments, the methods described
herein begin with
obtaining genomic bin values (e.g., bin counts or bin coverages) for a
sequencing of a liquid
biopsy sample and, optionally, one or more matching biological samples from
the subject
(e.g., a matched cancerous and/or matched non-cancerous sample from the
subject), from
which the statistics needed for focal CNV validation (e.g., bin-level sequence
ratios, segment-
level sequence ratios, and segment-level measures of dispersion) can be
determined. For
example, in some embodiments, genomic bin values 135-cf-by for a patient 121
is accessed
and/or downloaded over network 105 by system 100.
21
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0088] Similarly, in some embodiments, the methods described
herein begin with
obtaining the statistics needed for focal CNV validation (e.g., bin-level
sequence ratios,
segment-level sequence ratios, and segment-level measures of dispersion) for a
sequencing of
a liquid biopsy sample and, optionally, one or more matching biological
samples from the
subject (e.g., a matched cancerous and/or matched non-cancerous sample from
the subject),
e.g., as an output of a conventional bioinformatics tool (such as CNVkit). For
example, in
some embodiments, bin-level sequence ratios 135-cf-br, segment-level sequence
ratios 135-
cf-sr, and segment-level measures of dispersion for a patient 121 is accessed
and/or
downloaded over network 105 by system 100.
[0089] Referring again to method 400-1 in Figure 4F1, in some
embodiments, the method
includes obtaining a dataset including cell-free DNA sequencing data (Block
402-1), and
determining the statistics needed for focal CNV validation (e.g., bin-level
sequence ratios,
segment-level sequence ratios, and segment-level measures of dispersion). For
instance, in
some embodiments, system 100 obtains sequencing data 122 (e.g., sequence reads
123 and/or
aligned sequences 124) and applies a copy number segmentation algorithm 153-b
(e.g.,
CNVkit) to the sequencing data.
[0090] For example, in some embodiments, sequence reads 123
obtained from the
sequencing dataset 122 are aligned to a reference human construct (Block 404-
1), generating
a plurality of aligned reads 124 (Block 406-1). Aligned cfDNA sequence reads
are then
optionally processed (e.g., using normalization, filtering, and/or quality
control) (Block 408-
1).
[0091] A copy number segmentation algorithm 153-b is then used
for genomic region
binning, coverage calculation, bias correction, normalization to a reference
pool,
segmentation, and/or visualization Block 410-1). For example, in some
embodiments,
aligned sequence reads are sorted into bins (e.g., on target bins 153-b-1-a
and off-target bins
153-b-1-b) of pre-specified bin sizes (e.g., 100-150 base pairs) based on
their genomic
location using binning subroutine 153-b-1. For example, in some embodiments,
binning
subroutine 153-b-1 reads in mapped sequences 124 and pre-selected bins (e.g.,
target bins
153-b-1-a and off-target bins 153-b-1-b for target panel sequencing analysis)
and assigns
respective sequences to the bins based on their mapping within the reference
genome. Bin
values 135-by (e.g, liquid biopsy genomic bin values 135-cf-by) for each of
the bins, e.g.,
bin counts or bin coverages, can be read out from binning subroutine 153-b-1.
Bin values
22
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
135-by are optionally pre-processed, e.g., normalized, standardized,
corrected, etc., as
described in further detail herein.
[0092] Bin values 135-by are then used to determine bin-level
sequence ratios 135-br
(e.g., liquid biopsy bin-level sequence ratios 135-cf-br). Briefly, a copy
ratio subroutine 153-
b-2 reads in bin values 135-by and reference bin coverages 153-b-2-a
determined for one or
more reference samples (e.g., a matched non-cancerous sample of the subject or
a an average
from a plurality of non-cancerous reference samples), and compares bin values
for
corresponding bins, thereby generating bin-level sequence ratios 135-br.
[0093] These bin-level sequence ratios 135-br are then used to
group adjacent bins,
having similar sequence ratios, into segments, e.g., using circular binary
segmentation. For
example, in some embodiments, segmentation subroutine 153-a-3 reads in and
applies a
segmentation model (e.g, a circular binary segmentation model) to bin-level
sequence ratios
135-br, thereby generating a plurality of genomic segments, each corresponding
to one or
more contiguous bins.
[0094] Segment-level sequence ratios 135-sr (e.g., liquid biopsy
segment-level sequence
ratios 135-cf-sr) and segment-level measures of dispersion 135-sd (e.g.,
liquid biopsy
segment-level measures of dispersion 135-cf-sd) can be determined using a
statistics
subroutine 153-a-4, which may be read out from the copy number segmentation
algorithm
153-b, as illustrated in Figure 1D1, or may be separately implemented, e.g.,
by reading-in
segment annotations (e.g., including bin assignments to each segment)
generated by the
segmentation subroutine 153-a-3 and bin-level sequence ratios 135-br from the
copy ratio
subroutine 153-b-2.
[0095] Optionally, a copy number annotation subroutine 153-a-5
reads in one or both
segment-level sequence ratios 135-sr (e.g., liquid biopsy segment-level
sequence ratios 135-
cf-sr) and segment-level measures of dispersion 135-sd, to provide copy number
status
annotations (e.g., amplified, neutral, or deleted) 135-cn (e.g., liquid biopsy
copy numb
annotations 135-cf-en) for one or more of the identified segments.
[0096] In some embodiments, the process above is also performed
for a matched tumor
tissue biopsy of the subject, e.g., thereby generating one or more tumor
segment copy number
annotations 135-t-cn.
[0097] The bin-level copy ratios, segment-level copy ratios and
the corresponding
segment-level confidence intervals statistics obtained from the copy number
segmentation
23
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
algorithm 153 (e.g., CNVkit) output are used as inputs for a focal
amplification / deletion
validation algorithm, to determine whether putative segment amplifications
and/or deletions
can be validated. The copy number segmentation algorithm 153 applies a
plurality of filters
to statistics for one or more identified segment (Block 412-1). In some
embodiments, these
filters include one or more of:
= a bin-level measure of central tendency sequence ratio filter 153-a-1,
e.g., a median
bin-level copy ratio filter (Block 414-1);
= a segment-level measure of dispersion confidence filter 153-a-2, e.g., a
segment-level
confidence interval filter (Block 416-1);
= a bin-level measure of central tendency plus deviation filter 153-a-3,
e.g., a median-
plus-median absolute deviation (MAD) bin-level copy ratio filter (Block 418-
1); and
= a segment-level sequence ratio filter 153-a-4, e.g., a segment-level copy
ratio filter
(Block 419-1).
In some embodiments, the plurality of filters includes at least two of the
above filters. In
some embodiments, the plurality of filters includes at least three of the
above filters. In some
embodiments, the plurality of filters includes all four of the above filters.
[0098] The copy number status annotation (e.g., amplified,
neutral, deleted) for each
segment is validated or rejected if it passes or fails the plurality of copy
number status
annotation validation filters (Block 420-1). Specifically, when a filter in
the plurality of
filters is fired, the copy number annotation of the segment is rejected, and
the copy number
variation is determined to be a non-focal copy number variation. When no
filter in the
plurality of filters is fired, the copy number annotation of the segment is
validated, and the
copy number variation is determined to be a focal copy number variation (Block
422-1).
[0099] Validated copy number variations (e.g., focal
amplifications and/or focal deletions
of target genes) can then be used for variant analysis and clinical report
generation. For
example, focal copy number variations can be matched to the appropriate
therapies and/or
clinical trials (Block 426-1). A patient report indicating the validated copy
number variations
and any matched therapies and/or clinical trials can then be generated for use
in precision
oncology applications (Block 426-1).
[0100] Additional embodiments of the presently disclosed systems
and methods are
described in further detail below with reference to Figures 2A and 4F1 (see,
Example
24
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Workflow for Precision Oncology: Copy Number Variation Analysis) and Example 2
¨
Identification of Focal Copy Number Variation (see, Examples).
[0101] Copy number variations are considered a biomarker for
cancer diagnosis and
certain copy number variations are targets of treatment. For example, a subset
of copy
number variations that can be investigated using the methods disclosed herein
include
amplifications in MET, EGFR, ERBB2, CD274, CCNE1, and MYC, and deletions in
BRCA1 and BRCA2. However, the analysis is not limited to these reportable
genes. The
method utilizes bin-level copy ratios, in addition to segment-level copy
ratios, to validate the
copy number variations of target genomic segments, thus allowing a highly
sensitive
characterization of local (both internal and external) changes in copy number
to detect true
copy number variations with greater accuracy. The presently disclosed systems
and methods
enable an automatic and reliable way to detect actionable, focal copy number
variations via a
liquid biopsy assay that is not achieved by conventional methods and is
considerably less
invasive than a tissue biopsy. The combination of liquid biopsy and copy
number variation
detection benefits physicians, clinicians, and medical institutions by
providing a powerful
tool for diagnosing cancer conditions and administering treatments.
Furthermore, the
methods disclosed herein can be performed alone or alongside traditional solid
tumor biopsy
methods as a validation method for detecting copy number variations.
[0102] Specifically, the annotation and filtering algorithm can
be used to distinguish
between actionable and non-actionable copy number variations of target
biomarkers that are
informative for precision oncology. For example, as reported in Example 2
(Identification of
Focal Copy Number Variation; see Examples, below), when applied to two
experimental
samples both containing a conventionally obtained amplification status for the
MYC gene,
the method rejected the amplification in a first sample as a non-focal
amplification, and
validated the amplification in a second sample as a focal, and likely
actionable, amplification.
[0103] The identification of actionable genomic alterations in a
patient's cancer genome
is a difficult and computationally demanding problem. For instance, the
determination of
various prognostic metrics useful for precision oncology, such as variant
allelic ratio, copy
number variation, tumor mutational burden, microsatellite instability status,
etc., requires
analysis of hundreds of millions to billions, of sequenced nucleic acid bases.
An example of
a typical bioinformatics pipeline established for this purpose includes at
least five stages of
analysis: assessment of the quality of raw next generation sequencing data,
generation of
collapsed nucleic acid fragment sequences and alignment of such sequences to a
reference
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
genome, detection of structural variants in the aligned sequence data,
annotation of identified
variants, and visualization of the data. See, Wadapurkar and Vyas, Informatics
in Medicine
Unlocked, 11:75-82 (2018), the content of which is hereby incorporated by
reference, in its
entirety, for all purposes. Each one of these procedures is computationally
taxing in its own
right.
101041 For instance, the overall temporal and spatial computation
complexity of simple
global and local pairwise sequence alignment algorithms are quadratic in
nature (e.g., second
order problems), that increase rapidly as a function of the size of the
nucleic acid sequences
(n and m) being compared. Specifically, the temporal and spatial complexities
of these
sequence alignment algorithms can be estimated as 0(mn), where 0 is the upper
bound on
the asymptotic growth rate of the algorithm, n is the number of bases in the
first nucleic acid
sequence, and m is the number of bases in the second nucleic acid sequence.
See, Baichoo
and Ouzounis, BioSystems, 156-157:72-85 (2017), the content of which is hereby

incorporated by reference, in its entirety, for all purposes. Given that the
human genome
contains more than 3 billion bases, these alignment algorithms are extremely
computationally
taxing, especially when used to analyze next generation sequencing (NGS) data,
which can
generate more than 3 billion sequence reads per reaction.
[0105] This is particularly true when performed in the context of
a liquid biopsy assay,
because liquid biopsy samples contain a complex mixture of short DNA fragments

originating from many different germline (e.g., healthy) and diseased (e.g.,
cancerous)
tissues. Thus, the cellular origins of the sequence reads are unknown, and the
sequence
signals originating from cancerous cells, which may constitute multiple sub-
clonal
populations, must be computationally deconvoluted from signals originating
from germline
and hematopoietic origins, in order to provide relevant information about the
subject's
cancer. Thus, in addition to the computationally taxing processes required to
align sequence
reads to a human genome, there is a computation problem of determining whether
a particular
abnormal signal, e.g., one or more sequence reads corresponding to a genomic
alteration, (i)
is not an artifact, and (ii) originated from a cancerous source in the
subject. This is
increasingly difficult during the early stages of cancer¨when treatment is
presumably most
effective¨when only small amounts of ctDNA are diluted by germline and
hematopoietic
DNA
[0106] In addition to the computationally demanding problem of
aligning sequencing
data to a human reference genome, the method comprises dividing the plurality
of aligned
26
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
sequence reads into "bins" (e.g., regions of a predefined span of base pairs
corresponding to a
reference genome), determining the copy ratio of each bin by calculating the
differential read
depths between experimental and reference samples, and grouping subsets of
adjacent bins
with shared copy ratios into segments. Grouping bins into segments divides
each
chromosome into regions of equal copy number that minimizes noise in the data.
Such
methods essentially perform a change-point or edge detection algorithm, which
are either
temporally limited or computationally intense. For example, in some
embodiments, the
segmentation is performed using circular binary segmentation. Circular binary
segmentation
calculates a statistic for each genomic position, where the statistic
comprises a likelihood
ratio for the null hypothesis (no change in copy ratio at the respective
position) against the
alternative (one change in copy ratio at the respective position), and where
the null
hypothesis is rejected if the statistic is greater than a predefined
distribution threshold.
Notably, in circular binary segmentation, the chromosome is assumed to be
circularized, such
that the calculation is performed recursively for each position (e.g., each
bin) around the
circumference of the circle to identify all change-points across the length of
the chromosome.
Furthermore, for each position (e.g., bin) under investigation, a reference
distribution is
generated using a permutation approach, where the copy ratios for the
plurality of bins are
randomized (typically 10,000 times). For some embodiments that utilize bins of

approximately 100-150 bases long spanning a human reference genome of several
billion
bases, the number of permutations required to perform this recursive method
contributes to a
computationally intense procedure. See, for example, Olshen et al.,
Biostatistics 5, 4, 557-
572 (2004), doi:10.1093/biostatistics/kxh008, which is hereby incorporated
herein by
reference in its entirety.
[0107]
Advantageously, the present disclosure provides various systems and
methods that
improve the computational elucidation of actionable genomic alterations from a
liquid biopsy
sample of a cancer patient. Specifically, the present disclosure improves a
computer-
implemented method for identifying focal copy number variations by validating
copy number
status annotations assigned to genomic segments. As a further example, the
application of
the plurality of filters to the bin-level copy ratios, segment-level copy
ratios, and
corresponding segment-level confidence intervals is iterated, on a computer
system, over
each segment in the plurality of segments, and in some embodiments requires
calculations
using the copy ratios of each bin in the plurality of bins for each
chromosome, for each
27
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
segment in the plurality of segments. Taken together, the methods disclosed
herein are a
computational process designed to solve a computational problem.
[0108] Advantageously, the methods and systems described herein
provide an
improvement to the abovementioned technical problem (e.g., performing complex
computer-
implemented methods for analyzing a plurality of sequence reads for detection
and validation
of copy number variations in human genetic targets). The methods described
herein therefore
solve a problem in the computing art by improving upon conventional methods
for
identifying copy number variations for cancer diagnosis and treatment. For
example, the
application of a plurality of filters to the bin-level copy ratios, segment-
level copy ratios, and
corresponding segment-level confidence intervals provides a means for
detecting true copy
number variations for clinically relevant biomarkers and filtering out
artifactual variations
that are not therapeutically actionable, thus improving the accuracy and
precision of genomic
alteration detection in precision oncology.
[0109] The methods and systems described herein also improve
precision oncology
methods for assigning and/or administering treatment because of the improved
accuracy of
copy number variation detection. The identification of therapeutically
actionable, focal copy
number variations that can be included in a clinical report for patient and/or
clinician review,
and/or matched with appropriate therapies and/or clinical trials for treatment
and/or
monitoring, allows for more accurate assignment of treatments. Furthermore,
the removal of
non-therapeutically actionable, non-focal copy number variations reduces the
risk of patients
undergoing unnecessary or potentially harmful regimens due to misdiagnoses.
[0110] As described above, conventional liquid biopsy assays also
do not provide
accurate determination of variants (e.g., somatic variants), particularly at
low circulating
variant fractions. This is due, in large part, to the use of static variant
count filters that
require a common amount of support to call a variant positively as a somatic
variant in
sequencing data, regardless of the identity of the variant and its position
within the genome.
That is, conventional methods require that at least X number of unique
sequence reads (e.g., 8
sequence reads) provide support for (e.g., encompass) a particular variant in
order for that
variant to be confirmed as a true somatic variant. While this may be fine for
liquid biopsy
samples having a high tumor fraction, where more copies of each somatic
variant would be
expected to be found, it results in a high number of false negatives when
samples with lower
tumor fractions are analyzed. On the other hand, simply lowering the threshold
to allow
calling of variants with lower support for a particular variant will increase
the number of false
28
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
positives, that is the number of untrue positive somatic variant calls, which
are actually
sequencing en-ors.
[0111] While there are many methods of performing noise
suppression on ultra-high
depth sequencing data commonly generated for liquid biopsy assays, there
remains the
fundamental fidelity boundary of sequencing by synthesis that cannot be
overcome. Along
with this, there are a variety of complexities and non-linearities within the
ability to map
reads across complex sets of genomic features and from these data,
successfully call a
variant. While it is possible to filter very stringently, one of the goals of
liquid biopsy assays
is to detect alterations at very low circulating fractions. This requires that
low levels of
support be sufficient to make a positive alteration call given that at 0.1%
circulating fraction
and an average depth of 5000x, only 5 reads containing alternate alleles will
be present.
Because of this, it is impossible to have a consistent set of thresholds that
will be used to
filter variants as any filter will either be too stringent or too permissive
depending on the
variant context and local sequence specific error generation models.
[0112] Advantageously, the present disclosure provides methods
and systems that more
accurately call somatic variants by adjusting the variant count threshold in a
locus-by-locus
fashion, e.g., by lowering the variant count threshold when there is an
increased likelihood
(orthogonal to the variant count in the sequencing reaction) that a variant at
a particular locus
is a true somatic variant and/or by raising the variant count threshold when
there is an
increased likelihood (orthogonal to the variant count in the sequencing
reaction) that a variant
at a particular locus is a result of a sequencing error, rather than a true
somatic variant.
[0113] For example, in some embodiments, the methods and systems
described herein
employ a generalized application of Bayes' Theorem through the likelihood
ratio test that
allows dynamic calibration of filtering threshold for diagnostic assays. These
thresholds are
based on one or more of a sample-specific error rate, a methodology-specific
sequencing
error rate (e.g., from a pool of process matched healthy control samples), an
estimate of the
variant allele fraction for the variant being evaluated, and a historical
likelihood that a variant
would be present at a particular locus in a particular cancer (e.g., derived
from an extensive
cohort of human solid tumor tissue samples to inform probability models). This
results in
high sensitivity and specificity in variant detection, allowing identification
of actionable
oncologic targets, as well as determination of a precise limit of detection to
reduce the
occurrence of false negatives.
29
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0114] For instance, in some embodiments, the dynamic variant
filtering methodology
described herein uses an application of Bayes theorem to dynamically tune a
variant count
threshold for calling a somatic variant at a particular genomic region based
on the prevalence
of similar mutations within that genomic regions in similar cancers. For
instance, where
there is a high prevalence of a somatic variant in a given gene for a
particular cancer, (e.g.,
BRCA1 mutations are common in breast cancers), the dynamic filtering method
accounts for
this prior (e.g., the prior knowledge that BRCA mutations are commonly found
in breast
cancers) by setting a lower variant count threshold to call somatic variants
in the BRCA1
gene for a breast cancer. That is, the dynamic filtering methodology requires
less evidence in
order to call a variant in the BRCA1 gene when the subject has breast cancer
than when the
subject has a different cancer that is not associated with a high prevalence
of BRCA1
mutations.
[0115] In some embodiments, the dynamic variant filtering
methodology described herein
uses an application of Bayes theorem to dynamically tune a variant count
threshold for
calling a somatic variant based on an estimated variant allele fraction for
the variant being
evaluated. That is, the dynamic filtering methodology takes into account the
fact that in a
sample having a lower tumor fraction, and therefore a lower variant allele
fraction, a fewer
number of sequences encompassing a somatic variant would be expected than in a
sample
having a higher tumor fraction, and therefore a higher variant allele
fraction. Accordingly,
the sensitivity and specificity of the dynamic filter are tuned to account for
the expectation
that a higher percentage of variant sequences with low sequence counts (e.g.,
lower support)
represent true somatic variants in a sample with a low tumor fraction than in
a sample with a
high tumor fraction, for which a higher percentage of variant sequences with
low sequence
counts represent sequencing errors.
[0116] In some embodiments, the dynamic variant filtering
methodology described herein
used an application of Bayes theorem to dynamically tune a variant count
threshold for
calling a somatic variant at a particular genomic locus based on a historical
sequencing error
rate for the locus. That is, the dynamic filtering methodology takes into
account the fact that
at genomic loci that are more prone to sequencing errors, such as loci with
short nucleotide
repeat sequences (e.g., di-nucleotide or tri-nucleotide repeats), there is a
higher likelihood
that a particular variant is a product of a sequencing error, rather than a
true somatic
mutation, than at a locus that is not prone to sequencing errors.
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0117] Similarly, in some embodiments, the dynamic variant
filtering methodology
described herein used an application of Bayes theorem to dynamically tune a
variant count
threshold for calling a somatic variant at a particular genomic locus based on
a reaction-
specific sequencing error rate. That is, the dynamic filtering methodology
takes into account
the fact that in reactions with higher sequencing rates there is a higher
likelihood that a
particular variant is a product of a sequencing error, rather than a true
somatic mutation.
[0118] The present disclosure provides improved systems and
methods for precision
oncology based on improved variant calling in liquid biopsy data. The various
improvements
described herein, e.g., improved variant detection at low circulating
fractions, are embodied
in an example liquid biopsy workflow described in Examples 2 and 3. These
examples
describe an example liquid biopsy assay employing a 105-gene hybrid-capture
next-
generation sequencing (NGS) panel spanning 270 kb of the human genome,
configured to
detect targets in four variant classes, including single nucleotide variants
(SNVs), insertions
and/or deletions (indels), copy number variants (CNVs), and gene
rearrangements. To
establish robust clinical performance, extensive validation studies were
conducted that
demonstrated high sensitivity and specificity. Accordingly, the example liquid
biopsy assay
detected actionable variants with high accuracy in comparison to a commercial
ctDNA NGS
kit, commercial solid tumor biopsy-based assays, such as a solid tumor biopsy
NGS tissue
assay, and digital droplet PCR (ddPCR). As shown in the results of Figure 17,
the methods
and systems disclosed herein reduced false positive variant calling by 11.45%
compared to
conventional variant detection methods.
[0119] As described in detail above, the identification of
actionable genomic alterations
in a patient's cancer genome is a difficult and computationally demanding
problem.
[0120] Advantageously, the present disclosure provides various
systems and methods that
improve the computational elucidation of actionable genomic alterations from a
liquid biopsy
sample of a cancer patient. Specifically, the present disclosure improves a
method for
identifying variants in ctDNA using a dynamic thresholding approach. As
described above,
the disclosed methods and systems are necessarily computer-implemented due to
their
complexity and heavy computational requirements, and thus solve a problem in
the
computing art.
[0121] Advantageously, the methods and systems described herein
provide an
improvement to the abovementioned technical problem (e.g., performing complex
computer-
31
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
implemented methods for identifying variants in ctDNA using a dynamic
thresholding
approach). The methods described herein therefore solve a problem in the
computing art by
improving upon conventional methods for identifying variants (e.g., actionable
oncologic
targets) for cancer diagnosis and treatment. For example, the application of
Bayes' Theorem
through the likelihood ratio test provides a means for improving detection of
true positive
variants and reducing detection of false positive variants for clinically
relevant biomarkers,
thus improving the accuracy and precision of genomic alteration detection in
precision
oncology.
[0122] The methods and systems described herein also improve
precision oncology
methods for assigning and/or administering treatment because of the improved
accuracy of
variation detection. The identification of therapeutically actionable variants
that can be
included in a clinical report for patient and/or clinician review, and/or
matched with
appropriate therapies and/or clinical trials for treatment and/or monitoring,
allows for more
accurate assignment of treatments. Furthermore, the removal of false positive
variant
detection reduces the risk of patients undergoing unnecessary or potentially
harmful regimens
due to misdiagnoses.
[0123] Additionally, as described above, conventional liquid
biopsy assays do not
provide accurate determination of circulating tumor fraction estimates
(ctFEs). For example,
while low-pass, whole-genome sequencing can be used to estimate tumor
fractions, somatic
variant sequences are poorly identified from low-pass, whole genome sequencing
data,
particularly from samples having low tumor fractions. Accordingly,
conventional liquid
biopsy assays typically use targeted-panel sequencing in order to achieve
higher sequence
coverage required to identify somatic variants present at low levels within
the sample.
However, targeted-panel sequencing data does not span a large enough portion
of the genome
to accurately estimate tumor fraction. Rather, tumor fraction estimates
obtained using variant
allele fractions (VAFs) in targeted-panel sequencing data are noisy, due to
variant tissue
source and capture bias.
[0124] Advantageously, the present disclosure provides methods
and systems that do
provide accurate determination of circulating tumor fraction estimates by
using on-target and
off-target sequence reads from targeted-panel sequencing data. For example, in
some
embodiments, the methods and systems described herein fit experimental
coverage ratios for
segmented sequence reads across the genome to integer copy numbers across a
range of
simulated tumor fractions. These fitted copy numbers can then be used to
determine the
32
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
expected coverage ratio for the segment, at the given simulated tumor
fraction. The
aggregate difference between the experimental coverage ratios for all segments
and the
expected coverage ratios based on the fitted copy number at the given
simulated tumor
fraction is used as a measure of the accuracy of the fit. That is, where the
experimental
coverage ratios closely match the expected coverage ratios, the simulated
tumor fraction is a
good estimate of the actual tumor fraction of the sample. Likewise, where the
experimental
coverage ratios do not closely match the expected coverage ratios, the
simulated tumor
fraction is a poor estimate of the actual tumor fraction of the sample.
[0125] By using on-target and off-target sequence reads, the
systems and methods
described herein leverage data collected across a majority of the human
genome, which
allows for more accurate estimation of circulating tumor fraction than data
that is limited to
on-target probe regions. Advantageously, this method allows for both accurate
tumor
fraction estimation and robust variant identification from a single, low-cost
sequencing
reaction. Previously, in order to generate suitable data for both accurate
tumor fraction
estimate and robust variant identification two sequencing reactions would need
to be
performed; a low-pass whole genome sequencing reaction to generate data across
the genome
for estimating circulating tumor fraction and a targeted-panel sequencing
reaction to generate
sufficiently deep sequencing data to identify variants.
[0126] Accordingly, the systems and methods described herein can
be used in
conjunction with variant detection methods that rely on targeted panel
sequencing, such as
high-depth sequencing reactions. By ensuring uniform distribution of sequence
reads across
a genome (e.g., by a process of binning sequencing reads and correcting bins
for size, GC
content, sequencing depth, etc.), the systems and methods described herein
ensure that any
variation detected in regions of the genome are representative of the
reference genome. This
approach reduces noise resulting from capture bias, which can result in
unreliable circulating
tumor fraction estimates.
[0127] By using a maximum likelihood estimation (e.g., an
expectation-maximization
algorithm) to fit on-target and off-target sequence reads to genomic
variations (e.g., integer
copy states), the systems and methods described herein further improve the
accuracy and
reliability of circulating tumor fraction estimates. For example, in some
embodiments, the
sequencing coverage of on-target and off-target sequence reads are used to
determine a test
coverage ratio for regions of the genome in a test liquid biopsy sample. The
test coverage
ratio is compared to a set of expected coverage ratios obtained using
assumptions for
33
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
expected copy states and expected tumor fractions, which gives a distance
(e.g., an error) of
the test coverage ratio from the expected copy state. Using this model, by
minimizing the
distance (e.g., the error) between test parameters and expected parameters, it
is possible to
estimate the test tumor fraction with high confidence.
[0128] An improved method for obtaining accurate circulating
tumor fraction estimates
provide several benefits to liquid biopsies. Advantageously, more reliable
ctFEs improves
the classification accuracy of detected variants as somatic or germline
variants (e.g., any
variant detected at or below the ctFE can be classified as a somatic variant
with high
confidence). In addition, accurate ctFEs can greatly improve the sensitivity
of detection of
clinically relevant copy number variations, including integer copy number
calling.
Furthermore, in some embodiments, ctFEs are used as biomarkers for tumor
burden,
metastases, disease progression, or treatment resistance. For example, ctFEs
have been
shown to correlate with tumor volumes and vary in response to treatment.
[0129] As a result, the methods and systems disclosed herein
provide a sensitive, cost-
effective, and minimally invasive method to monitor patients for response to
therapy, disease
burden, relapse, progression, and/or emerging resistance mutations, which can
translate into
better care for patients. When used as part of the course of care, serial ctFE
monitoring can
predict objective measures of progression in at-risk individuals. Due to cost
and convenience
of sampling, the methods and systems disclosed herein can be applied at
shorter time
intervals than radiographic methods and can allow for more timely intervention
in the case of
disease progression.
[0130] Additionally, the methods and systems disclosed herein
provide benefits to
clinicians by generating more accurate variant calls and/or informative ctFE
biomarkers that
can aid in the prediction of clinical outcomes in patients and/or the
selection of appropriate
treatment plans.
[0131] Specifically, a validation of the performance of a method
for on-target and off-
target tumor estimation, in accordance with some embodiments of the present
disclosure,
revealed a correlation between ctFEs and metastases and disease progression.
For example,
as reported in Examples 2 and 3, when the method is applied to matched, de-
identified
clinical data for a cohort of 1,000 patients, high ctFEs were found to (i)
correlate well with
estimates derived from low-pass, whole genome sequencing, (ii) be a highly
specific
predictor of metastases, (iii) be positively con-elated with reported
"progressive disease" and
34
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
(iv) be negatively correlated with better clinical outcomes. Figure 7A3
provides an overview
of an experimental and analytical workflow used for validation of the off-
target tumor
estimation routine (OTTER).
[0132] As described in detail above, the identification of
actionable genomic alterations
in a patient's cancer genome is a difficult and computationally demanding
problem.
[0133] Advantageously, the present disclosure provides various
systems and methods that
improve the computational elucidation of actionable genomic alterations from a
liquid biopsy
sample of a cancer patient. Specifically, the present disclosure improves upon
the accuracy
of circulating tumor fractions estimated from targeted-panel sequencing.
Moreover, because
the methods described herein eliminate the need to process data from two
different
sequencing reactions, the disclosure lowers the computational budget for
accurately
estimating circulating tumor fractions and identifying actionable variants. As
described
above, the disclosed methods and systems are necessarily computer-implemented
due to their
complexity and heavy computational requirements, and thus solve a problem in
the
computing art.
[0134] Advantageously, the methods and systems described herein
provide an
improvement to the abovementioned technical problem (e.g, performing complex
computer-
implemented methods for determining accurate circulating tumor fraction
estimates). The
methods described herein therefore solve a problem in the computing art by
improving upon
conventional methods for determining tumor fraction estimates for cancer
diagnosis,
monitoring, and treatment. For example, the application of a maximum
likelihood estimation
(e.g, an expectation-maximization algorithm) to estimate genomic alterations
using on-target
and off-target sequence reads in liquid biopsy samples improves upon
conventional
approaches for precision oncology by providing highly reliable circulating
tumor fraction
estimates, while allowing concurrent variant detection in targeted panel
sequencing of liquid
biopsy samples. This in turn lowers the computational budget required for
these processes,
thereby improving the speed and lowering the power requirements of the
computer.
101351 The methods and systems described herein also improve
precision oncology
methods for assigning and/or administering treatment because of the improved
accuracy of
circulating tumor fraction estimations. Accurate ctFEs can be reported as
biomarkers and/or
used in downstream analysis for identification of therapeutically actionable
variants to be
included in a clinical report for patient and/or clinician review.
Additionally, ctFEs and any
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
therapeutically actionable variants identified using ctFEs can be matched with
appropriate
therapies and/or clinical trials, allowing for more accurate assignment of
treatments. The
improved accuracy of biomarker detection increases the chance of efficacy and
reduces the
risk of patients undergoing unnecessary or potentially harmful regimens due to
misdiagnoses.
Definitions
[0136] As used herein, the term "subject" refers to any living or
non-living organism
including, but not limited to, a human (e.g., a male human, female human,
fetus, pregnant
female, child, or the like), a non-human mammal, or a non-human animal. Any
human or
non-human animal can serve as a subject, including but not limited to mammal,
reptile, avian,
amphibian, fish, ungulate, ruminant, bovine (e.g., cattle), equine (e.g.,
horse), caprine and
ovine (e.g., sheep, goat), swine (e.g., pig), camelid (e.g., camel, llama,
alpaca), monkey, ape
(e.g., gorilla, chimpanzee), ursid (e.g., bear), poultry, dog, cat, mouse,
rat, fish, dolphin,
whale and shark. In some embodiments, a subject is a male or female of any age
(e.g., a man,
a woman, or a child).
[0137] As used herein, the terms "control," "control sample,"
"reference," "reference
sample," "normal," and "normal sample" describe a sample from a non-diseased
tissue. In
some embodiments, such a sample is from a subject that does not have a
particular condition
(e.g., cancer). In other embodiments, such a sample is an internal control
from a subject, e.g.,
who may or may not have the particular disease (e.g., cancer), but is from a
healthy tissue of
the subject. For example, where a liquid or solid tumor sample is obtained
from a subject
with cancer, an internal control sample may be obtained from a healthy tissue
of the subject,
e.g., a white blood cell sample from a subject without a blood cancer or a
solid germline
tissue sample from the subject. Accordingly, a reference sample can be
obtained from the
subject or from a database, e.g., from a second subject who does not have the
particular
disease (e.g., cancer).
[0138] As used herein the term -cancer," -cancerous tissue." or -
tumor" refers to an
abnormal mass of tissue in which the growth of the mass surpasses, and is not
coordinated
with, the growth of normal tissue, including both solid masses (e.g., as in a
solid tumor) or
fluid masses (e.g., as in a hematological cancer). A cancer or tumor can be
defined as
-benign" or -malignant" depending on the following characteristics: degree of
cellular
differentiation including morphology and functionality, rate of growth, local
invasion and
metastasis. A "benign" tumor can be well differentiated, have
characteristically slower
growth than a malignant tumor and remain localized to the site of origin. In
addition, in some
36
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
cases a benign tumor does not have the capacity to infiltrate, invade or
metastasize to distant
sites. A "malignant" tumor can be a poorly differentiated (anaplasia), have
characteristically
rapid growth accompanied by progressive infiltration, invasion, and
destruction of the
surrounding tissue. Furthermore, a malignant tumor can have the capacity to
metastasize to
distant sites. Accordingly, a cancer cell is a cell found within the abnormal
mass of tissue
whose growth is not coordinated with the growth of normal tissue. Accordingly,
a "tumor
sample- refers to a biological sample obtained or derived from a tumor of a
subject, as
described herein.
[0139] Non-limiting examples of cancer types include ovarian
cancer, cervical cancer,
uveal melanoma, colorectal cancer, chromophobe renal cell carcinoma, liver
cancer,
endocrine tumor, oropharyngeal cancer, retinoblastoma, biliary cancer, adrenal
cancer, neural
cancer, neuroblastoma, basal cell carcinoma, brain cancer, breast cancer, non-
clear cell renal
cell carcinoma, glioblastoma, glioma, kidney cancer, gastrointestinal stromal
tumor,
medulloblastoma, bladder cancer, gastric cancer, bone cancer, non-small cell
lung cancer,
thymoma, prostate cancer, clear cell renal cell carcinoma, skin cancer,
thyroid cancer,
sarcoma, testicular cancer, head and neck cancer (e.g., head and neck squamous
cell
carcinoma), meningioma, peritoneal cancer, endometrial cancer, pancreatic
cancer,
mesothelioma, esophageal cancer, small cell lung cancer, Her2 negative breast
cancer,
ovarian serous carcinoma, HR-P breast cancer, uterine serous carcinoma,
uterine corpus
endometrial carcinoma, gastroesophageal junction adenocarcinoma, gallbladder
cancer,
chordoma, and papillary renal cell carcinoma.
[0140] As used herein, the terms "cancer state" or "cancer
condition" refer to a
characteristic of a cancer patient's condition, e.g., a diagnostic status, a
type of cancer, a
location of cancer, a primary origin of a cancer, a cancer stage, a cancer
prognosis, and/or one
or more additional characteristics of a cancer (e.g., tumor characteristics
such as morphology,
heterogeneity, size, etc.). In some embodiments, one or more additional
personal
characteristics of the subject are used further describe the cancer state or
cancer condition of
the subject, e.g., age, gender, weight, race, personal habits (e.g., smoking,
drinking, diet),
other pertinent medical conditions (e.g., high blood pressure, dry skin, other
diseases), current
medications, allergies, pertinent medical history, current side effects of
cancer treatments and
other medications, etc.
[0141] As used herein, the term "liquid biopsy" sample refers to
a liquid sample obtained
from a subject that includes cell-free DNA Examples of liquid biopsy samples
include, but
37
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
are not limited to, blood, whole blood, plasma, serum, urine, cerebrospinal
fluid, fecal
material, saliva, sweat, tears, pleural fluid, pericardial fluid, or
peritoneal fluid of the subject.
In some embodiments, a liquid biopsy sample is a cell-free sample, e.g., a
cell free blood
sample. In some embodiments, a liquid biopsy sample is obtained from a subject
with
cancer. In some embodiments, a liquid biopsy sample is collected from a
subject with an
unknown cancer status, e.g., for use in determining a cancer status of the
subject. Likewise,
in some embodiments, a liquid biopsy is collected from a subject with a non-
cancerous
disorder, e.g., a cardiovascular disease. In some embodiments, a liquid biopsy
is collected
from a subject with an unknown status for a non-cancerous disorder, e.g., for
use in
determining a non-cancerous disorder status of the subject.
[0142] As used herein, the term "cell-free DNA- and "cfDNA-
interchangeably refer to
DNA fragments that circulate in a subject's body (e.g., bloodstream) and
originate from one
or more healthy cells and/or from one or more cancer cells. These DNA
molecules are found
outside cells, in bodily fluids such as blood, whole blood, plasma, serum,
urine, cerebrospinal
fluid, fecal material, saliva, sweat, sweat, tears, pleural fluid, pericardial
fluid, or peritoneal
fluid of a subject, and are believed to be fragments of genomic DNA expelled
from healthy
and/or cancerous cells, e.g., upon apoptosis and lysis of the cellular
envelope.
[0143] As used herein, the term -locus" refers to a position
(e.g., a site) within a genome,
e.g., on a particular chromosome. In some embodiments, a locus refers to a
single nucleotide
position, on a particular chromosome, within a genome. In some embodiments, a
locus refers
to a group of nucleotide positions within a genome. In some instances, a locus
is defined by a
mutation (e.g., substitution, insertion, deletion, inversion, or
translocation) of consecutive
nucleotides within a cancer genome. In some instances, a locus is defined by a
gene, a sub-
genic structure (e.g., a regulatory element, exon, intron, or combination
thereof), or a
predefined span of a chromosome. Because normal mammalian cells have diploid
genomes,
a normal mammalian genome (e.g., a human genome) will generally have two
copies of
every locus in the genome, or at least two copies of every locus located on
the autosomal
chromosomes, e.g., one copy on the maternal autosomal chromosome and one copy
on the
paternal autosomal chromosome.
[0144] As used herein, the term -allele" refers to a particular
sequence of one or more
nucleotides at a chromosomal locus. In a haploid organism, the subject has one
allele at
every chromosomal locus. In a diploid organism, the subject has two alleles at
every
chromosomal locus.
38
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0145] As used herein, the term "base pair" or "bp" refers to a
unit consisting of two
nucleobases bound to each other by hydrogen bonds. Generally, the size of an
organism's
genome is measured in base pairs because DNA is typically double stranded.
However, some
viruses have single-stranded DNA or RNA genomes.
[0146] As used herein, the terms -genomic alteration," -
mutation," and -variant" refer to
a detectable change in the genetic material of one or more cells. A genomic
alteration,
mutation, or variant can refer to various type of changes in the genetic
material of a cell,
including changes in the primary genome sequence at single or multiple
nucleotide positions,
e.g., a single nucleotide variant (SNV), a multi-nucleotide variant (MNV), an
indel (e.g., an
insertion or deletion of nucleotides), a DNA rearrangement (e.g., an inversion
or translocation
of a portion of a chromosome or chromosomes), a variation in the copy number
of a locus
(e.g., an exon, gene, or a large span of a chromosome) (CNV), a partial or
complete change in
the ploidy of the cell, as well as in changes in the epigenetic information of
a genome, such as
altered DNA methylation patterns. In some embodiments, a mutation is a change
in the
genetic information of the cell relative to a particular reference genome, or
one or more
'normal' alleles found in the population of the species of the subject. For
instance, mutations
can be found in both germline cells (e.g., non-cancerous, 'normal' cells) of a
subject and in
abnormal cells (e.g., pre-cancerous or cancerous cells) of the subject. As
such, a mutation in
a germline of the subject (e.g., which is found in substantially all 'normal
cells' in the
subject) is identified relative to a reference genome for the species of the
subject. However,
many loci of a reference genome of a species are associated with several
variant alleles that
are significantly represented in the population of the subject and are not
associated with a
diseased state, e.g., such that they would not be considered 'mutations.' By
contrast, in some
embodiments, a mutation in a cancerous cell of a subject can be identified
relative to either a
reference genome of the subject or to the subject's own germline genome. In
certain
instances, identification of both types of variants can be informative. For
instance, in some
instances, a mutation that is present in both the cancer genome of the subject
and the
germline of the subject is informative for precision oncology when the
mutation is a so-called
'driver mutation,' which contributes to the initiation and/or development of a
cancer.
However, in other instances, a mutation that is present in both the cancer
genome of the
subject and the germline of the subject is not informative for precision
oncology, e.g., when
the mutation is a so-called 'passenger mutation,' which does not contribute to
the initiation
and/or development of the cancer. Likewise, in some instances, a mutation that
is present in
39
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
the cancer genome of the subject but not the germline of the subject is
informative for
precision oncology, e.g., where the mutation is a driver mutation and/or the
mutation
facilitates a therapeutic approach, e.g., by differentiating cancer cells from
normal cells in a
therapeutically actionable way. However, in some instances, a mutation that is
present in the
cancer genome but not the germline of a subject is not informative for
precision oncology,
e.g., where the mutation is a passenger mutation and/or where the mutation
fails to
differentiate the cancer cell from a germline cell in a therapeutically
actionable way.
[0147] As used herein, the terms "focal copy number variation,"
"focal copy number
alteration,- -focal copy number variant," and the like interchangeably refer
to a genomic
variation, relative to a reference genome, in the copy number of a small
genomic segment.
Unless otherwise specified, a small genomic segment is less than 30 Mb.
However, in some
embodiments, a small genomic segment is less than 25 Mb, less than 20 Mb, less
15 Mb, less
than 10 Mb, less than 5 Mb, less than 4 Mb, less than 3 Mb, less than 2 Mb,
less than 1 Mb,
or smaller. Generally, focal copy number variations range from several hundred
bases to tens
of Mb. In some embodiments, a focal copy number variation consists of one or a
few exons
of a gene or several genes. For more information of focal copy number
variations see, for
example, Nord etal., Int. J. Cancer, 126, 1390-1402 (2010), which is hereby
incorporated
herein by reference in its entirely.
[0148] As used herein, the term "reference allele" refers to the
sequence of one or more
nucleotides at a chromosomal locus that is either the predominant allele
represented at that
chromosomal locus within the population of the species (e.g., the "wild-type-
sequence), or
an allele that is predefined within a reference genome for the species.
[0149] As used herein, the term "variant allele- refers to a
sequence of one or more
nucleotides at a chromosomal locus that is either not the predominant allele
represented at
that chromosomal locus within the population of the species (e.g., not the
"wild-type-
sequence), or not an allele that is predefined within a reference sequence
construct (e.g., a
reference genome or set of reference genomes) for the species. In some
instances, sequence
isoforms found within the population of a species that do not affect a change
in a protein
encoded by the genome, or that result in an amino acid substitution that does
not substantially
affect the function of an encoded protein, are not variant alleles.
[0150] As used herein, the term -variant allele fraction," -VAF,"
-allelic fraction," or
"AF" refers to the number of times a variant or mutant allele was observed
(e.g., a number of
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
reads supporting a candidate variant allele) divided by the total number of
times the position
was sequenced (e.g., a total number of reads covering a candidate locus).
[0151] As used herein, the terms "variant fragment count" and
"variant allele fragment
count" interchangeably refer to a quantification, e.g., a raw or normalized
count, of the
number of sequences representing unique cell-free DNA fragments encompassing a
variant
allele in a sequencing reaction. That is, a variant fragment count represents
a count of
sequence reads representing unique molecules in the liquid biopsy sample,
after duplicate
sequence reads in the raw sequencing data have been collapsed, e.g., through
the use of
unique molecular indices (UM1) and bagging, etc. as described herein.
[0152] As used herein, the term "germline variants" refers to
genetic variants inherited
from maternal and paternal DNA. Germline variants may be determined through a
matched
tumor-normal calling pipeline.
[0153] As used herein, the term "somatic variants" refers to
variants arising as a result of
dysregulated cellular processes associated with neoplastic cells, e.g., a
mutation. Somatic
variants may be detected via subtraction from a matched normal sample.
[0154] As used herein, the term "single nucleotide variant- or -
SNV- refers to a
substitution of one nucleotide to a different nucleotide at a position (e.g.,
site) of a nucleotide
sequence, e.g., a sequence read from an individual. A substitution from a
first nucleobase X
to a second nucleobase Y may be denoted as "X>Y." For example, a cytosine to
thymine
SNV may be denoted as "C>T.-
101551 As used herein, the term "insertions and deletions" or
"indels" refers to a variant
resulting from the gain or loss of DNA base pairs within an analyzed region.
[0156] As used herein, the term "copy number variation" or -CNV"
refers to the process
by which large structural changes in a genome associated with tumor aneuploidy
and other
dysregulated repair systems are detected. These processes are used to detect
large scale
insertions or deletions of entire genomic regions. CNV is defined as
structural insertions or
deletions greater than a certain base pair ("bp") in size, such as 500 bp.
[0157] As used herein, the term "gene fusion" refers to the
product of large-scale
chromosomal aberrations resulting in the creation of a chimeric protein. These
expressed
products can be non-functional, or they can be highly over or underactive.
This can cause
deleterious effects in cancer such as hyper-proliferative or anti-apoptotic
phenotypes.
41
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0158] As used herein, the term "loss of heterozygosity" refers
to the loss of one copy of
a segment (e.g., including part or all of one or more genes) of the genome of
a diploid subject
(e.g., a human) or loss of one copy of a sequence encoding a functional gene
product in the
genome of the diploid subject, in a tissue, e.g., a cancerous tissue, of the
subject. As used
herein, when referring to a metric representing loss of heterozygosity across
the entire
genome of the subject, loss of heterozygosity is caused by the loss of one
copy of various
segments in the genome of the subject. Loss of heterozygosity across the
entire genome may
be estimated without sequencing the entire genome of a subject, and such
methods for such
estimations based on gene panel targeting-based sequencing methodologies are
described in
the art. Accordingly, in some embodiments, a metric representing loss of
heterozygosity
across the entire genome of a tissue of a subject is represented as a single
value, e.g., a
percentage or fraction of the genome. In some cases, a tumor is composed of
various sub-
clonal populations, each of which may have a different degree of loss of
heterozygosity
across their respective genomes. Accordingly, in some embodiments, loss of
heterozygosity
across the entire genome of a cancerous tissue refers to an average loss of
heterozygosity
across a heterogeneous tumor population. As used herein, when referring to a
metric for loss
of heterozygosity in a particular gene, e.g., a DNA repair protein such as a
protein involved in
the homologous DNA recombination pathway (e.g., BRCA1 or BRCA2), loss of
heterozygosity refers to complete or partial loss of one copy of the gene
encoding the protein
in the genome of the tissue and/or a mutation in one copy of the gene that
prevents translation
of a full-length gene product, e.g., a frameshift or truncating (creating a
premature stop codon
in the gene) mutation in the gene of interest. In some cases, a tumor is
composed of various
sub-clonal populations, each of which may have a different mutational status
in a gene of
interest. Accordingly, in some embodiments, loss of heterozygosity for a
particular gene of
interest is represented by an average value for loss of heterozygosity for the
gene across all
sequenced sub-clonal populations of the cancerous tissue. In other
embodiments, loss of
heterozygosity for a particular gene of interest is represented by a count of
the number of
unique incidences of loss of heterozygosity in the gene of interest across all
sequenced sub-
clonal populations of the cancerous tissue (e.g., the number of unique frame-
shift and/or
truncating mutations in the gene identified in the sequencing data).
[0159] As used herein, the term -microsatellites" refers to
short, repeated sequences of
DNA. The smallest nucleotide repeated unit of a microsatellite is referred to
as the "repeated
unit" or -repeat unit." In some embodiments, the stability of a microsatellite
locus is
42
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
evaluated by comparing some metric of the distribution of the number of
repeated units at a
microsatellite locus to a reference number or distribution.
[0160] As used herein, the term "microsatellite instability" or
"MSI" refers to a genetic
hypermutability condition associated with various cancers that results from
impaired DNA
mismatch repair (MMR) in a subject. Among other phenotypes. MSI causes changes
in the
size of microsatellite loci, e.g, a change in the number of repeated units at
microsatellite loci,
during DNA replication. Accordingly, the size of microsatellite repeats is
varied in MSI
cancers as compared to the size of the corresponding microsatellite repeats in
the germline of
a cancer subject. The term -Microsatellite Instability-High" or -MSI-H" refers
to a state of a
cancer (e.g., a tumor) that has a significant MMR defect, resulting in
microsatellite loci with
significantly different lengths than the corresponding microsatellite loci in
normal cells of the
same individual. The term "Microsatellite Stable" or "MSS" refers to a state
of a cancer
(e.g, a tumor) without significant MMR defects, such that there is no
significant difference
between the lengths of the microsatellite loci in cancerous cells and the
lengths of the
corresponding microsatellite loci in normal (e.g., non-cancerous) cells in the
same individual.
The term -Microsatellite Equivocal" or -MSE" refers to a state of a cancer
(e.g., a tumor)
having an intermediate microsatellite length phenotype, that cannot be clearly
classified as
MSI-H or MSS based on statistical cutoffs used to define those two categories.
[0161] As used herein, the term "gene product" refers to an RNA
(e.g., mRNA or
miRNA) or protein molecule transcribed or translated from a particular genomic
locus, e.g., a
particular gene. The genomic locus can be identified using a gene name, a
chromosomal
location, or any other genetic mapping metric.
[0162] As used herein, the terms "expression level,- "abundance
level,- or simply
"abundance" refers to an amount of a gene product, (an RNA species, e.g., mRNA
or
miRNA, or protein molecule) transcribed or translated by a cell, or an average
amount of a
gene product transcribed or translated across multiple cells. When referring
to mRNA or
protein expression, the term generally refers to the amount of any RNA or
protein species
corresponding to a particular genomic locus, e.g., a particular gene. However,
in some
embodiments, an expression level can refer to the amount of a particular
isoform of an
mRNA or protein corresponding to a particular gene that gives rise to multiple
mRNA or
protein isoforms. The genomic locus can be identified using a gene name, a
chromosomal
location, or any other genetic mapping metric.
43
CA 03167253 2022- 8- 5

WO 2021/168146
PCT/US2021/018622
[0163] As used herein, the term "ratio" refers to any comparison
of a first metric X, or a
first mathematical transformation thereof X' (e.g., measurement of a number of
units of a
genomic sequence in a first one or more biological samples or a first
mathematical
transformation thereof) to another metric Y or a second mathematical
transformation thereof
Y' (e.g., the number of units of a respective genomic sequence in a second one
or more
biological samples or a second mathematical transformation thereof) expressed
as X/Y, Y/X,
logN(X/Y), logN(Y/X), X'/Y, Y/X', logN(X7Y), or logN(Y/X'), X/Y', Y'/X,
logN(X/Y),
logN(Y'/X) , X'/Y', Y'/X', logN(X'/Y'), or logN(Y'/X'), where N is any real
number greater
than 1 and where example mathematical transformations of X and Y include, but
are not
limited to. raising X or Y to a power Z, multiplying X or Y by a constant Q,
where Z and Q
are any real numbers, and/or taking an M based logarithm of X and/or Y, where
M is a real
number greater than 1. In one non-limiting example, X is transformed to X'
prior to ratio
calculation by raising X by the power of two (X2) and Y is transformed to Y'
prior to ratio
calculation by raising Y by the power of 3.2 (Y3.2) and the ratio of X and Y
is computed as
log2(X'/Y').
[0164] As used herein, the term -relative abundance" refers to a
ratio of a first amount of
a compound measured in a sample, e.g., a gene product (an RNA species, e.g.,
mRNA or
miRNA, or protein molecule) or nucleic acid fragments having a particular
characteristic
(e.g., aligning to a particular locus or encompassing a particular allele), to
a second amount of
a compound measured in a second sample. In some embodiments, relative
abundance refers
to a ratio of an amount of species of a compound to a total amount of the
compound in the
same sample. For instance, a ratio of the amount of mRNA transcripts encoding
a particular
gene in a sample (e.g., aligning to a particular region of the exome) to the
total amount of
mRNA transcripts in the sample. In other embodiments, relative abundance
refers to a ratio
of an amount of a compound or species of a compound in a first sample to an
amount of the
compound of the species of the compound in a second sample. For instance, a
ratio of a
normalized amount of mRNA transcripts encoding a particular gene in a first
sample to a
normalized amount of mRNA transcripts encoding the particular gene in a second
and/or
reference sample.
[0165] As used herein, the terms "sequencing," "sequence
determination," and the like
refer to any biochemical processes that may be used to determine the order of
biological
macromolecules such as nucleic acids or proteins. For example, sequencing data
can include
44
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
all or a portion of the nucleotide bases in a nucleic acid molecule such as an
mRNA transcript
or a genomic locus.
[0166] As used herein, the term "genetic sequence" refers to a
recordation of a series of
nucleotides present in a subject's RNA or DNA as determined by sequencing of
nucleic acids
from the subject.
[0167] As used herein, the term "sequence reads" or "reads"
refers to nucleotide
sequences produced by any nucleic acid sequencing process described herein or
known in the
art. Reads can be generated from one end of nucleic acid fragments ("single-
end reads-) or
from both ends of nucleic acid fragments (e.g., paired-end reads, double-end
reads). The
length of the sequence read is often associated with the particular sequencing
technology.
High-throughput methods, for example, provide sequence reads that can vary in
size from
tens to hundreds of base pairs (bp). In some embodiments, the sequence reads
are of a mean,
median or average length of about 15 bp to 900 bp long (e.g., about 20 bp,
about 25 bp, about
30 bp, about 35 bp, about 40 bp, about 45 bp, about 50 bp, about 55 bp, about
60 bp, about 65
bp, about 70 bp, about 75 bp, about 80 bp, about 85 bp, about 90 bp, about 95
bp, about 100
bp, about 110 bp, about 120 bp, about 130, about 140 bp, about 150 bp, about
200 bp, about
250 bp, about 300 bp, about 350 bp, about 400 bp, about 450 bp, or about 500
bp. In some
embodiments, the sequence reads are of a mean, median or average length of
about 1000 bp,
2000 bp, 5000 bp, 10,000 bp, or 50,000 bp or more. Nanopore sequencing, for
example,
can provide sequence reads that can vary in size from tens to hundreds to
thousands of base
pairs. Illumina0 parallel sequencing, for example, can provide sequence reads
that do not
vary as much, for example, most of the sequence reads can be smaller than 200
bp. A
sequence read (or sequencing read) can refer to sequence information
corresponding to a
nucleic acid molecule (e.g., a string of nucleotides). For example, a sequence
read can
correspond to a string of nucleotides (e.g., about 20 to about 150) from part
of a nucleic acid
fragment, can correspond to a string of nucleotides at one or both ends of a
nucleic acid
fragment, or can correspond to nucleotides of the entire nucleic acid
fragment. A sequence
read can be obtained in a variety of ways, e.g., using sequencing techniques
or using probes,
e.g., in hybridization arrays or capture probes, or amplification techniques,
such as the
polymerase chain reaction (PCR) or linear amplification using a single primer
or isothermal
amplification
[0168] As used herein, the term "read segment" refers to any form
of nucleotide sequence
read including the raw sequence reads obtained directly from a nucleic acid
sequencing
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
technique or from a sequence derived therefrom, e.g., an aligned sequence
read, a collapsed
sequence read, or a stitched sequence read.
[0169] As used herein, the term "read count" refers to the total
number of nucleic acid
reads generated, which may or may not be equivalent to the number of nucleic
acid molecules
generated, during a nucleic acid sequencing reaction.
[0170] As used herein, the term "read-depth," "sequencing depth,"
or "depth" can refer to
a total number of unique nucleic acid fragments encompassing a particular
locus or region of
the genome of a subject that are sequenced in a particular sequencing
reaction. Sequencing
depth can be expressed as "Yx", e.g., 50x, 100x, etc., where "Y" refers to the
number of
unique nucleic acid fragments encompassing a particular locus that are
sequenced in a
sequencing reaction. In such a case, Y is necessarily an integer, because it
represents the
actual sequencing depth for a particular locus. Alternatively, read-depth,
sequencing depth,
or depth can refer to a measure of central tendency (e.g., a mean or mode) of
the number of
unique nucleic acid fragments that encompass one of a plurality of loci or
regions of the
genome of a subject that are sequenced in a particular sequencing reaction.
For example, in
some embodiments, sequencing depth refers to the average depth of every locus
across an
arm of a chromosome, a targeted sequencing panel, an exome, or an entire
genome. In such
case, Y may be expressed as a fraction or a decimal, because it refers to an
average coverage
across a plurality of loci. When a mean depth is recited, the actual depth for
any particular
locus may be different than the overall recited depth. Metrics can be
determined that provide
a range of sequencing depths in which a defined percentage of the total number
of loci fall.
For instance, a range of sequencing depths within which 90% or 95%, or 99% of
the loci fall.
As understood by the skilled artisan, different sequencing technologies
provide different
sequencing depths. For instance, low-pass whole genome sequencing can refer to

technologies that provide a sequencing depth of less than 5x, less than 4x,
less than 3x, or less
than 2x, e.g., from about 0.5x to about 3x.
[0171] As used herein, the term "sequencing breadth- refers to
what fraction of a
particular reference exome (e.g., human reference exome), a particular
reference genome
(e.g., human reference genome), or part of the exome or genome has been
analyzed.
Sequencing breadth can be expressed as a fraction, a decimal, or a percentage,
and is
generally calculated as (the number of loci analyzed / the total number of
loci in a reference
exome or reference genome). The denominator of the fraction can be a repeat-
masked
genome, and thus 100% can correspond to all of the reference genome minus the
masked
46
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
parts. A repeat-masked exome or genome can refer to an exome or genome in
which
sequence repeats are masked (e.g., sequence reads align to unmasked portions
of the exome
or genome). In some embodiments, any part of an exome or genome can be masked
and,
thus, sequencing breadth can be evaluated for any desired portion of a
reference exome or
genome. In some embodiments, "broad sequencing- refers to sequencing/analysis
of at least
0.1% of an exome or genome.
[0172] As used herein, the terms "sequence ratio" and "coverage
ratio" interchangeably
refer to any measurement of a number of units of a genomic sequence in a first
one or more
biological samples (e.g., a test and/or tumor sample) compared to the number
of units of the
respective genomic sequence in a second one or more biological samples (e.g.,
a reference
and/or control sample). In some embodiments, a sequence ratio is a copy ratio,
a 10g2-
transformed copy ratio (e.g., 1og2 copy ratio), a coverage ratio, a base
fraction, an allele
fraction (e.g., a variant allele fraction), and/or a tumor ploidy. In some
embodiments
sequence ratio is a logN-transformed copy ratio, where N is any real number
greater than 1.
[0173] As used herein, the term -sequencing probe" refers to a
molecule that binds to a
nucleic acid with affinity that is based on the expected nucleotide sequence
of the RNA or
DNA present at that locus.
[0174] As used herein, the term "targeted panel" or "targeted
gene panel" refers to a
combination of probes for sequencing (e.g., by next-generation sequencing)
nucleic acids
present in a biological sample from a subject (e.g., a tumor sample, liquid
biopsy sample,
germline tissue sample, white blood cell sample, or tumor or tissue organoid
sample),
selected to map to one or more loci of interest on one or more chromosomes. An
example set
of loci/genes useful for precision oncology, e.g., via solid or liquid biopsy
assay, that can be
analyzed using a targeted panel is described in Table 1. In some embodiments,
in addition to
loci that are informative for precision oncology, a targeted panel includes
one or more probes
for sequencing one or more of a loci associated with a different medical
condition, a loci used
for internal control purposes, or a loci from a pathogenic organism (e.g., an
oncogenic
pathogen).
[0175] As used herein, the term, "reference exome" refers to any
sequenced or otherwise
characterized exome, whether partial or complete, of any tissue from any
organism or
pathogen that may be used to reference identified sequences from a subject.
Typically, a
reference exome will be derived from a subject of the same species as the
subject whose
47
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
sequences are being evaluated. Example reference exomes used for human
subjects as well
as many other organisms are provided in the on-line genome browser hosted by
the National
Center for Biotechnology Information ("NCBI"). An -exome" refers to the
complete
transcriptional profile of an organism or pathogen, expressed in nucleic acid
sequences. As
used herein, a reference sequence or reference exome often is an assembled or
partially
assembled exomic sequence from an individual or multiple individuals. In some
embodiments, a reference exome is an assembled or partially assembled exomic
sequence
from one or more human individuals. The reference exome can be viewed as a
representative
example of a species' set of expressed genes. In some embodiments, a reference
exome
comprises sequences assigned to chromosomes.
[0176] As used herein, the term "reference genome- refers to any
sequenced or otherwise
characterized genome, whether partial or complete, of any organism or pathogen
that may be
used to reference identified sequences from a subject. Typically, a reference
genome will be
derived from a subject of the same species as the subject whose sequences are
being
evaluated. Exemplary reference genomes used for human subjects as well as many
other
organisms are provided in the on-line genome browser hosted by the National
Center for
Biotechnology Information ("NCBI") or the University of California, Santa Cruz
(UCSC). A
-genome" refers to the complete genetic information of an organism or
pathogen, expressed
in nucleic acid sequences. As used herein, a reference sequence or reference
genome often is
an assembled or partially assembled genomic sequence from an individual or
multiple
individuals. In some embodiments, a reference genome is an assembled or
partially
assembled genomic sequence from one or more human individuals. The reference
genome
can be viewed as a representative example of a species' set of genes. In some
embodiments,
a reference genome comprises sequences assigned to chromosomes. Exemplary
human
reference genomes include but are not limited to NCBI build 34 (UCSC
equivalent: hg16),
NCBI build 35 (UCSC equivalent: hg17), NCBI build 36.1 (UCSC equivalent:
hg18),
GRCh37 (UCSC equivalent: hg19), and GRCh38 (UCSC equivalent: hg38). For a
haploid
genome, there can be only one nucleotide at each locus. For a diploid genome,
heterozygous
loci can be identified; each heterozygous locus can have two alleles, where
either allele can
allow a match for alignment to the locus.
101771 As used herein, the term "bioinforrnatics pipeline" refers
to a series of processing
stages used to determine characteristics of a subject's genome or exome based
on sequencing
data of the subject's genome or exome. A bioinformatics pipeline may be used
to determine
48
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
characteristics of a germline genome or exome of a subject and/or a cancer
genome or exome
of a subject. In some embodiments, the pipeline extracts information related
to genomic
alterations in the cancer genome of a subject, which is useful for guiding
clinical decisions
for precision oncology, from sequencing results of a biological sample, e.g.,
a tumor sample,
liquid biopsy sample, reference normal sample, etc., from the subject. Certain
processing
stages in a bioinformatics may be 'connected,' meaning that the results of a
first respective
processing stage are informative and/or essential for execution of a second,
downstream
processing stage. For instance, in some embodiments, a bioinformatics pipeline
includes a
first respective processing stage for identifying genomic alterations that are
unique to the
cancer genome of a subject and a second respective processing stage that uses
the quantity
and/or identity of the identified genomic alterations to determine a metric
that is informative
for precision oncology, e.g., a tumor mutational burden. In some embodiments,
the
bioinformatics pipeline includes a reporting stage that generates a report of
relevant and/or
actionable information identified by upstream stages of the pipeline, which
may or may not
further include recommendations for aiding clinical therapy decisions.
[0178] As used herein, the term -limit of detection" or "LOD"
refers to the minimal
quantity of a feature that can be identified with a particular level of
confidence. Accordingly,
level of detection can be used to describe an amount of a substance that must
be present in
order for a particular assay to reliably detect the substance. A level of
detection can also be
used to describe a level of support needed for an algorithm to reliably
identify a genomic
alteration based on sequencing data. For example, a minimal number of unique
sequence
reads to support identification of a sequence variant such as a SNV.
[0179] As used herein, the term "BAM File" or "Binary file
containing Alignment Maps"
refers to a file storing sequencing data aligned to a reference sequence
(e.g., a reference
genome or exome). In some embodiments, a BAM file is a compressed binary
version of a
SAM (Sequence Alignment Map) file that includes, for each of a plurality of
unique sequence
reads, an identifier for the sequence read, information about the nucleotide
sequence,
information about the alignment of the sequence to a reference sequence, and
optionally
metrics relating to the quality of the sequence read and/or the quality of the
sequence
alignment. While BAM files generally relate to files having a particular
format, for
simplicity they are used herein to simply refer to a file, of any format,
containing information
about a sequence alignment, unless specifically stated otherwise.
49
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0180] As used herein, the term "measure of central tendency"
refers to a central or
representative value for a distribution of values. Non-limiting examples of
measures of
central tendency include an arithmetic mean, weighted mean, midrange,
midhinge, trimean,
geometric mean, geometric median, Winsorized mean, median, and mode of the
distribution
of values.
101811 As used herein, the term "Positive Predictive Value" or
"PPV" means the
likelihood that a variant is properly called given that a variant has been
called by an assay.
PPV can be expressed as (number of true positives)/ (number of false positives
+ number of
true positives).
[0182] As used herein, the term "assay" refers to a technique for
determining a property
of a substance, e.g., a nucleic acid, a protein, a cell, a tissue, or an
organ. An assay (e.g., a
first assay or a second assay) can comprise a technique for determining the
copy number
variation of nucleic acids in a sample, the methylation status of nucleic
acids in a sample, the
fragment size distribution of nucleic acids in a sample, the mutational status
of nucleic acids
in a sample, or the fragmentation pattern of nucleic acids in a sample. Any
assay known to a
person having ordinary skill in the art can be used to detect any of the
properties of nucleic
acids mentioned herein. Properties of a nucleic acids can include a sequence,
genomic
identity, copy number, methylation state at one or more nucleotide positions,
size of the
nucleic acid, presence or absence of a mutation in the nucleic acid at one or
more nucleotide
positions, and pattern of fragmentation of a nucleic acid (e.g., the
nucleotide position(s) at
which a nucleic acid fragments). An assay or method can have a particular
sensitivity and/or
specificity, and their relative usefulness as a diagnostic tool can be
measured using ROC-
AUC statistics.
[0183] As used herein, the term "classification" can refer to any
number(s) or other
characters(s) that are associated with a particular property of a sample. For
example, in some
embodiments, the term -classification" can refer to a type of cancer in a
subject, a stage of
cancer in a subject, a prognosis for a cancer in a subject, a tumor load, a
presence of tumor
metastasis in a subject, and the like. The classification can be binary (e.g.,
positive or
negative) or have more levels of classification (e.g., a scale from 1 to 10 or
0 to 1). The terms
-cutoff' and "threshold" can refer to predetermined numbers used in an
operation. For
example, a cutoff size can refer to a size above which fragments are excluded.
A threshold
value can be a value above or below which a particular classification applies.
Either of these
terms can be used in either of these contexts.
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0184] As used herein, the term "sensitivity" or "true positive
rate" (TPR) refers to the
number of true positives divided by the sum of the number of true positives
and false
negatives. Sensitivity can characterize the ability of an assay or method to
correctly identify
a proportion of the population that truly has a condition. For example,
sensitivity can
characterize the ability of a method to correctly identify the number of
subjects within a
population having cancer. In another example, sensitivity can characterize the
ability of a
method to correctly identify the one or more markers indicative of cancer.
[0185] As used herein, the term "specificity" or "true negative
rate" (TNR) refers to the
number of true negatives divided by the sum of the number of true negatives
and false
positives. Specificity can characterize the ability of an assay or method to
correctly identify a
proportion of the population that truly does not have a condition. For
example, specificity
can characterize the ability of a method to correctly identify the number of
subjects within a
population not having cancer. In another example, specificity characterizes
the ability of a
method to correctly identify one or more markers indicative of cancer.
[0186] As used herein, an -actionable genomic alteration" or -
actionable variant" refers
to a genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number
variation, or
ploidy variation), or value of another cancer metric derived from nucleic acid
sequencing data
(e.g., a tumor mutational burden, MSI status, or tumor fraction), that is
known or believed to
be associated with a therapeutic course of action that is more likely to
produce a positive
effect in a cancer patient that has the actionable variant than in a similarly
situated cancer
patient that does not have the actionable variant. For instance,
administration of EGFR
inhibitors (e.g., afatinib, erlotinib, gefitinib) is more effective for
treating non-small cell lung
cancer in patients with an EGFR mutation in exons 19/21 than for treating non-
small cell
lung cancer in patients that do not have an EGFR mutations in exons 19/21.
Accordingly, an
EGFR mutation in exon 19/21 is an actionable variant. In some instances, an
actionable
variant is only associated with an improved treatment outcome in one or a
group of specific
cancer types. In other instances, an actionable variant is associated with an
improved
treatment outcome in substantially all cancer types.
[0187] As used herein, a "variant of uncertain significance" or
"VUS" refers to a genomic
alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or
ploidy
variation), or value of another cancer metric derived from nucleic acid
sequencing data (e.g.,
a tumor mutational burden, MSI status, or tumor fraction), whose impact on
disease
development/progression is unknown_
51
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0188] As used herein, a "benign variant" or "likely benign
variant" refers to a genomic
alteration (e.g., a SNV, MNV, indel, rearrangement, copy number variation, or
ploidy
variation), or value of another cancer metric derived from nucleic acid
sequencing data (e.g.,
a tumor mutational burden, MSI status, or tumor fraction), that is known or
believed to not
contribute to disease development/progression.
101891 As used herein, a "pathogenic variant" or "likely
pathogenic variant" refers to a
genomic alteration (e.g., a SNV, MNV, indel, rearrangement, copy number
variation, or
ploidy variation), or value of another cancer metric derived from nucleic acid
sequencing data
(e.g., a tumor mutational burden, MS1 status, or tumor fraction), that is
known or believed to
contribute to disease development/progression.
[0190] As used herein, an -effective amount" or -therapeutically
effective amount" is an
amount sufficient to affect a beneficial or desired clinical result upon
treatment. An effective
amount can be administered to a subject in one or more doses. In terms of
treatment, an
effective amount is an amount that is sufficient to palliate, ameliorate,
stabilize, reverse or
slow the progression of the disease, or otherwise reduce the pathological
consequences of the
disease. The effective amount is generally determined by the physician on a
case-by-case
basis and is within the skill of one in the art. Several factors are typically
taken into account
when determining an appropriate dosage to achieve an effective amount. These
factors
include age, sex and weight of the subject, the condition being treated, the
severity of the
condition and the form and effective concentration of the therapeutic agent
being
administered.
[0191] The terminology used in the present disclosure is for the
purpose of describing
particular embodiments only and is not intended to be limiting of the
invention. As used in
the description of the invention and the appended claims, the singular forms
"a", "an" and
"the- are intended to include the plural forms as well, unless the context
clearly indicates
otherwise. It will also be understood that the term -and/or" as used herein
refers to and
encompasses any and all possible combinations of one or more of the associated
listed items.
It will be further understood that the terms "comprises" and/or "comprising,"
when used in
this specification, specify the presence of stated features, integers, steps,
operations,
elements, and/or components, but do not preclude the presence or addition of
one or more
other features, integers, steps, operations, elements, components, and/or
groups thereof
Furthermore, to the extent that the terms "including," "includes," "having,"
"has," "with," or
52
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
variants thereof are used in either the detailed description and/or the
claims, such terms are
intended to be inclusive in a manner similar to the term "comprising."
[0192] As used herein, the term "if' may be construed to mean
"when" or "upon" or "in
response to determining" or -in response to detecting," depending on the
context. Similarly,
the phrase -if it is determined" or -if [a stated condition or event] is
detected" may be
construed to mean "upon determining" or "in response to determining" or "upon
detecting
[the stated condition or eventl" or "in response to detecting [the stated
condition or event],"
depending on the context.
[0193] It will also be understood that, although the terms first,
second, etc. may be used
herein to describe various elements, these elements should not be limited by
these terms.
These terms are only used to distinguish one element from another. For
example, a first
subject could be termed a second subject, and, similarly, a second subject
could be termed a
first subject, without departing from the scope of the present disclosure. The
first subject and
the second subject are both subjects, but they are not the same subject.
Furthermore, the
terms -subject," -user," and -patient" are used interchangeably herein.
[0194] Reference will now be made in detail to embodiments,
examples of which are
illustrated in the accompanying drawings. In the following detailed
description, numerous
specific details are set forth in order to provide a thorough understanding of
the present
disclosure, including example systems, methods, techniques, instruction
sequences, and
computing machine program products that embody illustrative implementations.
However,
the illustrative discussions below are not intended to be exhaustive or to
limit the
implementations to the precise forms disclosed. Many modifications and
variations are
possible in view of the above teachings. The features described herein are not
limited by the
illustrated ordering of acts or events, as some acts can occur in different
orders and/or
concurrently with other acts or events.
[0195] The implementations provided herein are chosen and
described in order to best
explain the principles and their practical applications, to thereby enable
others skilled in the
art to best utilize the various embodiments with various modifications as are
suited to the
particular use contemplated. In some instances, well-known methods,
procedures,
components, circuits, and networks have not been described in detail so as not
to
unnecessarily obscure aspects of the embodiments. In other instances, it will
be apparent to
53
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
one of ordinary skill in the art that the present disclosure may be practiced
without one or
more of the specific details.
[0196] It will be appreciated that, in the development of any
such actual implementation,
numerous implementation-specific decisions are made in order to achieve the
designer's
specific goals, such as compliance with use case- and business-related
constraints, and that
these specific goals will vary from one implementation to another and from one
designer to
another. Moreover, it will be appreciated that though such a design effort
might be complex
and time-consuming, it will nevertheless be a routine undertaking of
engineering for those of
ordering skill in the art having the benefit of the present disclosure.
Example System Embodiments
[0197] Now that an overview of some aspects of the present
disclosure and some
definitions used in the present disclosure have been provided, details of an
exemplary system
for providing clinical support for personalized cancer therapy using a liquid
biopsy assay are
now described in conjunction with Figures IA, 1B, IC1, 1D1, IC2, 1D2, 1E2,
1F2, IC3, and
1D3. Figures 1A, 1B, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2, 1C3, and 1D3collectively
illustrate the
topology of an example system for providing clinical support for personalized
cancer therapy
using a liquid biopsy assay, in accordance with some embodiments of the
present disclosure.
Advantageously, the example system illustrated in Figures 1A, 1B, ICI, 1D1,
1C2, 1D2,
1E2, 1F2, 1C3, and 1D3improves upon conventional methods for providing
clinical support
for personalized cancer therapy by validating copy number variations, thus
identifying focal
copy number variations for actionable treatment, validating a somatic sequence
variant in a
test subject having a cancer condition, and/or determining circulating tumor
fraction
estimates using on-target and off-target sequence reads.
[0198] Figure IA is a block diagram illustrating a system in
accordance with some
implementations. The device 100 in some implementations includes one or more
processing
units CPU(s) 102 (also referred to as processors), one or more network
interfaces 104, a user
interface 106, e.g., including a display 108 and/or an input 110 (e.g., a
mouse, touchpad,
keyboard, etc.), a non-persistent memory 111, a persistent memory 112, and one
or more
communication buses 114 for interconnecting these components. The one or more
communication buses 114 optionally include circuitry (sometimes called a
chipset) that
interconnects and controls communications between system components. The non-
persistent
memory 111 typically includes high-speed random access memory, such as DRAM,
SRAM,
DDR RAM, ROM, EEPROM, flash memory, whereas the persistent memory 112
typically
54
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
includes CD-ROM, digital versatile disks (DVD) or other optical storage,
magnetic cassettes,
magnetic tape, magnetic disk storage or other magnetic storage devices,
magnetic disk
storage devices, optical disk storage devices, flash memory devices, or other
non-volatile
solid state storage devices. The persistent memory 112 optionally includes one
or more
storage devices remotely located from the CPU(s) 102. The persistent memory
112, and the
non-volatile memory device(s) within the non-persistent memory 112, comprise
non-
transitory computer readable storage medium. In some implementations, the non-
persistent
memory 111 or alternatively the non-transitory computer readable storage
medium stores the
following programs, modules and data structures, or a subset thereof,
sometimes in
conjunction with the persistent memory 112:
= an operating system 116, which includes procedures for handling various
basic system
services and for performing hardware dependent tasks;
= a network communication module (or instructions) 118 for connecting the
system 100
with other devices and/or a communication network 105;
= a test patient data store 120 for storing one or more collections of
features from
patients (e.g., subjects);
= a bioinformatics module 140 for processing sequencing data and extracting
features
from sequencing data, e.g., from liquid biopsy sequencing assays;
= a feature analysis module 160 for evaluating patient features, e.g.,
genomic
alterations, compound genomic features, and clinical features; and
= a reporting module 180 for generating and transmitting reports that
provide clinical
support for personalized cancer therapy.
101991 Although Figures 1A, 1B, 1C1, 1D1, 1C2, 1D2, 1E2, 1F2,
1C3, and 1D3depict a
"system 100,- the figures are intended more as a functional description of the
various features
that may be present in computer systems than as a structural schematic of the
implementations described herein. In practice, and as recognized by those of
ordinary skill in
the art, items shown separately could be combined and some items could be
separated.
Moreover, although Figure 1 depicts certain data and modules in non-persistent
memory 111,
some or all of these data and modules may be in persistent memory 112. For
example, in
various implementations, one or more of the above identified elements are
stored in one or
more of the previously mentioned memory devices and correspond to a set of
instructions for
performing a function described above. The above identified modules, data, or
programs
(e.g., sets of instructions) need not be implemented as separate software
programs,
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
procedures, datasets, or modules, and thus various subsets of these modules
and data may be
combined or otherwise re-arranged in various implementations.
[0200] In some implementations, the non-persistent memory 111
optionally stores a
subset of the modules and data structures identified above. Furthermore, in
some
embodiments, the memory stores additional modules and data structures not
described above.
In some embodiments, one or more of the above-identified elements is stored in
a computer
system, other than that of system 100, that is addressable by system 100 so
that system 100
may retrieve all or a portion of such data when needed.
[0201] For purposes of illustration in Figure 1A, system 100 is
represented as a single
computer that includes all of the functionality for providing clinical support
for personalized
cancer therapy. However, while a single machine is illustrated, the term -
system" shall also
be taken to include any collection of machines that individually or jointly
execute a set (or
multiple sets) of instructions to perform any one or more of the methodologies
discussed
herein.
[0202] For example, in some embodiments, system 100 includes one
or more computers.
In some embodiments, the functionality for providing clinical support for
personalized cancer
therapy is spread across any number of networked computers and/or resides on
each of
several networked computers and/or is hosted on one or more virtual machines
at a remote
location accessible across the communications network 105. For example,
different portions
of the various modules and data stores illustrated in Figures 1A, 1B, 1C1,
1D1, 1C2, 1D2,
1E2, 1F2, 1C3, and 1D3can be stored and/or executed on the various instances
of a
processing device and/or processing server/database in the distributed
diagnostic environment
210 illustrated in Figure 2B (e.g., processing devices 224, 234, 244, and 254,
processing
server 262, and database 264).
[0203] The system may operate in the capacity of a server or a
client machine in client-
server network environment, as a peer machine in a peer-to-peer (or
distributed) network
environment, or as a server or a client machine in a cloud computing
infrastructure or
environment. The system may be a personal computer (PC), a tablet PC, a set-
top box (STB),
a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a
server, a network
router, a switch or bridge, or any machine capable of executing a set of
instructions
(sequential or otherwise) that specify actions to be taken by that machine.
56
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0204] In another implementation, the system comprises a virtual
machine that includes a
module for executing instructions for performing any one or more of the
methodologies
disclosed herein. In computing, a virtual machine (VM) is an emulation of a
computer
system that is based on computer architectures and provides functionality of a
physical
computer. Some such implementations may involve specialized hardware,
software, or a
combination of hardware and software.
[0205] One of skill in the art will appreciate that any of a wide
array of different
computer topologies are used for the application and all such topologies are
within the scope
of the present disclosure.
Test Patient Data Store (120)
[0206] Referring to Figure 1B, in some embodiments, the system
(e.g., system 100)
includes a patient data store 120 that stores data for patients 121-1 to 121-M
(e.g., cancer
patients or patients being tested for cancer) including one or more sequencing
data 122,
feature data 125, and clinical assessments 139. These data are used and/or
generated by the
various processes stored in the bioinformatics module 140 and feature analysis
module 160 of
system 100, to ultimately generate a report providing clinical support for
personalized cancer
therapy of a patient. While the feature scope of patient data 121 across all
patients may be
informationally dense, an individual patient's feature set may be sparsely
populated across
the entirety of the collective feature scope of all features across all
patients. That is to say,
the data stored for one patient may include a different set of features that
the data stored for
another patient. Further, while illustrated as a single data construct in
Figure 1B, different
sets of patient data may be stored in different databases or modules spread
across one or more
system memories.
[0207] In some embodiments, sequencing data 122 from one or more
sequencing
reactions 122-i, including a plurality of sequence reads 123-1 to 123-K, is
stored in the test
patient data store 120. The data store may include different sets of
sequencing data from a
single subject, corresponding to different samples from the patient, e.g., a
tumor sample,
liquid biopsy sample, tumor organoid derived from a patient tumor, and/or a
normal sample,
and/or to samples acquired at different times, e.g., while monitoring the
progression,
regression, remission, and/or recurrence of a cancer in a subject. The
sequence reads may be
in any suitable file format, e.g., BCL, FASTA, FASTQ, etc. In some
embodiments,
sequencing data 122 is accessed by a sequencing data processing module 141,
which
57
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
performs various pre-processing, genome alignment, and demultiplexing
operations, as
described in detail below with reference to bioinformatics module 140. In some

embodiments, sequence data that has been aligned to a reference construct,
e.g., BAM file
124, is stored in test patient data store 120.
[0208] In some embodiments, the test patient data store 120
includes feature data 125,
e.g., that is useful for identifying clinical support for personalized cancer
therapy. In some
embodiments, the feature data 125 includes personal characteristics 126 of the
patient, such
as patient name, date of birth, gender, ethnicity, physical address, smoking
status, alcohol
consumption characteristic, anthropomorphic data, etc.
[0209] In some embodiments, the feature data 125 includes medical
history data 127 for
the patient, such as cancer diagnosis information (e.g., date of initial
diagnosis, date of
metastatic diagnosis, cancer staging, tumor characterization, tissue of
origin, previous
treatments and outcomes, adverse effects of therapy, therapy group history,
clinical trial
history, previous and current medications, surgical history, etc.), previous
or current
symptoms, previous or current therapies, previous treatment outcomes, previous
disease
diagnoses, diabetes status, diagnoses of depression, diagnoses of other
physical or mental
maladies, and family medical history. In some embodiments, the feature data
125 includes
clinical features 128, such as pathology data 128-1, medical imaging data 128-
2, and tissue
culture and/or tissue organoid culture data 128-3.
[0210] In some embodiments, yet other clinical features, such as
previous laboratory
testing results, are stored in the test patient data store 120. Medical
history data 127 and
clinical features may be collected from various sources, including at intake
directly from the
patient, from an electronic medical record (EMR) or electronic health record
(EHR) for the
patient, or curated from other sources, such as fields from various testing
records (e.g.,
genetic sequencing reports).
[0211] In some embodiments, the feature data 125 includes genomic
features 131 for the
patient. Non-limiting examples of genomic features include allelic states 132
(e.g., the
identity of alleles at one or more loci, support for wild type or variant
alleles at one or more
loci, support for SNVs/MNVs at one or more loci, support for indels at one or
more loci,
and/or support for gene rearrangements at one or more loci), allelic fractions
133 (e.g., ratios
of variant to reference alleles (or vice versa), methylation states 134 (e.g.,
a distribution of
methylation patterns at one or more loci and/or support for aberrant
methylation patterns at
58
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
one or more loci), genomic copy numbers 135 (e.g., a copy number value at one
or more loci
and/or support for an aberrant (increased or decreased) copy number at one or
more loci),
tumor mutational burden 136 (e.g., a measure of the number of mutations in the
cancer
genome of the subject), and microsatellite instability status 137 (e.g., a
measure of the
repeated unit length at one or more microsatellite loci and/or a
classification of the MSI status
for the patient's cancer). In some embodiments, one or more of the genomic
features 131 are
determined by a nucleic acid bioinformatics pipeline, e.g., as described in
detail below with
reference to Figure 4 (e.g., Figures 4A-E, 4F1, 4F2, and 4F3). In particular,
in some
embodiments, the feature data 125 include genomic copy numbers 135 (e.g., 135-
1 for
Patient 1 121-1) variant allele fractions 133, and/or circulating tumor
fraction estimates 131-i,
as determined using the improved methods for analyzing copy number variations
(CNVs)
using the copy number variation analysis module 153, validating somatic
sequence variants,
and/or determining circulating tumor fraction estimates, and as described in
further detail
below with reference to Figures 1 and 4 (e.g., Figures 1C1, 1D1, 4F1; Figures
1C2, 1D2, and
4F2; and/or Figures 1C3, 1D3, and 4F3). In some embodiments, one or more of
the genomic
features 131 are obtained from an external testing source, e.g., not connected
to the
bioinformatics pipeline as described below.
102121 For example, referring to Figure 1C1, the one or more
genomic features 131
include genomic copy numbers 135 comprising liquid biopsy genomic copy numbers
135-cf
and optional tumor biopsy genomic copy numbers 1354, in accordance with some
embodiments of the present disclosure. In some embodiments, the liquid biopsy
genomic
copy numbers 135-cf are determined by a nucleic acid bioinformatics pipeline
(e.g., as
described in detail below with reference to Figures 4A-E and 4F1) using a
plurality of
sequence reads 123 obtained from a sequencing of cell-free nucleic acids from
a liquid biopsy
sample. In some embodiments, the liquid biopsy genomic copy numbers comprise
plurality
of copy number annotations (e.g., 135-cf-1, 135-cf-2,... , 135-cf-N), where
each copy number
annotation corresponds to a genomic target (e.g., a gene or a region of a
genome). In some
embodiments, a copy number annotation comprises a qualitative status and/or a
quantitative
copy number. In some alternative embodiments, the optional tumor biopsy
genomic copy
numbers 135-t are determined by a nucleic acid bioinformatics pipeline using a
plurality of
sequence reads 123 obtained from a sequencing of nucleic acids from a tumor
(e.g., tissue)
biopsy. In some embodiments, the optional tumor biopsy genomic copy numbers
comprise a
plurality of optional copy number annotations (e.g., 135-14-1, 135-1-t-2,...,
135-14-0),
59
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
where each copy number annotation corresponds to a genomic target (e.g., a
gene or a region
of a genome).
[0213] Referring again to Figure 1B, in some embodiments, the
feature data 125 further
includes data 138 from other -omics fields of study. Non-limiting examples of -
omics fields
of study that may yield feature data useful for providing clinical support for
personalized
cancer therapy include transcriptomics, epigenomics, proteomics, metabolomics,

metabonomics, microbiomics, lipidomics, glycomics, cellomics, and
organoidomics.
[0214] In some embodiments, yet other features may include
features derived from
machine learning approaches, e.g., based at least in part on evaluation of any
relevant
molecular or clinical features, considered alone or in combination, not
limited to those listed
above. For instance, in some embodiments, one or more latent features learned
from
evaluation of cancer patient training datasets improve the diagnostic and
prognostic power of
the various analysis algorithms in the feature analysis module 160.
[0215] The skilled artisan will know of other types of features
useful for providing
clinical support for personalized cancer therapy. The listing of features
above is merely
representative and should not be construed to be limiting.
[0216] In some embodiments, a test patient data store 120
includes clinical assessment
data 139 for patients, e.g., based on the feature data 125 collected for the
subject. In some
embodiments, the clinical assessment data 139 includes a catalogue of
actionable variants and
characteristics 139-1 (e.g., genomic alterations and compound metrics based on
genomic
features known or believed to be targetable by one or more specific cancer
therapies),
matched therapies 139-2 (e.g., the therapies known or believed to be
particularly beneficial
for treatment of subjects having actionable variants), and/or clinical reports
139-3 generated
for the subject, e.g., based on identified actionable variants and
characteristics 139-1 and/or
matched therapies 139-2.
[0217] In some embodiments, clinical assessment data 139 is
generated by analysis of
feature data 125 using the various algorithms of feature analysis module 160,
as described in
further detail below. In some embodiments, clinical assessment data 139 is
generated,
modified, and/or validated by evaluation of feature data 125 by a clinician,
e.g., an
oncologist. For instance, in some embodiments, a clinician (e.g., at clinical
environment 220)
uses feature analysis module 160, or accesses test patient data store 120
directly, to evaluate
feature data 125 to make recommendations for personalized cancer treatment of
a patient.
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Similarly, in some embodiments, a clinician (e.g., at clinical environment
220) reviews
recommendations determined using feature analysis module 160 and approves,
rejects, or
modifies the recommendations, e.g., prior to the recommendations being sent to
a medical
professional treating the cancer patient.
Bioinformatics Module (140)
[0218] Referring again to Figure 1A, the system (e.g, system 100)
includes a
bioinformatics module 140 that includes a feature extraction module 145 and
optional
ancillary data processing constructs, such as a sequence data processing
module 141 and/or
one or more reference sequence constructs 158 (e.g., a reference genome,
exome, or targeted-
panel construct that includes reference sequences for a plurality of loci
targeted by a
sequencing panel).
102191 In some embodiments, bioinformatics module 140 includes a
sequence data
processing module 141 that includes instructions for processing sequence
reads, e.g., raw
sequence reads 123 from one or more sequencing reactions 122-i, prior to
analysis by the
various feature extraction algorithms, as described in detail below. In some
embodiments,
sequence data processing module 141 includes one or more pre-processing
algorithms 142
that prepare the data for analysis. In some embodiments, the pre-processing
algorithms 142
include instructions for converting the file format of the sequence reads from
the output of
the sequencer (e.g., a BCL file format) into a file format compatible with
downstream
analysis of the sequences (e.g., a FASTQ or FASTA file format). In some
embodiments, the
pre-processing algorithms 142 include instructions for evaluating the quality
of the sequence
reads (e.g., by interrogating quality metrics like Phred score, base-calling
error probabilities,
Quality (Q) scores, and the like) and/or removing sequence reads that do not
satisfy a
threshold quality (e.g., an inferred base call accuracy of at least 80%, at
least 90%, at least
95%, at least 99%, at least 99.5%, at least 99.9%, or higher). In some
embodiments, the pre-
processing algorithms 142 include instructions for filtering the sequence
reads for one or
more properties, e.g., removing sequences failing to satisfy a lower or upper
size threshold or
removing duplicate sequence reads.
[0220] In some embodiments, sequence data processing module 141
includes one or more
alignment algorithms 143, for aligning pre-processed sequence reads 123 to a
reference
sequence construct 158, e.g., a reference genome, exome, or targeted-panel
construct. Many
algorithms for aligning sequencing data to a reference construct are known in
the art, for
example, BWA, Blat, SHRiMP, LastZ, and MAQ. One example of a sequence read
61
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
alignment package is the Burrows-Wheeler Alignment tool (BWA), which uses a
Burrows-
Wheeler Transform (BWT) to align short sequence reads against a large
reference construct,
allowing for mismatches and gaps. Li and Durbin, Bioinformatics, 25(14):1754-
60 (2009),
the content of which is incorporated herein by reference, in its entirety, for
all purposes.
Sequence read alignment packages import raw or pre-processed sequence reads
122, e.g., in
BCL, FASTA, or FASTQ file formats, and output aligned sequence reads 124,
e.g., in SAM
or BAM file formats.
[0221] In some embodiments, sequence data processing module 141
includes one or more
demultiplexing algorithms 144, for dividing sequence read or sequence
alignment files
generated from sequencing reactions of pooled nucleic acids into separate
sequence read or
sequence alignment files, each of which corresponds to a different source of
nucleic acids in
the nucleic acid sequencing pool. For instance, because of the cost of
sequencing reactions, it
is common practice to pool nucleic acids from a plurality of samples into a
single sequencing
reaction. The nucleic acids from each sample are tagged with a sample-specific
and/or
molecule-specific sequence tag (e.g., a UMI), which is sequenced along with
the molecule.
In some embodiments, demultiplexing algorithms 144 sort these sequence tags in
the
sequence read or sequence alignment files to demultiplex the sequencing data
into separate
files for each of the samples included in the sequencing reaction.
[0222] Bioinformatics module 140 includes a feature extraction
module 145, which
includes instructions for identifying diagnostic features, e.g., genomic
features 131, from
sequencing data 122 of biological samples from a subject, e.g., one or more of
a solid tumor
sample, a liquid biopsy sample, or a normal tissue (e.g., control) sample. For
instance, in
some embodiments, a feature extraction algorithm compares the identity of one
or more
nucleotides at a locus from the sequencing data 122 to the identity of the
nucleotides at that
locus in a reference sequence construct (e.g., a reference genome, exome, or
targeted-panel
construct) to determine whether the subject has a variant at that locus. In
some embodiments,
a feature extraction algorithm evaluates data other than the raw sequence, to
identify a
genomic alteration in the subject, e.g., an allelic ratio, a relative copy
number, a repeat unit
distribution, etc.
[0223] For instance, in some embodiments, feature extraction
module 145 includes one or
more variant identification modules that include instructions for various
variant calling
processes. In some embodiments, variants in the germline of the subject are
identified, e.g.,
using a germline variant identification module 146. In some embodiments,
variants in the
62
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
cancer genome, e.g., somatic variants, are identified, e.g., using a somatic
variant
identification module 150. While separate germline and somatic variant
identification
modules are illustrated in Figure IA, in some embodiments they are integrated
into a single
module. In some embodiments, the variant identification module includes
instructions for
identifying one or more of nucleotide variants (e.g., single nucleotide
variants (SNV) and
multi-nucleotide variants (MNV)) using one or more SNV/MNV calling algorithms
(e.g.,
algorithms 147 and/or 151), indels (e.g., insertions or deletions of
nucleotides) using one or
more indel calling algorithms (e.g., algorithms 148 and/or 152), and genomic
rearrangements
(e.g., inversions, translocation, and fusions of nucleotide sequences) using
one or more
genomic rearrangement calling algorithms (e.g., algorithms 149 and/or 153).
[0224] For example, referring to Figures 1C2 and 1D2, in some
embodiments, feature
extraction module 145 comprises, in the variant identification module 146, a
variant
thresholding module 146-a, a sequence variant data store 146-r, and a variant
validation
module 146-o. In some such embodiments, the sequence variant data store 146-r
comprises
one or more candidate variants for a test subject identified by aligning to a
reference
sequence a plurality of sequence reads obtained from sequencing a liquid
biopsy sample of
the test subject, the one or more candidate variants corresponding to a
respective one or more
loci in the reference sequence. The plurality of sequence reads aligned to the
reference
sequence is used to identify a variant allele fragment count for each
candidate variant. The
sequence variant data store 146-r further comprises, in some embodiments, a
plurality of
variants from a first set of nucleic acids obtained from a cohort of subjects
(e.g., from a tumor
tissue biopsy for each subject in a baseline cohort of subjects). The variant
thresholding
module 146-a performs a function for each candidate variant in the one or more
candidate
variants where, for each corresponding locus 146-b (e.g., 146-b-1,..., 146-b-
P), a dynamic
variant count threshold 146-d (e.g., 146-d-1) is obtained based on a pre-test
odds of a positive
variant call for the locus, based on the prevalence of variants in the genomic
region that
includes the locus, using the plurality of variants for the baseline cohort.
The variant
thresholding module 146-a compares the variant allele fragment count 146-c
(e.g., 146-c-1)
for the candidate variant against the dynamic variant count threshold 146-d
for the locus
corresponding to the candidate variant. In some embodiments, the variant
validation module
146-o determines whether the candidate variant is validated or rejected as a
somatic sequence
variant based on the comparison. For example, when the variant allele fragment
count for the
candidate variant satisfies the dynamic variant count threshold for the locus,
the somatic
63
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
sequence variant is validated, and when the variant allele fragment count for
the candidate
variant does not satisfy the dynamic variant count threshold for the locus,
the somatic
sequence variant is rejected.
[0225] In some embodiments, the dynamic variant count threshold
is determined based
on a distribution of variant detection sensitivities as a function of
circulating variant allele
fraction from the cohort of subjects (e.g, the baseline cohort). For example,
referring to
Figure 1C2, in some such embodiments, the variant thresholding module 146-a
takes as input
one or more variant allele fractions 133 from the genomic features module 131.
In some such
embodiments, the variant allele fractions 133 comprises a plurality of variant
allele fractions
obtained from tumor tissue biopsies 133-t (e.g., 1334-1, 133-t-2..., 1334-0)
for the cohort of
subjects. In some embodiments, the variant allele fractions comprise a
plurality of variant
allele fractions obtained from liquid biopsy samples 133-cf (e.g., 133-cf-1,
133-cf-2..., 133-
cf-N) for the cohort of subjects. In some embodiments, the circulating variant
allele fraction
is obtained by comparing the liquid biopsy variant allele fractions 133-cf to
the tumor biopsy
variant allele fraction 133-t.
[0226] Additional embodiments for using variant allele fractions
(e.g, variant allele
frequencies) to identify somatic variants are detailed below (see, Example
Methods: Variant
Identification).
[0227] A SNV/MNV algorithm 147 may identify a substitution of a
single nucleotide that
occurs at a specific position in the genome. For example, at a specific base
position, or locus,
in the human genome, the C nucleotide may appear in most individuals, but in a
minority of
individuals, the position is occupied by an A. This means that there is a SNP
at this specific
position and the two possible nucleotide variations, C or A, are said to be
alleles for this
position. SNPs underlie differences in human susceptibility to a wide range of
diseases (e.g.,
sickle-cell anemia, f3-thalassemia and cystic fibrosis result from SNPs). The
severity of
illness and the way the body responds to treatments are also manifestations of
genetic
variations. For example, a single-base mutation in the APOE (apolipoprotein E)
gene is
associated with a lower risk for Alzheimer's disease. A single-nucleotide
variant (SNV) is a
variation in a single nucleotide without any limitations of frequency and may
arise in somatic
cells. A somatic single-nucleotide variation (e.g., caused by cancer) may also
be called a
single-nucleotide alteration. An MNP (Multiple-nucleotide polymorphisms)
module may
identify the substitution of consecutive nucleotides at a specific position in
the genome.
64
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0228] An indel calling algorithm 148 may identify an insertion
or deletion of bases in
the genome of an organism classified among small genetic variations. While
indels usually
measure from 1 to 10 000 base pairs in length, a microindel is defined as an
indel that results
in a net change of 1 to 50 nucleotides. Indels can be contrasted with a SNP or
point mutation.
An indel inserts and/or deletes nucleotides from a sequence, while a point
mutation is a form
of substitution that replaces one of the nucleotides without changing the
overall number in the
DNA. Indels, being insertions and/or deletions, can be used as genetic markers
in natural
populations, especially in phylogenetic studies. Indel frequency tends to be
markedly lower
than that of single nucleotide polymorphisms (SNP), except near highly
repetitive regions,
including homopolymers and microsatellites.
[0229] A genomic rearrangement algorithm 149 may identify hybrid
genes formed from
two previously separate genes. It can occur as a result of translocation,
interstitial deletion, or
chromosomal inversion. Gene fusion can play an important role in
tumorigenesis. Fusion
genes can contribute to tumor formation because fusion genes can produce much
more active
abnormal protein than non-fusion genes. Often, fusion genes are oncogenes that
cause
cancer; these include BCR-ABL, TEL-AML I (ALL with 1(12 ; 21)), AMLI-ETO (M2
AML
with t(8 ; 21)), and TMPRSS2-ERG with an interstitial deletion on chromosome
21, often
occurring in prostate cancer. In the case of TMPRSS2-ERG, by disrupting
androgen receptor
(AR) signaling and inhibiting AR expression by oncogenic ETS transcription
factor, the
fusion product regulates prostate cancer. Most fusion genes are found from
hematological
cancers, sarcomas, and prostate cancer. BCAM-AKT2 is a fusion gene that is
specific and
unique to high-grade serous ovarian cancer. Oncogenic fusion genes may lead to
a gene
product with a new or different function from the two fusion partners.
Alternatively, a proto-
oncogene is fused to a strong promoter, and thereby the oncogenic function is
set to function
by an upregulation caused by the strong promoter of the upstream fusion
partner. The latter
is common in lymphomas, where oncogenes are juxtaposed to the promoters of the

immunoglobulin genes. Oncogenic fusion transcripts may also be caused by trans-
splicing or
read-through events. Since chromosomal translocations play such a significant
role in
neoplasia, a specialized database of chromosomal aberrations and gene fusions
in cancer has
been created. This database is called Mitelman Database of Chromosome
Aberrations and
Gene Fusions in Cancer.
[0230] In some embodiments, feature extraction module 145
includes instructions for
identifying one or more complex genomic alterations (e.g., features that
incorporate more
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
than a change in the primary sequence of the genome) in the cancer genome of
the subject.
For instance, in some embodiments, feature extraction module 145 includes
modules for
identifying one or more of copy number variation (e.g., copy number variation
analysis
module 153), microsatellite instability status (e.g., microsatellite
instability analysis module
154), tumor mutational burden (e.g., tumor mutational burden analysis module
155), tumor
ploidy (e. g. , tumor ploidy analysis module 156), and homologous
recombination pathway
deficiencies (e.g., homologous recombination pathway analysis module 157).
[0231] For example, referring to Figure 1D1, the copy number
variation analysis module
153 performs a method that validates a copy number annotation of a genomic
segment in a
test subject, in accordance with some embodiments of the present disclosure.
The method
comprises obtaining an input data store 153-r (e.g., a dataset), where the
input data store
includes a bin-level sequence ratio data structure 153-r-1 containing a
plurality of bin-level
sequence ratios; a segment-level sequence ratio data structure 153-r-2
containing a plurality
of segment-level sequence ratios; and a segment-level dispersion measure data
structure 153-
r-3 containing a plurality of segment-level measures of dispersion. In some
embodiments,
the method further comprises passing the data in the input data store 153-r to
an
amplification/deletion filter construct I53-a, thus applying the dataset to a
plurality of filters.
The amplification/deletion filter construct 153-a comprises a plurality of
filters, including an
optional measure of central tendency bin-level sequence ratio filter 153-a-1;
an optional
segment-level measure of dispersion confidence filter 153-a-2; an optional
measure of central
tendency-plus-deviation bin-level sequence ratio filter 153-a-3; and/or an
optional segment-
level sequence ratio filter 153-a-4. In some embodiments, the copy number
variation analysis
module further provides an output via the validation construct 153-o, where,
when a filter in
the amplification/deletion filter construct 153-a is fired, the copy number
annotation of the
genomic segment is rejected, and when no filter in the amplification/deletion
filter construct
153-a is fired, the copy number annotation of the genomic segment is
validated. In some
embodiments, copy number annotations validated using the copy number variation
analysis
module 153 in the feature extraction module 145 are used to populate the
plurality of
genomic copy numbers 135 in the one or more genomic features 131 of the test
patient data
store 120.
[0232] As another example, referring to Figure 1D3, in some
embodiments, feature
extraction module 145 comprises a tumor fraction estimation module 145-tf. In
some
embodiments, the tumor fraction estimation module 145-tf comprises a sequence
ratio data
66
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
structure 145-tf-r including a plurality of sequence ratios (e.g., coverage
ratios) obtained from
a sequencing of a test liquid biopsy sample of a subject. In some embodiments,
the sequence
ratio data structure 145-tf-r includes the sequence ratios that are used as
input to determine
tumor fraction estimates for the test liquid biopsy sample. In some
embodiments, the tumor
fraction estimation module 145-tf also comprises a tumor purity algorithm
construct 145-tf-a
that executes, for example, a maximum likelihood estimation (e.g., an
expectation-
maximization algorithm) to calculate an estimate of the circulating tumor
fraction. The tumor
purity algorithm construct 145-tf-a comprises an optional input data
filtration construct 145-
tf-k (e.g., for removing one or more inputs passed from the sequence ratio
data structure
based on a minimum probe threshold or a position on a sex chromosome) and a
plurality of
model parameters 145-tf-d (e.g., 145-tf-d-1, 145-tf-d-2,... ) used for
executing the algorithm.
In some embodiments, model parameters include expected sequence ratios for a
set of copy
states at a given tumor purity; a distance (e.g, an error) from a test
sequence ratio to the
closest expected sequence ratio at the given tumor purity; a minimum distance
(e.g., a
minimum error) from a test sequence ratio to the closest expected sequence
ratio at the given
tumor purity (e.g., an assigned test copy state selected from a minimal
distance expected copy
state); and/or a tumor purity score (e.g., a sum of weighted errors).
[0233] In some embodiments, referring to Figure 1C3, the tumor
fraction estimation
module 145-tf is used to obtain one or more circulating tumor fraction
estimates 131-i that
are included as feature data 125 in a test patient data store 120. For
example, in some
embodiments, a plurality of circulating tumor fraction estimates is obtained
from a test liquid
biopsy sample of a subject 131-i-cf (e.g., 131-i-cf-1, 131-i-cf-2... , 131-i-
cf-N). In some
embodiments, the plurality of circulating tumor fraction estimates is obtained
from a single
patient at different collection times.
[0234] Further details and specific embodiments regarding methods
for analysis and
validation of copy number variation, validation of a somatic sequence variant,
and/or
determination of a circulating tumor fraction estimate are provided below with
reference to
Figures 4, 5, and 6 (e.g., Figures 4F1, 5A1-5E1, and 6A1-6C1; Figures 4F2, 5A2-
5B2, and
6A2, and/or Figures 4F3, 5A3-5B3, and 6A3-6C3).
Feature Analysis Module (160)
102351 Referring again to Figure 1A, the system (e.g, system 100)
includes a feature
analysis module 160 that includes one or more genomic alteration
interpretation algorithms
161, one or more optional clinical data analysis algorithms 165, an optional
therapeutic
67
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
curation algorithm 165, and an optional recommendation validation module 167.
In some
embodiments, feature analysis module 160 identifies actionable variants and
characteristics
139-1 and corresponding matched therapies 139-2 and/or clinical trials using
one or more
analysis algorithms (e.g., algorithms 162, 163, 164, and 165) to evaluate
feature data 125.
The identified actionable variants and characteristics 139-1 and corresponding
matched
therapies 139-2, which are optionally stored in test patient data store 120,
are then curated by
feature analysis module 160 to generate a clinical report 139-3, which is
optionally validated
by a user, e.g., a clinician, before being transmitted to a medical
professional, e.g., an
oncologist, treating the patient.
102361 In some embodiments, the genomic alteration interpretation
algorithms 161
include instructions for evaluating the effect that one or more genomic
features 131 of the
subject, e.g., as identified by feature extraction module 145, have on the
characteristics of the
patient's cancer and/or whether one or more targeted cancer therapies may
improve the
clinical outcome for the patient. For example, in some embodiments, one or
more genomic
variant analysis algorithms 163 evaluate various genomic features 131 by
querying a
database, e.g., a look-up-table ("LUT") of actionable genomic alterations,
targeted therapies
associated with the actionable genomic alterations, and any other conditions
that should be
met before administering the targeted therapy to a subject having the
actionable genomic
alteration. For instance, evidence suggests that depatuxizumab mafodotin (an
anti-EGFR
mAb conjugated to monomethyl auristatin F) has improved efficacy for the
treatment of
recurrent glioblastomas having EGFR focal amplifications, van den Bent M. et
at., Cancer
Chemother Pharmacol., 80(6):1209-17 (2017). Accordingly, the actionable
genomic
alteration LUT would have an entry for the focal amplification of the EGFR
gene indicating
that depatuxizumab mafodotin is a targeted therapy for glioblastomas (e.g.,
recurrent
glioblastomas) having a focal gene amplification. In some instances, the LUT
may also
include counter indications for the associated targeted therapy, e.g., adverse
drug interactions
or personal characteristics that are counter-indicated for administration of
the particular
targeted therapy.
102371 In some embodiments, a genomic alteration interpretation
algorithm 161
determines whether a particular genomic feature 131 should be reported to a
medical
professional treating the cancer patient. In some embodiments, genomic
features 131 (e.g.,
genomic alterations and compound features) are reported when there is clinical
evidence that
the feature significantly impacts the biology of the cancer, impacts the
prognosis for the
68
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
cancer, and/or impacts pharmacogenomics, e.g., by indicating or counter-
indicating particular
therapeutic approaches. For instance, a genomic alteration interpretation
algorithm 161 may
classify a particular CNV feature 135 as -Reportable," e.g., meaning that the
CNV has been
identified as influencing the character of the cancer, the overall disease
state, and/or
pharmacogenomics, as "Not Reportable,- e.g., meaning that the CNV has not been
identified
as influencing the character of the cancer, the overall disease state, and/or
pharmacogenomics, as "No Evidence,- e.g., meaning that no evidence exists
supporting that
the CNV is -Reportable" or -Not Reportable," or as -Conflicting Evidence,"
e.g., meaning
that evidence exists supporting both that the CNV is "Reportable- and that the
CNV is "Not
Reportable."
[0238] In some embodiments, the genomic alteration interpretation
algorithms 161
include one or more pathogenic variant analysis algorithms 162, which evaluate
various
genomic features to identify the presence of an oncogenic pathogen associated
with the
patient's cancer and/or targeted therapies associated with an oncogenic
pathogen infection in
the cancer. For instance, RNA expression patterns of some cancers are
associated with the
presence of an oncogenic pathogen that is helping to drive the cancer. See,
for example, U.S.
Patent Application Serial No. 16/802,126, filed February 26, 2020, the content
of which is
hereby incorporated by reference, in its entirety, for all purposes. In some
instances, the
recommended therapy for the cancer is different when the cancer is associated
with the
oncogenic pathogen infection than when it is not. Accordingly, in some
embodiments, e.g.,
where feature data 125 includes RNA abundance data for the cancer of the
patient, one or
more pathogenic variant analysis algorithms 162 evaluate the RNA abundance
data for the
patient's cancer to determine whether a signature exists in the data that
indicates the presence
of the oncogenic pathogen in the cancer. Similarly, in some embodiments,
bioinformatics
module 140 includes an algorithm that searches for the presence of pathogenic
nucleic acid
sequences in sequencing data 122. See, for example, U.S. Provisional Patent
Application
Serial No. 62/978,067, filed February 18, 2020, the content of which is hereby
incorporated
by reference, in its entirety, for all purposes. Accordingly, in some
embodiments, one or
more pathogenic variant analysis algorithms 162 evaluates whether the presence
of an
oncogenic pathogen in a subject is associated with an actionable therapy for
the infection. In
some embodiments, system 100 queries a database, e.g., a look-up-table (-
LUT"), of
actionable oncogenic pathogen infections, targeted therapies associated with
the actionable
infections, and any other conditions that should be met before administering
the targeted
69
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
therapy to a subject that is infected with the oncogenic pathogen. In some
instances, the LUT
may also include counter indications for the associated targeted therapy,
e.g., adverse drug
interactions or personal characteristics that are counter-indicated for
administration of the
particular targeted therapy.
[0239] In some embodiments, the genomic alteration interpretation
algorithms 161
include one or more multi-feature analysis algorithms 164 that evaluate a
plurality of features
to classify a cancer with respect to the effects of one or more targeted
therapies. For instance,
in some embodiments, feature analysis module 160 includes one or more
classifiers trained
against feature data, one or more clinical therapies, and their associated
clinical outcomes for
a plurality of training subjects to classify cancers based on their predicted
clinical outcomes
following one or more therapies.
102401 In some embodiments, the classifier is implemented as an
artificial intelligence
engine and may include gradient boosting models, random forest models, neural
networks
(NN), regression models, Naive Bayes models, and/or machine learning
algorithms (MLA).
An MLA or a NN may be trained from a training data set that includes one or
more features
125, including personal characteristics 126, medical history 127, clinical
features 128,
genomic features 131, and/or other -omic features 138. MLAs include supervised
algorithms
(such as algorithms where the features/classifications in the data set are
annotated) using
linear regression, logistic regression, decision trees, classification and
regression trees, naïve
Bayes, nearest neighbor clustering; unsupervised algorithms (such as
algorithms where no
features/classification in the data set are annotated) using Apriori, means
clustering, principal
component analysis, random forest, adaptive boosting; and semi-supervised
algorithms (such
as algorithms where an incomplete number of features/classifications in the
data set are
annotated) using generative approach (such as a mixture of Gaussian
distributions, mixture of
multinomial distributions, hidden Markov models), low density separation,
graph-based
approaches (such as mincut, harmonic function, manifold regularization),
heuristic
approaches, or support vector machines.
[0241] NNs include conditional random fields, convolutional
neural networks, attention
based neural networks, deep learning, long short term memory networks, or
other neural
models where the training data set includes a plurality of tumor samples, RNA
expression
data for each sample, and pathology reports covering imaging data for each
sample.
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0242] While MLA and neural networks identify distinct approaches
to machine learning,
the terms may be used interchangeably herein. Thus, a mention of MLA may
include a
corresponding NN or a mention of NN may include a corresponding MLA unless
explicitly
stated otherwise. Training may include providing optimized datasets, labeling
these traits as
they occur in patient records, and training the MLA to predict or classify
based on new
inputs. Artificial NNs are efficient computing models which have shown their
strengths in
solving hard problems in artificial intelligence. They have also been shown to
be universal
approximators, that is, they can represent a wide variety of functions when
given appropriate
parameters.
[0243] In some embodiments, system 100 includes a classifier
training module that
includes instructions for training one or more untrained or partially trained
classifiers based
on feature data from a training dataset. In some embodiments, system 100 also
includes a
database of training data for use in training the one or more classifiers. In
other
embodiments, the classifier training module accesses a remote storage device
hosting training
data. In some embodiments, the training data includes a set of training
features, including but
not limited to, various types of the feature data 125 illustrated in Figure B.
In some
embodiments, the classifier training module uses patient data 121, e.g., when
test patient data
store 120 also stores a record of treatments administered to the patient and
patient outcomes
following therapy.
[0244] In some embodiments, feature analysis module 160 includes
one or more clinical
data analysis algorithms 165, which evaluate clinical features 128 of a cancer
to identify
targeted therapies which may benefit the subject. For example, in some
embodiments, e.g.,
where feature data 125 includes pathology data 128-1, one or more clinical
data analysis
algorithms 165 evaluate the data to determine whether an actionable therapy is
indicated
based on the histopathology of a tumor biopsy from the subject, e.g., which is
indicative of a
particular cancer type and/or stage of cancer. In some embodiments, system 100
queries a
database, e.g., a look-up-table ("LUT"), of actionable clinical features
(e.g., pathology
features), targeted therapies associated with the actionable features, and any
other conditions
that should be met before administering the targeted therapy to a subject
associated with the
actionable clinical features 128 (e.g., pathology features 128-1). In some
embodiments,
system 100 evaluates the clinical features 128 (e.g., pathology features 12S-
1) directly to
determine whether the patient's cancer is sensitive to a particular
therapeutic agent. Further
details on example methods, systems, and algorithms for classifying cancer and
identifying
71
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
targeted therapies based on clinical data, such as pathology data 128-1,
imaging data 138-2,
and/or tissue culture/organoid data 128-3 are discussed, for example, in U.S.
Patent
Application No. 16/830,186, filed on March 25, 2020, U.S. Patent Application
No.
16/789,363, filed on Feb. 12, 2020, and U.S. Provisional Application No.
63/007,874, filed
on April 9, 2020, the contents of which are hereby incorporated by reference,
in their
entireties, for all purposes.
[0245] In some embodiments, feature analysis module 160 includes
a clinical trials
module that evaluates test patient data 121 to determine whether the patient
is eligible for
inclusion in a clinical trial for a cancer therapy, e.g., a clinical trial
that is currently recruiting
patients, a clinical trial that has not vet begun recruiting patients, and/or
an ongoing clinical
trial that may recruit additional patients in the future. In some embodiments,
a clinical trial
module evaluates test patient data 121 to determine whether the results of a
clinical trial are
relevant for the patient, e.g., the results of an ongoing clinical trial
and/or the results of a
completed clinical trial. For instance, in some embodiments, system 100
queries a database,
e.g., a look-up-table ("LUT-) of clinical trials, e.g., active and/or
completed clinical trials,
and compares patient data 121 with inclusion criteria for the clinical trials,
stored in the
database, to identify clinical trials with inclusion criteria that closely
match and/or exactly
match the patient's data 121. In some embodiments, a record of matching
clinical trials, e.g.,
those clinical trials that the patient may be eligible for and/or that may
inform personalized
treatment decisions for the patient, are stored in clinical assessment
database 139.
[0246] In some embodiments, feature analysis module 160 includes
a therapeutic curation
algorithm 166 that assembles actionable variants and characteristics 139-1,
matched therapies
139-2, and/or relevant clinical trials identified for the patient, as
described above. In some
embodiments, a therapeutic curation algorithm 166 evaluates certain criteria
related to which
actionable variants and characteristics 139-1, matched therapies 139-2, and/or
relevant
clinical trials should be reported and/or whether certain matched therapies,
considered alone
or in combination, may be counter-indicated for the patient, e.g., based on
personal
characteristics 126 of the patient and/or known drug-drug interactions. In
some
embodiments, the therapeutic curation algorithm then generates one or more
clinical reports
139-3 for the patient. In some embodiments, the therapeutic curation algorithm
generates a
first clinical report 139-3-1 that is to be reported to a medical professional
treating the patient
and a second clinical report 139-3-2 that will not be communicated to the
medical
professional, but may be used to improve various algorithms within the system.
72
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0247] In some embodiments, feature analysis module 160 includes
a recommendation
validation module 167 that includes an interface allowing a clinician to
review, modify, and
approve a clinical report 139-3 prior to the report being sent to a medical
professional, e.g.,
an oncologist, treating the patient.
[0248] In some embodiments, each of the one or more feature
collections, sequencing
modules, bioinformatics modules (including, e.g., alteration module(s),
structural variant
calling and data processing modules), classification modules and outcome
modules are
communicatively coupled to a data bus to transfer data between each module for
processing
and/or storage. In some alternative embodiments, each of the feature
collection, alteration
module(s), structural variant and feature store are communicatively coupled to
each other for
independent communication without sharing the data bus.
102491 Further details on systems and exemplary embodiments of
modules and feature
collections are discussed in PCT Application PCT/US19/69149, titled "A METHOD
AND
PROCESS FOR PREDICTING AND ANALYZING PATIENT COHORT RESPONSE,
PROGRESSION, AND SURVIVAL," filed December 31, 2019, which is hereby
incorporated herein by reference in its entirety.
Example Methods
102501 Now that details of a system 100 for providing clinical
support for personalized
cancer therapy, e.g., with improved validation of copy number variation,
improved validation
of somatic sequence variants, and/or improved determination of circulating
tumor fraction
estimates have been disclosed, details regarding processes and features of the
system, in
accordance with various embodiments of the present disclosure, are disclosed
below.
Specifically, example processes are described below with reference to Figures
2A, 3, 4, 5, 6
and 7 (e.g., Figures 2A, 3, 4A-E; Figures 4F1, 5A1-5E1, 6A1-6C1, and 7A1-7C1;
Figures
4F2, 5A2-5B2, 6A2, and 7A2-7B2; and/or Figures 4F3, 5A3-5B3, 6A3-6C3, and
7A3). In
some embodiments, such processes and features of the system are carried out by
modules
118, 120, 140, 160, and/or 170, as illustrated in Figure 1A. Referring to
these methods, the
systems described herein (e.g., system 100) include instructions for
determining and
validating focal copy number variations that are improved compared to
conventional methods
for copy number analysis, instructions for validating somatic variants that
are improved
compared to conventional methods for somatic variant detection, and/or
instructions for
determining accurate circulating tumor fraction estimates that are improved
compared to
conventional methods for obtaining circulating tumor fraction estimates.
73
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Figure 2B: Distributed Diagnostic and Clinical Environment
[0251] In some aspects, the methods described herein for
providing clinical support for
personalized cancer therapy are performed across a distributed
diagnostic/clinical
environment, e.g., as illustrated in Figure 2B. However, in some embodiments,
the improved
methods described herein for supporting clinical decisions in precision
oncology using liquid
biopsy assays (e.g., by validating a copy number variation in a test subject,
validating a
somatic sequence variant in a test subject having a cancer condition,
determining accurate
circulating tumor fraction estimates, etc.) are performed at a single
location, e.g., at a single
computing system or environment, although ancillary procedures supporting the
methods
described herein, and/or procedures that make further use of the results of
the methods
described herein, may be performed across a distributed diagnostic/clinical
environment.
102521 Figure 2B illustrates an example of a distributed
diagnostic/clinical environment
210. In some embodiments, the distributed diagnostic/clinical environment is
connected via
communication network 105. In some embodiments, one or more biological
samples, e.g.,
one or more liquid biopsy samples, solid tumor biopsy, normal tissue samples,
and/or control
samples, are collected from a subject in clinical environment 220, e.g., a
doctor's office,
hospital, or medical clinic, or at a home health care environment (not
depicted).
Advantageously, while solid tumor samples should be collected within a
clinical setting,
liquid biopsy samples can be acquired in a less invasive fashion and are more
easily collected
outside of a traditional clinical setting. In some embodiments, one or more
biological
samples, or portions thereof, are processed within the clinical environment
220 where
collection occurred, using a processing device 224, e.g., a nucleic acid
sequencer for
obtaining sequencing data, a microscope for obtaining pathology data, a mass
spectrometer
for obtaining proteomic data, etc. In some embodiments, one or more biological
samples, or
portions thereof are sent to one or more external environments, e.g.,
sequencing lab 230,
pathology lab 240, and/or molecular biology lab 250, each of which includes a
processing
device 234, 244, and 254, respectively, to generate biological data 121 for
the subject. Each
environment includes a communications device 222, 232, 242, and 252,
respectively, for
communicating biological data 121 about the subject to a processing server 262
and/or
database 264, which may be located in yet another environment, e.g.,
processing/storage
center 260 Thus, in some embodiments, different portions of the systems and
methods
described herein are fulfilled by different processing devices located in
different physical
environments.
74
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0253] Accordingly, in some embodiments, a method for providing
clinical support for
personalized cancer therapy, e.g., with improved validation of copy number
variations,
improved validation of somatic sequence variants, and/or improved
determination of
circulating tumor fraction estimates, is performed across one or more
environments, as
illustrated in Figure 2B. For instance, in some such embodiments, a liquid
biopsy sample is
collected at clinical environment 220 or in a home healthcare environment. The
sample, or a
portion thereof, is sent to sequencing lab 230 where raw sequence reads 123 of
nucleic acids
in the sample are generated by sequencer 234. The raw sequencing data 123 is
communicated, e.g., from communications device 232, to database 264 at
processing/storage
center 260, where processing server 262 extracts features from the sequence
reads by
executing one or more of the processes in bioinformatics module 140, thereby
generating
genomic features 131 for the sample. Processing server 262 may then analyze
the identified
features by executing one or more of the processes in feature analysis module
160, thereby
generating clinical assessment 139, including a clinical report 139-3. A
clinician may access
clinical report 139-3, e.g., at processing/storage center 260 or through
communications
network 105, via recommendation validation module 167. After final approval,
clinical
report 139-3 is transmitted to a medical professional, e.g., an oncologist, at
clinical
environment 220, who uses the report to support clinical decision making for
personalized
treatment of the patient's cancer.
Figure 2A: Example Workflow for Precision Oncology
[0254] Figure 2A is a flowchart of an example workflow 200 for
collecting and analyzing
data in order to generate a clinical report 139 to support clinical decision
making in precision
oncology. Advantageously, the methods described herein improve this process,
for example,
by improving various stages within feature extraction 206, including
validating copy number
variations, validating somatic sequence variants, and/or determining
circulating tumor
fraction estimates.
[0255] Briefly, the workflow begins with patient intake and
sample collection 201, where
one or more liquid biopsy samples, one or more tumor biopsy, and one or more
normal and/or
control tissue samples are collected from the patient (e.g., at a clinical
environment 220 or
home healthcare environment, as illustrated in Figure 2B). In some
embodiments, personal
data 126 corresponding to the patient and a record of the one or more
biological samples
obtained (e.g., patient identifiers, patient clinical data, sample type,
sample identifiers, cancer
conditions, etc.) are entered into a data analysis platform, e.g., test
patient data store 120.
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Accordingly, in some embodiments, the methods disclosed herein include
obtaining one or
more biological samples from one or more subjects, e.g., cancer patients. In
some
embodiments, the subject is a human, e.g., a human cancer patient.
[0256] In some embodiments, one or more of the biological samples
obtained from the
patient are a biological liquid sample, also referred to as a liquid biopsy
sample. In some
embodiments, one or more of the biological samples obtained from the patient
are selected
from blood, plasma, serum, urine, vaginal fluid, fluid from a hydrocele (e.g.,
of the testis),
vaginal flushing fluids, pleural fluid, ascitic fluid, cerebrospinal fluid,
saliva, sweat, tears,
sputum, bronchoalveolar lavage fluid, discharge fluid from the nipple,
aspiration fluid from
different parts of the body (e.g., thyroid, breast), etc. In some embodiments,
the liquid biopsy
sample includes blood and/or saliva. In some embodiments, the liquid biopsy
sample is
peripheral blood. In some embodiments, blood samples are collected from
patients in
commercial blood collection containers, e.g., using a PAXgene Blood DNA
Tubes. In
some embodiments, saliva samples are collected from patients in commercial
saliva
collection containers, e.g., using an Oragene0 DNA Saliva Kit.
[0257] In some embodiments, the liquid biopsy sample has a volume
of from about 1 mL
to about 50 mL. For example, in some embodiments, the liquid biopsy sample has
a volume
of about 1 mL, about 2 mL, about 3 mL, about 4 mL, about 5 mL, about 6 mL,
about 7 mL,
about 8 mL, about 9 mL, about 10 mL, about 11 mL, about 12 mL, about 13 mL,
about 14
mL, about 15 mL, about 16 mL, about 17 mL, about 18 mL, about 19 mL, about 20
mL, or
greater.
[0258] Liquid biopsy samples include cell free nucleic acids,
including cell-free DNA
(cfDNA). As described above, cfDNA isolated from cancer patients includes DNA
originating from cancerous cells, also referred to as circulating tumor DNA
(ctDNA), cfDNA
originating from germline (e.g., healthy or non-cancerous) cells, and cfDNA
originating from
hematopoietic cells (e.g., white blood cells). The relative proportions of
cancerous and non-
cancerous cfDNA present in a liquid biopsy sample varies depending on the
characteristics
(e.g., the type, stage, lineage, genomic profile, etc.) of the patient's
cancer. As used herein,
the 'tumor burden' of the subject refers to the percentage cfDNA that
originated from
cancerous cells.
[0259] As described herein, cfDNA is a particularly useful source
of biological data for
various implementations of the methods and systems described herein, because
it is readily
76
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
obtained from various body fluids. Advantageously, use of bodily fluids
facilitates serial
monitoring because of the ease of collection, as these fluids are collectable
by non-invasive
or minimally invasive methodologies. This is in contrast to methods that rely
upon solid
tissue samples, such as biopsies, which often times require invasive surgical
procedures.
Further, because bodily fluids, such as blood, circulate throughout the body,
the cfDNA
population represents a sampling of many different tissue types from many
different
locations.
[0260] In some embodiments, a liquid biopsy sample is separated
into two different
samples. For example, in some embodiments, a blood sample is separated into a
blood
plasma sample, containing cfDNA, and a buffy coat preparation, containing
white blood
cells.
102611 In some embodiments, a plurality of liquid biopsy samples
is obtained from a
respective subject at intervals over a period of time (e.g., using serial
testing). For example,
in some such embodiments, the time between obtaining liquid biopsy samples
from a
respective subject is at least 1 day, at least 2 days, at least 1 week, at
least 2 weeks, at least 1
month, at least 2 months, at least 3 months, at least 4 months, at least 6
months, or at least 1
year.
[0262] In some embodiments, one or more biological samples
collected from the patient
is a solid tissue sample, e.g., a solid tumor sample or a solid normal tissue
sample. Methods
for obtaining solid tissue samples, e.g., of cancerous and/or normal tissue
are known in the art
and are dependent upon the type of tissue being sampled. For example, bone
marrow
biopsies and isolation of circulating tumor cells can be used to obtain
samples of blood
cancers, endoscopic biopsies can be used to obtain samples of cancers of the
digestive tract,
bladder, and lungs, needle biopsies (e.g., fine-needle aspiration, core needle
aspiration,
vacuum-assisted biopsy, and image-guided biopsy, can be used to obtain samples
of
subdermal tumors, skin biopsies, e.g., shave biopsy, punch biopsy, incisional
biopsy, and
excisional biopsy, can be used to obtain samples of dermal cancers, and
surgical biopsies can
be used to obtain samples of cancers affecting internal organs of a patient.
In some
embodiments, a solid tissue sample is a formalin-fixed tissue (FFT). In some
embodiments, a
solid tissue sample is a macro-dissected formalin fixed paraffin embedded
(FFPE) tissue. In
some embodiments, a solid tissue sample is a fresh frozen tissue sample.
77
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0263] In some embodiments, a dedicated normal sample is
collected from the patient, for
co-processing with a liquid biopsy sample. Generally, the normal sample is of
a non-
cancerous tissue, and can be collected using any tissue collection means
described above. In
some embodiments, buccal cells collected from the inside of a patient's cheeks
are used as a
normal sample. Buccal cells can be collected by placing an absorbent material,
e.g., a swab,
in the subject's mouth and rubbing it against their cheek, e.g., for at least
15 second or for at
least 30 seconds. The swab is then removed from the patient's mouth and
inserted into a
tube, such that the tip of the tube is submerged into a liquid that serves to
extract the buccal
cells off of the absorbent material. An example of buccal cell recovery and
collection devices
is provided in U.S. Patent No. 9,138,205, the content of which is hereby
incorporated by
reference, in its entirety, for all purposes. In some embodiments, the buccal
swab DNA is
used as a source of normal DNA in circulating heme malignancies.
[0264] The biological samples collected from the patient are,
optionally, sent to various
analytical environments (e.g., sequencing lab 230, pathology lab 240, and/or
molecular
biology lab 250) for processing (e.g., data collection) and/or analysis (e.g.,
feature
extraction). Wet lab processing 204 may include cataloguing samples (e.g.,
accessioning),
examining clinical features of one or more samples (e.g., pathology review),
and nucleic acid
sequence analysis (e.g., extraction, library prep, capture + hybridize,
pooling, and
sequencing). In some embodiments, the workflow includes clinical analysis of
one or more
biological samples collected from the subject, e.g., at a pathology lab 240
and/or a molecular
and cellular biology lab 250, to generate clinical features such as pathology
features 128-3,
imaging data 128-3, and/or tissue culture / organoid data 128-3.
[0265] In some embodiments, the pathology data 128-1 collected
during clinical
evaluation includes visual features identified by a pathologist's inspection
of a specimen
(e.g., a solid tumor biopsy), e.g., of stained H&E or IHC slides. In some
embodiments, the
sample is a solid tissue biopsy sample. In some embodiments, the tissue biopsy
sample is a
formalin-fixed tissue (FFT), e.g., a formalin-fixed paraffin-embedded (FFPE)
tissue. In some
embodiments, the tissue biopsy sample is an FFPE or FFT block. In some
embodiments, the
tissue biopsy sample is a fresh-frozen tissue biopsy. The tissue biopsy sample
can be
prepared in thin sections (e.g., by cutting and/or affixing to a slide), to
facilitate pathology
review (e.g., by staining with immunohistochemistry stain for IFIC review
and/or with
hematoxylin and eosin stain for H&E pathology review). For instance, analysis
of slides for
H&E staining or IHC staining may reveal features such as tumor infiltration,
programmed
78
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
death-ligand 1 (PD-L1) status, human leukocyte antigen (HLA) status, or other
immunological features.
[0266] In some embodiments, a liquid sample (e.g., blood)
collected from the patient
(e.g., in EDTA-containing collection tubes) is prepared on a slide (e.g., by
smearing) for
pathology review. In some embodiments, macrodissected FFPE tissue sections,
which may
be mounted on a histopathology slide, from solid tissue samples (e.g., tumor
or normal tissue)
are analyzed by pathologists. In some embodiments, tumor samples are evaluated
to
determine, e.g., the tumor purity of the sample, the percent tumor cellularity
as a ratio of
tumor to normal nuclei, etc. For each section, background tissue may be
excluded or
removed such that the section meets a tumor purity threshold, e.g., where at
least 20% of the
nuclei in the section are tumor nuclei, or where at least 25%, 30%, 40%, 50%,
60%, 70%,
80%, 90%, or more of the nuclei in the section are tumor nuclei.
[0267] Conversion of solid tumor test to liquid biopsy test. In
one embodiment, the solid
tissue sample is insufficient for NGS testing (for example, the sample is too
small or too
degraded, the amount or quality of nucleic acids extracted from the sample
does not result in
quality NGS results that would result in reliable determination of variants
and/or other
genetic characteristics of the sample), and the physician or patient may
decide to convert the
solid tissue test that was ordered to a liquid biopsy test to be performed on
a liquid biopsy
sample collected from the same patient. The resulting report and/or display of
the results on
a portal may include an "xF Conversion Badge" to distinguish any order that
has been
converted from solid tissue test to a liquid biopsy test (compared to, for
example, a liquid
biopsy test that was not initially ordered as a solid tissue test). This will
allow a user to
identify which orders have been converted by this process, and distinguish
between orders
that were intentionally placed for the liquid biopsy panel.
[0268] In some embodiments, pathology data 128-1 is extracted, in
addition to or instead
of visual inspection, using computational approaches to digital pathology,
e.g., providing
morphometric features extracted from digital images of stained tissue samples.
A review of
digital pathology methods is provided in Bera, K. et al., Nat. Rev. Clin.
Oncol., 16:703-15
(2019), the content of which is hereby incorporated by reference, in its
entirety, for all
purposes. In some embodiments, pathology data 128-1 includes features
determined using
machine learning algorithms to evaluate pathology data collected as described
above.
79
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0269] Further details on methods, systems, and algorithms for
using pathology data to
classify cancer and identify targeted therapies are discussed, for example, in
are discussed,
for example, in U.S. Patent Application No. 16/830,186, filed on March 25,
2020, and U.S.
Provisional Application No. 63/007,874, filed on April 9, 2020, the contents
of which are
hereby incorporated by reference, in their entireties, for all purposes.
102701 In some embodiments, imaging data 128-2 collected during
clinical evaluation
includes features identified by review of in-vitro and/or in-vivo imaging
results (e.g., of a
tumor site), for example a size of a tumor, tumor size differentials over time
(such as during
treatment or during other periods of change). In some embodiments, imaging
data 128-2
includes features determined using machine learning algorithms to evaluate
imaging data
collected as described above.
102711 Further details on methods, systems, and algorithms for
using medical imaging to
classify cancer and identify targeted therapies are discussed, for example, in
are discussed,
for example, in U.S. Patent Application No. 16/830,186, filed on March 25,
2020, and U.S.
Provisional Application No. 63/007,874, filed on April 9, 2020, the contents
of which are
hereby incorporated by reference, in their entireties, for all purposes.
[0272] In some embodiments, tissue culture / organoid data 128-3
collected during
clinical evaluation includes features identified by evaluation of cultured
tissue from the
subject. For instance, in some embodiments, tissue samples obtained from the
patients (e.g.,
tumor tissue, normal tissue, or both) are cultured (e.g., in liquid culture,
solid-phase culture,
and/or organoid culture) and various features, such as cell morphology, growth

characteristics, genomic alterations, and/or drug sensitivity, are evaluated.
In some
embodiments, tissue culture / organoid data 128-3 includes features determined
using
machine learning algorithms to evaluate tissue culture / organoid data
collected as described
above. Examples of tissue organoid (e.g., personal tumor organoid) culturing
and feature
extractions thereof are described in U.S. Provisional Application Serial No.
62/924,621, filed
on October 22, 2019, and U.S. Patent Application Serial No. 16/693,117, filed
on November
22, 2019, the contents of which are hereby incorporated by reference, in their
entireties, for
all purposes.
[0273] Nucleic acid sequencing of one or more samples collected
from the subject is
performed, e.g., at sequencing lab 230, during wet lab processing 204. An
example workflow
for nucleic acid sequencing is illustrated in Figure 3. In some embodiments,
the one or more
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
biological samples obtained at the sequencing lab 230 are accessioned (302),
to track the
sample and data through the sequencing process.
[0274] Next, nucleic acids, e.g., RNA and/or DNA are extracted
(304) from the one or
more biological samples. Methods for isolating nucleic acids from biological
samples are
known in the art, and are dependent upon the type of nucleic acid being
isolated (e.g.,
cfDNA, DNA, and/or RNA) and the type of sample from which the nucleic acids
are being
isolated (e.g., liquid biopsy samples, white blood cell buff y coat
preparations, formalin-fixed
paraffin-embedded (FFPE) solid tissue samples, and fresh frozen solid tissue
samples). The
selection of any particular nucleic acid isolation technique for use in
conjunction with the
embodiments described herein is well within the skill of the person having
ordinary skill in
the art, who will consider the sample type, the state of the sample, the type
of nucleic acid
being sequenced and the sequencing technology being used.
[0275] For instance, many techniques for DNA isolation, e.g.,
genomic DNA isolation,
from a tissue sample are known in the art, such as organic extraction, silica
adsorption, and
anion exchange chromatography. Likewise, many techniques for RNA isolation,
e.g., mRNA
isolation, from a tissue sample are known in the art. For example, acid
guanidinium
thiocyanate-phenol-chloroform extraction (see, for example, Chomczynski and
Sacchi, 2006,
Nat Protoc, 1(2):581-85, which is hereby incorporated by reference herein),
and silica
bead/glass fiber adsorption (see, for example, Poeckh, T. et at., 2008, Anal
Biochem.,
373(2):253-62, which is hereby incorporated by reference herein). The
selection of any
particular DNA or RNA isolation technique for use in conjunction with the
embodiments
described herein is well within the skill of the person having ordinary skill
in the art, who will
consider the tissue type, the state of the tissue, e.g., fresh, frozen,
formalin-fixed, paraffin-
embedded (FFPE), and the type of nucleic acid analysis that is to be
performed.
[0276] In some embodiments where the biological sample is a
liquid biopsy sample, e.g.,
a blood or blood plasma sample, cfDNA is isolated from blood samples using
commercially
available reagents, including proteinase K, to generate a liquid solution of
cfDNA.
102771 In some embodiments, isolated DNA molecules are
mechanically sheared to an
average length using an ultrasonicator (for example, a Covaris
ultrasonicator). In some
embodiments, isolated nucleic acid molecules are analyzed to determine their
fragment size,
e.g., through gel electrophoresis techniques and/or the use of a device such
as a LabChip GX
Touch. The skilled artisan will know of an appropriate range of fragment
sizes, based on the
81
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
sequencing technique being employed, as different sequencing techniques have
differing
fragment size requirements for robust sequencing. In some embodiments, quality
control
testing is performed on the extracted nucleic acids (e.g., DNA and/or RNA),
e.g., to assess
the nucleic acid concentration and/or fragment size. For example, sizing of
DNA fragments
provides valuable information used for downstream processing, such as
determining whether
DNA fragments require additional shearing prior to sequencing.
[0278] Wet lab processing 204 then includes preparing a nucleic
acid library from the
isolated nucleic acids (e.g., cfDNA, DNA, and/or RNA). For example, in some
embodiments, DNA libraries (e.g., gDNA and/or cfDNA libraries) are prepared
from isolated
DNA from the one or more biological samples. In some embodiments, the DNA
libraries are
prepared using a commercial library preparation kit, e.g., the KAPA Hyper Prep
Kit, a New
England Biolabs (NEB) kit, or a similar kit.
[0279] Conversion of solid tumor test to liquid biopsy test. In
one embodiment, the solid
tissue sample is insufficient for NGS testing (for example, the sample is too
small or too
degraded, the amount or quality of nucleic acids extracted from the sample
does not result in
quality NGS results that would result in reliable determination of variants
and/or other
genetic characteristics of the sample), and the physician or patient may
decide to convert the
solid tissue test that was ordered to a liquid biopsy test to be performed on
a liquid biopsy
sample collected from the same patient. The resulting report and/or display of
the results on
a portal may include an "xF Conversion Badge" to distinguish any order that
has been
converted from solid tissue test to a liquid biopsy test (compared to, for
example, a liquid
biopsy test that was not initially ordered as a solid tissue test). This will
allow a user to
identify which orders have been converted by this process, and distinguish
between orders
that were intentionally placed for the liquid biopsy panel.
[0280] In some embodiments, during library preparation, adapters
(e.g., UDI adapters,
such as Roche SeqCap dual end adapters, or UM1 adapters such as full length or
stubby Y
adapters) are ligated onto the nucleic acid molecules. In some embodiments,
the adapters
include unique molecular identifiers (UMIs), which are short nucleic acid
sequences (e.g., 3-
base pairs) that are added to ends of DNA fragments during adapter ligation.
In some
embodiments, UMIs are degenerate base pairs that serve as a unique tag that
can be used to
identify sequence reads originating from a specific DNA fragment. In some
embodiments,
e.g., when multiplex sequencing will be used to sequence DNA from a plurality
of samples
(e.g., from the same or different subjects) in a single sequencing reaction, a
patient-specific
82
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
index is also added to the nucleic acid molecules. In some embodiments, the
patient specific
index is a short nucleic acid sequence (e.g., 3-20 nucleotides) that are added
to ends of DNA
fragments during library construction, that serve as a unique tag that can be
used to identify
sequence reads originating from a specific patient sample. Examples of
identifier sequences
are described, for example, in Kivioja et al., Nat. Methods 9(1):72-74 (2011)
and Islam et al.,
Nat. Methods 11(2):163-66 (2014), the contents of which are hereby
incorporated by
reference, in their entireties, for all purposes.
[0281] In some embodiments, an adapter includes a PCR primer
landing site, designed
for efficient binding of a PCR or second-strand synthesis primer used during
the sequencing
reaction. In some embodiments, an adapter includes an anchor binding site, to
facilitate
binding of the DNA molecule to anchor oligonucleotide molecules on a sequencer
flow cell,
serving as a seed for the sequencing process by providing a starting point for
the sequencing
reaction. During PCR amplification following adapter ligation, the UMIs,
patient indexes,
and binding sites are replicated along with the attached DNA fragment. This
provides a way
to identify sequence reads that came from the same original fragment in
downstream analysis.
[0282] In some embodiments, DNA libraries are amplified and
purified using commercial
reagents, (e.g., Axygen MAG PCR clean up beads). In some such embodiments, the

concentration and/or quantity of the DNA molecules are then quantified using a
fluorescent
dye and a fluorescence microplate reader, standard spectrofluorometer, or
filter fluorometer.
In some embodiments, library amplification is performed on a device (e.g., an
Illumina C-
Bot2) and the resulting flow cell containing amplified target-captured DNA
libraries is
sequenced on a next generation sequencer (e.g., an Illumina HiSeq 4000 or an
Illumina
NovaSeq 6000) to a unique on-target depth selected by the user. In some
embodiments,
DNA library preparation is performed with an automated system, using a liquid
handling
robot (e.g., a SciClone NGSx).
[0283] In some embodiments, where feature data 125 includes
methylation states 132 for
one or more genomic locations, nucleic acids isolated from the biological
sample (e.g.,
cfDNA) are treated to convert unmethylated cytosines to uracils, e.g., prior
to generating the
sequencing library. Accordingly, when the nucleic acids are sequenced, all
cytosines called
in the sequencing reaction were necessarily methylated, since the unmethylated
cytosines
were converted to uracils and accordingly would have been called as
thymidines, rather than
cytosines, in the sequencing reaction. Commercial kits are available for
bisulfite-mediated
conversion of methylated cytosines to uracils, for instance, the EZ DNA
MethylationTM-
83
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Gold, EZ DNA MethylationTm-Direct, and EZ DNA MethylationTm-Lightning kit
(available
from Zymo Research Corp (Irvine, CA)). Commercial kits are also available for
enzymatic
conversion of methylated cytosines to uracils, for example, the APOBEC-Seq kit
(available
from NEBiolabs, Ipswich, MA).
[0284] In some embodiments, wet lab processing 204 includes
pooling (308) DNA
molecules from a plurality of libraries, corresponding to different samples
from the same
and/or different patients, to forming a sequencing pool of DNA libraries. When
the pool of
DNA libraries is sequenced, the resulting sequence reads correspond to nucleic
acids isolated
from multiple samples. The sequence reads can be separated into different
sequence read
files, corresponding to the various samples represented in the sequencing read
based on the
unique identifiers present in the added nucleic acid fragments. In this
fashion, a single
sequencing reaction can generate sequence reads from multiple samples.
Advantageously,
this allows for the processing of more samples per sequencing reaction.
[0285] In some embodiments, wet lab processing 204 includes
enriching (310) a
sequencing library, or pool of sequencing libraries, for target nucleic acids,
e.g., nucleic acids
encompassing loci that are informative for precision oncology and/or used as
internal controls
for the sequencing or bioinformatics processes. In some embodiments,
enrichment is
achieved by hybridizing target nucleic acids in the sequencing library to
probes that hybridize
to the target sequences, and then isolating the captured nucleic acids away
from off-target
nucleic acids that are not bound by the capture probes. In some embodiments,
one or more
off-target nucleic acids will remain in the final sequencing pool.
[0286] Advantageously, enriching for target sequences prior to
sequencing nucleic acids
significantly reduces the costs and time associated with sequencing,
facilitates multiplex
sequencing by allowing multiple samples to be mixed together for a single
sequencing
reaction, and significantly reduces the computation burden of aligning the
resulting sequence
reads, as a result of significantly reducing the total amount of nucleic acids
analyzed from
each sample.
102871 In some embodiments, the enrichment is performed prior to
pooling multiple
nucleic acid sequencing libraries. However, in other embodiments, the
enrichment is
performed after pooling nucleic acid sequencing libraries, which has the
advantage of
reducing the number of enrichment assays that have to be performed.
84
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0288] In some embodiments, the enrichment is performed prior to
generating a nucleic
acid sequencing library. This has the advantage that fewer reagents are needed
to perform
both the enrichment (because there are fewer target sequences at this point,
prior to library
amplification) and the library production (because there are fewer nucleic
acid molecules to
tag and amplify after the enrichment). However, this raises the possibility of
pull-down bias
and/or that small variations in the enrichment protocol will result in less
consistent results.
[0289] In some embodiments, nucleic acid libraries are pooled
(two or more DNA
libraries may be mixed to create a pool) and treated with reagents to reduce
off-target capture,
for example Human COT-1 and/or IDT xGen Universal Blockers. Pools may be dried
in a
vacufuge and resuspended. DNA libraries or pools may be hybridized to a probe
set (for
example, a probe set specific to a panel that includes loci from at least 100,
600, 1,000,
10,000, etc. of the 19,000 known human genes) and amplified with commercially
available
reagents (for example, the KAPA HiFi HotStart ReadyMix). For example, in some
embodiments, a pool is incubated in an incubator, PCR machine, water bath, or
other
temperature-modulating device to allow probes to hybridize. Pools may then be
mixed with
Streptavidin-coated beads or another means for capturing hybridized DNA-probe
molecules,
such as DNA molecules representing exons of the human genome and/or genes
selected for a
genetic panel.
[0290] Pools may be amplified and purified more than once using
commercially available
reagents, for example, the KAPA HiFi Library Amplification kit and Axygen MAG
PCR
clean up beads, respectively. The pools or DNA libraries may be analyzed to
determine the
concentration or quantity of DNA molecules, for example by using a fluorescent
dye (for
example, PicoGreen pool quantification) and a fluorescence microplate reader,
standard
spectrofluorometer, or filter fluorometer. In one example, the DNA library
preparation
and/or capture is performed with an automated system, using a liquid handling
robot (for
example, a SciClone NGSx).
[0291] In some embodiments, e.g., where a whole genome sequencing
method will be
used, nucleic acid sequencing libraries are not target-enriched prior to
sequencing, in order to
obtain sequencing data on substantially all of the competent nucleic acids in
the sequencing
library. Similarly, in some embodiments, e.g., where a whole genome sequencing
method
will be used, nucleic acid sequencing libraries are not mixed, because of
bandwidth
limitations related to obtaining significant sequencing depth across an entire
genome.
However, in other embodiments, e.g., where a low pass whole genome sequencing
(LPWGS)
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
methodology will be used, nucleic acid sequencing libraries can still be
pooled, because very
low average sequencing coverage is achieved across a respective genome, e.g.,
between about
0.5x and about 5x.
[0292] In some embodiments, a plurality of nucleic acid probes
(e.g., a probe set) is used
to enrich one or more target sequences in a nucleic acid sample (e.g., an
isolated nucleic acid
sample or a nucleic acid sequencing library), e.g, where one or more target
sequences is
informative for precision oncology. For instance, in some embodiments, one or
more of the
target sequences encompasses a locus that is associated with an actionable
allele. That is,
variations of the target sequence are associated with targeted therapeutic
approaches. In
some embodiments, one or more of the target sequences and/or a property of one
or more of
the target sequences is used in a classifier trained to distinguish two or
more cancer states.
102931 In some embodiments, the probe set includes probes
targeting one or more gene
loci, e.g., exon or intron loci. In some embodiments, the probe set includes
probes targeting
one or more loci not encoding a protein, e.g., regulatory loci, miRNA loci,
and other non-
coding loci, e.g, that have been found to be associated with cancer, in some
embodiments,
the plurality of loci includes at least 25, 50, 100, 150, 200, 250, 300, 350,
400, 500, 750,
1000, 2500, 5000, or more human genomic loci.
[0294] In some embodiments, the probe set includes probes
targeting one or more of the
genes listed in Table 1. In some embodiments, the probe set includes probes
targeting at least
of the genes listed in Table 1. In some embodiments, the probe set includes
probes
targeting at least 10 of the genes listed in Table 1. In some embodiments, the
probe set
includes probes targeting at least 25 of the genes listed in Table 1. In some
embodiments, the
probe set includes probes targeting at least 50 of the genes listed in Table
1. In some
embodiments, the probe set includes probes targeting at least 75 of the genes
listed in Table
1. In some embodiments, the probe set includes probes targeting at least 100
of the genes
listed in Table 1. In some embodiments, the probe set includes probes
targeting all of the
genes listed in Table 1.
86
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0295] Table 1. An example panel of 105 genes that are
informative for precision
oncology.
Table 1: Example Gene Panel for Precision Oncology
ALK B2M ERRFIl IDH2 MSH6 PIK3R1 SPOP
FGFR2 BAP 1 ESR1 JAK1 MTOR PMS2 STK11
FGFR3 BRCA1 EZH2 JAK2 MYCN PTCH1 TERT
NTRKI BRCA2 FBXW7 JAK3 NF 1 PTEN TP53
RET BTK FGFR1 KDR NF2 PTPN11 TSC 1
RO S1 CCND1 FGFR4 KEAP 1 NFE2L 2 RAD51C TSC2
BRAF CCND2 FLT3 KIT NOTCH1 RAF 1 UGTIAI
AKTI CCND3 FOXL2 KRAS NPMI RBI VHL
AKT2 CDHI GATA3 MAP2K1 NRAS RHEB CCNEI
APC CDK4 GNAll MAP2K2 PALB2 RHOA CD274
AR CDK6 GNAQ MAPK1 PBRM1 RIT1 EGFR
ARAF CDKN2A GNAS MLH1 PDCD1LG2 RNF43 ERBB2
ARIDIA C TNNB 1 HNFlA MPL PDGFRA SDHA MET
ATM DDR2 HRAS MSH2 PDGFRB SMAD4 MYC
ATR DPYD IDHI MSH3 PIK3CA SMO KMT2A
[0296] In some embodiments, the probe set includes probes
targeting one or more of the
genes listed in List 1, provided below. In some embodiments, the probe set
includes probes
targeting at least 5 of the genes listed in List 1. In some embodiments, the
probe set includes
probes targeting at least 10 of the genes listed in List 1. In some
embodiments, the probe set
includes probes targeting at least 25 of the genes listed in List 1. In some
embodiments, the
probe set includes probes targeting at least 50 of the genes listed in List 1.
In some
embodiments, the probe set includes probes targeting at least 70 of the genes
listed in List 1.
In some embodiments, the probe set includes probes targeting all of the genes
listed in List 1.
[0297] In some embodiments, the probe set includes probes
targeting one or more of the
genes listed in List 2, provided below. In some embodiments, the probe set
includes probes
targeting at least 5 of the genes listed in List 2. In some embodiments, the
probe set includes
87
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
probes targeting at least 10 of the genes listed in List 2. In some
embodiments, the probe set
includes probes targeting at least 25 of the genes listed in List 2. In some
embodiments, the
probe set includes probes targeting at least 50 of the genes listed in List 2.
In some
embodiments, the probe set includes probes targeting at least 75 of the genes
listed in List 2.
In some embodiments, the probe set includes probes targeting at least 100 of
the genes listed
in List 2. In some embodiments, the probe set includes probes targeting all of
the genes listed
in List 2.
[0298] In some embodiments, panels of genes including one or more
genes from the
following lists are used for analyzing specimens, sequencing, and/or
identification. In some
embodiments, panels of genes for analyzing specimens, sequencing, and/or
identification
include one or more genes from List 1 or List 2. In some embodiments, panels
of genes for
analyzing specimens, sequencing, and/or identification include one or more
genes from:
[0299] List 1: AKT1 (14q32.33), ALK (2p23.2-23.1), APC (5q22.2),
AR (Xq12), ARAF
(Xp11.3), ARID1A (1p36.11), ATM (11q22.3), BRAF (7q34), BRCA1 (17q21.31),
BRCA2
(13q13.1), CCND1 (11q13.3), CCND2 (12p13.32), CCNE1 (19q12), CDH1 (16q22.1),
CDK4 (12q14.1), CDK6 (7q21.2), CDKN2A (9p21.3), CTNNB1 (3p22.1), DDR2
(1q23.3),
EGFR (7p11.2), ERBB2 (17q12), ESR1 (6q25.1-25.2), EZH2 (7q36.1), FBXW7
(4q31.3),
FGFR1 (8p11.23), FGFR2 (10q26.13), FGFR3 (4p16.3), GATA3 (10p14), GNAll
(19p13.3), GNAQ (9q21.2), GNAS (20q13.32), HNFlA (12q24.31), HRAS (11p15.5),
IDH1
(2q34), IDH2 (15q26.1), JAK2 (9p24.1), JAK3 (19p13.11), KIT (4q12), KRAS
(12p12.1),
MAP2K1 (15q22.31), MAP2K2 (19p13.3), MAPK1 (22q11.22), MAPK3 (16p11.2), MET
(7q31.2), MLH1 (3p22.2), MPL (1p34.2), MTOR (1p36.22), MYC (8q24.21), NF1
(17q11.2), NFE2L2 (2q31.2), NOTCH1 (9q34.3), NPM1 (5q35.1), NRAS (1p13.2),
NTRK1
(1q23.1), NTRK3 (15q25.3). PDGFRA (4q12), PIK3CA (3q26.32), PTEN (10q23.31),
PTPN11 (12q24.13), RAF1 (3p25.2), RB1 (13q14.2), RET (10q11.21), RHEB
(7q36.1),
RHOA (3p21.31), RIT1 (1q22), ROS1 (6q22.1), SMAD4 (18q21.2), SMO (7q32.1),
STK11
(19p13.3), TERT (5p15.33), TP53 (17p13.1), TSC1 (9q34.13), and VHL (3p25.3).
[0300] List 2: ABL1, ACVR1B, AKT1, AKT2, AKT3, ALK, AL0X12B,
AMER1
(FAM123B), APC, AR, ARAF, ARFRP1, ARID1A, ASXL1, ATM, ATR, ATRX, AURKA,
AURKB, AXIN1, AXL, BAP1, BARD1, BCL2, BCL2L1, BCL2L2, BCL6, BCOR,
BCORL1, BRAF, BRCA1, BRCA2, BRD4, BRIP1, BTG1, BTG2, BTK, Cllorf30 (EMSY),
C17orf39 (GID4), CALR, CARD11, CASP8, CBFB, CBL, CCND1, CCND2, CCND3,
CCNE1, CD22, CD274 (PD-L1), CD70, CD79A, CD79B, CDC73, CDH1, CDK12, CDK4,
88
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
CDK6, CDK8, CDKNIA, CDKNIB, CDKN2A, CDKN2B, CDKN2C, CEBPA, CHEKI,
CHEK2, CIC, CREBBP, CRKL, CSFIR, CSF3R, CTCF, CTNNAI, CTNNBI, CUL3,
CUL4A, CXCR4, CYP17A1, DAXX, DDR1, DDR2, DIS3, DNMT3A, DOT1L, EED,
EGFR, EP300, EPHA3, EPHB I, EPHB4_ ERBB2, ERBB3, ERBB4, ERCC4, ERG, ERRFIl,
ESRI, EZH2, FAM46C, FANCA, FANCC, FANCG, FANCL, FAS, FBXW7, FGF10,
FGF12, FGF14, FGF19, FGF23, FGF3, FGF4, FGF6, FGFRI, FGFR2, FGFR3, FGFR4, FH,
FLCN, FLT1, FLT3, FOXL2, FUBP1, GABRA6, GATA3, GATA4, GATA6, GNAll,
GNA13, GNAQ, GNAS, GRNI3, GSK3B, H3F3A, HDAC I, HGF, HNF 1A, HRAS,
HSD3B1, ID3, IDH1, IDH2, IGF1R, IKBKE, IKZFL INPP4B, IRF2, IRF4, IRS2, JAK1,
JAK2, JAK3, JUN, KDM5A, KDM5C, KDM6A, KDR, KEAPI, KEL, KIT, KLHL6,
KMT2A, KMT2D (MLL2), KRAS, LTK, LYN, MAF, MAP2K1 (MEK1), MAP2K2
(MEK2), MAP2K4, MAP3K1, MAP3K13, MAPK1, MCL1, MDM2, MDM4, MED12,
MEF2B, MENI, MERTK, MET, MITF, MKNKI, MLHI, MPL, MREI IA, MSH2, MSH3,
MSH6, MST1R, MTAP, MTOR, MUTYH, MYC, MYCL (MYCL1), MYCN, MYD88,
NBN, NF1, NF2, NFE2L2, NFKBIA, NKX2-1, NOTCH1, NOTCH2, NOTCH3, NPM1,
NRAS, NSD3 (WHSC1L1), NT5C2, NTRK1, NTRK2, NTRK3, P2RY8, PALB2, PARK2,
PARPI, PARP2, PARP3, PAX5, PBRN11, PDCDI (PD-1), PDCD1LG2 (PD-L2), PDGFRA,
PDGFRB, PDKI, PIK3C2B, PIK3C2G, PIK3CA, PIK3CB, PIK3R1, PIMI, PMS2, POLDI,
POLE, PPARG, PPP2R1A, PPP2R2A, PRDMI, PRKAR1A, PRKCI, PTCHI, PTEN,
PTPN11, PTPRO, QKI, RAC1, RAD21, RAD51, RAD51B, RAD51C. RAD51D, RAD52,
RAD54L, RAF I, RARA, RBI, RBM10, REL, RET, RICTOR, RNF43, ROS I, RPTOR,
SDHA, SDHB, SDHC, SDHD, SETD2, SF3B1, SGKI, SMAD2, SMAD4, SMARCA4,
SMARCBI, SMO, SNCAIP, SOCSI, SOX2, SOX9, SPEN, SPOP, SRC, STAG2, STAT3,
STK11, SUFU, SYK, TBX3, TEK, TERC, TERT, TET2, ncRNA, Promoter, TGFBR2,
TIPARP, INFAIP3, INFRSF14, TP53, TSC1, TSC2, TYR03, U2AF1, VEGFA, VHL,
WHSC1, WT1, XP01, XRCC2, ZNF217, and ZNF703.
103011 Generally, probes for enrichment of nucleic acids (e.g.,
cfDNA obtained from a
liquid biopsy sample) include DNA, RNA, or a modified nucleic acid structure
with a base
sequence that is complementary to a locus of interest. For instance, a probe
designed to
hybridize to a locus in a cfDNA molecule can contain a sequence that is
complementary to
either strand, because the cfDNA molecules are double stranded. In some
embodiments, each
probe in the plurality of probes includes a nucleic acid sequence that is
identical or
complementary to at least 10, at least 11, at least 12, at least 13, at least
14, or at least 15
89
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
consecutive bases of a locus of interest. In some embodiments, each probe in
the plurality of
probes includes a nucleic acid sequence that is identical or complementary to
at least 20, 25,
30, 40, 50, 75, 100, 150, 200, or more consecutive bases of a locus of
interest.
[0302] Targeted panels provide several benefits for nucleic acid
sequencing. For
example, in some embodiments, algorithms for discriminating between, e.g., a
first and
second cancer condition can be trained on smaller, more informative data sets
(e.g., fewer
genes), which leads to more computationally efficient training of classifiers
that discriminate
between the First and second cancer states. Such improvements in computational
efficiency,
owing to the reduced size of the discriminating gene set, can advantageously
either be used to
speed up classifier training or be used to improve the performance of such
classifiers (e.g.,
through more extensive training of the classifier).
103031 In some embodiments, the gene panel is a whole-exome panel
that analyzes the
exomes of a biological sample. In some embodiments, the gene panel is a whole-
genome
panel that analyzes the genome of a specimen. In some embodiments, the gene
panel is
optimized for use with liquid biopsy samples (e.g., to provide clinical
decision support for
solid tumors). See, for example, Table 1 above.
[0304] In some embodiments, the probes include additional nucleic
acid sequences that
do not share any homology to the locus of interest. For example, in some
embodiments, the
probes also include nucleic acid sequences containing an identifier sequence,
e.g., a unique
molecular identifier (UMI), e.g., that is unique to a particular sample or
subject. Examples of
identifier sequences are described, for example, in Kivioja etal., 2011, Nat.
Methods 9(1),
pp. 72-74 and Islam et al., 2014, Nat. Methods 11(2), pp. 163-66, which are
incorporated by
reference herein. Similarly, in some embodiments, the probes also include
primer nucleic
acid sequences useful for amplifying the nucleic acid molecule of interest,
e.g., using PCR.
In some embodiments, the probes also include a capture sequence designed to
hybridize to an
anti-capture sequence for recovering the nucleic acid molecule of interest
from the sample.
[0305] Likewise, in some embodiments, the probes each include a
non-nucleic acid
affinity moiety covalently attached to nucleic acid molecule that is
complementary to the
locus of interest, for recovering the nucleic acid molecule of interest. Non-
limited examples
of non-nucleic acid affinity moieties include biotin, digoxigenin, and
dinitrophenol. In some
embodiments, the probe is attached to a solid-state surface or particle, e.g.,
a dipstick or
magnetic bead, for recovering the nucleic acid of interest. In some
embodiments, the
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
methods described herein include amplifying the nucleic acids that bound to
the probe set
prior to further analysis, e.g., sequencing. Methods for amplifying nucleic
acids, e.g., by
PCR, are well known in the art.
[0306] Sequence reads are then generated (312) from the
sequencing library or pool of
sequencing libraries. Sequencing data may be acquired by any methodology known
in the
art. For example, next generation sequencing (NGS) techniques such as
sequencing-by-
synthesis technology (IIlumina), pyrosequencing (454 Life Sciences), ion
semiconductor
technology (Ion Torrent sequencing), single-molecule real-time sequencing
(Pacific
Biosciences), sequencing by ligation (SOLiD sequencing), nanopore sequencing
(Oxford
Nanopore Technologies), or paired-end sequencing. In some embodiments,
massively
parallel sequencing is performed using sequencing-by-synthesis with reversible
dye
terminators. In some embodiments, sequencing is performed using next
generation
sequencing technologies, such as short-read technologies. In other
embodiments, long-read
sequencing or another sequencing method known in the art is used.
[0307] Next-generation sequencing produces millions of short
reads (e.g., sequence
reads) for each biological sample. Accordingly, in some embodiments, the
plurality of
sequence reads obtained by next-generation sequencing of cfDNA molecules are
DNA
sequence reads. In some embodiments, the sequence reads have an average length
of at least
fifty nucleotides. In other embodiments, the sequence reads have an average
length of at
least 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, or more nucleotides.
[0308] In some embodiments, sequencing is performed after
enriching for nucleic acids
(e.g., cfDNA, gDNA, and/or RNA) encompassing a plurality of predetermined
target
sequences, e.g., human genes and/or non-coding sequences associated with
cancer.
Advantageously, sequencing a nucleic acid sample that has been enriched for
target nucleic
acids, rather than all nucleic acids isolated from a biological sample,
significantly reduces the
average time and cost of the sequencing reaction. Accordingly, in some
embodiments, the
methods described herein include obtaining a plurality of sequence reads of
nucleic acids that
have been hybridized to a probe set for hybrid-capture enrichment (e.g., of
one or more genes
listed in Table 1).
[0309] In some embodiments, panel-targeting sequencing is
performed to an average on-
target depth of at least 500x, at least 750x, at least 1000x, at least 2500x,
at least 500x, at
least 10,000x, or greater depth. In some embodiments, samples are further
assessed for
91
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
uniformity above a sequencing depth threshold (e.g., 95% of all targeted base
pairs at 300x
sequencing depth). In some embodiments, the sequencing depth threshold is a
minimum
depth selected by a user or practitioner.
[0310] In some embodiments, the sequence reads are obtained by a
whole genome or
whole exome sequencing methodology. In some such embodiments, whole exome
capture is
performed with an automated system, using a liquid handling robot (for
example, a SciClone
NGSx). Whole genome sequencing, and to some extent whole exome sequencing, is
typically performed at lower sequencing depth than smaller target-panel
sequencing
reactions, because many more loci are being sequenced. For example, in some
embodiments,
whole genome or whole exome sequencing is performed to an average sequencing
depth of at
least 3x, at least 5x, at least 10x, at least 15x, at least 20x, or greater.
In some embodiments,
low-pass whole genome sequencing (LPWGS) techniques are used for whole genome
or
whole exome sequencing. LPWGS is typically performed to an average sequencing
depth of
about 0.25x to about 5x, more typically to an average sequencing depth of
about 0.5x to
about 3x.
[0311] Because of the differences in the sequencing
methodologies, data obtained from
targeted-panel sequencing is better suited for certain analyses than data
obtained from whole
genome/whole exome sequencing, and vice versa. For instance, because of the
higher
sequencing depth achieved by targeted-panel sequencing, the resulting sequence
data is better
suited for the identification of variant alleles present at low allelic
fractions in the sample,
e.g., less than 20%. By contrast, data generated from whole genome/whole exome

sequencing is better suited for the estimation of genome-wide metrics, such as
tumor
mutational burden, because the entire genome is better represented in the
sequencing data.
Accordingly, in some embodiments, a nucleic acid sample, e.g., a cfDNA, gDNA,
or mRNA
sample, is evaluated using both targeted-panel sequencing and whole
genome/whole exome
sequencing (e.g., LPWGS).
[0312] In some embodiments, the raw sequence reads resulting from
the sequencing
reaction are output from the sequencer in a native file format, e.g., a BCL
file. In some
embodiments, the native file is passed directly to a bioinformatics pipeline
(e.g., variant
analysis 206), components of which are described in detail below. In other
embodiments,
pre-processing is performed prior to passing the sequences to the
bioinformatics platform.
For instance, in some embodiments, the format of the sequence read file is
converted from
the native file format (e.g., BCL) to a file format compatible with one or
more algorithms
92
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
used in the bioinformatics pipeline (e.g., FASTQ or FASTA). In some
embodiments, the raw
sequence reads are filtered to remove sequences that do not meet one or more
quality
thresholds. In some embodiments, raw sequence reads generated from the same
unique
nucleic acid molecule in the sequencing read are collapsed into a single
sequence read
representing the molecule, e.g., using UMIs as described above. In some
embodiments, one
or more of these pre-processing activities is performed within the
bioinformatics pipeline
itself.
[0313] In one example, a sequencer may generate a BCL file. A BCL
file may include
raw image data of a plurality of patient specimens which are sequenced. BCL
image data is
an image of the flow cell across each cycle during sequencing. A cycle may be
implemented
by illuminating a patient specimen with a specific wavelength of
electromagnetic radiation,
generating a plurality of images which may be processed into base calls via
BCL to FASTQ
processing algorithms which identify which base pairs are present at each
cycle. The
resulting FASTQ file includes the entirely of reads for each patient specimen
paired with a
quality metric, e.g., in a range from 0 to 64 where a 64 is the best quality
and a 0 is the worst
quality. In embodiments where both a liquid biopsy sample and a normal tissue
sample are
sequenced, sequence reads in the corresponding FASTQ files may be matched,
such that a
liquid biopsy-normal analysis may be performed.
[0314] FASTQ format is a text-based format for storing both a
biological sequence, such
as a nucleotide sequence, and its corresponding quality scores. These FASTQ
files are
analyzed to determine what genetic variants or copy number changes are present
in the
sample. Each FASTQ file contains reads that may be paired-end or single reads,
and may be
short-reads or long-reads, where each read represents one detected sequence of
nucleotides in
a nucleic acid molecule that was isolated from the patient sample or a copy of
the nucleic
acid molecule, detected by the sequencer. Each read in the FASTQ file is also
associated
with a quality rating. The quality rating may reflect the likelihood that an
error occurred
during the sequencing procedure that affected the associated read. In some
embodiments, the
results of paired-end sequencing of each isolated nucleic acid sample are
contained in a split
pair of FASTQ files, for efficiency. Thus, in some embodiments, forward (Read
1) and
reverse (Read 2) sequences of each isolated nucleic acid sample are stored
separately but in
the same order and under the same identifier
[0315] In various embodiments, the bioinformatics pipeline may
filter FASTQ data from
the corresponding sequence data file for each respective biological sample.
Such filtering
93
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
may include correcting or masking sequencer errors and removing (trimming) low
quality
sequences or bases, adapter sequences, contaminations, chimeric reads,
overrepresented
sequences, biases caused by library preparation, amplification, or capture,
and other errors.
[0316] While workflow 200 illustrates obtaining a biological
sample, extracting nucleic
acids from the biological sample, and sequencing the isolated nucleic acids,
in some
embodiments, sequencing data used in the improved systems and methods
described herein
(e.g., which include improved methods for validating copy number variations,
improved
methods for validating a somatic sequence variant in a test subject having a
cancer condition,
and/or improved methods for determining accurate circulating tumor fraction
estimates) is
obtained by receiving previously generated sequence reads, in electronic form.
[0317] Referring again to Figure 2A, nucleic acid sequencing data
122 generated from
the one or more patient samples is then evaluated (e.g, via variant analysis
206) in a
bioinformatics pipeline, e.g., using bioinformatics module 140 of system 100,
to identify
genomic alterations and other metrics in the cancer genome of the patient. An
example
overview for a bioinformatics pipeline is described below with respect to
Figure 4 (e.g,
Figure 4A-E, 4F1-3, and/or 4G1-3). Advantageously, in some embodiments, the
present
disclosure improves bioinformatics pipelines, like pipeline 206, by improving
methods and
systems for the validation of copy number variations, the validation of
somatic sequence
variants, and/or the determination of circulating tumor fraction estimates.
[0318] Figure 4A illustrates an example bioinformatics pipeline
206 (e.g., as used for
feature extraction in the workflows illustrated in Figures 2A and 3) for
providing clinical
support for precision oncology. As shown in Figure 4A, sequencing data 122
obtained from
the wet lab processing 204 (e.g., sequence reads 314) is input into the
pipeline.
[0319] In various embodiments, the bioinformatics pipeline
includes a circulating tumor
DNA (ciDNA) pipeline for analyzing liquid biopsy samples. The pipeline may
detect SNVs,
INDELs, copy number amplifications/deletions and genomic rearrangements (for
example,
fusions). The pipeline may employ unique molecular index (UMI)-based consensus
base
calling as a method of error suppression as well as a Bayesian tri-nucleotide
context-based
position level error suppression. In various embodiments, it is able to detect
variants having a
0.1%, 0.15%, 0.2%, 0.25%, 0.3%, 0.4%, or 0.5% variant allele fraction.
[0320] In some embodiments, the sequencing data is processed
(e.g., using sequence data
processing module 141) to prepare it for genomic feature identification 385.
For instance, in
94
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
some embodiments as described above, the sequencing data is present in a
native file format
provided by the sequencer. Accordingly, in some embodiments, the system (e.g.,
system
100) applies a pre-processing algorithm 142 to convert the file format (318)
to one that is
recognized by one or more upstream processing algorithms. For example, BCL
file outputs
from a sequencer can be converted to a FASTQ file format using the bc12fastq
or bc12fastq2
conversion software (Illuminak). FASTQ format is a text-based format for
storing both a
biological sequence, such as nucleotide sequence, and its corresponding
quality scores. These
FASTQ files are analyzed to determine what genetic variants, copy number
changes, etc., are
present in the sample.
[0321] In some embodiments, other preprocessing functions are
performed, e.g., filtering
sequence reads 122 based on a desired quality, e.g., size and/or quality of
the base calling. In
some embodiments, quality control checks are performed to ensure the data is
sufficient for
variant calling. For instance, entire reads, individual nucleotides, or
multiple nucleotides that
are likely to have errors may be discarded based on the quality rating
associated with the read
in the FASTQ file, the known error rate of the sequencer, and/or a comparison
between each
nucleotide in the read and one or more nucleotides in other reads that has
been aligned to the
same location in the reference genome. Filtering may be done in part or in its
entirely by
various software tools, for example, a software tool such as Skewer. See,
Jiang, H. et al.,
BMC Bioinformatics 15(182):1-12 (2014). FASTQ files may be analyzed for rapid
assessment of quality control and reads, for example, by a sequencing data QC
software such
as AfterQC, Kraken, RNA-SeQC, FastQC, or another similar software program. For
paired
end reads, reads may be merged.
[0322] In some embodiments, when both a liquid biopsy sample and
a normal tissue
sample from the patient are sequenced, two FASTQ output files are generated,
one for the
liquid biopsy sample and one for the normal tissue sample. A 'matched' (e.g.,
panel-specific)
workflow is run to jointly analyze the liquid biopsy-normal matched FASTQ
files. When a
matched normal sample is not available from the patient, FASTQ files from the
liquid biopsy
sample are analyzed in the 'tumor-only' mode. See, for example, Figure 4B. If
two or more
patient samples are processed simultaneously on the same sequencer flow cell,
e.g, a liquid
biopsy sample and a normal tissue sample, a difference in the sequence of the
adapters used
for each patient sample barcodes nucleic acids extracted from both samples, to
associate each
read with the correct patient sample and facilitate assignment to the correct
FASTQ file.
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0323] For efficiency, in some embodiments, the results of paired-
end sequencing of each
isolate are contained in a split pair of FASTQ files. Forward (Read 1) and
reverse (Read 2)
sequences of each tumor and normal isolate are stored separately but in the
same order and
under the same identifier. See, for example, Figure 4C. In various
embodiments, the
bioinformatics pipeline may filter FASTQ data from each isolate. Such
filtering may include
correcting or masking sequencer errors and removing (trimming) low quality
sequences or
bases, adapter sequences, contaminations, chimeric reads, overrepresented
sequences, biases
caused by library preparation, amplification, or capture, and other errors.
See, for example,
Figure 4D.
[0324] Similarly, in some embodiments, sequencing (312) is
performed on a pool of
nucleic acid sequencing libraries prepared from different biological samples,
e.g., from the
same or different patients. Accordingly, in some embodiments, the system
demultiplexes
(320) the data (e.g., using demultiplexing algorithm 144) to separate sequence
reads into
separate files for each sequencing library included in the sequencing pool,
e.g., based on UMI
or patient identifier sequences added to the nucleic acid fragments during
sequencing library
preparation, as described above. In some embodiments, the demultiplexing
algorithm is part
of the same software package as one or more pre-processing algorithms 142. For
instance,
the bc12fastq or bc12fastq2 conversion software (Illuminag) include
instructions for both
converting the native file format output from the sequencer and demultiplexing
sequence
reads 122 output from the reaction.
[0325] The sequence reads are then aligned (322), e.g., using an
alignment algorithm 143,
to a reference sequence construct 158, e.g., a reference genome, reference
exome, or other
reference construct prepared for a particular targeted-panel sequencing
reaction. For
example, in some embodiments, individual sequence reads 123, in electronic
form (e.g., in
FASTQ files), are aligned against a reference sequence construct for the
species of the
subject (e.g., a reference human genome) by identifying a sequence in a region
of the
reference sequence construct that best matches the sequence of nucleotides in
the sequence
read. In some embodiments, the sequence reads are aligned to a reference exome
or
reference genome using known methods in the art to determine alignment
position
information. The alignment position information may indicate a beginning
position and an
end position of a region in the reference genome that corresponds to a
beginning nucleotide
base and end nucleotide base of a given sequence read. Alignment position
information may
also include sequence read length, which can be determined from the beginning
position and
96
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
end position. A region in the reference genome may be associated with a gene
or a segment
of a gene. Any of a variety of alignment tools can be used for this task.
[0326] For instance, local sequence alignment algorithms compare
subsequences of
different lengths in the query sequence (e.g., sequence read) to subsequences
in the subject
sequence (e.g., reference construct) to create the best alignment for each
portion of the query
sequence. In contrast, global sequence alignment algorithms align the entirety
of the
sequences, e.g., end to end. Examples of local sequence alignment algorithms
include the
Smith-Waterman algorithm (see, for example, Smith and Waterman, J Mol. Biol.,
147(1):195-97 (1981), which is incorporated herein by reference), Lalign (see,
for example,
Huang and Miller, Adv. Appl. Math, 12:337-57 (1991), which is incorporated by
reference
herein), and PatternHunter (see, for example, Ma B. etal., Bioinformatics,
18(3):440-45
(2002), which is incorporated by reference herein).
[0327] In some embodiments, the read mapping process starts by
building an index of
either the reference genome or the reads, which is then used to retrieve the
set of positions in
the reference sequence where the reads are more likely to align. Once this
subset of possible
mapping locations has been identified, alignment is performed in these
candidate regions
with slower and more sensitive algorithms. See, for example, Hatem etal.,
2013,
-Benchmarking short sequence mapping tools," BMC Bioinformatics 14: p. 184;
and Flicek
and Bimey, 2009, "Sense from sequence reads: methods for alignment and
assembly," Nat
Methods 6(Suppl. 11), S6-S12, each of which is hereby incorporated by
reference. In some
embodiments, the mapping tools methodology makes use of a hash table or a
Burrows-
Wheeler transform (BWT). See, for example, Li and Homer, 2010, "A survey of
sequence
alignment algorithms for next-generation sequencing,- Brief Bioinformatics 11,
pp. 473-483,
which is hereby incorporated by reference.
[0328] Other software programs designed to align reads include,
for example, Novoalign
(Novocraft, Inc.), Bowtie, Burrows Wheeler Aligner (BWA), and/or programs that
use a
Smith-Waterman algorithm. Candidate reference genomes include, for example,
hg19,
GRCh38, hg38, GRCh37, and/or other reference genomes developed by the Genome
Reference Consortium. In some embodiments, the alignment generates a SAM file,
which
stores the locations of the start and end of each read according to
coordinates in the reference
genome and the coverage (number of reads) for each nucleotide in the reference
genome.
97
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0329] For example, in some embodiments, each read of a FASTQ
file is aligned to a
location in the human genome having a sequence that best matches the sequence
of
nucleotides in the read. There are many software programs designed to align
reads, for
example, Novoalign (Novocraft, Inc.), Bowtie, Burrows Wheeler Aligner (BWA),
programs
that use a Smith-Waterman algorithm, etc. Alignment may be directed using a
reference
genome (for example, hg19, GRCh38, hg38, GRCh37, other reference genomes
developed by
the Genome Reference Consortium, etc.) by comparing the nucleotide sequences
in each read
with portions of the nucleotide sequence in the reference genome to determine
the portion of
the reference genome sequence that is most likely to correspond to the
sequence in the read.
In some embodiments, one or more SAM files are generated for the alignment,
which store
the locations of the start and end of each read according to coordinates in
the reference
genome and the coverage (number of reads) for each nucleotide in the reference
genome.
The SAM files may be converted to BAM files. In some embodiments, the BAM
files are
sorted, and duplicate reads are marked for deletion, resulting in de-
duplicated BAM files.
[0330] In some embodiments, adapter-trimmed FASTQ files are
aligned to the 19th
edition of the human reference genome build (HG19) using Burrows-Wheeler
Aligner
(BWA, Li and Durbin, Bioinformatics, 25(14):1754-60 (2009). Following
alignment, reads
are grouped by alignment position and UMI family and collapsed into consensus
sequences,
for example, using fgbio tools (e.g., available on the internet at
fulcrumgenomics.github.io/fgbio/). Bases with insufficient quality or
significant
disagreement among family members (for example, when it is uncertain whether
the base is
an adenine, cytosine, guanine, etc.) may be replaced by N's to represent a
wildcard nucleotide
type. PHRED scores are then scaled based on initial base calling estimates
combined across
all family members. Following single-strand consensus generation, duplex
consensus
sequences are generated by comparing the forward and reverse oriented PCR
products with
mirrored UMI sequences. In various embodiments, a consensus can be generated
across read
pairs. Otherwise, single-strand consensus calls will be used. Following
consensus calling,
filtering is performed to remove low-quality consensus fragments. The
consensus fragments
are then re-aligned to the human reference genome using BWA. A BAM output file
is
generated after the re-alignment, then sorted by alignment position, and
indexed.
[0331] In some embodiments, where both a liquid biopsy sample and
a normal tissue
sample are analyzed, this process produces a liquid biopsy BAM file (e.g.,
Liquid BAM 124-
1-i-cf) and a normal BAM file (e.g., Germline BAM 124-1-i-g), as illustrated
in Figure 4A.
98
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
In various embodiments, BAM files may be analyzed to detect genetic variants
and other
genetic features, including single nucleotide variants (SNVs), copy number
variants (CNVs),
gene rearrangements, etc.
[0332] In some embodiments, the sequencing data is normalized,
e.g., to account for pull-
down, amplification, and/or sequencing bias (e.g., mappability, GC bias etc.).
See, for
example, Schwartz et al., PLoS ONE 6(1):e1668.5 (2011) and Benjamini and
Speed, Nucleic
Acids Research 40(10):e72 (2012), the contents of which are hereby
incorporated by
reference, in their entireties, for all purposes.
[0333] In some embodiments, SAM files generated after alignment
are converted to
BAM files 124. Thus, after preprocessing sequencing data generated for a
pooled sequencing
reaction, BAM files are generated for each of the sequencing libraries present
in the master
sequencing pools. For example, as illustrated in Figure 4A, separate BAM files
are generated
for each of three samples acquired from subject 1 at time i (e.g., tumor BAM
124-1-i-t
corresponding to alignments of sequence reads of nucleic acids isolated from a
solid tumor
sample from subject 1, Liquid BAM 124-1-i-cf corresponding to alignments of
sequence
reads of nucleic acids isolated from a liquid biopsy sample from subject 1,
and Germline
BAM 124-1-i-g corresponding to alignments of sequence reads of nucleic acids
isolated from
a normal tissue sample from subject 1), and one or more samples acquired from
one or more
additional subjects at time j (e.g., Tumor BAM 124-2-j-t corresponding to
alignments of
sequence reads of nucleic acids isolated from a solid tumor sample from
subject 2). In some
embodiments, BAM files are sorted, and duplicate reads are marked for
deletion, resulting in
de-duplicated BAM files. For example, tools like SamBAMBA mark and filter
duplicate
alignments in the sorted BAM files.
[0334] Many of the embodiments described below, in conjunction
with Figure 4 (e.g.,
Figure 4A-E, 4F1-3, and/or 4G1-3), relate to analyses performed using
sequencing data from
cfDNA of a cancer patient, e.g., obtained from a liquid biopsy sample of the
patient.
Generally, these embodiments are independent and, thus, not reliant upon any
particular
sequencing data generation methods, e.g., sample preparation, sequencing,
and/or data pre-
processing methodologies. However, in some embodiments, the methods described
below
include one or more features 204 of generating sequencing data, as illustrated
in Figures 2A
and 3.
99
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0335] Alignment files prepared as described above (e.g., BAM
files 124) are then passed
to a feature extraction module 145, where the sequences are analyzed (324) to
identify
genomic alterations (e.g., SNVs/MNVs, indels, genomic rearrangements, copy
number
variations, etc.) and/or determine various characteristics of the patient's
cancer (e.g., MSI
status, TMB, tumor ploidy, HRD status, tumor fraction, tumor purity,
methylation patterns,
etc.). Many software packages for identifying genomic alterations are known in
the art, for
example, freebayes, PolyBayse, samtools, GATK, pindel, SAMtools, Breakdancer,
Cortex,
Crest, Delly, Gridss, Hydra, Lumpy, Manta, and Socrates. For a review of many
of these
variant calling packages see, for example, Cameron, D.L. etal.. Nat. Commun.,
10(3240):1-
11 (2019), the content of which is hereby incorporated by reference, in its
entirety, for all
purposes. Generally, these software packages identify variants in sorted SAM
or BAM files
124, relative to one or more reference sequence constructs 158. The software
packages then
output a file e.g., a raw VCF (variant call format), listing the variants (e.g
= genomic features
131) called and identifying their location relevant to the reference sequence
construct (e.g,
where the sequence of the sample nucleic acids differ from the corresponding
sequence in the
reference construct). In some embodiments, system 100 digests the contents of
the native
output file to populate feature data 125 in test patient data store 120. In
other embodiments,
the native output file serves as the record of these genomic features 131 in
test patient data
store 120.
[0336] Generally, the systems described herein can employ any
combination of available
variant calling software packages and internally developed variant
identification algorithms.
In some embodiments, the output of a particular algorithm of a variant calling
software is
further evaluated, e.g, to improve variant identification. Accordingly, in
some embodiments,
system 100 employs an available variant calling software package to perform
some of all of
the functionality of one or more of the algorithms shown in feature extraction
module 145.
[0337] In some embodiments, as illustrated in Figure 1A, separate
algorithms (or the
same algorithm implemented using different parameters) are applied to identify
variants
unique to the cancer genome of the patient and variants existing in the
germline of the
subject. In other embodiments, variants are identified indiscriminately and
later classified as
either germline or somatic, e.g., based on sequencing data, population data,
or a combination
thereof. In some embodiments, variants are classified as germline variants,
and/or non-
actionable variants, when they are represented in the population above a
threshold level, e.g.,
as determined using a population database such as ExAC or gnomAD. For
instance, in some
100
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
embodiments, variants that are represented in at least 1% of the alleles in a
population are
annotated as germline and/or non-actionable. In other embodiments, variants
that are
represented in at least 2%, at least 3%, at least 4%, at least 5%, at least
7.5%, at least 10%, or
more of the alleles in a population are annotated as germline and/or non-
actionable. In some
embodiments, sequencing data from a matched sample from the patient, e.g., a
normal tissue
sample, is used to annotate variants identified in a cancerous sample from the
subject. That
is, variants that are present in both the cancerous sample and the normal
sample represent
those variants that were in the germline prior to the patient developing
cancer and can be
annotated as germline variants.
[0338] In various aspects, the detected genetic variants and
genetic features are analyzed
as a form of quality control. For example, a pattern of detected genetic
variants or features
may indicate an issue related to the sample, sequencing procedure, and/or
bioinformatics
pipeline (e.g., example, contamination of the sample, mislabeling of the
sample, a change in
reagents, a change in the sequencing procedure and/or bioinformatics pipeline,
etc.).
[0339] Figure 4E illustrates an example workflow for genomic
feature identification
(324). This particular workflow is only an example of one possible collection
and
arrangement of algorithms for feature extraction from sequencing data 124.
Generally, any
combination of the modules and algorithms of feature extraction module 145,
e.g., illustrated
in Figure 1A, can be used for a bioinformatics pipeline, and particularly for
a bioinformatics
pipeline for analyzing liquid biopsy samples. For instance, in some
embodiments, an
architecture useful for the methods and systems described herein includes at
least one of the
modules or variant calling algorithms shown in feature extraction module 145.
In some
embodiments, an architecture includes at least 2, 3, 4, 5, 6, 7, 8, 9, 10, or
more of the modules
or variant calling algorithms shown in feature extraction module 145. Further,
in some
embodiments, feature extraction modules and/or algorithms not illustrated in
Figure 1A find
use in the methods and systems described herein.
Variant Identification
[0340] In some embodiments, variant analysis of aligned sequence
reads, e.g., in SAM or
BAM format, includes identification of single nucleotide variants (SNVs),
multiple
nucleotide variants (MNVs), indels (e.g., nucleotide additions and deletions),
and/or genomic
rearrangements (e.g, inversions, translocations, and gene fusions) using
variant identification
module 146, e.g., which includes a SNV/MNV calling algorithm (e.g., SNV/MNV
calling
algorithm 147), an indel calling algorithm (e.g., indel calling algorithm
14g), and/or one or
101
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
more genomic rearrangement calling algorithms (e.g., genomic rearrangement
calling
algorithm 149). An overview of an example method for variant identification is
shown in
Figure 4E. Essentially, the module first identifies a difference between the
sequence of an
aligned sequence read 124 and the reference sequence to which the sequence
read is aligned
(e.g., an SNV/MNV, an indel, or a genomic rearrangement) and makes a record of
the
variant, e.g., in a variant call format (VCF) file. For instance, software
packages such as
freebayes and pindel are used to call variants using sorted BAM files and
reference BED files
as the input. For a review of variant calling packages see, for example,
Cameron, D.L. etal.,
Nat. Commun., 10(3240):1-11 (2019). A raw VCF file (variant call format) file
is output,
showing the locations where the nucleotide base in the sample is not the same
as the
nucleotide base in that position in the reference sequence construct.
[0341] In some embodiments, as illustrated in Figure 4E, raw VCF
data is then
normalized, e.g., by parsimony and left alignment. For example, software
packages such as
vcfbreakmulti and vt are used to normalize multi-nucleotide polymorphic
variants in the raw
VCF file and a variant normalized VCF file is output. See, for example, E.
Garrison, "Vcflib:
A C++ library for parsing and manipulating VCF files, GitHub, available on the
internet at
2ithub.comlelsgivellW (2012), the content of which is hereby incorporated by
reference, in its
entirety, for all purposes. In some embodiments, a normalization algorithm is
included
within the architecture of a broader variant identification software package.
[0342] An algorithm is then used to annotate the variants in the
(e.g., normalized) VCF
file, e.g., determines the source of the variation, e.g., whether the variant
is from the germline
of the subject (e.g., a germline variant), a cancerous tissue (e.g., a somatic
variant), a
sequencing error, or of an undeterminable source. In some embodiments, an
annotation
algorithm is included within the architecture of a broader variant
identification software
package. However, in some embodiments, an external annotation algorithm is
applied to
(e.g., normalized) VCF data obtained from a conventional variant
identification software
package. The choice to use a particular annotation algorithm is well within
the purview of
the skilled artisan, and in some embodiments is based upon the data being
annotated.
[0343] For example, in some embodiments, where both a liquid
biopsy sample and a
normal tissue sample of the patient are analyzed, variants identified in the
normal tissue
sample inform annotation of the variants in the liquid biopsy sample. In some
embodiments,
where a particular variant is identified in the normal tissue sample, that
variant is annotated as
a germline variant in the liquid biopsy sample. Similarly, in some
embodiments, where a
102
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
particular variant identified in the liquid biopsy sample is not identified in
the normal tissue
sample, the variant is annotated as a somatic variant when the variant
otherwise satisfies any
additional criteria placed on somatic variant calling, e.g., a threshold
variant allele fraction
(VAF) in the sample.
[0344] By contrast, in some embodiments, where only a liquid
biopsy sample is being
analyzed, the annotation algorithm relies on other characteristics of the
variant in order to
annotate the origin of the variant. For instance, in some embodiments, the
annotation
algorithm evaluates the VAF of the variant in the sample, e.g., alone or in
combination with
additional characteristics of the sample, e.g., tumor fraction. Accordingly,
in some
embodiments, where the VAF is within a first range encompassing a value that
corresponds
to a 1:1 distribution of variant and reference alleles in the sample, the
algorithm annotates the
variant as a germline variant, because it is presumably represented in cfDNA
originating from
both normal and cancer tissues. Similarly, in some embodiments, where the VAF
is below a
baseline variant threshold, the algorithm annotates the variant as
undeterminable, because
there is not sufficient evidence to distinguish between the possibility that
the variant arose as
a result of an amplification or sequencing error and the possibility that the
variant originated
from a cancerous tissue. Similarly, in some embodiments, where the VAF falls
between the
first range and the baseline variant threshold, the algorithm annotates the
variant as a somatic
variant.
[0345] In some embodiments the baseline variant threshold is a
value from 0.01% VAF
to 0.5% VAF. In some embodiments, the baseline variant threshold is a value
from 0.05%
VAF to 0.35% VAF. In some embodiments, the baseline variant threshold is a
value from
0.1% VAF to 0.25% VAF. In some embodiments, the baseline variant threshold is
about
0.01% VAF, 0.015% VAF, 0.02% VAF, 0.025% VAF, 0.03% VAF, 0.035% VAF, 0.04%
VAF, 0.045% VAF, 0.05% VAF, 0.06% VAF, 0.07% VAF, 0.075% VAF, 0.08% VAF,
0.09% VAF, 0.1% VAF, 0.15% VAF, 0.2% VAF, 0.25% VAF, 0.3% VAF, 0.35% VAF,
0.4% VAF, 0.45% VAF, 0.5% VAF, or greater. In some embodiments, the baseline
variant
threshold is different for variants located in a first region, e.g., a region
identified as a
mutational hotspot and/or having high genomic complexity, than for variants
located in a
second region, e.g., a region that is not identified as a mutational hotspot
and/or having
average genomic complexity. For example, in some embodiments, the baseline
variant
threshold is a value from 0.01% to 0.25% for variants located in the first
region and is a value
from 0.1% to 0.5% for variants located in the second region.
103
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0346] In some embodiments, the first region is a region of
interest in the genome that
may have been manually selected based on criteria (for example, selection may
be based on a
known likelihood that a region is associated with variants) and the second
region is a region
that did not meet the selection criteria. In some embodiments, the baseline
variant threshold
is a value from 0.01% to 0.5% for variants located in the first region and is
a value from 1%
to 5% for variants located in the second region. In some embodiments, the
first region is a
region of interest in the genome that may have been manually selected based on
criteria (for
example, selection may be based on a known likelihood that a region is
associated with
variants) and the second region is a region selected based on a second set of
criteria.
[0347] In some embodiments, a baseline variant threshold is
influenced by the
sequencing depth of the reaction, e.g., a locus-specific sequencing depth
and/or an average
sequencing depth (e.g., across a targeted panel and/or complete reference
sequence
construct). In some embodiments, the baseline variant threshold is dependent
upon the type
of variant being detected. For example, in some embodiments, different
baseline variant
thresholds are set for SNPs/MNVs than for indels and/or genomic
rearrangements. For
instance, while an apparent SNP may be introduced by amplification and/or
sequencing
errors, it is much less likely that a genomic rearrangement is introduced this
way and, thus, a
lower baseline variant threshold may be appropriate for genomic rearrangements
than for
SNPs/MNVs.
[0348] In some embodiments, one or more additional criteria are
required to be satisfied
before a variant can be annotated as a somatic variant. For instance, in some
embodiments, a
threshold number of unique sequence reads encompassing the variant must be
present to
annotate the variant as somatic. In some embodiments, the threshold number of
unique
sequence reads is 2, 3, 4, 5, 7, 10, 12, 15, or greater. In some embodiments,
the threshold
number of unique sequence reads is only applied when certain conditions are
met, e.g., when
the variant allele is located in a region of a certain genomic complexity. In
some
embodiments, the certain genomic complexity is a low genomic complexity. In
some
embodiments, the certain genomic complexity is an average genomic complexity.
In some
embodiments, the certain genomic complexity is a high genomic complexity.
[0349] In some embodiments, a threshold sequencing coverage,
e.g., a locus-specific
and/or an average sequencing depth (e.g., across a targeted panel and/or
complete reference
sequence construct) must be satisfied to annotate the variant as somatic. In
some
embodiments, the threshold sequencing coverage is 50X, 100X, 150X, 200X, 250X,
300X,
104
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
350X, 400X or greater. In some embodiments, the variant is located in a
microsatellite
instable (MSI) region. In some embodiments, the variant is not located in a
microsatellite
instable (MSI) region. In some embodiments, the variant has sufficient signal-
to-noise ratio.
[0350] In some embodiments, bases contributing to the variant
satisfy a threshold
mapping quality to annotate the variant as somatic. In some embodiments,
alignments
contributing to the variant must satisfy a threshold alignment quality to
annotate the variant
as somatic. In some embodiments, a threshold value is determined for a variant
detected in a
somatic (cancer) sample by analyzing the threshold metric (for example, the
baseline variant
threshold is determined by analyzing VAF, or the threshold sequencing coverage
is
determined by analyzing coverage) associated with that variant in a group of
germline
(normal) samples that were each processed by the same sample processing and
sequencing
protocol as the somatic sample (process-matched). This may be used to ensure
the variants
are not caused by observed artifact generating processes.
[0351] In some embodiments, the threshold value is set above the
median base fraction of
the threshold metric value associated with the variant in more than a
specified percentage of
process-matched germline samples, e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more
standard
deviations above the median base fraction of the threshold metric value
associated with 25%,
30, 40, 50, 60, 70, 75, or more of the processed-matched germline samples. For
example, in
one embodiment, the threshold value is set to a value 5 standard deviations
above the median
base fraction of the threshold metric value associated with the variant in
more than 50% of
the process matched germline samples.
[0352] In some embodiments, variants around homopolymer and
multimer regions
known to generate artifacts may be specifically filtered to avoid such
artifacts. For example,
in some embodiments, strand specific filtering is performed in the direction
of the read in
order to minimize stranded artifacts. Similarly, in some embodiments, variants
that do not
exceed the stranded minimum deviation for their specific locus within a known
artifact-
generating region may be filtered to avoid artifacts.
103531 Variants may be filtered using dynamic methods, such as
through the application
of Bayes' Theorem through a likelihood ratio test. In some such embodiments,
the threshold
is dynamically calibrated to account for variants with low support (e.g., due
to low tumor
fraction, low circulating tumor fraction, and/or low sequencing depths). The
dynamic
threshold may be based on, for example, factors such as sample specific error
rate, the en-or
105
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
rate from a healthy reference pool (e.g., a pool of process matched healthy
control samples
for validation of variants detected in tumor samples), and information from
internal human
solid tumors (e.g., for validation of variants detected in liquid biopsy
samples). Accordingly,
in some embodiments, the dynamic filtering method employs a tri-nucleotide
context-based
Bayesian model. That is, in some embodiments, the threshold for filtering any
particular
putative variant is dynamically calibrated using a context-based Bayesian
model that
considers one or more of a sample-specific sequencing error rate, a process-
matched control
sequencing error rate, and/or a variant-specific frequency (e.g., determined
from similar
cancers). In this fashion, a minimum number of alternative alleles required to
positively
identify a true variant is determined for individual alleles and/or loci.
[0354] In some embodiments, the dynamic threshold is selected
from a Bayesian
probability model, where the selection is based on one or more en-or rates
and/or information
from one or more baseline variant distributions. For example, in some
embodiments, the
dynamic threshold is selected based on a variant detection specificity that is
calculated using
a distribution of variant detection sensitivities, where the distribution of
variant detection
sensitivities is a function of circulating variant allele fraction from a
plurality of baseline
and/or reference alleles (e.g., from a cohort of subjects). Filtration of
variants using a
dynamic threshold (e.g., to validate the presence of a somatic variant) is
performed by
comparing the number of unique sequence reads encompassing the variant (e.g.,
a variant
allele fragment count for the variant) against the dynamic threshold.
[0355] As described herein, in some embodiments, the methods
described herein (e.g.,
methods 400-2, 450, and 500-2 as illustrated in Figures 4 and 5) include one
or more data
collection steps, in addition to data analysis and downstream steps. For
example, as
described herein, e.g., with reference to Figures 2 and 3, in some
embodiments, the methods
include collection of a liquid biopsy sample and, optionally, one or more
matching biological
samples from the subject (e.g., a matched cancerous and/or matched non-
cancerous sample
from the subject). Likewise, as described herein, e.g., with reference to
Figures 2 and 3, in
some embodiments, the methods include extraction of cfDNA from the liquid
biopsy sample
and, optionally, one or more matching biological samples from the subject
(e.g., a matched
cancerous and/or matched non-cancerous sample from the subject). Similarly, as
described
herein, e.g., with reference to Figures 2 and 3, in some embodiments, the
methods include
nucleic acid sequencing of cIDNA from the liquid biopsy sample and,
optionally, one or
106
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
more matching biological samples from the subject (e.g., a matched cancerous
and/or
matched non-cancerous sample from the subject).
[0356] However, in other embodiments, the methods described
herein begin with
obtaining nucleic acid sequencing results, e.g., raw or collapsed sequence
reads of cfDNA
from a liquid biopsy sample and, optionally, one or more matching biological
samples from
the subject (e.g, a matched cancerous and/or matched non-cancerous sample from
the
subject), from which the statistics needed for somatic variant identification
(e.g., variant
allele count 133-ac and/or variant allele fraction 133-af) can be determined.
For example, in
some embodiments, sequencing data 122 for a patient 121 is accessed and/or
downloaded
over network 105 by system 100.
[0357] Similarly, in some embodiments, the methods described
herein begin with
obtaining the genomic features needed for somatic variant identification (e.g,
variant allele
count 133-ac and/or variant allele fraction 133-af) for a sequencing of a
liquid biopsy sample
and, optionally, one or more matching biological samples from the subject
(e.g., a matched
cancerous and/or matched non-cancerous sample from the subject). For example,
in some
embodiments, variant allele counts 133-cf-ac and/or variant allele fractions
133-cf-af for
sequencing data 122 of patient 121 is accessed and/or downloaded over network
105 by
system 100.
[0358] One goal of the liquid biopsy assays described herein is
to detect variant
alterations at low circulating fractions, which requires that low levels of
support be sufficient
to call a variant. Therefore, consistent thresholds to filter variants that do
not take into
account variant context and local sequence specific error cannot be used.
[0359] In some embodiments, a dynamic variant filtering method is
applied which uses
an application of Bayes' Theorem through the likelihood ratio test. The
dynamic threshold is
based on sample specific error rate, the error rate from a healthy reference
pool, and from
internal human solid tumors. The basic application of the likelihood ratio
test is as follows:
post-test-odds = pre-test-odds * sensitivity / (1 - specificity)
[0360] Given a fixed value for post-test-adds, the specificity
can be solved for. The
specificity represents the minimum acceptable quantile of an error
distribution (e.g., a
BetaBinomial, Beta, and Poisson error distribution). The above equation can be
refactored to
the one below:
specificity = 1 - pre-test-odds * sensitivity / post-test-odds
107
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0361] Specificity can then be plugged into the quantile error
(e.g., BetaBinomial, Beta,
or Poisson) function to derive the minimum number of alternative alleles that
can be observed
at a given depth to validate a candidate somatic variant.
[0362] In some embodiments, the post-test odds are post-test
probability / (1 - post-test
probability). The post-test probability is the probability of having a
positive variant given
Bayes Theorem. The post-test-odds is pre-defined.
[0363] In some embodiments, the pre-test odds are pre-test
probability / (1 - pre-test
probability). The pre-test probability is the probability of having a positive
valiant given the
patient's cancer-type and the prevalence of variant alterations within a
genomic region
encompassing a candidate somatic sequence variant in a reference population
having the
same cancer type.
[0364] In some embodiments, a pre-test-odds multiplier is applied
to the pre-test odds for
a resistance mutation that would develop and/or become more prominent within a

heterogeneous population of cancer cells in response to therapeutic treatment.
The multiplier
is applied to specific genomic regions (e.g., exon windows) containing the
resistance
mutation position. In some embodiments, the multiplier is only applied in
specified cancer
contexts. For example, in some embodiments, a multiplier is applied to a pre-
test odds for a
genomic region containing a mutation that is resistant to at least one cancer
therapy used to
treat the type of cancer the subject has. For example, if a given mutation is
known to have
resistance to a therapy used to treat breast cancer, but not to any of the
therapies used to treat
brain cancer, a multiplier will be applied to the pre-test odds for the
genomic region
encompassing the mutation if the subject has breast cancer, but not if the
subject has brain
cancer.
[0365] In some embodiments, sensitivity is the fraction of
variants detected by the liquid
biopsy assay at a given variant allele fraction (e.g., 0.1%, 0.25%, 0.5%,
etc.).
[0366] Calculating the pre-test probability. In some embodiments,
the pre-test
probability is calculated using historical data for a set of reference
subjects having the same
type of cancer, e.g., from sequencing of solid tumor samples. In this fashion,
it is possible to
accurately assess the prevalence of specific variants within the population of
advanced human
tumors. In some embodiments, the set of reference subjects is at least 10
reference subjects.
In some embodiments, the set of reference subjects is at least 50 reference
subjects. In some
embodiments, the set of reference subjects is at least 100 reference subjects.
In some
108
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
embodiments, the set of reference subjects is at least 500 reference subjects.
In some
embodiments, the set of reference subjects is at least 1000 reference
subjects. In some
embodiments, the set of reference subjects is at least 5000 reference
subjects. In some
embodiments, the set of reference subjects is at least 10000 reference
subjects.
[0367] In some embodiments, variant prevalence is calculated by
indexing genomic
regions (e.g., exons) in the reference sample and counting the number of
variants in each
genomic region (e.g., exon) for each cancer-type. The number of patients who
have at least
one variant in the genomic region (e.g., the exon) / the number of patients
equals the variant
prevalence. The pre-test-odds are calculated from the prevalence by pre-test-
odds =
prevalence / (1 - prevalence).
[0368] In some embodiments, for a cancer where the number of
patients in the reference
is too low to calculate prevalence, a default pan cancer cancer-type is used.
Where no
prevalence can be calculated, the mean variant prevalence across cancer-types
is used.
[0369] In some embodiments, pre-test-odds are not calculated each
time an input sample
is run. Rather, in some embodiments, it is read from a pre-existing file,
which will be
evaluated and regenerated if deemed necessary.
[0370] Calculating the pre-test-odds multiplier. Resistance
mutations have historically
low prevalence and variant allele fraction and may incorrectly be filtered by
the dynamic
variant filtering method due to low pre-test-odds. The resistance mutations
develop in
response to therapeutic treatment, and detecting resistance mutations early
provides insights
into the current treatment strategy. Low variant allele frequency, low
prevalence resistance
mutations in historic solid tumor samples have been identified. The high
sensitivity of the
liquid biopsy assay described herein permits the early detection of these
resistance mutations
in circulating DNA. Examples of such resistance mutations include PIK3CA
p.E545K in
breast cancer, EGFR p.T790M in non-small cell lung cancer, and AR p.H875Y for
prostate
cancer.
[0371] In some embodiments, to estimate the pre-test-odds-
multiplier required to pass
resistance mutations down to low variant allele fractions (e.g., 0.1% or 0.25%
VAF), the
average depth for each variant position is utilized from the reference pool
(e.g., the reference
pool used to determine the pre-test odds) depth, at a high minimum average
depth (e.g., of
2500X). For each resistance mutation, the number of alternate alleles required
to achieve a
0.1% or 0.25% VAF were calculated. The total alternate alleles and depth for
each resistance
109
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
mutation was input to the Dynamic Variant Filtering method, and multipliers
were applied
until those resistance mutations passed the filtering strategy.
[0372] In some embodiments, the minimum multiplier required to
pass resistance
mutations is determined when the input sample alternate allele count is
greater than the
background alternate allele count (as outlined in Calculating Testing Sample
Alt Allele Count
and Calculating Background Alt Allele Count below). In some embodiments, the
multiplier is
selected based on the multiplier required to pass the variant at a low variant
allele fraction
(e.g., 0.1% VAF or 0.25% VAF). In some embodiments, a maximum value for the
multiplier
is applied, in order to prevent excessive artifacts from passing the filter.
Large multipliers
may permit false positive variants to pass the Dynamic Variant Filtering
method, however,
large multipliers are necessary to pass resistance mutations that have
historically low
prevalence. In some embodiments, the maximum multiplier is between 750 and
1500. In
some embodiments, the maximum multiplier is between 900 and 1100. In some
embodiments, the maximum multiplier is between 1000 and 1050.
[03731 In some embodiments, the usage of the pre-test-odds-
multiplier is limited by
cancer-type context and genomic region (e.g., exon-window). In some
embodiments,
therefore, the multipliers will not be applied to all genomic regions (e.g.,
exon-windows)
given a specified cancer-type, nor all cancer-types given a specific genomic
region (e.g.,
exon-window).
[0374] Calculating testing sample variant allele count. In some
embodiments, the
filtering method (the statistical method used for the Dynamic Variant
Filtering method) is
selected from a beta-binomial distribution model, a beta distribution model,
and a Poisson
distribution model. In some embodiments, the model is a beta-binomial model.
In some
embodiments, when applying a quantile beta-binomial distribution, the sum of
the input
sample alternate reads is divided by the input sample sequencing depth at each
variant
position, and then multiplied by the reference pool depth (the sequencing
depth at genomic
positions for a pool of reference, e.g., healthy normal, controls).
103751 Calculating background variant allele count. In some
embodiments, the
background variant allele count calculation takes into account the background
error from a
pool of reference (e.g., healthy normal subjects), the input sample error, and
the prevalence of
historical variants in the reference cancer subjects. The quantile beta-
binomial model
considers (i) reference pool depth (the sequencing depth at genomic positions
for a pool of
110
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
reference, e.g., healthy normal, controls), background posterior error average
from the input
sample, and alpha calculated from the pre-test-odds, sensitivity, and the post-
test-odds (e.g.,
where alpha is equal to 1 - specificity = pre-test-odds * sensitivity / post-
test-odds. The pre-
test-odds calculated for a specific genomic region (e.g., exon window) and
cancer-type will
yield a unique alpha for each variant, given that the variants do not fall in
the same genomic
region (e.g., exon window)).
[0376] In some embodiments, the background posterior error
incorporates a trinucleotide
error average (e.g., a reaction-specific sequencing error rate), the reference
pool error (e.g., a
locus-specific, process-matched sequencing error rate; e.g., a sum of
alternate reads for each
position / depth from a pool of healthy normal controls), and a shrinkage
weight parameter.
In some embodiments, the trinucleotide error average is an aggregate of the
input sample
background average, where the input sample background average equals the error
counts for
each position divided by the position-specific sequencing depth. In some
embodiments, the
sample background average is then aggregated for each trinucleotide context.
The
trinucleotide average is used to calculate the shrinkage weight parameter. In
some
embodiments, the shrinkage weight parameter equals the trinucleotide error
average divided
by the sum of the trinucleotide error average and the reference pool error. In
instances when
the shrinkage weight parameter is undefined, it is changed to 1. In some
embodiments, the
final calculation of the background posterior error is calculated as:
background posterior error = shrinkage weight parameter * trinucleotide error
average + (1 -
shrinkage weight parameter) * healthy subject error.
[0377] In some embodiments, a reference pool error can be used in
place of an input
sample background average, for calculating the background posterior average
error rate.
[0378] In some embodiments, the alpha for the beta-binomial
distribution is calculated
using the pre-test-odds, sensitivity, and post-test-odds, where:
alpha = 1 - specificity = pre-test-odds * sensitivity / post-test-odds
103791 Accordingly, in some embodiments, the background posterior
average, the
reference pool depth, and the alpha are used in calculating the input to the
quantile beta-
binomial function. The alpha is used in calculating the mean value of the beta-
binomial
distribution, which equals 1 - alpha / 2. The size of the quantile beta-
binomial is the matrix of
the reference pool depth. The shape 1 parameter for the quantile beta-binomial
function is the
reference pool depth multiplied by the background posterior average error
rate, and the shape
111
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
2 parameter of the quantile beta-binomial function is the shape 1 parameter
subtracted from
reference pool depth.
[0380] The output from the (pantile BetaBinomial function is the
minimum value a
variant needs to be called. Any variant that has a normalized allele count
below the
quantile(BetaBinomial) output will be filtered due to the high background
error observed at
that position.
[0381] For example, Figure 4F2 illustrates a flow chart of a
method 400-2 for validating a
somatic sequence variant in a test subject having a cancer condition, in
accordance with some
embodiments of the present disclosure.
[0382] In some embodiments, the method includes obtaining (402-2)
cell-free DNA
sequencing data 122 from a sequencing reaction of a liquid biopsy sample of a
test subject
121 (e.g., sequence reads 123-1-1-1.....123-1-1-K for sequence run 122-1-1 for
a liquid
biopsy sample from patient 121-1, as illustrated in Figure 1B) As described
herein, in some
embodiments, the obtaining includes a step of sequencing cell-free nucleic
acids from a liquid
biopsy sample. Example methods for sequencing cell-free nucleic acids are
described herein.
[0383] Sequence reads 123 from the sequencing data 122 are then
aligned (404-2) to a
human reference sequence (e.g., a human genome or a portion of a human genome,
e.g., 1%,
5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 75%, 90%, 95%, 99%, or more of the

human genome, or to a map of a human reference genome or a set of human
reference
genomes, or a portion thereof), thereby generating a plurality of aligned
reads 124.
Optionally, the pre-aligned sequence reads 123 and/or aligned sequence reads
124 are pre-
processed (408-2) using any of the methods disclosed above (e.g.,
normalization, bias
correction, etc.). In some embodiments, as described herein, device 100
obtains previously
aligned sequence reads.
[0384] The aligned sequences reads 124 are then evaluated to
identify mismatches with
the reference construct (e.g., reference genome or set of reference genomes),
thereby
identifying one or more candidate somatic sequence variants 132-c at
respective genomic
loci. The number of aligned sequence reads containing the sequence variant at
the locus are
determined, thereby defining a variant allele fragment count 132-c-ac (e.g.,
variant allele
fragment count 132-c-1-ac as illustrated in Figure 1C2). In some embodiments,
the number
of aligned sequence reads containing the locus of the candidate variant allele
(regardless of
the identity of the allele represented in the sequence read) are also
determined, thereby
112
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
defining a variant allele locus count 132-c-lc (e.g., variant allele locus
count 132-c-1-1c as
illustrated in Figure 1C2). Accordingly, in some embodiments, the variant
allele fragment
count 132-c-ac can be compared to the variant allele locus count 132-c-lc to
determine a
variant allele fraction 132-c-vf (e.g., variant allele fraction 132-c-1-vf as
illustrated in Figure
1C2) for the candidate variant allele. This represents a measure of the
portion of sequence
reads encompassing the nucleotide(s) that is altered in the candidate variant
allele that include
the candidate variant. In some embodiments, as described below, this measure
can be used to
define a sensitivity for the detection of the candidate variant based on a
distribution of
detection sensitivities corresponding to detection of a variant within a
genomic region
encompassing the locus in reference samples with defined variant allele
fractions.
[0385] Method 400-2 then includes obtaining (412-2) a dynamic
variant count threshold
191 for the candidate variant allele. As described herein, in some
embodiments, the dynamic
variant count threshold is based upon a prevalence of sequence variations in a
genomic region
encompassing the locus of the candidate variant allele in cancer patients
sharing one or more
similarities with the test subject. For example, in some embodiments, this
prevalence defines
a pre-test odds that the test subject has a sequence variant within the
genomic region
encompassing the locus at which the candidate sequence variant is located. In
some
embodiments, this pre-test odds is used in an application of Bayes theorem to
derive a
minimal amount of support required of the sequencing reaction to validate the
presence of the
candidate sequence variant in a cancerous tissue of the subject at a desired
confidence level.
Information about Bayes theorem and Bayesian inference can be found, for
instance, in
Section 8.7 of Stuart, A. and Ord, K. (1994), Kendall's Advanced Theory of
Statistics:
Volume I¨Distribution Theory, Edward Arnold; and Gelman, A. et al., (2013),
Bayesian
Data Analysis, Third Edition, Chapman and Hall/CRC, ISBN 978-1-4398-4095-5,
the
disclosure of both of which are incorporated herein by reference for their
teachings of how to
implement Bayes theorem and Bayesian inference.
[0386] In some embodiments, the prevalence of sequence variants
in the genomic region
encompassing the locus of the candidate variant allele is determined from a
population of
reference cancer subjects having the same type of cancer. In some embodiments,
the
population of reference cancer subjects is further defined by a matching
personal
characteristic, e.g., an age, gender, race, smoking status, or any other
personal characteristic.
In some embodiments, the population of reference subjects is further defined
by a plurality of
113
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
matching personal characteristics, e.g., at least 2, 3, 4, 5, 6, 7, 8, 9, 10,
or more person
characteristics, in addition to cancer type.
[0387] For instance, in some embodiments, the prevalence of
sequence variants is
determined from variant prevalence training data 192, as illustrated in Figure
1F. The variant
prevalence training data 192 includes data on the variants found in a
cancerous tissue from a
plurality of reference subjects 193. For example, training data 192 for
reference subject 1
193-1 includes a cancer type 194-1 and a list of somatic sequence variants 195-
1, including
individual variants 196-1-1 . . 196-1-S. To determine a prevalence for a
particular candidate
sequence variant detected for a test subject, a genomic region encompassing
the locus of the
candidate sequence variant is defined (e.g., the exon of a gene in which a
candidate sequence
variant is detected). Then, it is determined what portion of reference
subjects 193, that have
the same cancer as the test subject, have a sequence variant located within
the defined
genomic region (e.g., the exon of the gene).
[0388] In some embodiments, e.g., when only a limited set of
defined candidate variants
will be validated, sequence variant prevalence is predetermined and stored in
a database, e.g.,
in non-persistent memory 111, or in an addressable remote server, as a look-up
table. In
other embodiments, system 100 determines a sequence variant prevalence for a
genomic
region and matching patient profile upon identification of a candidate
sequence variant, e.g.,
by filtering variant prevalence training data 192 for the relevant genomic
region and
matching reference subjects.
[0389] Generally, the genomic region encompassing the candidate
sequence variant is
larger than a single nucleotide. For example, in some embodiments, the genomic
region
includes at least 10 nucleotides, at least 50 nucleotides, at least 100
nucleotides, at least 250
nucleotides, at least 500 nucleotides, at least 1000 nucleotides, at least
2500 nucleotides, or
more nucleotides. In some embodiments, the genomic region is no larger than
10,000
nucleotides, not larger than 7500 nucleotides, no larger than 5000
nucleotides, no larger than
2500 nucleotides, or fewer nucleotides. In some embodiments, the genomic
region is from
nucleotides to 10,000 nucleotides. In some embodiments, the genomic region is
from 25
nucleotides to 5000 nucleotides. In some embodiments, the genomic region is
from 50
nucleotides to 2500 nucleotides.
[0390] In some embodiments, when the candidate sequence variant
falls within a protein
coding sequence, the genomic region is defined as the exon in which the
candidate sequence
114
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
variant is located. In some embodiments, the genomic region is defined as
several adjacent
exons, including the exon in which the candidate sequence variant is located.
In some
embodiments, when the candidate sequence variant falls within a protein coding
sequence,
the genomic region is defined as all exons of the gene in which the candidate
sequence
variant is located. In some embodiments, when the candidate sequence variant
falls within a
protein coding sequence, the genomic region is defined as the entire gene in
which the
candidate sequence variant is located. Similarly, in some embodiments, when
the candidate
sequence variant falls within an intronic sequence of a gene, the genomic
region is defined as
the entire intron in which the candidate sequence variant is located, or
several adjacent
introns including the intron in which the candidate sequence variant is
located.
[0391] In some embodiments, the genomic region encompassing the
candidate sequence
variant is a fixed window encompassing, e.g., sun-ounding, the candidate
sequence variant.
For example, in some embodiments, when the candidate sequence variant falls
within a non-
coding portion of the genome, the genomic region is defined as a fixed window
surrounding
the candidate sequence variant. However, in some embodiments, when the
sequence variant
falls within a non-coding genetic element, e.g., a promoter, enhancer, etc.,
the genomic
region is defined as the entirely of the genetic element.
[0392] In some implementations, the genomic region encompassing
the candidate
sequence variant is dependent upon the sequence context of the locus. For
example, when
the candidate sequence variant falls within a coding sequence, the exon or
several adjacent
exons defines the genomic region, but when the candidate sequence variant
falls within a
non-coding sequence, the genomic region is defined by a fixed window
encompassing the
candidate sequence variant.
[0393] In some embodiments, the genomic region encompassing the
candidate sequence
variant is dependent upon a known or inferred effect of the sequence variant.
For instance, as
described in more detail below, in some embodiments, when the candidate
sequence variant
causes, or is inferred to cause, a partial or complete loss of function
mutation in a gene, the
genomic region is defined by all exons of the gene in which the candidate
sequence variant is
located. Similarly, as described in more detail below, in some embodiments,
when the
candidate sequence variant causes, or is inferred to cause, a gain of function
mutation in a
gene having one or more hotspots for gain of function mutations, the genomic
region is
defined as those exons of the gene encompassing the one or more hotspots.
115
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0394] In some embodiments, when the candidate sequence variant
falls within a
genomic region associated with a known therapeutic resistance gene for the
cancer of the
subject, the pre-test odds determined based on the historical prevalence data
is multiplied by
a pre-test-odds multiplier (e.g., as described above).
[0395] In some embodiments, the Bayesian analysis is further
informed by defining the
specificity of variant detection based on an apparent variant allele fraction
in the sample. For
example, in some embodiments, the variant allele fraction for the candidate
sequence variant
is determined by a comparison of the variant allele fragment count 132-c-ac to
the variant
allele locus count 132-c-lc (e.g., a ratio of the variant allele fragment
count to the variant
allele locus count), thereby determining a variant allele fraction 132-c-vf.
In some
embodiments, the variant allele fraction is then compared to a distribution of
variant detection
specificities established based on a set of training samples (e.g.,
sensitivity distribution
training data) with known variant allele fractions. For example, in some
embodiments,
nucleic acids from each of a plurality of training samples 181 having a known
variant allele
fraction 184 for one or more variant alleles 183 is sequenced according to a
processed-
matched sequencing reaction (e.g., using a substantially identical or
identical sequencing
reaction), and it is determined whether each sequence variant can be detected,
e.g., defining a
detection status 185 for each locus/variant 183. Over a large number of
training samples, a
specificity of detection of variants having different variant allele fractions
can be determined.
In some embodiments, the specificity is determined on a locus-by-locus basis,
such that the
specificity of detection is specific for the genomic region or locus
encompassing the
candidate sequence variant. In some embodiments, the specificity is determined
globally,
e.g., not on a locus-by-locus basis.
[0396] A correlation can then be established between the measured
detection specificity
and the variant allele fraction (e.g., variant detection sensitivity
distribution 186). In some
embodiments, the correlation is a linear or non-linear fit between measured
detection
specificities and variant allele fractions. In other embodiments, the
correlation is determined
by binning specificities (e.g., in bins 187) as a function of ranges of
variant allele fractions
188, and determining a measure of central tendency (e.g., a mean) for the
specificities 189 in
the bin. The variant allele fraction 132-c-ac determined for the candidate
sequence variant is
then compared to the established correlation (e.g., variant detection
sensitivity distribution
186) to define the specificity of detection for the candidate sequence
variant.
116
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0397]
In some embodiments, the Bayesian analysis is further informed by
accounting for
the sequencing error rate for the variant allele and, accordingly, the
probability that the
candidate sequence variant is a product of a sequencing error, rather than a
genomic variant.
In some embodiments, a reaction-specific error rate (e.g., a trinucleotide
sequencing error
rate) is determined for the sequencing reaction (e.g., using an internal
control spiked into the
reaction). In some embodiments, a locus-specific error rate is determined from
historical
sequencing errors at the genomic region, or specific locus, encompassing the
candidate
sequence variant. In some embodiments, both a reaction-specific sequencing
error rate and a
locus-specific error rate are used to define a variant count distribution
(e.g., variant count
distribution 190), representing the number of variant allele counts (e.g.,
variant allele
fragment count 132-c-ac) necessary to validate the presence of the candidate
variant sequence
in the cancer of the subject at a defined detection sensitivity. In some
embodiments, a beta
binomial distribution is established based on the reaction-specific sequencing
error rate and
the locus-specific error rate.
[0398]
Method 400-2 then includes applying (414-2) the dynamic variant count
threshold
(e.g., locus-specific dynamic variant count threshold 191) to the sequencing
data, e.g., by
determining whether the variant allele fragment count 132-c-ac for the
candidate sequence
variant satisfies the threshold, and validating the candidate sequence variant
(e.g., creating a
record 132-v of the validation) when the threshold is satisfied or rejecting
the candidate
sequence variant when the threshold is not satisfied. In some embodiments, one
or more
additional filters, relating to global sequencing metrics and/or locus-
specific sequencing
metrics (e.g., one or more of variant locus coverage filter(s) 463, variant
allele fraction
filter(s) 465, variant support mapping filter(s) 467, variant support
sequencing quality filter(s)
469, and low complexity region filter(s) 471, as illustrated in Figure 1D2)
must be satisfied
before validating a candidate sequence variant.
[0399]
As described in further detail herein, in some embodiments, one or more
validated
variant statuses 132-v are used to match (424-2) the subject with a targeted
therapy and/or a
clinical trial. In some embodiments, as described in further detail herein,
one or more
validated variant statuses 132-v for one or more actionable variants 139-1-1,
one or more
matched therapies 139-1-2, and/or one or more matched clinical trials are used
to generate
(426-2) a patient report 1 39-1 -3 In some embodiments, the patient report is
transmitted to a
medical professional treating the subject. In some embodiments, the patient is
then
117
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
administered (428-2) a personalized course of therapy, e.g., based on a
matched therapy
and/or clinical trial.
[0400] In some embodiments, the methods of validating a candidate
somatic sequence
variant using a dynamic threshold described herein fall within the context of
a larger variant
detection method, e.g., as illustrated by method 450 illustrated in Figures
4G1-4G3. For
example, in some embodiments, the method includes obtaining (452) cfDNA
sequence reads,
as described herein, and aligning (454) those reads to a reference construct
(e.g., a reference
genome or mapped representation of several reference genomes), to generate
aligned
sequences 124 (e.g., a plurality of unique sequence reads). In some
embodiments, putative
somatic sequence variants are identified (456), e.g., those sequence variants
having a variant
allele fraction that is lower than expected for a germline sequence variant
(which should be
around 50% after accounting for an estimated circulating tumor fraction for
the liquid biopsy
sample), e.g., less than 30%, less than 20%, less than 10% etc. One or more
candidate
somatic sequence variants are then validated by applying one or more filters.
For instance, as
described herein, a dynamic variant count threshold is determined (459) and
then used to
apply (460) a dynamic probabilistic variant count filter to sequencing data
for the candidate
somatic sequence variant. In some embodiments, the method also includes
applying (462) a
variant loci coverage filter. In some embodiments, the method also includes
applying (464) a
variant allele fraction filter. In some embodiments, the method also includes
applying (466) a
variant support mapping filter. In some embodiments, the method also includes
applying
(468) a variant support sequencing quality filter. In some embodiments, the
method also
includes applying (470) a low complexity region filter. When all selected
candidate somatic
sequence variants have been validated or rejected according to these filters
(472), the process
proceeds with a reporting function.
[0401] In some embodiments, method 450 also includes validating
(474) the sequencing
data globally, using any of the metrics described herein. In some embodiments,
the
validation includes applying (476) a loci minimal coverage filter. In some
embodiments, the
validation includes applying (478) a loci central tendency coverage filter. In
some
embodiments, the validation includes applying (480) a total sequence read
filter. In some
embodiments, the validation includes applying (481) a sequence read quality
filter. In some
embodiments, the validation includes applying a sequencing control filter
(4g2). The entire
sequencing reaction is then validated or rejected (483) based on whether the
sequencing data
passes these global filters.
118
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0402] In some embodiments, method 450 also includes validating
(485) one or more
germline mutations. In some embodiments, candidate germline sequence variants
are
identified (484), e.g., those sequence variants having a variant allele
fraction that is higher
than expected for a somatic sequence variant. In some embodiments, the
validation includes
applying (486) a germline-specific variant allele fraction filter. In some
embodiments, the
validation includes applying (487) a variant support mapping filter. In some
embodiments,
the validation includes applying (488) a variant support sequencing quality
filter. When all
selected candidate germline sequence variants have been validated or rejected
according to
these filters (489), the process proceeds with a reporting function.
[0403] As described in further detail herein, in some
embodiments, one or more validated
variant statuses 132-v are used to match (490) the subject with a targeted
therapy and/or a
clinical trial. In some embodiments, as described in further detail herein,
one or more
validated variant statuses 132-v for one or more actionable variants 139-1-1,
one or more
matched therapies 139-1-2, and/or one or more matched clinical trials are used
to generate
(492) a patient report 139-1-3. In some embodiments, the patient report is
transmitted to a
medical professional treating the subject. In some embodiments, the patient is
then
administered (494) a personalized course of therapy, e.g., based on a matched
therapy and/or
clinical trial.
[0404] In some embodiments, all, or nearly all, of the aligned
sequence reads are
evaluated to identify candidate sequence variants (e.g., candidate somatic
sequence variants
and/or candidate germline sequence variants). In other embodiments, a subset
of the aligned
sequence reads is evaluated to identify candidate sequence variants. For
example, in one
embodiment, targeted-panel sequencing reaction is used to generate sequencing
data 122 and
only sequence reads corresponding to the target panel (on-target reads) are
evaluated to
identify candidate sequence variants. In some embodiments, targeted-panel
sequencing
reaction is used to generate sequencing data 122 and a subset of sequence
reads
corresponding to a subset of the target panel are evaluated to identify
candidate sequence
variants. In some embodiments, a subset of the sequence reads corresponding to
a subset of
genes, regardless of whether the sequencing reaction is a targeted-panel
sequencing reaction,
a whole exome sequencing reaction, or a whole genome sequencing reaction, are
evaluated to
identify candidate sequence variants. In some embodiments, a subset of
sequence reads
corresponding to a defined set of regions within the genome, e.g., one or more
genes, one or
119
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
more introns, one or more exons, one or more subregion of an intron and/or
exon associated
with cancer etiology, etc., are evaluated to identify candidate sequence
variants.
[0405] Alternatively, in some embodiments, regardless of what
subset of aligned
sequence reads are evaluated to identify candidate sequence variants, only a
subset of
candidate sequence variants is further validated. For example, in some
embodiments, only
candidate sequence variants corresponding to the target panel (on-target
reads) are validated.
Similarly, in some embodiments, only candidate sequence variants corresponding
to a subset
of the target panel are validated. Likewise, in some embodiments, only
candidate sequence
variants corresponding to a subset of genes, regardless of whether the
sequencing reaction is
a targeted-panel sequencing reaction, a whole exome sequencing reaction, or a
whole genome
sequencing reaction, are validated. Similarly, in some embodiments, only
candidate variants
corresponding to a defined set of regions within the genome, e.g., one or more
genes, one or
more introns, one or more exons, one or more subregion of an intron and/or
exon associated
with cancer etiology, etc., are validated.
[0406] In some embodiments, different sets of sequence variants
are evaluated depending
on the type of cancer being evaluated. That is, when the subject has a first
type of cancer,
candidate sequence variants in a first set of genomic loci are evaluated,
typically associated
with the etiology of the first type cancer and/or a particular course of
actionable therapy for
the first type cancer, and when the subject has a second type of cancer,
candidate sequence
variants in a second set of genomic loci are evaluated, typically associated
with the etiology
of the second type cancer and/or a particular course of actionable therapy for
the second type
of cancer. These selections may be applied at the level of initial sequence
read evaluation
(e.g., only sequence reads corresponding to a defined set of loci are
evaluated to identify a
candidate sequence variant) or the validation level (e.g., sequence reads
corresponding to a
larger set of loci are evaluated to identify candidate sequence variants, but
only those
candidates corresponding to a defined set are further validated).
[0407] Similarly, in some embodiments, for one or more target
loci falling within a gene
exon, only candidate sequence variants that would result in an amino acid
change in the
amino acid sequence encoded by the gene are evaluated. In some embodiments,
any
candidate sequence variant resulting in an amino acid change are evaluated. In
some
embodiments, candidate sequence variants resulting in a defined amino acid
change, e.g., an
amino acid change associated with cancer etiology and/or a particular
actionable cancer
therapy, are evaluated. In some embodiments, only a subset of validated
sequence variants is
120
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
included on a clinical report for the sample. That is, in some embodiments,
aligned sequence
reads con-esponding to all or a subset of genomic loci are evaluated to
identify candidate
sequence variants, all or a subset of identified candidate sequence variants
are evaluated for
validation, and only a subset of all possibly validated sequence variants are
included on a
clinical report generated for the sample.
104081 For example, lists of example candidate sequence variants
for evaluation in breast
cancer, non-small cell lung cancer, prostate cancer, pan cancer, and cancer of
unknown origin
are provided below. Standard nomenclature is used to describe chromosomal
location and
specific amino acid variants, as described further by the Human Genome
Variation Society,
e.g., at the URL
varnomen.hgvs.org/recommendations/protein/variant/substitution/.
[0409] For example, in some embodiments, the subject has breast
cancer and candidate
variants associated with at least one of the following genes and/or genetic
loci are evaluated:
ERBB2 (or a genetic locus including a chromosomal position of 17:37880220
and/or
17:37881064), EGFR (or a genetic locus including a chromosomal position of
7:55227926,
7:55242511, and/or 7:55249022), ESR1 (or a genetic locus including a
chromosomal position
of 6:152419922, 6:152419923 and/or 6:152419926), KRAS (or a genetic locus
including a
chromosomal position of 12:25380275, 12:25380276, 12:25380277, and/or
12:25380279),
MAP2K1 (or a genetic locus including a chromosomal position of 15:66729162
and/or
15:66729163), MET (or a genetic locus including a chromosomal position of
7:116422117
and/or 7:116423413); MTOR (or a genetic locus including a chromosomal position
of
1:11187094, 1:11187096, and/or 1:11187796), NTRK1 (or a genetic locus
including a
chromosomal position of 1:156846342, 1:156849044 and/or 1:156849144), and
PIK3CA (or
a genetic locus including a chromosomal position of 3:178936082, 3:178936091,
3:178936092, 3:178936093, 3:178952084, and/or 3:178952085). In some
embodiments, the
subject has breast cancer and candidate variants associated with at least 2,
at least 3, at least
4, at least 5, at least 6, at least 7, or at least 8 of the genes listed above
(or loci including the
enumerated corresponding chromosomal positions) are evaluated. In some
embodiments, the
subject has breast cancer and candidate variants associated with any of the
genes listed above
(or loci including the enumerated corresponding chromosomal positions) are
evaluated.
[0410] In some of the embodiments described above where the
subject has breast cancer,
only a subset of possible candidate sequence variants in the ERBB2 gene are
evaluated and/or
reported. In some embodiments, the subset of possible candidate sequence
variants in the
121
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
ERBB2 gene includes variants resulting in an amino acid change selected from
L755*,
L755S, L755W, T798I, T798K, and T798R.
[0411] In some of the embodiments described above where the
subject has breast cancer,
only a subset of possible candidate sequence variants in the EGFR gene are
evaluated and/or
reported. In some embodiments, the subset of possible candidate sequence
variants in the
EGFR gene includes variants resulting in an amino acid change selected from
G465*,
G465R, D761H, D761N, D761Y, V774L, and V774M.
[0412] In some of the embodiments described above where the
subject has breast cancer,
only a subset of possible candidate sequence variants in the ESRI gene are
evaluated and/or
reported. In some embodiments, the subset of possible candidate sequence
variants in the
ESR1 gene includes variants resulting in an amino acid change selected from
Y537D,
Y537H, Y537N, Y537C, Y537S, Y537F, D538A, D538G, and D538V.
[0413] In some of the embodiments described above where the
subject has breast cancer,
only a subset of possible candidate sequence variants in the KRAS gene are
evaluated and/or
reported. In some embodiments, the subset of possible candidate sequence
variants in the
KRAS gene includes variants resulting in an amino acid change selected from
G60D, Q61H,
Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, and Q61K.
[0414] In some of the embodiments described above where the
subject has breast cancer,
only a subset of possible candidate sequence variants in the MAP2K1 gene are
evaluated
and/or reported. In some embodiments, the subset of possible candidate
sequence variants in
the MAP2K1 gene includes variants resulting in an amino acid change selected
from P124A,
P124S, P124T, P124R, P124L, P124Q.
[0415] In some of the embodiments described above where the
subject has breast cancer,
only a subset of possible candidate sequence variants in the MET gene are
evaluated and/or
reported. In some embodiments, the subset of possible candidate sequence
variants in the
MET gene includes variants resulting in an amino acid change selected from
F1200I,
F1200L, F1200V, Y1230D, Y1230H, and Y1230N.
[0416] In some of the embodiments described above where the
subject has breast cancer,
only a subset of possible candidate sequence variants in the MTOR gene are
evaluated and/or
reported. In some embodiments, the subset of possible candidate sequence
variants in the
MTOR gene includes variants resulting in an amino acid change selected from
A2034E,
A2034G, A2034V, F2108F, F2108I, F2108L, and F2108V.
122
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0417] In some of the embodiments described above where the
subject has breast cancer,
only a subset of possible candidate sequence variants in the NTRK1 gene are
evaluated
and/or reported. In some embodiments, the subset of possible candidate
sequence variants in
the NTRK1 gene includes variants resulting in an amino acid change selected
from G595R,
G595W, F646I, F646L, F646V, D679A, D679G, and D679V.
104181 In some of the embodiments described above where the
subject has breast cancer,
only a subset of possible candidate sequence variants in the PIK3CA gene are
evaluated
and/or reported. In some embodiments, the subset of possible candidate
sequence variants in
the PIK3CA gene includes variants resulting in an amino acid change selected
from E542K,
E545*, E545K, E545Q, E545A, E545G, E545V, E545D, E545E, H1047D, H1047Y,
H1047N, H1047L, H1047P, H1047R.
104191 Similarly, in some embodiments, the subject has non-small
cell lung cancer and
candidate variants associated with at least one of the following genes and/or
genetic loci are
evaluated: ALK (or a genetic locus including a chromosomal position of
2:29443613,
2:29443631, 2:29443695, 2:29443697, 2:29445213, and/or 2:29445258), B2M (or a
genetic
locus including a chromosomal position of 15:45003745), BRAF (or a genetic
locus
including a chromosomal position of 7:140453135, 7:140453136, and/or
7:140453137),
EGFR (or a genetic locus including a chromosomal position of 7:55227926,
7:55241704,
7:55241705, 7:55241706, 7:55242469, 7:55242511, 7:55249022, 7:55249071,
7:55249091,
7:55249092, 7:55249093, 7:55249094, and/or 7:55259515), ERBB2 (or a genetic
locus
including a chromosomal position of 17:37880220), KRAS (or a genetic locus
including a
chromosomal position of 12:25378562, 12:25378643, 12:25380275, 12:25380276,
12:25380277, 12:25380279, 12:25398255, 12:25398280, 12:25398281, 12:25398282,
12:25398283, 12:25398284, and/or 12:25398285), MAP2K1 (or a genetic locus
including a
chromosomal position of 15:66729162 and/or 15:66729163), MET (or a genetic
locus
including a chromosomal position of 7:116422117 and/or 7:116423413), NTRK1 (or
a
genetic locus including a chromosomal position of 1:156846342, 1:156849044,
and/or
1:156849144), PIK3CA (or a genetic locus including a chromosomal position of
3:178936091, 3:178936092, 3:178936093, 3:178952072, 3:178952084, and/or
3:178952085),
and STK11 (or a genetic locus including a chromosomal position of 19:1218483,
19.1220370, 19.1220487, 19.1220629, and/or 19.1220649) In some embodiments,
the
subject has non-small cell lung cancer and candidate variants associated with
at least 2, at
least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least
9, or at least 10 of the
123
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
genes listed above (or loci including the enumerated corresponding chromosomal
positions)
are evaluated. In some embodiments, the subject has non-small cell lung cancer
and
candidate variants associated with any of the genes listed above (or loci
including the
enumerated corresponding chromosomal positions) are evaluated.
[0420] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the ALK
gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the ALK gene includes variants resulting in an amino acid change
selected from
G1202*, G1202R, L1196L, L1196M, L1196V, F1174F, F1174L, F11741, F1174V,
11171N,
11171S, 111711, C1156F, C1156S, and C1156Y.
[0421] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the BRAF
gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the BRAF gene includes variants resulting in an amino acid change
selected from
V600*, V600A, V600E, V600G, V600L, and V600M.
[0422] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the EGFR
gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the EGFR gene includes variants resulting in an amino acid change
selected from
G465*, G465R, L718L, L718M, L718V, L718P, L718Q, L718R, L747I, L747L, L747V,
D761H, D761N, D761Y, V774L, V774M, T790K, T790M, T790R, C797G, C797R, C797S,
C797F, C797Y, C797*, C797C, C797W, L798F, L798I, L798V, L858P, L858Q, and
L858R.
[0423] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the
ERBB2 gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the ERBB2 gene includes variants resulting in an amino acid change
selected from
L755*, L755S, and L755W.
[0424] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the KRAS
gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the KRAS gene includes variants resulting in an amino acid change
selected from
A146T, D119N, Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, Q61K, G60V, Q22K,
124
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
G13G, G13A, G13V, G13D, G13C, G13R, G13S, G12G, G12A, G12V, G12D, G12C,
G12R, and G12S.
[0425] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the
MAP2K1 gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the MAP2K1 gene includes variants resulting in an amino acid
change selected
from P124A, P124S, P124T, P124R, P124L, and P124Q.
[0426] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the MET
gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the MET gene includes variants resulting in an amino acid change
selected from
F12001, F1200L, F1200V, Y1230D, Y1230H, and Y1230N.
[0427] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the
NTRK1 gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the NTRK1 gene includes variants resulting in an amino acid change
selected
from G595R, G595W, F646I, F646L, F646V, D679A, D679G, and D679V.
[0428] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the
PIK3CA gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the PIK3CA gene includes variants resulting in an amino acid
change selected
from E545*, E545K, E545Q, E545A, E545G, E545V, E545D, E545E, M1043V, H1047D,
H1047Y, H1047N, H1047L, H1047P, and H1047R.
[0429] In some of the embodiments described above where the
subject has non-small cell
lung cancer, only a subset of possible candidate sequence variants in the
STK11 gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the STK11 gene includes variants resulting in an amino acid change
selected from
E120*, D194Y, S216F, and E223*, as well as nucleotide substitution c.465-2A>T.
[0430] Similarly, in some embodiments, the subject has prostate
cancer and candidate
variants associated with at least one of the following genes and/or genetic
loci are evaluated:
AR (or a genetic locus including a chromosomal position of X:66766292,
X:66931463,
X:66931504, X:66937370, X:66937371, X:66937372, X:66943543, X:66943549, and/or
125
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
X:66943552), EGFR (or a genetic locus including a chromosomal position of
7:55227926,
7:55242511, and/or 7:55249022), ERBB2 (or a genetic locus including a
chromosomal
position of 17:37880220), KRAS (or a genetic locus including a chromosomal
position of
12:25380275, 12:25380276, and/or 12:25380277), MAP2K1 (or a genetic locus
including a
chromosomal position of 15:66729162 and/or 15:66729163), MET (or a genetic
locus
including a chromosomal position of 7:116422117 and/or 7:116423413), NTRK1 (or
a
genetic locus including a chromosomal position of 1:156846342, 1:156849044,
and/or
1:156849144), and PIK3CA (or a genetic locus including a chromosomal position
of
3:178952084 and/or 3:178952085). In some embodiments, the subject has prostate
cancer
and candidate variants associated with at least 2, at least 3, at least 4, at
least 5, at least 6, or
at least 7 of the genes listed above (or loci including the enumerated
corresponding
chromosomal positions) are evaluated. In some embodiments, the subject has
prostate cancer
and candidate variants associated with any of the genes listed above (or loci
including the
enumerated corresponding chromosomal positions) are evaluated.
[0431] In some of the embodiments described above where the
subject has prostate
cancer, only a subset of possible candidate sequence variants in the AR gene
are evaluated
and/or reported. In some embodiments, the subset of possible candidate
sequence variants in
the AR gene includes variants resulting in an amino acid change selected from
W435L,
L702H, L702P, L702R, V716M, W742G, W742R, W742*, W742L, W742S, W742C,
H875Y, F877L, T878A, T878P, and T878S.
[0432] In some of the embodiments described above where the
subject has prostate
cancer, only a subset of possible candidate sequence variants in the EGFR gene
are evaluated
and/or reported. In some embodiments, the subset of possible candidate
sequence variants in
the EGFR gene includes variants resulting in an amino acid change selected
from G465*,
G465R, D761H, D761N, D761Y, V774L, and V774M.
[0433] In some of the embodiments described above where the
subject has prostate
cancer, only a subset of possible candidate sequence variants in the ERBB2
gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the ERBB2 gene includes variants resulting in an amino acid change
selected from
L755*, L755S, and L755W.
[0434] In some of the embodiments described above where the
subject has prostate
cancer, only a subset of possible candidate sequence variants in the KRAS gene
are evaluated
126
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
and/or reported. In some embodiments, the subset of possible candidate
sequence variants in
the KRAS gene includes variants resulting in an amino acid change selected
from Q61H,
Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, and Q61K.
[0435] In some of the embodiments described above where the
subject has prostate
cancer, only a subset of possible candidate sequence variants in the MAP2K1
gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the MAP2K1 gene includes variants resulting in an amino acid
change selected
from P124A, P124S, P124T, P124R, P124L, and P124Q.
[0436] In some of the embodiments described above where the
subject has prostate
cancer, only a subset of possible candidate sequence variants in the MET gene
are evaluated
and/or reported. In some embodiments, the subset of possible candidate
sequence variants in
the MET gene includes variants resulting in an amino acid change selected from
F12001,
F1200L, F1200V, Y1230D, Y1230H, and Y1230N.
[0437] In some of the embodiments described above where the
subject has prostate
cancer, only a subset of possible candidate sequence variants in the NTRK1
gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the NTRK1 gene includes variants resulting in an amino acid change
selected
from G595R, G595W, F646I, F646L, F646V, D679A, D679G, and D679V.
[0438] In some of the embodiments described above where the
subject has prostate
cancer, only a subset of possible candidate sequence variants in the PIK3CA
gene are
evaluated and/or reported. In some embodiments, the subset of possible
candidate sequence
variants in the PIK3CA gene includes variants resulting in an amino acid
change selected
from H1047D, H1047Y, H1047N, H1047L, H1047P, and H1047R.
[0439] In one example, the cancer condition is any type of cancer
(for example, pan-
cancer) and the somatic variants validated by this method include variants
associated with
any of the following genes: EGER (or a genetic locus including a chromosomal
position of
7:55227926, 7:55242511, and/or 7:55249022), ERBB2 (or a genetic locus
including a
chromosomal position of 17:37880220), KRAS (or a genetic locus including a
chromosomal
position of 12:25380275, 12:25380276, and/or 12:25380277), MAP2K1 (or a
genetic locus
including a chromosomal position of 15:66729162 and/or 15:66729163), MET (or a
genetic
locus including a chromosomal position of 7:116422117 and/or 7:116423413),
NTRK1 (or a
genetic locus including a chromosomal position of 1:156846342, 1:156849044,
and/or
127
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
1:156849144), PIK3CA (or a genetic locus including a chromosomal position of
3:178952084 and/or 3:178952085), and TP53. In some embodiments, the subject
has any
cancer (e.g., pan cancer) and candidate variants associated with at least 2,
at least 3, at least 4,
at least 5, at least 6, or at least 7 of the genes listed above (or loci
including the enumerated
corresponding chromosomal positions) are evaluated. In some embodiments, the
subject has
any cancer (e.g., pan cancer) and candidate variants associated with any of
the genes listed
above (or loci including the enumerated corresponding chromosomal positions)
are
evaluated.
[0440] In some of the embodiments described above where the
subject has any type of
cancer (e.g., pan-cancer), only a subset of possible candidate sequence
variants in the EGFR
gene are evaluated and/or reported. In some embodiments, the subset of
possible candidate
sequence variants in the EGFR gene includes variants resulting in an amino
acid change
selected from G465*, G465R, D761H, D761N, D761Y, V774L, and V774M.
[0441] In some of the embodiments described above where the
subject has any type of
cancer (e.g., pan-cancer), only a subset of possible candidate sequence
variants in the ERBB2
gene are evaluated and/or reported. In some embodiments, the subset of
possible candidate
sequence variants in the ERBB2 gene includes variants resulting in an amino
acid change
selected from L755*, L755S, and L755W.
[0442] In some of the embodiments described above where the
subject has any type of
cancer (e.g., pan-cancer), only a subset of possible candidate sequence
variants in the KRAS
gene are evaluated and/or reported. In some embodiments, the subset of
possible candidate
sequence variants in the KRAS gene includes variants resulting in an amino
acid change
selected from Q61H, Q61Q, Q61L, Q61P, Q61R, Q61*, Q61E, and Q61K.
[0443] In some of the embodiments described above where the
subject has any type of
cancer (e.g., pan-cancer), only a subset of possible candidate sequence
variants in the
MAP2K1 gene are evaluated and/or reported. In some embodiments, the subset of
possible
candidate sequence variants in the MAP2K1 gene includes variants resulting in
an amino acid
change selected from P124A, P124S, P1241, P124R, P124L, and P124Q.
[0444] In some of the embodiments described above where the
subject has any type of
cancer (e.g., pan-cancer), only a subset of possible candidate sequence
variants in the MET
gene are evaluated and/or reported. In some embodiments, the subset of
possible candidate
128
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
sequence variants in the MET gene includes variants resulting in an amino acid
change
selected from F12001, F1200L, F1200V, Y1230D, Y1230H, and Y1230N.
[0445] In some of the embodiments described above where the
subject has any type of
cancer (e.g., pan-cancer), only a subset of possible candidate sequence
variants in the NTRK1
gene are evaluated and/or reported. In some embodiments, the subset of
possible candidate
sequence variants in the NTRK1 gene includes variants resulting in an amino
acid change
selected from G595R, G595W, F646I, F646L, F646V, D679A, D679G, and D679V.
[0446] In some of the embodiments described above where the
subject has any type of
cancer (e.g., pan-cancer), only a subset of possible candidate sequence
variants in the
PIK3CA gene are evaluated and/or reported. In some embodiments, the subset of
possible
candidate sequence variants in the PIK3CA gene includes variants resulting in
an amino acid
change selected from H1047D, H1047Y, H1047N, H1047L, H1047P, and H1047R.
[0447] Similarly, in some embodiments, the subject has a tumor of
unknown origin or a
cancer of unknown primary and candidate variants associated with at least one
of the
following genes and/or genetic loci are evaluated: EGFR (or a genetic locus
including a
chromosomal position of 7:55227926, 7:55242511, and/or 7:55249022), ERBB2 (or
a genetic
locus including a chromosomal position of 17:37880220), KRAS (or a genetic
locus
including a chromosomal position of 12:25380275, 12:25380276, 12:25380277,
and/or
12:25398255), MAP2K1 (or a genetic locus including a chromosomal position of
15:66729162 and/or 15:66729163), MET (or a genetic locus including a
chromosomal
position of 7:116422117 and/or 7:116423413), NRAS (or a genetic locus
including a
chromosomal position of 1:115258748), NTRK1 (or a genetic locus including a
chromosomal
position of 1:156846342, 1:156849044, and/or 1:156849144), PIK3CA (or a
genetic locus
including a chromosomal position of 3:178927980, 3:178952084 and/or
3:178952085), and
TP53. In some embodiments, the subject has any cancer (e.g., pan cancer) and
candidate
variants associated with at least 2, at least 3, at least 4, at least 5, at
least 6, at least 7, or at
least 8 of the genes listed above (or loci including the enumerated
corresponding
chromosomal positions) are evaluated. In some embodiments, the subject has any
cancer
(e.g., pan cancer) and candidate variants associated with any of the genes
listed above (or loci
including the enumerated corresponding chromosomal positions) are evaluated.
[0448] In some of the embodiments described above where the
subject has a tumor of
unknown origin or a cancer of unknown primary, only a subset of possible
candidate
129
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
sequence variants in the EGFR gene are evaluated and/or reported. In some
embodiments,
the subset of possible candidate sequence variants in the EGFR gene includes
variants
resulting in an amino acid change selected from G465*, G465R, D761H, D761N,
D761Y,
V774L, and V774M.
[0449] In some of the embodiments described above where the
subject has a tumor of
unknown origin or a cancer of unknown primary, only a subset of possible
candidate
sequence variants in the ERBB2 gene are evaluated and/or reported. In some
embodiments,
the subset of possible candidate sequence variants in the ERBB2 gene includes
variants
resulting in an amino acid change selected from L755*, L755S, and L755W.
[0450] In some of the embodiments described above where the
subject has a tumor of
unknown origin or a cancer of unknown primary, only a subset of possible
candidate
sequence variants in the KRAS gene are evaluated and/or reported. In some
embodiments,
the subset of possible candidate sequence variants in the KRAS gene includes
variants
resulting in an amino acid change selected from Q61H, Q61Q, Q61L, Q61P, Q61R,
Q61*,
Q61E, Q61K, and Q22K.
[0451] In some of the embodiments described above where the
subject has a tumor of
unknown origin or a cancer of unknown primary, only a subset of possible
candidate
sequence variants in the MAP2K1 gene are evaluated and/or reported. In some
embodiments, the subset of possible candidate sequence variants in the MAP2K1
gene
includes variants resulting in an amino acid change selected from P124A,
P124S, P124T,
P124R, P124L, and P124Q.
[0452] In some of the embodiments described above where the
subject has a tumor of
unknown origin or a cancer of unknown primary, only a subset of possible
candidate
sequence variants in the MET gene are evaluated and/or reported. In some
embodiments, the
subset of possible candidate sequence variants in the MET gene includes
variants resulting in
an amino acid change selected from F12001, Fl 200L, Fl 200V, Y1 230D, Y1230H,
and
Y1230N.
[0453] In some of the embodiments described above where the
subject has a tumor of
unknown origin or a cancer of unknown primary, only a subset of possible
candidate
sequence variants in the NRAS gene are evaluated and/or reported. In some
embodiments,
the subset of possible candidate sequence variants in the NRAS gene includes
variants
resulting in an amino acid change of G12S.
130
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0454] In some of the embodiments described above where the
subject has a tumor of
unknown origin or a cancer of unknown primary, only a subset of possible
candidate
sequence variants in the NTRKI gene are evaluated and/or reported. In some
embodiments,
the subset of possible candidate sequence variants in the NTRK1 gene includes
variants
resulting in an amino acid change selected from G595R, G595W, F646I, F646L,
F646V,
D679A, D679G, and D679V.
[0455] In some of the embodiments described above where the
subject has a tumor of
unknown origin or a cancer of unknown primary, only a subset of possible
candidate
sequence variants in the PIK3CA gene are evaluated and/or reported. In some
embodiments,
the subset of possible candidate sequence variants in the PIK3CA gene includes
variants
resulting in an amino acid change selected from C420R, H1047D, H1047Y, H1047N,

H1047L, H1047P, and H1047R.
[0456] In other embodiments, the cancer condition is acute
myeloid leukemia, adrenal
cancer, b cell lymphoma, basal cell carcinoma, biliary cancer, bladder cancer,
brain cancer,
breast cancer, cervical cancer, chromophobe renal cell carcinoma, clear cell
renal cell
carcinoma, colorectal cancer, confirm at path review (cancer type
unconfirmed), endocrine
tumor, endometrial cancer, esophageal cancer, gastric cancer, gastrointestinal
stromal tumor,
glioblastoma, head and neck cancer, head and neck squamous cell carcinoma,
heme other,
high-grade glioma, kidney cancer, liver cancer, low grade glioma,
medulloblastoma,
melanoma, meningioma, mesothelioma, multiple myeloma, neuroblastoma, non-clear
cell
renal cell carcinoma, non-small cell lung cancer, oropharyngeal cancer,
ovarian cancer, pan-
cancer, pancreatic cancer, peritoneal cancer, prostate cancer, sarcoma, skin
cancer, small cell
lung cancer, t cell lymphoma, testicular cancer, thymoma, thyroid cancer,
tumor of unknown
origin, or uveal melanoma.
[0457] In some embodiments, certain variants pre-identified on a
whitelist may be
rescued, e.g., not filtered out, when they fail to pass selective filters,
e.g., MSI/SN, a
Bayesian filtering method, and/or a coverage, VAF or region-based filter. The
rationale for
whitelisting a variant is to apply less stringent filtering criteria to such a
variant so that it can
be reviewed and/or reported. In some embodiments, one or more variant on the
whitelist is a
common pathogenic variant, e.g., with high clinical relevance. In this
fashion, when a variant
on the whitelist fails to pass certain filters, it will be rescued and not
filtered out. As used
herein, MSI/SN refers to a variant filter for filtering out potential
artifactual variants based on
the MSI (microsatellite instable) and SN (signal-to-noise ratio) values
calculated by the
131
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
variant caller VarDict. See, for example, VarDict documentation, available on
the intemet at
github.com/AstraZeneca-NGSNarDictJava.
[0458] In some embodiments, one or more locus and/or genomic
region is blacklisted,
preventing somatic variant annotation for variants identified at the locus or
region. In some
embodiments, the variant has a length of 120, 100, 80, 60, 40, 20, 10, 5 or
less base pairs. In
various embodiments, any combination of the additional criteria, as well as
additional criteria
not listed above, may be applied to the variant calling process. Again, in
some embodiments,
different criteria are applied to the annotation of different types of
variants.
[0459] In some embodiments, liquid biopsy assays are used to
detect variant alterations
present at low circulating fractions in the patient's blood. In such
circumstances, it may be
warranted to lower the requirements for positively identifying a variant. That
is, in some
embodiments, low levels of support may be sufficient to call a variant,
dependent upon the
reason for using the liquid biopsy assay.
[0460] In some embodiments, SNV/INDEL detection is accomplished
using VarDict
(available on the intemet at github.com/AstraZeneca-NGS/VarDictJava). Both
SNVs and
INDELs are called and then sorted, deduplicated, normalized and annotated. The
annotation
uses SnpEff to add transcript information, 1000 genomes minor allele
frequencies, COSMIC
reference names and counts, ExAC allele frequencies, and Kaviar population
allele
frequencies. The annotated variants are then classified as germline, somatic,
or uncertain
using a Bayesian model based on prior expectations informed by databases of
germline and
cancer variants. In some embodiments, uncertain variants are treated as
somatic for filtering
and reporting purposes.
[0461] In some embodiments, genomic rearrangements (e.g.,
inversions, translocations,
and gene fusions) are detected following de-multiplexing by aligning tumor
FASTQ files
against a human reference genome using a local alignment algorithm, such as
BWA. In some
embodiments, DNA reads are sorted, and duplicates may be marked with a
software, for
example, SAMBlaster. Discordant and split reads may be further identified and
separated.
These data may be read into a software, for example, LUMPY, for structural
variant
detection. In some embodiments, structural alterations are grouped by type,
recurrence, and
presence and stored within a database and displayed through a fusion viewer
software tool.
The fusion viewer software tool may reference a database, for example,
Ensembl, to
determine the gene and proximal exons sun-ounding the breakpoint for any
possible transcript
132
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
generated across the breakpoint. The fusion viewer tool may then place the
breakpoint 5' or
3' to the subsequent exon in the direction of transcription. For inversions,
this orientation
may be reversed for the inverted gene. After positioning of the breakpoint,
the translated
amino acid sequences may be generated for both genes in the chimeric protein,
and a plot
may be generated containing the remaining functional domains for each protein,
as returned
from a database, for example, Uniprot.
[0462] For instance, in an example implementation, gene
rearrangements are detected
using the SpeedSeq analysis pipeline. Chiang et al., 2015, "SpeedSeq: ultra-
fast personal
genome analysis and interpretation," Nat Methods, (12), pg. 966. Briefly,
FASTQ files are
aligned to hg19 using BWA. Split reads mapped to multiple positions and read
pairs mapped
to discordant positions are identified and separated, then utilized to detect
gene
rearrangements by LUMPY. Layer et al., 2014, "LUMPY: a probabilistic framework
for
structural variant discovery," Genome Biol, (15), pg. 84. Fusions can then be
filtered
according to the number of supporting reads.
[0463] In some embodiments, putative fusion variants supported by
less than a minimum
number of unique sequence reads are filtered. In some embodiments, the minimum
number
of unique sequence reads is 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 15, or 20 unique
sequence reads.
Allelic Fraction Determination
[0464] In some embodiments, the analysis of aligned sequence
reads, e.g., in SAM or
BAM format, includes determination of variant allele fractions (133) for one
or more of the
variant alleles 132 identified as described above. In some embodiments, a
variant allele
fraction module 151 tallies the instances that each allele is represented by a
unique sequence
read encompassing the variant locus of interest, generating a count for each
allele represented
at that locus. In some embodiments, these tallies are used to determine the
ratio of the variant
allele, e.g., an allele other than the most prevalent allele in the subject's
population for a
respective locus, to a reference allele. This variant allele fraction 133 can
be used in several
places in the feature extraction 206 workflow. For instance, in some
embodiments, a variant
allele fraction is used during annotations of identified variants, e.g., when
determining
whether the allele originated from a germline cell or a somatic cell. In other
instances, a
variant allele fraction is used in a process for estimating a tumor fraction
for a liquid biopsy
sample or a tumor purity for a solid tumor fraction. For instance, variant
allele fractions for a
plurality of somatic alleles can be used to estimate the percentage of
sequence reads
originating from one copy of a cancerous chromosome. Assuming a 100% tumor
purity and
133
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
that each cancer cell caries one copy of the variant allele, the overall
purity of the tumor can
be estimated. This estimate, of course, can be further con-ected based on
other information
extracted from the sequencing data, such as copy number alterations, tumor
ploidy
aberrations, tumor heterozygosity, etc.
Methylati on Determination
104651 In some embodiments, where nucleic acid sequencing library
was processed by bi-
sulfite treatment or enzymatic methyl-cytosine conversion, as described above,
the analysis
of aligned sequence reads, e.g., in SAM or BAM format, includes determination
of
methylation states 132 for one or more loci in the genome of the patient. In
some
embodiments, methylation sequencing data is aligned to a reference sequence
construct 158
in a different fashion than non-methylation sequencing, because non-methylated
cytosines are
converted to uracils, and the resulting uracils are ultimately sequenced as
thymines, whereas
methylated cytosine are not converted and sequenced as cytosine. Different
approaches,
therefore, have to be used to align these modified sequences to a reference
sequence
construct, such as seeding alignments with shorter regions of identity or
converting all
cytosines to thymidines in the sequencing data and then aligning the data to
reference
sequence constructs for both the plus and minus strand of the sequence
construct For review
of these approaches, see Zhou Q. et al., BMC Bioinformatics, 20(47):1-11
(2019), the content
of which is hereby incorporated by reference, in its entirety, for all
purposes. Algorithms for
calling methylated bases are known in the art. For example, Bismark is able to
distinguish
between cytosines in CpG, CHG, and CHH contexts. Krueger F. and Andrews SR,
Bioinformatics, 27(11):1571-71 (2011), the content of which is hereby
incorporated by
reference, in its entirety, for all purposes.
Copy Number Variation:
104661 In some embodiments, the analysis of aligned sequence
reads, e.g., in SAM or
BAM format, includes determination of the copy number 135 for one or more
locus, using a
copy number variation analysis module 153. For example, Figure 4F1 illustrates
a workflow
of an exemplary method 400-1 for validating copy number variation to be used
in generating
clinical reports to support clinical decision making in precision oncology, in
accordance with
some embodiments of the present disclosure. More specifically, method 400-1
describes a
bioinformatics pipeline for extraction and identification of genomic copy
number variation
(e.g., a method for feature extraction 206), in accordance with some
embodiments of the
present disclosure.
134
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0467] Referring to Block 402-1, the method comprises obtaining a
dataset of cell-free
DNA sequencing data. The sequencing data can be obtained using any of the
methods and/or
embodiments disclosed herein, including any of the implementations for wet lab
processing
204. In some embodiments, where both a liquid biopsy sample and a normal
tissue sample of
the patient are analyzed, de-duplicated BAM files and a VCF generated from the
variant
calling pipeline are used to compute read depth and variation in heterozygous
germline SNVs
between sequencing reads for each sample. By contrast, in some embodiments,
where only a
liquid biopsy sample is being analyzed, comparison between a tumor sample and
a pool of
process-matched normal controls is used.
[0468] Pre-processing and/or alignment can be applied to the
cfDNA sequencing data, as
described in detail above. For example, referring to Block 404-1, in some
embodiments,
sequence reads obtained from the cfDNA sequencing data are aligned to a
reference human
construct, thus generating a plurality of aligned reads 406-1. Referring to
Block 408-1, the
method further comprises optionally processing the aligned cfDNA sequence
reads by, for
example, normalization, filtering, and/or quality control, as described in
detail above.
[0469] Referring to Block 410-1, in some embodiments, the method
further comprises
obtaining for validation one or more copy number status annotations (e.g.,
amplified, neutral,
deleted). In some embodiments, the copy number status annotations are obtained
via copy
number analysis.
[0470] For instance, in an example implementation, copy number
variants (CNVs) are
analyzed using the CNVkit package. See, Talevich etal., PLoS Comput Biol,
12:1004873
(2016), the content of which is hereby incorporated by reference, in its
entirety, for all
purposes. CNVkit is used for genomic region binning, coverage calculation,
bias correction,
normalization to a reference pool, segmentation and visualization. The 1og2
ratios between the
tumor sample and a pool of process matched healthy samples from the CNVkit
output are
then annotated and filtered using statistical models whereby the amplification
status
(amplified or not-amplified) of each gene is predicted and non-focal
amplifications are
removed.
[0471] In some embodiments, copy number variations (CNVs) are
analyzed using a
combination of an open-source tool, such as CNVkit, and an
annotation/filtering algorithm,
e.g., implemented via a python script. CNVkit is used initially to perform
genomic region
binning, coverage calculation, bias correction, normalization to a reference
pool,
135
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
segmentation and, optionally, visualization. The bin-level copy ratios and
segment-level
copy ratios, in addition to their con-esponding confidence intervals, from the
CNVkit output
are then used in the annotation and filtering where the copy number state
(amplified, neutral,
deleted) of each segment and bin are determined and non-focal
amplifications/deletions are
filtered out based on a set of acceptance criteria. In some embodiments, one
or more copy
number variations selected from amplifications in the MET, EGFR, ERBB2, CD274,

CCNE1, and MYC genes, and deletions in the BRCAI and BRCA2 genes are analyzed.

However, the methods described herein is not limited to only these reportable
genes.
[0472] In some embodiments, CNV analysis is performed using a
tumor BAM file, a
target region BED file, a pool of process matched normal samples, and inputs
for initial
reference pool construction. Inputs for initial reference pool construction
include one or
more of normal BAM files, a human reference genome file, mappable regions of
the genome,
and a blacklist that contains recurrent problematic areas of the genome.
[0473] CNVkit utilizes both targeted captured sequencing reads
and non-specifically
captured off-target reads to infer copy number information. The targeted
genomic regions
specified in the probe target BED file are divided to target bins with an
average size of, e.g,
100 base pairs, which can be specified by the user. The genomic regions
between the target
regions, e.g., excluding regions that cannot be mapped reliably, are
automatically divided into
off-target (also referred to as anti-target) bins with an average size of,
e.g., 150 kbp, which
again can be specified by the user. Raw 10g2-transformed depths are then
calculated from the
alignments in the input BAM file and written to two tab-delimited .cnn files,
one for each of
the target and off-target bins.
[0474] A pooled reference is constructed from a panel of process
matched normal
samples. The raw 1og2 depths of target and off-target bins in each normal
sample are
computed as described above, and then each are median-centered and corrected
for bias
including GC content, genome sequence repetitiveness, target size, and/or
spacing. The
corrected target and off-target 10g2 depths are combined, and a weighted
average and spread
are calculated as Tukey's biweight location and midvariance in each bin. These
values are
written to a tab delimited reference .cnn file, which is used to normalize an
input tumor
sample as follows.
[0475] The raw 10g2 depths of an input sample are median-centered
and bias-corrected as
described in the reference construction. The corrected 10g2 depth of each bin
is then
136
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
subtracted by the corresponding 1og2 depth in the reference file, resulting in
the log copy
ratios (also referred to as copy ratios or log2 ratios) between the input
tumor sample and the
reference pool. These values are written to a tab-delimited .cnr file.
[0476] The copy ratios are then segmented, e.g., via a circular
binary segmentation (CBS)
algorithm or another suitable segmentation algorithm, whereby adjacent bins
are grouped to
larger genomic regions (segments) of equal copy number. The segment's copy
ratio is
calculated as the weighted mean of all bins within the segment. The confidence
interval of
the segment mean is estimated by bootstrapping the bin-level copy ratios
within the segment.
The segments' genomic ranges, copy ratios and confidence intervals are written
to a tab-
delimited .cns file.
[0477] In some embodiments, copy number analysis includes
application of a circular
binary segmentation algorithm and selection of segments with highly
differential 10g2 ratios
between the cancer sample and its comparator (e.g., a matched normal or normal
pool). In
some embodiments, approximate integer copy number is assessed from a
combination of'
differential coverage in segmented regions and an estimate of stromal
admixture (for
example, tumor purity, or the portion of a sample that is cancerous vs. non-
cancerous, such as
a tumor fraction for a liquid biopsy sample) is generated by analysis of
heterozygous
germline SNVs. In some embodiments, the integer copy number of a genomic
segment in a
cancer sample is used to assign a copy number status annotation to the genomic
segment
(e.g., amplified, neutral, deleted) based on a comparison with the integer
copy number of a
corresponding genomic segment in a reference pool.
[0478] Validation _filters. Referring again to Block 410-1, the
annotation/filtering
algorithm is subsequently applied to the bin-level copy ratios and segment-
level copy ratios,
in addition to their corresponding confidence intervals, obtained from the
CNVkit output.
The annotation/filtering algorithm comprises a plurality of filters for
validation of copy
number status annotations 412-1, including an optional median bin-level copy
ratio filter 414-
1; an optional segment-level confidence interval filter 416-1; an optional
median-plus-median
absolute deviation (MAD) bin-level copy ratio filter 418-1; and/or an optional
segment-level
copy ratio filter. Referring to Block 420-1, the method further comprises
validating or
rejecting a copy number variation as a focal copy number variation based on
the plurality of
copy number status annotation validation filters. Specifically, when a filter
in the plurality of
filters is fired, the copy number annotation of the segment is rejected, and
the copy number
variation is determined to be a non-focal copy number variation When no filter
in the
137
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
plurality of filters is fired, the copy number annotation of the segment is
validated 422-1 and
the copy number variation is determined to be a focal copy number variation.
[0479] The extracted features (e.g., validated status of copy
number variation 422-1) can
then be used for variant analysis 208 and clinical report generation (e.g., as
described in
further detail below with reference to Figure 2A). For example, referring to
Block 424-1, the
method further comprises matching therapies and/or clinical trials based on
the status (e.g.,
validated or rejected) of the respective copy number annotation. Referring to
Block 426-1,
the method further comprises generating a patient report indicating the CNV
status, in
addition to matched therapies and/or clinical trials based on the CNV status.
[0480] Specific embodiments and further details regarding systems
and methods for
validating copy number status annotations are provided in following sections
with reference
to Figures 5A1-5E1 and 6A1-6C1.
Microsatellite Instability (MSI):
104811 In some embodiments, analysis of aligned sequence reads,
e.g., in SAM or BAM
format, includes analysis of the microsatellite instability status 137 of a
cancer, using a
microsatellite instability analysis module 154. In some embodiments, an MSI
classification
algorithm classifies a cancer into three categories: microsatellite
instability-high (MSI-H),
microsatellite stable (MSS), or microsatellite equivocal (MSE). Microsatellite
instability is a
clinically actionable genomic indication for cancer immunotherapy. In
microsatellite
instability-high (MSI-H) tumors, defects in DNA mismatch repair (MMR) can
cause a
hypermutated phenotype where alterations accumulate in the repetitive
microsatellite regions
of DNA. MSI detection is conventionally performed by subjecting tumor tissue
("solid
biopsy-) to clinical next-generation sequencing or specific assays, such as
MMR IHC or MSI
PCR.
[0482] For example, microsatellite instability status can be
assessed by determining the
number of repeating units present at a plurality of microsatellite loci, e.g.,
5, 10, 15, 20, 25,
30, 40, 50, 75, 100, 250, 500, 750, 1000, 2500, 5000, or more loci. In some
embodiments,
only reads encompassing a microsatellite locus that include a significant
number of flanking
nucleotides on both ends, e.g., at least 5, 10, 15, or more nucleotides
flanking each end, are
used for the analysis in order to avoid using reads that do not completely
cover the locus. In
some embodiments, a minimal number of reads, e.g., at least 5, 10, 20, 30, 40,
50, or more
reads have to meet this criteria in order to use a particular microsatellite
locus, in order to
138
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
ensure the accuracy of the determination given the high incidence of
polymerase slipping
during replication of these repeated sequences.
[0483] In some embodiments, each locus is tested individually for
instability, e.g., as
measured by a change or variance in the number of nucleotide base repeats,
e.g., in cancer-
derived nucleotide sequences relative to a normal sample or standard, for
example, using the
Kolmogorov-Smirnov test. For example, if p < 0.05, the locus is considered
unstable. The
proportion of unstable microsatellite loci may be fed into a logistic
regression classifier
trained on samples from various cancer types, especially cancer types which
have clinically
determined MSI statuses, for example, colorectal and endometrial cohorts. For
MSI testing
where only a liquid biopsy sample is analyzed, the mean and variance for the
number of
repeats may be calculated for each microsatellite locus. A vector containing
the mean and
variance data may be put into a classifier (e.g., a support vector machine
classification
algorithm) trained to provide a probability that the patient is MSI-H, which
may be compared
to a threshold value. In some embodiments, the threshold value for calling the
patient as
MSI-H is at least 60% probability, or at least 65% probability, 70%
probability, 75%
probability, 80% probability, or greater. In some embodiments, a baseline
threshold may be
established to call the patient as MSS. In some embodiments, the baseline
threshold is no
more than 40%, or no more than 35% probability, 30% probability, 25%
probability, 20%
probability, or less. In some embodiments, when the output of the classifier
falls within the
range between the MSI-H and MSS thresholds, the patient is identified as MSE.
[0484] Other methods for determining the MSI status of a subject
are known in the art.
For example, in some embodiments, microsatellite instability analysis module
154 employs
an MSI evaluation methods described in U.S. Provisional Patent Application
Serial No.
62/881,845, filed August 1, 2019, or U.S. Provisional Application Serial No.
62/931,600.
filed November 6, 2019, the contents of which are hereby incorporated by
reference, in their
entireties, for all purposes.
Tumor Mutational Burden (TMB):
[0485] In some embodiments, the analysis of aligned sequence
reads, e.g., in SAM or
BAM format, includes determination of a mutation burden for the cancer (e.g.,
a tumor
mutational burden 136), using a tumor mutational burden analysis module 155.
Generally, a
tumor mutational burden is a measure of the mutations in a cancer per unit of
the patient's
genome. For example, a tumor mutational burden may be expressed as a measure
of central
tendency (e.g., an average) of the number of somatic variants per million base
pairs in the
139
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
genome. In some embodiments, a tumor mutational burden refers to only a set of
possible
mutations, e.g., one or more of SNVs, MNVs, indels, or genomic rearrangements.
In some
embodiments, a tumor mutational burden refers to only a subset of one or more
types of
possible mutations, e.g., non-synonymous mutations, meaning those mutations
that alter the
amino acid sequence of an encoded protein. In other embodiments, for example,
a tumor
mutational burden refers to the number of one or more types of mutations that
occur in
protein coding sequences, e.g., regardless of whether they change the amino
acid sequence of
the encoded protein.
[0486] As an example, in some embodiments, a tumor mutational
burden (TMB) is
calculated by dividing the number of mutations (e.g., all variants or non-
synonymous
variants) identified in the sequencing data (e.g., as represented in a VCF
file) by the size (e.g.,
in megabases) of a capture probe panel used for targeted sequencing. In some
embodiments,
a variant is included in tumor mutation burden calculation only when certain
criteria are met.
For instance, in some embodiments, a threshold sequence coverage for the locus
associated
with the variant must be met before the variant is included in the
calculation, e.g., at least
25x, 50x, 75x, 100x, 250x, 500x, or greater. Similarly, in some embodiments, a
minimum
number of unique sequence reads encompassing the variant allele must be
identified in the
sequencing data, e.g., at least 4, 5, 6, 7, 8, 9, 10, or more unique sequence
reads. In some
embodiments, a threshold variant allelic fraction threshold must be satisfied
before the
variant is included in the calculation, e.g., at least 0.01%, 0.1%, 0.25%,
0.5%, 0.75%, 1%,
1.5%, 2%, 2.5%, 3%, 4%, 5%, or greater. In some embodiments, an inclusion
criteria may be
different for different types of variants and/or different variants of the
same type. For
instance, a variant detected in a mutation hotspot within the genome may face
less rigorous
criteria than a variant detected in a more stable locus within the genome.
[0487] Other methods for calculating tumor mutation burden in
liquid biopsy samples
and/or solid tissue samples are known in the art. See, for example, Fenizia F
et al., Transl
Lung Cancer Res., 7(6):668-77 (2018) and Georgiadis A et al., Clin. Cancer
Res.,
25(23):7024-34 (2019), the disclosures of which are hereby incorporated by
reference, in
their entireties, for all purposes.
Homologous Recombination Status (HRD):
104881 In some embodiments, analysis of aligned sequence reads,
e.g, in SAM or BAM
format, includes analysis of whether the cancer is homologous recombination
deficient (HRD
status 137-3), using a homologous recombination pathway analysis module 157.
140
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0489] Homologous recombination (HR) is a normal, highly
conserved DNA repair
process that enables the exchange of genetic information between identical or
closely related
DNA molecules. It is most widely used by cells to accurately repair harmful
breaks (e.g.,
damage) that occur on both strands of DNA. DNA damage may occur from exogenous

(external) sources like UV light, radiation, or chemical damage; or from
endogenous
(internal) sources like errors in DNA replication or other cellular processes
that create DNA
damage. Double strand breaks are a type of DNA damage. Using poly (ADP-ribose)

polymerase (PARP) inhibitors in patients with HRD compromises two pathways of
DNA
repair, resulting in cell death (apoptosis). The efficacy of PARP inhibitors
is improved not
only in ovarian cancers displaying germline or somatic BRCA mutations, but
also in cancers
in which HRD is caused by other underlying etiologies.
[0490] In some embodiments, HRD status can be determined by
inputting features
correlated with HRD status into a classifier trained to distinguish between
cancers with
homologous recombination pathway deficiencies and cancers without homologous
recombination pathway deficiencies. For example, in some embodiments, the
features
include one or more of (i) a heterozygosity status for a first plurality of
DNA damage repair
genes in the genome of the cancerous tissue of the subject, (ii) a measure of
the loss of
heterozygosity across the genome of the cancerous tissue of the subject, (iii)
a measure of
variant alleles detected in a second plurality of DNA damage repair genes in
the genome of
the cancerous tissue of the subject, and (iv) a measure of variant alleles
detected in the second
plurality of DNA damage repair genes in the genome of the non-cancerous tissue
of the
subject. In some embodiments, all four of the features described above are
used as features in
an HRD classifier. More details about HRD classifiers using these and other
features are
described in U.S. Patent Application Serial No. 16/789,363, filed February 12,
2020, the
content of which is hereby incorporated by reference, in its entirety, for all
purposes.
Circulating Tumor Fraction:
[0491] In some embodiments, the analysis of aligned sequence
reads, e.g., in SAM or
BAM format, includes estimation of a circulating tumor fraction for the liquid
biopsy sample.
Tumor fraction or circulating tumor fraction is the fraction of cell free
nucleic acid molecules
in the sample that originates from a cancerous tissue of the subject, rather
than from a non-
cancerous tissue (e g , a germline or hematopoietic tissue). Several open
source analysis
packages have modules for calculating tumor fraction from solid tumor samples.
For
instance, PureCN (Riester, M., et al., Source Code Biol Med, 11:13 (2016)) is
designed to
141
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
estimate tumor purity from targeted short-read sequencing data of solid tumor
samples.
Similarly, FACETS (Shen R, Seshan VE, Nucleic Acids Res., 44(16):e131 (2016))
is
designed to estimate tumor fraction from sequencing data of solid tumor
samples. However,
estimating tumor fraction from a liquid biopsy sample is more difficult
because of the,
generally, lower tumor fraction relative to a solid tumor sample and typically
small size of a
targeted panel used for liquid biopsy sequencing. Indeed, packages such as
PureCN and
FACETS perform poorly at low tumor fractions and with sequencing data
generated using
small targeted-panels.
[0492] In some embodiments, circulating tumor fraction is
estimated from a targeted-
panel sequencing reaction of a liquid biopsy sample using an off-target read
methodology,
e.g., as described herein with reference to Figures 4 and 5 (e.g., Figures
4F3, 5A3-5B3).
Briefly, a circulating tumor fraction estimate is determined from reads in the
target captured
regions, as well as off-target reads uniformly distributed across the human
reference genome.
Segments having similar copy ratios, e.g., as assigned via circular binary
segmentation (CBS)
during CNV analysis, are fit to integer copy states, e.g, via an expectation-
maximization
algorithm using the sum of squared error of the segment 1og2 ratios
(normalized to genomic
interval size) to expected ratios given a putative copy state and tumor
fraction. A measure of
fit between corresponding segment-level coverage ratios and assigned integer
copy states
across the plurality of simulated circulating tumor fractions is then used to
select the
simulated circulating tumor fraction to be used as the circulating tumor
fraction for the liquid
biopsy sample. In some embodiments, error minimization is used to identify the
simulated
tumor fraction providing the best fit to the data.
[0493] In some embodiments, circulating tumor fraction is
estimated from a targeted-
panel sequencing reaction of a liquid biopsy sample using an off-target read
methodology,
e.g., as described herein with reference to Figures 4 and 5 (e.g., Figures
4F3, 5A3-5B3).
Briefly, a circulating tumor fraction estimate is determined from reads in the
target captured
regions, as well as off-target reads uniformly distributed across the human
reference genome.
Segments having similar copy ratios, e.g., as assigned via circular binary
segmentation (CBS)
during CNV analysis, are fit to integer copy states, e.g., via an expectation-
maximization
algorithm using the sum of squared error of the segment 1og2 ratios
(normalized to genomic
interval size) to expected ratios given a putative copy state and tumor
fraction. For more
information on expectation maximization algorithms see, for example, Sundberg,
Rolf
(1974). "Maximum likelihood theory for incomplete data from an exponential
family".
142
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Scandinavian Journal of Statistics. 1(2): 49-58, the content of which is
hereby incorporated
by reference in its entirety. A measure of fit between corresponding segment-
level coverage
ratios and assigned integer copy states across the plurality of simulated
circulating tumor
fractions is then used to select the simulated circulating tumor fraction to
be used as the
circulating tumor fraction for the liquid biopsy sample. In some embodiments,
error
minimization is used to identify the simulated tumor fraction providing the
best fit to the data.
[0494] In some embodiments, a measure of fit between
corresponding segment-level
coverage ratios and assigned integer copy states across the plurality of
simulated circulating
tumor fractions (e.g., using an error minimization algorithm) provides a
number of local
optima (e.g., local minima for an error minimization model or local maxima for
a fix
maximization model) for the best fit between the segment-level coverage ratios
and assigned
integer copy states. In some such embodiments, a second estimate of
circulating tumor
fraction is used to select the local optima (e.g., the local minima in best
agreement with the
second estimate of circulating tumor fraction) to be used as the circulating
tumor fraction for
the liquid biopsy sample.
[0495] For example, in some embodiments, multiple local optima
(e.g., minima) can be
disambiguated based on a difference between somatic and germline variant
allele fractions.
The assumption is that the variant allele fraction (VAF) of germline variants
that exhibit loss
of heterozygosity (LOH) will increase or decrease by the amount approximately
equal to half
of the tumor purity (e.g., the circulating tumor fraction for a liquid biopsy
sample). With a
matched normal sample (e.g., where sequencing data for both a liquid biopsy
sample and a
non-cancerous sample from the subject is available, or where sequencing data
for both a solid
tumor sample and a non-cancerous sample from the subject is available), for a
given
heterozygous germline variant, the VAF delta can be calculated as delta =
abs(VAFtumat¨
VAFnomiai). However, for tumor only sequencing (e.g., where sequencing data is
only
available for a liquid biopsy sample or a solid tumor sample), the VAFtiommi
is unknown. In
some embodiments, the VAFnorimi is assumed to be 50%. To increase statistical
power and
account for the imprecision in the VAF by sequencing, the delta for all such
variants are
calculated and the circulating tumor fraction estimate (ctFE) for this method
is calculated as
ctFE = max(2 x delta) for all variant delta values. While this can be used as
a method for
ctFE alone, its precision is limited by the number of detected LOH variants.
For a small
panel, there are few expected LOH variants and thus the ctFE may not be
precise on its own.
However, it can be used to disambiguate multiple local optima (e.g., minima),
especially for
143
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
high tumor fraction values estimated by the off-target read methodology
described herein.
For that, the off-target read methodology ctFE peaks corresponding to all the
local optima
(e.g., minima) are identified and the one closest to the ctFE estimated by LOH
delta is chosen
as the most likely global optima (e.g., minima).
[0496] Several other methods may also be used to estimate
circulating tumor fractions.
In some embodiments, these methods are used in combination with the off-target
tumor
estimate method described herein. For example, in some embodiments, one or
more of these
methodologies is used to generate an estimate of tumor fraction, which is then
used to
identify the nearest local optima (e.g., minima) obtained from the tumor
fraction estimation
methods described above, and further herein.
[0497] For example, the ichorCNA package applies a probabilistic
model to normalized
read coverages from ultra-low pass whole genome sequencing data of cell-free
DNA to
estimate tumor fraction in the liquid biopsy sample. For more information,
see,
Adalsteinsson, V.A. et al., Nat Commun 8:1324 (2017), the content of which is
disclosed
herein for its description of a probabilistic tumor fraction estimation model
in the -methods"
section. Similarly, Tiancheng H. et al., describe a Maximum Likelihood model
based on the
copy number of an allele in the sample and variant allele frequency in paired-
control samples.
For more information, see, Tiancheng H. et al., Journal of Clinical Oncology
37:15 suppl,
e13053-e13053 (2019), the content of which is disclosed herein for its
description of a
Maximum Likelihood tumor fraction estimation model.
[0498] In some embodiments, a statistic for somatic variant
allele fractions determined
for the liquid biopsy sample is used as an estimate for the circulating tumor
fraction of the
liquid biopsy sample. For example, in some embodiments, a measure of central
tendency
(e.g., a mean or median) for a plurality of variant allele fractions
determined for the liquid
biopsy sample is used as an estimate of circulating tumor fraction. In some
embodiments, a
lowest (minimum) variant allele fraction determined for the liquid biopsy
sample is used as
an estimate of circulating tumor fraction. In some embodiments, a highest
(maximum)
variant allele fraction determined for the liquid biopsy sample is used as an
estimate of
circulating tumor fraction. In some embodiments, a range defined by two or
more of these
statistics is used to limit the range of simulated tumor fraction analysis via
the off-target read
methodology described herein. For instance, in some embodiments, lower and
upper bounds
of the simulated tumor fraction analysis are defined by the minimum variant
allele fraction
and the maximum variant allele fraction determined for a liquid biopsy sample,
respectively.
144
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
In some embodiments, the range is further expanded, e.g., on either or both
the lower and
upper bounds. For example, in some embodiments, the lower bound of a simulated
tumor
fraction analysis is defined as 0.5-times the minimum variant allele fraction,
0.75-times the
minimum variant allele fraction, 0.9-times the minimum variant allele
fraction, 1.1-times the
minimum variant allele fraction, 1.25-times the minimum variant allele
fraction, 1.5-times the
minimum variant allele fraction, or a similar multiple of the minimum variant
allele fraction
determined for the liquid biopsy sample. Similarly, in some embodiments, the
upper bound
of a simulated tumor fraction analysis is defined as 2.5-times the maximum
variant allele
fraction, 2-times the maximum variant allele fraction, 1.75-times the maximum
variant allele
fraction, 1.5-times the maximum variant allele fraction, 1.25-times the
maximum variant
allele fraction, 1.1-times the maximum variant allele fraction, 0.9-times the
maximum variant
allele fraction, or a similar multiple of the maximum variant allele fraction
determined for the
liquid biopsy sample.
[0499] In some embodiments, circulating tumor fraction is
estimated based on a
distribution of the lengths of cfDNA in the liquid biopsy sample. In some
embodiments,
sequence reads are binned according to their position within the genome, e.g.,
as described
elsewhere herein. For each bin, the length of each fragment is determined.
Each fragment is
then classified as belonging to one of a plurality of classes, e.g., one of
two classes
corresponding to a population of short fragments and a population of long
fragments. In
some embodiments, the classification is performed using a static length
threshold, e.g., that is
the same across all the bins. In some embodiments, the classification is
performed using a
dynamic length threshold. In some embodiments, a dynamic length threshold is
determined
by comparing the distribution of fragment lengths in liquid biopsy samples
from reference
subjects that do not have cancer to the distribution of fragment lengths in
liquid biopsy
samples from reference subjects that have cancer, in a positional fashion.
[0500] For example, in some embodiments, the comparison is done
over windows
spanning entire chromosomes, e.g., each chromosome defines a comparison window
over
which a dynamic length threshold is determined. In some embodiments, the
comparison is
done over a window spanning a single bin, e.g., each bin defines a comparison
window over
which a dynamic length threshold is determined. In certain embodiments, the
bin
determination may be made according to various genomic features. For example,
the
comparison window may be based on a chromosome by chromosome basis, or a
chromosomal arm by chromosomal arm basis. In some embodiments, the comparison
145
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
window is based on a gene level basis. In some embodiments, the comparison
window is a
fixed size, such as 1 KB, 5 KB, 10 KB, 25 kB, 50kB, 100kB, 25 KB, 500 KB, 1
MB, 2 MB, 3
MB, or more. In some embodiments, the reference subjects having cancer used to
determine
the dynamic fragment length is matched to the cancer type of the subject whose
liquid biopsy
sample is being evaluated.
105011 Once each fragment is classified as belonging to either
the population of short
fragments or the population of long fragments, a model trained to estimate
circulating tumor
fraction based on fragment length distribution data across the genome is
applied to the binned
data to generate an estimate of the circulating tumor fraction for the liquid
biopsy sample. In
some embodiments, a comparison of (i) the population of short fractions and
(ii) the
population of long fragments is made for each bin, e.g., a fraction of the
number of short
fragments to the number of long fragments in each bin is determined and used
as an input for
the model. In some embodiments, the model is a probabilistic model (e.g., an
application of
Bayes theorem), a deep learning model (e.g., a neural network, such as a
convolutional neural
network), or an admixture model.
[0502] In some embodiments, two or more of the circulating tumor
estimation models
described herein are used to generate respective tumor fraction estimates,
which are
combined to form a final tumor fraction estimate. For example, in some
embodiments, a
measure of central tendency (e.g., a mean) for several tumor fraction
estimates is determined
and used as the final tumor fraction estimate. In some embodiments, a tumor
fraction
estimate derived from a plurality of estimation models, e.g., a measure of
central tendency for
several tumor fraction estimates is used to identify the nearest local optima
(e.g., minima)
obtained from the tumor fraction estimation methods described above, and
further herein.
Quality Control
[0503] In some embodiments, a positive sensitivity control sample
is processed and
sequenced along with one or more clinical samples. In some embodiments, the
control
sample is included in at least one flow cell of a multi-flow cell reaction and
is processed and
sequenced each time a set of samples is sequenced or periodically throughout
the course of a
plurality of sets of samples. In some embodiments, the control includes a pool
of controls. In
some embodiments, a quality control analysis requires that read metrics of
variants present in
the control sample fall within acceptable criteria. In some embodiments, a
quality control
requires approval by a pathologist before the results are reported.
146
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0504] In some embodiments, the quality control system includes
methods that pass
samples for reporting if various criteria are met. Similarly, in some
embodiments, the system
includes methods that allow for more manual review if a sample does not meet
the criteria
established for automatic pass. In some embodiments, the criteria for pass of
panel
sequencing results include one or more of the following:
= A criterion for the on-target rate of the sequencing reaction, defined as
a comparison
(e.g., a ratio) of (i) the number of sequenced nucleotides or reads falling
within the
targeted panel region of a genome and (ii) the number of sequenced nucleotides
or
reads falling outside of the targeted panel region of the genome. Generally,
an on-
target rate threshold will be selected based on the sequencing technology
used, the
size of the targeted panel, and the expected number of sequence reads
generated by
the combination of the technology and targeted panel used. For example, in
some
embodiments where next generation sequencing-by-synthesis technology is used,
the
criterion is implemented as a minimum on-target rate threshold of at least
30%, at
least 40%, at least 50%, at least 60%, at least 70%, or greater. In some
embodiments,
the on-target rate criteria is implemented as a range of acceptable on-target
rates, e.g.,
requiring that the on-target rate for a reaction is from 30% to 70%, from 30%
to 80%,
from 40% to 70%, from 40% to 80%, and the like.
= A criterion for the number of total reads generated by the sequencing
reaction,
including both unique sequence reads and non-unique sequence reads. Generally,
a
total read number threshold will be selected based on the sequencing
technology used,
the size of the targeted panel, and the expected number of sequence reads
generated
by the combination of the technology and targeted panel used. For example, in
some
embodiments where next generation sequencing-by-synthesis technology is used,
the
criterion is implemented as a minimum number of total reads threshold of at
least 100
million, 110 million, 120 million, 130 million, 140 million, 150 million, 160
million,
170 million, 180 million, 190 million, 200 million, or more total sequence
reads. In
some embodiments, the criterion is implemented as a range of acceptable number
of
total reads, e.g., requiring that the sequencing reaction generate from 50
million to
300 million total sequence reads, from 100 million to 300 million sequence
reads_
from 100 million to 200 million sequence reads, and the like.
= A criterion for the number of unique reads generated by the sequencing
reaction.
Generally, a unique read number threshold will be selected based on the
sequencing
147
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
technology used, the size of the targeted panel, and the expected number of
sequence
reads generated by the combination of the technology and targeted panel used.
For
example, in some embodiments where next generation sequencing-by-synthesis
technology is used, the criterion is implemented as a minimum number of total
reads
threshold of at least 3 million, 4 million, 5 million, 6 million, 7 million, 8
million, 9
million, or more unique sequence reads. In some embodiments, the criterion is
implemented as a range of acceptable number of unique reads, e.g., requiring
that the
sequencing reaction generate from 2 million to 10 million total sequence
reads, from 3
million to 9 million sequence reads, from 3 million to 9 million sequence
reads, and
the like.
= A criterion for unique read depth across the panel, defined as a measure
of central
tendency (e.g., a mean or median) for a distribution of the number of unique
reads in
the sequencing reaction encompassing the genomic regions targeted by each
probe.
For instance, in some embodiments, an average unique read depth is calculated
for
each targeted region defined in a target region BED file, using a first
calculation of
the number of reads mapped to the region multiplied by the read length,
divided by
the length of the region, if the length of the region is longer than the read
length, or
otherwise using a second calculation of the number of reads falling within the
region
multiplied by the read length. The median of unique read depth across the
panel is
then calculated as the median of those average unique read depths of all
targeted
regions. In some embodiments, the resolution as to how depth is calculated is
increased or decreased, e.g., in cases where it is necessary or desirable to
calculate
depth for each base, or for a single gene. Generally, a unique read depth
threshold
will be selected based on the sequencing technology used, the size of the
targeted
panel, and the expected number of sequence reads generated by the combination
of
the technology and targeted panel used. For example, in some embodiments where

next generation sequencing-by-synthesis technology is used, the criterion is
implemented as a minimum unique read depth threshold of at least 1500, 1750,
2000,
2250, 2500, 2750, 3000, 3250, 3500, or higher unique read depth. In some
embodiments, the criterion is implemented as a range of acceptable unique read
depth,
e.g., requiring that the sequencing reaction generate a unique read depth of
from 1000
to 4000, from 1500 to 4000, from 1500 to 4000, and the like.
148
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
= A criterion for the unique read depth of a lowest percentile across the
panel, defined
as a measure of central tendency (e.g., a mean or median) for a distribution
of the
number of unique reads in the sequencing reaction encompassing the genomic
regions
targeted by each probe that fall within the lowest percentile of genomic
regions by
read depth (e.g, the first, second, third, fourth, fifth, tenth, fifteenth,
twentieth,
twenty-fifth, or similar percentile). Generally, a unique read depth at a
lowest
percentile threshold will be selected based on the sequencing technology used,
the
size of the targeted panel, the lowest percentile selected, and the expected
number of
sequence reads generated by the combination of the technology and targeted
panel
used. For example, in some embodiments where next generation sequencing-by-
synthesis technology is used, the criterion is implemented as a minimum unique
read
depth threshold at the fifth percentile of at least 500, 750, 1000, 1250,
1500, 1750,
2000, 2250, 2500, or higher unique read depth. In some embodiments, the
criterion is
implemented as a range of acceptable unique read depth at the fifth
percentile, e.g.,
requiring that the sequencing reaction generate a unique read depth at the
fifth
percentile of from 250 to 3000, from 500 to 3000, from 500 to 2500, and the
like.
= A criterion for the deamination or OxoG Q-score of a sequencing reaction,
defined as
a Q-score for the occun-ence of artifacts arising from template
oxidation/deamination.
Generally, a deamination or OxoG Q-score threshold will be selected based on
the
sequencing technology used. For example, in some embodiments where next
generation sequencing-by-synthesis technology is used, the criterion is
implemented
as a minimum deamination or OxoG Q-score threshold of at least 10, 20, 30, 40,
50,
60, 70, 80, 90, or higher. In some embodiments, the criterion is implemented
as a
range of acceptable deamination or OxoG Q-scores, e.g., from 10 to 100, from
10 to
90, and the like.
= A criterion for the estimated contamination fraction is of a sequencing
reaction,
defined as an estimate of the fraction of template fragments in the sample
being
sequenced arising from contamination of the sample, commonly expressed as a
decimal, e.g., where 1% contamination is expressed as 0.01. An example method
for
estimating contamination in a sequencing method is described in Jun G. et al.,
Am. J.
Hum. Genet., 91:839-48 (2012). For example, in some embodiments, the criterion
is
implemented as a maximum contamination fraction threshold of no more than
0.001,
0.0015, 0.002, 0.0025, 0.003, 0.0035, 0.004. In some embodiments, the
criterion is
149
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
implemented as a range of acceptable contamination fractions, e.g., from
0.0005 to
0.005, from 0.0005 to 0.004, from 0.001 to 0.004, and the like.
= A criterion for the fingerprint correlation score of a sequencing
reaction, defined as a
Pearson correlation coefficient calculated between the variant allele
fractions of a set
of pre-defined single nucleotide polymorphisms (SNPs) in two samples. An
example
method for determining a fingerprint correlation score is described in Sejoon
L. etal.,
Nucleic Acids Research, Volume 45, Issue 11,20 June 2017, Page e103, the
content
of which is incorporated herein by reference, in its entirety, for all
purposes. For
example, in some embodiments, the criterion is implemented as a minimum
fingerprint correlation score threshold of at least 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8,
0.9, or higher. In some embodiments, the criterion is implemented as a range
of
acceptable fingerprint correlation scores, e.g., from 0.1 to 0.9, from 0.2 to
0.9, from
0.3 to 0.9, and the like.
= A criterion for the raw coverage of a minimum percentage of the genomic
regions
targeted by a probe, defined as a minimum number of unique reads in the
sequencing
reaction encompassing each of a minimum percentage (e.g., at least 80%, 85%,
90%,
95%, 98%, 99%, 99.5%, 99.9%, and the like) of the genomic regions targeted by
the
probe panel. In some embodiments, the term "unique read depth" is used to
distinguish deduplicated reads from raw reads that may contain multiple reads
sequenced from the same original DNA molecule via PCR. Generally, a raw
coverage of a minimum percentage of the genomic regions targeted by a probe
threshold will be selected based on the sequencing technology used, the size
of the
targeted panel, the minimum percentage selected, and the expected number of
sequence reads generated by the combination of the technology and targeted
panel
used. For example, in some embodiments where next generation sequencing-by-
synthesis technology is used, the criterion is implemented as a raw coverage
of 95%
of the genomic regions targeted by a probe threshold of at least 500, 750,
1000, 1250,
1500, 1750, 2000, 2250, 2500, or higher unique read depth. In some
embodiments,
the criterion is implemented as a range of acceptable unique read depth for
95% of the
genomic regions targeted by a probe, e.g., requiring that the sequencing
reaction
generate a unique read depth for 95% of the targeted regions of from 250 to
3000,
from 500 to 3000, from 500 to 2500, and the like.
150
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
= A criterion for the PCR duplication rate of a sequencing reaction,
defined as the
percentage of sequence reads that arise from the same template molecule as at
least
one other sequence read generated by the reaction. Generally, a PCR
duplication rate
threshold will be selected based on the sequencing technology used, the size
of the
targeted panel, and the expected number of sequence reads generated by the
combination of the technology and targeted panel used. For example, in some
embodiments where next generation sequencing-by-synthesis technology is used,
the
criterion is implemented as a minimum PCR duplication rate threshold of at
least
91%, 92% ,93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher. In some embodiments,
the criterion is implemented as a range of acceptable PCR duplication rates,
e.g., of
from 90% to 100%, from 90% to 99%, from 91% to 99%, and the like.
[0505] Similarly, in some embodiments, the quality control system
includes methods that
fail samples for reporting if various criteria are met. In some embodiments,
the system
includes methods that allow for more manual review if a sample does meet the
criteria
established for automatic fail. In some embodiments, the criteria for failing
panel sequencing
results include one or more of the following:
= A criterion for the on-target rate of the sequencing reaction, defined as
a comparison
(e.g., a ratio) of (i) the number of sequenced nucleotides or reads falling
within the
targeted panel region of a genome and (ii) the number of sequenced nucleotides
or
reads falling outside of the targeted panel region of the genome. Generally,
an on-
target rate threshold will be selected based on the sequencing technology
used, the
size of the targeted panel, and the expected number of sequence reads
generated by
the combination of the technology and targeted panel used. For example, in
some
embodiments where next generation sequencing-by-synthesis technology is used,
the
criterion is implemented as a maximum on-target rate threshold of no more than
30%,
40%, 50%, 60%, 70%, or greater. That is, the criterion for failing the sample
is
satisfied when the on-target rate for the sequencing reaction is below the
maximum
on-target rate threshold. In some embodiments, the on-target rate criteria is
implemented as not falling within a range of acceptable on-target rates, e.g.,
falling
outside of an on-target rate for a reaction of from 30% to 70%, from 30% to
80%,
from 40% to 70%, from 40% to 80%, and the like.
= A criterion for the number of total reads generated by the sequencing
reaction,
including both unique sequence reads and non-unique sequence reads. Generally,
a
151
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
total read number threshold will be selected based on the sequencing
technology used,
the size of the targeted panel, and the expected number of sequence reads
generated
by the combination of the technology and targeted panel used. For example, in
some
embodiments where next generation sequencing-by-synthesis technology is used,
the
criterion is implemented as a maximum number of total reads threshold of no
more
than 100 million, 110 million, 120 million, 130 million, 140 million, 150
million, 160
million, 170 million, 180 million, 190 million, 200 million, or more total
sequence
reads. That is, the criterion for failing the sample is satisfied when the
number of total
reads for the sequencing reaction is below the maximum total read threshold.
In some
embodiments, the criterion is implemented as not falling within a range of
acceptable
number of total reads, e.g., falling outside of a range of from 50 million to
300 million
total sequence reads, from 100 million to 300 million sequence reads, from 100

million to 200 million sequence reads, and the like.
= A criterion for the number of unique reads generated by the sequencing
reaction.
Generally, a unique read number threshold will be selected based on the
sequencing
technology used, the size of the targeted panel, and the expected number of
sequence
reads generated by the combination of the technology and targeted panel used.
For
example, in some embodiments where next generation sequencing-by-synthesis
technology is used, the criterion is implemented as a maximum number of total
reads
threshold of no more than 3 million, 4 million, 5 million, 6 million, 7
million, 8
million, 9 million, or more unique sequence reads. That is, the criterion for
failing the
sample is satisfied when the number of unique reads for the sequencing
reaction is
below the maximum total read threshold. In some embodiments, the criterion is
implemented as not falling within a range of acceptable number of unique
reads, e.g.,
falling outside of a range of from 2 million to 10 million total sequence
reads, from 3
million to 9 million sequence reads, from 3 million to 9 million sequence
reads, and
the like.
= A criterion for unique read depth across the panel, defined as a measure
of central
tendency (e.g., a mean or median) for a distribution of the number of unique
reads in
the sequencing reaction encompassing the genomic regions targeted by each
probe.
Generally, a unique read depth threshold will be selected based on the
sequencing
technology used, the size of the targeted panel, and the expected number of
sequence
reads generated by the combination of the technology and targeted panel used.
For
152
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
example, in some embodiments where next generation sequencing-by-synthesis
technology is used, the criterion is implemented as a maximum unique read
depth
threshold of no more than 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3250,
3500, or
higher unique read depth. That is, the criterion for failing the sample is
satisfied when
the unique read depth across the panel for the sequencing reaction is below
the
maximum total read threshold. In some embodiments, the criterion is
implemented as
falling outside of a range of acceptable unique read depth, e.g., falling
outside of a
unique read depth range of from 1000 to 4000, from 1500 to 4000, from 1500 to
4000, and the like.
= A criterion for the unique read depth of a lowest percentile across the
panel, defined
as a measure of central tendency (e.g., a mean or median) for a distribution
of the
number of unique reads in the sequencing reaction encompassing the genomic
regions
targeted by each probe that fall within the lowest percentile of genomic
regions by
read depth (e.g., the first, second, third, fourth, fifth, tenth, fifteenth,
twentieth,
twenty-fifth, or similar percentile). Generally, a unique read depth at a
lowest
percentile threshold will be selected based on the sequencing technology used,
the
size of the targeted panel, the lowest percentile selected, and the expected
number of
sequence reads generated by the combination of the technology and targeted
panel
used. For example, in some embodiments where next generation sequencing-by-
synthesis technology is used, the criterion is implemented as a maximum unique
read
depth threshold at the fifth percentile of no more than 500, 750, 1000, 1250,
1500,
1750, 2000, 2250, 2500, or higher unique read depth. That is, the criterion
for failing
the sample is satisfied when the unique read depth at a lowest percentile
threshold for
the sequencing reaction is below the maximum unique read depth at a lowest
percentile threshold. In some embodiments, the criterion is implemented as
falling
outside of a range of acceptable unique read depth at the fifth percentile,
e.g., falling
outside of a unique read depth at the fifth percentile range of from 250 to
3000, from
500 to 3000, from 500 to 2500, and the like.
= A criterion for the deamination or OxoG Q-score of a sequencing reaction,
defined as
a Q-score for the occurrence of artifacts arising from template
oxidation/deamination.
Generally, a deamination or OxoG Q-score threshold will be selected based on
the
sequencing technology used. For example, in some embodiments where next
generation sequencing-by-synthesis technology is used, the criterion is
implemented
153
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
as a maximum deamination or OxoG Q-score threshold of no more than 10, 20, 30,

40, 50, 60, 70, 80, 90, or higher. That is, the criterion for failing the
sample is
satisfied when the deamination or OxoG Q-score for the sequencing reaction is
below
the maximum deamination or OxoG Q-score threshold. In some embodiments, the
criterion is implemented as falling outside of a range of acceptable
deamination or
OxoG Q-scores, e.g., falling outside of a deamination or OxoG Q-score range of
from
to 100, from 10 to 90, and the like.
= A criterion for the estimated contamination fraction is of a sequencing
reaction,
defined as an estimate of the fraction of template fragments in the sample
being
sequenced arising from contamination of the sample, commonly expressed as a
decimal, e.g., where 1% contamination is expressed as 0.01. An example method
for
estimating contamination in a sequencing method is described in Jun G. et al.,
Am. J.
Hum. Genet., 91:839-48 (2012). For example, in some embodiments, the criterion
is
implemented as a minimum contamination fraction threshold of at least 0.001,
0.0015,
0.002, 0.0025, 0.003, 0.0035, 0.004. That is, the criterion for failing the
sample is
satisfied when the contamination fraction for the sequencing reaction is above
the
minimum contamination fraction threshold In some embodiments, the criterion is

implemented as falling outside of a range of acceptable contamination
fractions, e.g.,
falling outside of a contamination fraction range of from 0.0005 to 0.005,
from 0.0005
to 0.004, from 0.001 to 0.004, and the like.
= A criterion for the fingerprint correlation score of a sequencing
reaction, defined as a
Pearson correlation coefficient calculated between the variant allele
fractions of a set
of pre-defined single nucleotide polymorphisms (SNPs) in two samples. An
example
method for determining a fingerprint correlation score is described in Sejoon
L. et al.,
Nucleic Acids Research, Volume 45, Issue 11,20 June 2017, Page e103. For
example, in some embodiments, the criterion is implemented as a maximum
fingerprint correlation score threshold of no more than 0.1, 0.2, 0.3, 0.4,
0.5, 0.6, 0.7,
0.8, 0.9, or higher. That is, the criterion for failing the sample is
satisfied when the
fingerprint correlation score for the sequencing reaction is below the maximum

fingerprint correlation score threshold. In some embodiments, the criterion is

implemented as falling outside of a range of acceptable fingerprint
correlation scores,
e.g., falling outside of a fingerprint correlation range of from 0.1 to 0.9,
from 0.2 to
0.9, from 0.3 to 0.9, and the like.
154
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
= A criterion for the raw coverage of a minimum percentage of the genomic
regions
targeted by a probe, defined as a minimum number of unique reads in the
sequencing
reaction encompassing each of a minimum percentage (e.g., at least 80%, 85%,
90%,
95%, 98%, 99%, 99.5%, 99.9%, and the like) of the genomic regions targeted by
the
probe panel. Generally, a raw coverage of a minimum percentage of the genomic
regions targeted by a probe threshold will be selected based on the sequencing

technology used, the size of the targeted panel, the minimum percentage
selected, and
the expected number of sequence reads generated by the combination of the
technology and targeted panel used. For example, in some embodiments where
next
generation sequencing-by-synthesis technology is used, the criterion is
implemented
as a raw coverage of 95% of the genomic regions targeted by a probe threshold
of no
more than 500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500, or higher unique
read
depth. That is, the criterion for failing the sample is satisfied when the raw
coverage
of a minimum percentage of the genomic regions targeted by a probe for the
sequencing reaction is below the maximum raw coverage of a minimum percentage
of
the genomic regions targeted by a probe threshold. In some embodiments, the
criterion is implemented as falling outside of a range of acceptable unique
read depth
for 95% of the genomic regions targeted by a probe, e.g., requiring that the
sequencing reaction generate a unique read depth for 95% of the targeted
regions
falling outside of a range of from 250 to 3000, from 500 to 3000, from 500 to
2500,
and the like.
= A criterion for the PCR duplication rate of a sequencing reaction,
defined as the
percentage of sequence reads that arise from the same template molecule as at
least
one other sequence read generated by the reaction. Generally, a PCR
duplication rate
threshold will be selected based on the sequencing technology used, the size
of the
targeted panel, and the expected number of sequence reads generated by the
combination of the technology and targeted panel used. For example, in some
embodiments where next generation sequencing-by-synthesis technology is used,
the
criterion is implemented as a maximum PCR duplication rate threshold of at
least
91%, 92% ,93%, 94%, 95%, 96%, 97%, 98%, 99%, or higher. That is, the criterion

for failing the sample is satisfied when the PCR duplication rate for the
sequencing
reaction is below the maximum PCR duplication rate threshold. In some
embodiments, the criterion is implemented as falling outside of a range of
acceptable
155
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
PCR duplication rates, e.g., of from 90% to 100%, from 90% to 99%, from 91% to

99%, and the like.
[0506] Thresholds for the auto-pass and auto-fail criteria may be
established with
reference to one another but are not necessarily set at the same level. For
instance, in some
embodiments, samples with a metric that falls between auto-pass and auto-fail
criteria may be
routed for manual review by a qualified bioinformatics scientist. Samples that
are failed
either automatically or by manual review may be routed to medical and
laboratory teams for
final review and can be released for downstream processing at the discretion
of the laboratory
medical director or designee.
Systems and Methods for Improved Validation of Copy Number Variation
[0507] An overview of methods for providing clinical support for
personalized cancer
therapy is described above with reference to Figures 2-4 above. Below, systems
and methods
for improving validation of copy number variation in a test subject, e.g.,
within the context of
the methods and systems described above, are described with reference to
Figures 5A1-E1
and 6A1-C1.
[0508] Many of the embodiments described below, in conjunction
with Figures 5A1-E1
and 6A1-C1, relate to analyses performed using sequencing data for cfDNA
obtained from a
liquid biopsy sample of a cancer patient. Generally, these embodiments are
independent and,
thus, not reliant upon any particular DNA sequencing methods. However, in some

embodiments, the methods described below include generating the sequencing
data.
105091 In one aspect, the disclosure provides a method for
validating a copy number
variation (e.g., identifying a true focal copy number variation) in a test
subject, by applying
one or more filters to segmented copy ratio data from a sequencing assay
performed on a
liquid biopsy sample from the subject. The method includes obtaining, from a
first
sequencing reaction, a corresponding sequence of each cell-free DNA fragment
in a first
plurality of cell-free DNA fragments in a liquid biopsy sample of the test
subject, thereby
obtaining a first plurality of sequence reads, e.g., a plurality of de-
duplicated sequence reads,
where each sequence read correspond to a unique cell-free DNA fragment from
the sample.
In some embodiments, the first plurality of sequence reads includes at least
1000 sequence
reads. In some embodiments, the first plurality of sequence reads includes at
least 10,000
sequence reads. In some embodiments, the first plurality of sequence reads
includes at least
100,000 sequence reads. In some embodiments, the first plurality of sequence
reads includes
156
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
at least 200,000, 300,000, 400,000, 500,000, 750,000, 1,000,000, 2,500,000,
5,000,000
sequence reads, or more.
[0510] The method then includes aligning each respective sequence
read in the first
plurality of sequence reads to a reference sequence for the species of the
subject. As
described above, in some embodiments, the reference sequence is a reference
genome, e.g., a
reference human genome. In some embodiments, a reference genome has several
blacklisted
regions, such that the reference genome covers only about 75%, 80%, 85%, 90%,
95%, 98%,
99%, 99.5%, or 99.9% of the entire genome for the species of the subject. In
some
embodiments, the reference sequence for the subject covers at least 10% of the
entire genome
for the species of the subject, or at least 15%, 20%, 25%, 30%, 35%, 40%, 45%,
50%, 55%,
60%, 65%, 70%, 75%, or more of the entire genome for the species of the
subject. In some
embodiments, the reference sequence for the subject represents a partial or
whole exome for
the species of the subject. For instance, in some embodiments, the reference
sequence for the
subject covers at least 10% of the exome for the species of the subject, or at
least 15%, 20%,
25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%,
98%,
99%, 99.9%, or 100% of the exome for the species of the subject. In some
embodiments, the
reference sequence covers a plurality of loci that constitute a panel of
genomic loci, e.g., a
panel of genes used in a panel-enriched sequencing reaction. An example of
genes useful for
precision oncology, e.g., which may be targeted with such a panel, are shown
in Table 1.
Accordingly, in some embodiments, the reference sequence for the subject
covers at least 100
kb of the genome for the species of the subject. In other embodiments, the
reference
sequence for the subject covers at least 250 kb, 500 kb, 750 kb, 1 Mb, 2 Mb, 5
Mb, 10 Mb, 25
Mb, 50 Mb, 100 Mb, 250 Mb, or more of the genome for the species of the
subject.
However, in some embodiments, there is no size limitation of the reference
sequence. For
example, in some embodiments, the reference sequence can be a sequence for a
single locus,
e.g., a single exon, gene, etc.) within the genome for the species of the
subject.
[0511] The method then includes determining several metrics for
the sequencing data. In
some embodiments, the metrics include a plurality of bin-level sequence
ratios, each
respective bin-level sequence ratio in the plurality of bin-level sequence
ratios corresponding
to a respective bin in a plurality of bins. In some embodiments, the plurality
of bins includes
at least 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10,000, 25,000, 50,000,
or more bins
distributed across the reference sequence (e.g., the genome) for the species
of the subject. In
some embodiments, the bins are distributed relatively uniformly across the
reference
157
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
sequence, e.g., such that the each encompasses a similar number of bases,
e.g., about 0.5 kb,
1 kb, 2 kb, 5 kb, 10 kb, 25 kb, 50 kb, 100 kb or more bases. Each respective
bin in the
plurality of bins represents a corresponding region of a reference sequence
(e.g., genome) for
the species of the subject. In some embodiments, the bins are distributed
relatively uniformly
across the reference sequence, e.g., such that the each encompasses a similar
number of
bases, e.g., about 0.5 kb, 1 kb, 2 kb, 5 kb, 10 kb, 25 kb, 50 kb, 100 kb or
more bases. Each
respective bin-level sequence ratio in the plurality of bin-level sequence
ratios is determined
from a comparison of the first plurality of sequence reads to sequence reads
from one or more
reference samples. In some embodiments, the one or more reference sample is a
process-
matched reference sample. That is, in some embodiments the one or more
reference samples
are prepared for sequencing using the same methodology as used to prepare the
sample from
the test subject. Similarly, in some embodiments, the one or more reference
samples are
sequenced using the same sequencing methodology as used to sequence the sample
from the
test subject. In this fashion, internal biases for particular regions or
sequences are controlled
for in the reference samples.
105121 In some embodiments, the metrics include a plurality of
segment-level sequence
ratios, each respective segment-level sequence ratio in the plurality of
segment-level
sequence ratios corresponding to a segment in a plurality of segments. Each
respective
segment in the plurality of segments represents a corresponding region of the
reference
genome for the species of the subject encompassing a subset of adjacent bins
in the plurality
of bins. Each respective segment-level sequence ratio in the plurality of
segment-level
sequence ratios is determined from a measure of central tendency of the
plurality of bin-level
sequence ratios corresponding to the subset of adjacent bins encompassed by
the respective
segment. That is, in some embodiments, bins adjacent to each other in the
reference
sequence (e.g., reference genome) are grouped together to form segments of the
reference
sequence (e.g., genome) having similar sequence ratios and, therefore,
presumably the same
copy number in the cancerous tissue of the subject.
105131 In some embodiments, the metrics include a plurality of
segment-level measures
of dispersion. Each respective segment-level measure of dispersion in the
plurality of
segment-level measures of dispersion corresponding to a respective segment in
the plurality
of segments. Each respective segment-level measure of dispersion in the
plurality of
segment-level measures of dispersion is determined using the plurality of bin-
level sequence
ratios corresponding to the subset of adjacent bins encompassed by the
respective segment.
158
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
That is, a measure of the dispersion of the individual bin-level sequence
ratio that make up a
segment is determined.
[0514] The method then includes validating a copy number status
annotation (e.g.,
determining whether a copy number variation is a focal amplification or
deletion) of a
respective segment in the plurality of segments that is annotated with a copy
number
variation by applying the first dataset to an algorithm having one or more
criteria filters. The
copy number status annotation of the respective segment (e.g., whether or not
a segment
represents a focal amplification or focal deletion) is then verified or
rejected based on a
predetermined pattern of firing or lack of firing of each of the filters in
the one or more
filters.
[0515] In some embodiments, the one or more filters includes a
measure of central
tendency bin-level sequence ratio filter that is fired when a measure of
central tendency of the
plurality of bin-level sequence ratios corresponding to the subset of bins
encompassed by the
respective segment fails to satisfy one or more bin-level sequence ratio
thresholds.
[0516] In some embodiments, the one or more filters includes a
confidence filter that is
fired when the segment-level measure of dispersion corresponding to the
respective segment
fails to satisfy a confidence threshold.
[0517] In some embodiments, the one or more filters includes a
measure of central
tendency-plus-deviation bin-level sequence ratio filter that is fired when a
measure of central
tendency of the plurality of bin-level sequence ratios corresponding to the
subset of bins
encompassed by the respective segment fails to satisfy one or more measure of
central
tendency-plus-deviation bin-level sequence ratio thresholds. The one or more
measure of
central tendency-plus-deviation bin-level copy ratio thresholds are derived
from (i) a measure
of central tendency of the bin-level sequence ratios corresponding to the
plurality of bins that
map to the same chromosome of the reference genome for the species of the
subject as the
respective segment, and (ii) a measure of dispersion across the bin-level
sequence ratios
corresponding to the plurality of bins that map to the respective chromosome.
[0518] This general method is described with further, optional
details below, with
reference to Method 500-1. Referring to method 500-1, the present disclosure
provides a
method for validating a copy number variation in a test subject.
[0519] Subjects and Biological Samples.
159
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0520] Referring to Block 502-1, the method includes obtaining a
first dataset that
comprises a plurality of bin-level sequence ratios, each respective bin-level
sequence ratio in
the plurality of bin-level sequence ratios corresponding to a respective bin
in a plurality of
bins. Each respective bin in the plurality of bins represents a corresponding
region of a
human reference genome, and each respective bin-level sequence ratio in the
plurality of bin-
level sequence ratios is determined from a sequencing of a plurality of cell-
free nucleic acids
in a first liquid biopsy sample of the test subject and one or more reference
samples. In some
embodiments, the plurality of bin-level sequence ratios comprises 2 or more
bin-level
sequence ratios, 3 or more bin-level sequence ratios, 4 or more bin-level
sequence ratios, 5 or
more bin-level sequence ratios, 6 or more bin-level sequence ratios, 7 or more
bin-level
sequence ratios, 8 or more bin-level sequence ratios, 100 or more bin-level
sequence ratios,
1000 or more bin-level sequence ratios, 1500 or more bin-level sequence
ratios, 2000 or more
bin-level sequence ratios, 2500 or more bin-level sequence ratios, 3000 or
more bin-level
sequence ratios, 3500 or more bin-level sequence ratios, 4000 or more bin-
level sequence
ratios, 4500 or more bin-level sequence ratios, 5000 or more bin-level
sequence ratios, 5500
or more bin-level sequence ratios, 6000 or more bin-level sequence ratios,
6500 or more bin-
level sequence ratios, 7000 or more bin-level sequence ratios, 7500 or more
bin-level
sequence ratios, 8000 or more bin-level sequence ratios, 8500 or more bin-
level sequence
ratios, 9000 or more bin-level sequence ratios, 9500 or more bin-level
sequence ratios,
10,000 or more bin-level sequence ratios, 20,000 or more bin-level sequence
ratios, 50,000 or
more bin-level sequence ratios, or 100,000 or more bin-level sequence ratios.
In some
embodiments, the plurality of bin-level sequence ratios consists of between
100 and 100,000
bin-level sequence ratios.
[0521] In some embodiments, the test subject is a patient in a
clinical trial. Referring to
Block 504-1, in some embodiments, the test subject is a patient with a cancer.
In some such
embodiments, the cancer is a solid tumor cancer. In some embodiments, the
cancer is
Ovarian Cancer, Cervical Cancer, Uveal Melanoma, Colorectal Cancer,
Chromophobe Renal
Cell Carcinoma, Liver Cancer, Endocrine Tumor, Oropharyngeal Cancer,
Retinoblastoma,
Biliary Cancer, Adrenal cancer, Neural, Neuroblastoma, Basal Cell Carcinoma,
Brain Cancer,
Breast Cancer, Melanoma, Non-Clear Cell Renal Cell Carcinoma, Glioblastoma,
Glioma,
Tumor of Unknown Origin, Kidney Cancer, Gastrointestinal Stromal Tumor,
Medulloblastoma, Bladder Cancer, Gastric Cancer, Bone Cancer, Non-Small Cell
Lung
Cancer, Thymoma, Low Grade Glioma, Prostate Cancer, Clear Cell Renal Cell
Carcinoma,
160
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Skin Cancer, Thyroid Cancer, Sarcoma, Testicular cancer, Head and Neck Cancer,
Head and
Neck Squamous Cell Carcinoma, Meningioma, Peritoneal cancer, Endometrial
Cancer,
Pancreatic Cancer, Mesothelioma, Esophageal Cancer, Small Cell Lung Cancer,
Her2
Negative Breast Cancer, Solid Tumor, Ovarian Serous Carcinoma, HR+ Breast
Cancer,
Uterine Serous Carcinoma, Endometrial Cancer, Uterine Corpus Endometrial
Carcinoma,
Gastroesophageal Junction Adenocarcinoma, Gallbladder Cancer, Chordoma, or
Papillary
Renal Cell Carcinoma.
[0522] Referring to Block 506-1, in some embodiments, the liquid
biopsy sample is a
liquid biopsy sample. Referring to Block 508-1, in some embodiments, the
liquid biopsy
sample is blood. For example, in some embodiments, the liquid biopsy sample
comprises
blood, whole blood, peripheral blood, plasma, serum, or lymph of the test
subject. In some
alternative embodiments, the liquid biopsy sample is any of the embodiments
described
above (see, Definitions: Liquid Biopsy and/or Example Methods: Figure 2A:
Example
Workflow for Precision Oncology).
105231 In some embodiments, the method further comprises
obtaining the liquid biopsy
sample from a sample repository or database (e.g., BioIVT, TSC Biosample
Repository,
BioLINCC, etc.). In some embodiments, the liquid biopsy sample is obtained
from the test
subject at least 1 hour, at least 2 hours, at least 12 hours, at least I day,
at least 2 days, at least
1 week, at least 1 month, or at least 1 year prior to processing and/or
sequencing the liquid
biopsy sample. In some such embodiments, the liquid biopsy sample is fresh,
frozen, dried,
and/or fixed. In some embodiments, the liquid biopsy sample is processed
and/or sequenced
at least 1 day, at least 2 days, at least 1 week, at least 1 month, or at
least 1 year prior to
obtaining the first dataset. For example, in some embodiments, the sequencing
data for the
liquid biopsy sample are obtained from a data repository (e.g., GenBank, NCB'
Assembly,
DNA DataBank of Japan, European Nucleotide Archive, European Variation
Archive, etc.).
[0524] Concurrent Jesting
[0525] Unless stated otherwise, as used herein, the term
"concurrent- as it relates to
assays refers to a period of time between zero and ninety days. In some
embodiments,
concurrent tests using different biological samples from the same subject
(e.g., two or more
of a liquid biopsy sample, cancerous tissue¨such as a solid tumor sample or
blood sample
for a blood-based cancer ______ and a non-cancerous sample) are performed
within a period of
time (e.g., the biological samples are collected within the period of time) of
from 0 days to 90
161
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
days. In some embodiments, concurrent tests using different biological samples
from the
same subject (e.g., two or more of a liquid biopsy sample, cancerous
tissue¨such as a solid
tumor sample or blood sample for a blood-based cancer¨and a non-cancerous
sample) are
performed within a period of time (e.g., the biological samples are collected
within the period
of time) of from 0 days to 60 days. In some embodiments, concurrent tests
using different
biological samples from the same subject (e.g., two or more of a liquid biopsy
sample,
cancerous tissue¨such as a solid tumor sample or blood sample for a blood-
based cancer¨
and a non-cancerous sample) are performed within a period of time (e.g., the
biological
samples are collected within the period of time) of from 0 days to 30 days. In
some
embodiments, concurrent tests using different biological samples from the same
subject (e.g.,
two or more of a liquid biopsy sample, cancerous tissue __ such as a solid
tumor sample or
blood sample for a blood-based cancer¨and a non-cancerous sample) are
performed within a
period of time (e.g., the biological samples are collected within the period
of time) of from 0
days to 21 days. In some embodiments, concurrent tests using different
biological samples
from the same subject (e.g., two or more of a liquid biopsy sample, cancerous
tissue¨such as
a solid tumor sample or blood sample for a blood-based cancer¨and anon-
cancerous
sample) are performed within a period of time (e.g., the biological samples
are collected
within the period of time) of from 0 days to 14 days. In some embodiments,
concurrent tests
using different biological samples from the same subject (e.g., two or more of
a liquid biopsy
sample, cancerous tissue¨such as a solid tumor sample or blood sample for a
blood-based
cancer __________ and a non-cancerous sample) are performed within a period of
time (e.g., the
biological samples are collected within the period of time) of from 0 days to
7 days. In some
embodiments, concurrent tests using different biological samples from the same
subject (e.g.,
two or more of a liquid biopsy sample, cancerous tissue¨such as a solid tumor
sample or
blood sample for a blood-based cancer
____________________________________________ and a non-cancerous sample) are
performed within a
period of time (e.g., the biological samples are collected within the period
of time) of from 0
days to 3 days.
[0526]
105271 In some embodiments, a liquid biopsy assay may be used
concurrently with a
solid tumor assay to return more comprehensive information about a patient's
variants. For
example, a blood specimen and a solid tumor specimen may be sent to a
laboratory for
evaluation. The solid tumor specimen may be analyzed using a bioinformatics
pipeline to
produce a solid tumor result. A solid tumor assay is described, for instance,
in U.S. Patent
162
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Application No. 16/657,804. The cancer type of the solid tumor may include,
for example,
non small cell lung cancer, colorectal cancer, or breast cancer. Alterations
identified in the
tumor/matched normal result may include, for example, EGFR+ for non small cell
lung
cancer; HER2+ for breast cancer; or KRAS G12C for several cancers.
[0528] In some embodiments, the blood specimen may be divided
into a first portion and
a second portion. The first portion of the blood specimen and the solid tumor
specimen may
be analyzed using a bioinformatics pipeline to produce a tumor/matched normal
result. The
second portion of the blood specimen may be analyzed using a bioinformatics
pipeline to
produce a liquid biopsy result. For example, the blood specimen may be
analyzed using at
least an improvement in somatic variant identification, e.g., as described
herein in the section
entitled "Systems and Methods for Improved Validation of Somatic Sequence
Variants"
and/or "Variant Identification." For example, the blood specimen may be
analyzed using an
improvement in focal copy number identification, e.g., as described herein in
the section
entitled -Systems and Methods for Improved Validation of Copy Number
Variation" and/or
"Copy Number Variation.- For example, the blood specimen may be analyzed using
an
improvement in circulating tumor fraction determination, e.g., as described
above in the
section entitled "Systems and Methods for Improved Circulating Tumor Fraction
Estimates"
and/or -Circulating Tumor Fraction."
[0529] Therapies may be identified for further consideration in
response to receiving the
tumor or tumor/matched normal result along with the liquid biopsy result. For
example, if
the results overall indicate that the patient has HER2+ breast cancer,
neratinib may be
identified along with the test results for further consideration by the
ordering clinician.
[0530] The solid tumor or tumor/matched normal assay may be
ordered concurrently;
their results may be delivered concurrently; and they may be analyzed
concurrently.
[0531] In some embodiments, the liquid biopsy sample corresponds
to a matched tumor
sample (e.g., a solid tumor sample obtained from the test subject). For
example, in some
embodiments, the method further comprises obtaining a second dataset that is
determined
from a sequencing of a plurality of cell-free nucleic acids in a matched tumor
sample of the
test subject. In some embodiments, the matched tumor sample is obtained from
the test
subject concurrently with the liquid biopsy sample. In some embodiments, the
matched
tumor sample is obtained from the test subject at a different time point from
the obtaining the
liquid biopsy sample. In some embodiments, the matched tumor sample is any of
the
163
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
embodiments described above (see, Example Methods: Figure 2A: Example Workflow
for
Precision Oncology). In some embodiments, the method further comprises
obtaining the
matched tumor sample from a sample repository or database (e.g., BioIVT, TSC
Biosample
Repository, BioLINCC, etc.). In some embodiments, the matched tumor sample is
obtained
from the test subject at least 1 hour, at least 2 hours, at least 12 hours, at
least 1 day, at least 2
days, at least 1 week, at least 1 month, or at least 1 year prior to obtaining
the liquid biopsy
sample. In some such embodiments, the matched tumor sample is fresh, frozen,
dried, and/or
fixed. In some embodiments, the matched tumor sample is processed and/or
sequenced at
least 1 day, at least 2 days, at least 1 week, at least 1 month, or at least 1
year prior to
obtaining the second dataset. For example, in some such embodiments, the
sequencing data
for the plurality of nucleic acids in the matched tumor sample are obtained
from a data
repository (e.g., GenBank, NCBI Assembly, DNA DataBank of Japan, European
Nucleotide
Archive, European Variation Archive, etc.).
[0532] In some embodiments, the one or more reference samples are
non-cancerous
samples. In some embodiments, the one or more reference samples is a matched
normal
sample (e.g., a normal sample obtained from the test subject). In some
embodiments, the
matched normal sample is obtained from the test subject concurrently with the
liquid biopsy
sample. In some embodiments, the matched normal sample is obtained from the
test subject
at a different time point from the obtaining the liquid biopsy sample. In some
embodiments,
the matched normal sample is any of the embodiments described above (see,
Example
Methods: Figure 2A: Example Workflow for Precision Oncology).
[0533] In some alternative embodiments, the one or more reference
samples comprise a
pool of normal (e.g., non-cancerous) samples obtained from a plurality of
control subjects
(e.g., healthy subjects). In some such embodiments, the method further
comprises obtaining
the one or more reference samples from a sample repository or database (e.g.,
BioIVT, TSC
Biosample Repository, BioLINCC, etc.). In some embodiments, the one or more
reference
samples include liquid biopsy samples comprising a plurality of cell-free
nucleic acids and/or
solid tissue samples comprising a plurality of nucleic acids. In some
embodiments, the one
or more reference samples are processed and/or sequenced at least 1 day, at
least 2 days, at
least 1 week, at least 1 month, or at least 1 year prior to obtaining the
first dataset. For
example, in some such embodiments, the sequencing data for the one or more
reference
samples are obtained from a data repository (e.g., GenBank, NCB' Assembly, DNA

DataBank of Japan, European Nucleotide Archive, European Variation Archive,
etc.).
164
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0534] Referring to Block 510-1, in some embodiments, the cell-
free nucleic acids (e.g.,
in the first liquid biopsy sample of the test subject and the one or more
reference samples)
comprise circulating tumor DNA (ctDNA). In some embodiments, the method
further
comprises isolating the plurality of cell-free nucleic acids from the liquid
biopsy sample of
the test subject prior to the sequencing. In some embodiments, the sequencing
is multiplexed
sequencing. In some embodiments, the sequencing is short-read sequencing or
long-read
sequencing.
[0535] In some embodiments, the sequencing is a panel-enriched
sequencing reaction. In
some such embodiments, the sequencing reaction is performed at a read depth of
100X or
more, 250X or more, 500X or more, 1000X or more, 2500X or more, 5000X or more,

10,000X or more, 20,000X or more, or 30,000X or more. In some embodiments, the

sequencing panel comprises 1 or more, 10 or more, 20 or more, 50 or more, 100
or more, 150
or more, 200 or more, 300 or more, 500 or more, or 1000 or more genes. In some

embodiments, the sequencing panel comprises one or more genes listed in Table
1. In some
embodiments, the sequencing panel includes at least 2, 3, 4, 5, 10, 15, 20,
25, 30, 40, 50, 60,
70, 80, 90, 100, or all of the genes listed in Table 1. In some embodiments,
the sequencing
panel comprises one or more genes selected from the group consisting of MET,
EGFR,
ERBB2, CD274, CCNE1, MYC, BRCA1 and BRCA2. In some embodiments, the
sequencing panel includes at least 2, 3, 4, 5, 6, 7, or all 8 of MET, EGFR,
ERBB2, CD274,
CCNE1, MYC, BRCA1 and BRCA2. In some embodiments, the sequencing reaction is a

whole exome sequencing reaction.
[0536] In some embodiments, the sequencing reaction is a whole
genome sequencing
reaction. In some such embodiments, the sequencing reaction is performed at an
average
read depth of 10X or more, 15X or more, 20X or more, 25X or more, 30X or more,
40X or
more, or 50X or more. In some embodiments, certain regions of the genome are
blacklisted
from the analysis of a whole genome sequencing reaction, e.g., centromeres,
telomeres,
highly repeated sequences, and the like, for which accurate sequencing results
are difficult to
obtain.
[0537] In some embodiments, the obtaining the first dataset
further comprises aligning a
plurality of sequence reads, obtained from a sequencing of the plurality of
cell-free nucleic
acids in the first liquid biopsy sample of the test subject, to the human
reference genome.
165
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0538] In some embodiments, on average, each respective bin in
the plurality of bins has
two or more, three or more, five or more, ten or more, fifteen or more, twenty
or more, fifty
or more, one hundred or more, five hundred or more, one thousand or more, ten
thousand or
more, or 100,000 or more sequence reads in the plurality of sequence reads
mapping onto the
portion of the reference genome corresponding to the respective bin, where
each such
sequence read uniquely represents a different molecule in the plurality of
cell-free nucleic
acids in the liquid biopsy sample. For instance, in some embodiments, the
plurality of cell-
free nucleic acids in the liquid biopsy sample are sequenced with a sequencing
methodology
that makes use of unique molecular identifier (UMIs) for each cell-free
nucleic acid in the
liquid biopsy sample and each sequence read in the plurality of sequence reads
has a unique
UMI. In such embodiments, sequence reads with the same UMI are bagged
(collapsed) into a
single sequence read bearing the UMI.
[0539] In some embodiments, the sequencing of the plurality of
cell-free nucleic acids in
the first liquid biopsy sample of the test subject is performed at a central
laboratory or
sequencing facility. In some such embodiments, the obtaining the first dataset
comprises
accessing one or more sequencing datasets and/or one or more auxiliary files,
in electronic
form, through a cloud-based interface. For example, a first dataset can be
obtained by
performing a bioinformatics pipeline using tumor BAM files, normal BAM files,
a human
reference genome file, a target region BED file, a list of mappable regions of
the genome,
and/or a blacklist of recurrent problematic areas of the genome.
[0540] In some embodiments, the obtaining the first dataset
comprises accessing the first
dataset, in electronic form, through a cloud-based interface. For example, a
first dataset can
comprise one or more outputs from a bioinformatics pipeline (e.g., CNVkit
outputs ".cns-
and/or -.cnr").
[0541] Additional methods and embodiments for sequencing nucleic
acids, including
aligning and preprocessing sequence reads, are described in further detail
above (see,
Example Methods: Figure 2A: Example Workflow for Precision Oncology).
Additional
methods and embodiments for performing the presently disclosed methods at a
distributed
diagnostic and clinical environment are described in detail above (see,
Example Methods:
Figure 2B: Distributed Diagnostic and Clinical Environment). Other embodiments
and/or
any combinations, substitutions, additions or deletions thereof are possible,
as will be
apparent to one skilled in the art.
166
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0542] Bins and Sequence Ratios.
[0543] In some embodiments, the methods and systems described
herein bin sequences
(e.g., sequence reads) across one or more regions of a genome to evaluate the
copy number at
one or more locations of the genome in a tissue of a subject. In this fashion,
a count of the
number of sequences generated for a test sample that map to the region of the
genome
corresponding to the bin, or a measure of depth of coverage across the region
of the genome
corresponding to the bin, are determined. All or a portion of these bin values
(e.g., copy
number or count number) can then be then compared to reference values for the
same
corresponding bins, to evaluate how the genomic copy number of the genome
corresponding
to the test sample differs from that of a reference, which can be a single
sample or an average
of a plurality of samples. For instance, where the reference bin values
represent one or more
non-cancerous reference samples, a comparison of bin values for the test
sample to these
reference values can reveal copy number differences having biological
significance for the
diagnosis and/or treatment of cancer in the test subject.
[0544] Generally, each bin in a plurality of bins corresponds to
a contiguous and non-
overlapping region, of any size, of a reference genome (e.g., a reference
human genome or
equivalent construct). For example, in some embodiments, each bin in a
plurality of bins
(e.g., spanning all or a portion of a reference genome) is at least 50 base
pairs (bp), at least
100 bp, at least 150 bp, at least 200 bp, at least 300 bp, at least 400 bp, at
least 500 bp, at least
750 bp, at least 1 kilobase pairs (kb), at least 2.5 kb, at least 5 kb, at
least 10 kb, at least 25
kb, or more. In some embodiments, each bin in a plurality of bins (e.g.,
spanning all or a
portion of a reference genome) is less than 250 kb, less than 100 kb, less
than 50 kb, less than
25 kb, less than 10 kb, less than 5 kb, less than 2.5 kb, or less. In some
embodiments, the
average bin size of each bin in the plurality of bins is from 50 bp to 25 kb,
from 50 bp to 5
kb, from 50 bp to 1 kb, from 50 bp to 500 bp, or within any other range
starting no lower than
25 bp and ending no higher than 350 kb.
[0545] When targeted-panel sequencing is used, generally bins
encompassing on-target
reads (those sequence reads con-esponding to fragments bound by an enrichment
probe) will
have smaller sizes than bins encompassing off-target reads (those sequence
reads
corresponding to fragments not bound by an enrichment probe). Accordingly, in
some
embodiments, the size of each bin depends on whether it is an on-target bin or
an off-target
bin. In some embodiments, bin size also varies bin to bin, e.g., such that the
number of reads
167
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
per bin is similar (e.g., within 25% or less of each other). For example,
CNVkit
automatically adjusts each bin's size so that the number of reads per bin is
roughly consistent.
[0546] In some embodiments, on-target bins have an average size
of about 100 bp. In
some embodiments, on-target bins have an average size of from 25 to 500 bp. In
some
embodiments, on-target bins have an average size of from 25 to 250 bp. In some

embodiments, on-target bins have an average size of from 50 to 250 bp. In some

embodiments, on-target bins have an average size of from 50 to 150 bp. A
smaller size could
be used if a higher resolution (for segmentation and subsequent CNV calling)
is desired, but
the bins may be noisier since they would contain fewer reads. Thus, the
optimal bin size may
depend on sequencing depth and sensitivity requirements.
[0547] In some embodiments, off-target bins have an average size
of at least 1 kb. In
some embodiments, off-target bins have an average size of at least 5 kb. In
some
embodiments, off-target bins have an average size of from 5 kb to 350 kb. In
some
embodiments, off-target bins have an average size of from 10 kb to 250 kb. The
size of off-
target bins may depend on both the on-target and off-target sequencing depths
of a
sequencing reaction.
[0548] Generally, each bin has a defined start nucleotide and a
defined ending nucleotide
in the reference genome for the species of subject. For example, where the
test species is a
human, each bin comprises a start and end position that indicates its location
in the human
reference genome. In some embodiments, each bin corresponds to (i) a first
subset of bins
that map to the same position of the human reference genome as a locus in a
targeted
sequencing panel (e.g., target bins), or (ii) a second subset of bins that map
to an off-target
portion of a reference genome that is not represented in the targeted
sequencing panel (e.g.,
off-target bins). In some embodiments, each bin in the first subset of bins
represents a
different gene, open reading frame, or genetic feature (e.g., promoter of a
gene, enhancer of a
gene, repressor of a gene) in a reference genome.
[0549] In some embodiments, each bin in a plurality of bins is
approximately the same
size (e.g., spans about the same number of base pairs in the reference genome
as every other
bin). For example, in some embodiments, the bin size specified by a user, such
that the
number of bins is dependent upon the size of the region over which the
plurality of bins span.
In some embodiments, the number of bins spanning a region is specified by a
user, such that
168
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
the size of each respective bin is dependent upon the size of the region over
which the
plurality of bins span.
[0550] In some embodiments, each bin in a plurality of bins is
not the same size. For
example, in some embodiments, each bin size is determined based on the number
of
sequences falling with the bins in one or more reference samples, e.g., to
normalize for an
expected number of sequence reads mapping to each bin. In some embodiments,
where
panel-enriched sequencing is used, bins in a first subset of bins spanning
regions of the
genome corresponding to the enrichment panel (e.g., bins corresponding to on-
target reads of
the sequencing reaction) are smaller than a second subset of bins spanning
regions of the
genome that do not correspond to the enrichment panel (e.g., bins
corresponding to off-target
reads of the sequencing reaction).
105511 In some embodiments, the plurality of bins covers at least
1 Mb of a reference
genome for the species of the subject (e.g., the human genome). In some
embodiments, the
plurality of bins covers at least 2.5 Mb, at least 5 Mb, at least 10 Mb, at
least 25 Mb, at least
50 Mb, at least 100 Mb, at least 250 Mb, at least 500 Mb, at least 1000 Mb, at
least 2000 Mb,
at least 3000 Mb, or more of the reference genome. In some embodiments, the
plurality of
bins covers at least 25% of a reference genome for the species of the subject
(e.g., the human
genome). In some embodiments, the plurality of bins covers at least 50%, at
least 75%, at
least 90%, at least 95%, at least 98%, at least 99%, or more of the reference
genome.
[0552] In some embodiments, a plurality of sequence reads are
obtained from a
sequencing of nucleic acids (e.g., in the liquid biopsy sample and/or in the
one or more
reference samples), and the obtained sequences, e.g., collapsed (de-
duplicated) sequence
reads, are assigned to respective bins corresponding to the region of the
genome that the
sequence reads map to. In some embodiments, the sequencing data is pre-
processed to
correct biases or errors using one or more methods such as normalization,
correction of GC
biases, correction of biases due to PCR over-amplification, etc., prior to
binning.
[0553] In some embodiments, the bin values processed to correct
for biases or errors, e.g.,
by normalization, standardization, etc. For instance, in some embodiments, a
median bin
value across a plurality of bin values for a sample is obtained, and each
respective bin value
in the plurality of bin values is divided by this median value, assuring that
the bin values for
the respective training subject are centered on a known value (e.g., on zero):
169
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
byi
bvi* =
median(bvi)
where, bvi= the bin value of bin i in the plurality of bin values for the
sample, bvi* = the
normalized bin value of bin i in the plurality of bin values for the sample
upon this first
normalization, and median(bvi) = the median bin value across the plurality of
unnormalized
bin values for the sample. In some embodiments, rather than using the median
bin value
across the corresponding plurality of bin values, some other measure of
central tendency is
used, such as an arithmetic mean, weighted mean, midrange, midhinge, trimean,
Winsorized
mean, mean, or mode across the plurality of bin values of the sample.
[0554] In some embodiments, each respective normalized bin count
bvi* is further
normalized by the median normalized value for the respective bin across a
plurality of
samples k:
bvi** = log (median(bvi'))
where, bvi* ¨ the normalized bin value of bin Amn the first plurality of bin
values for the
sample from the first normalization procedure described above, bvi** = the
normalized bin
value of bin i for the respective sample upon this second normalization
described here, and
median(bvi*k*) = the median normalized bin value bvi* for bin i across the
plurality of
samples (e.g., k reference samples).
[0555] In some embodiments, the un-normalized bin values (counts)
bvi are GC
normalized. In some embodiments, the normalized bin values bvi*are GC
normalized. In
some embodiments, the normalized bin values bvrare GC normalized. In such
embodiments, GC counts of respective sequence reads in the plurality of
sequence reads of
each sample in the plurality of reference samples are binned. A curve
describing the
conditional mean fragment count per GC value is estimated by such binning
(Yoon et al.,
2009, Genome Research 19(9):1586), or, alternatively, by assuming smoothness
(Boeva et
at., 2011, Bioinformatics 27(2), p. 268; Miller etal., 2011, PLoS ONE 6(1), p.
e16327). The
resulting GC curve determines a predicted count for each bin based on the
bin's GC. These
predictions can be used directly to normalize the original signal (e. g .,bvT
, bvi ,or bvr).
As a non-limiting example, in the case of binning and direct normalization,
for each
170
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
respective G+C percentage in the set 10%, 1%, 2%, 3%,..., 100%1, the value
mGc, the
median value of bvrof all bins across the plurality of training subjects
having this respective
G+C percentage, is determined and subtracted from the normalized bin values
bvi** of those
bins having the respective G+C percentage to form GC normalized bin values
bvi*** . In some
embodiments, rather than using the median value of byrof all bins across the
first plurality
of subjects having this respective G+C percentage, some other form of measure
of central
tendency of bvrof all bins across the plurality of training subjects having
this respective
G+C percentage is used, such as an arithmetic mean, weighted mean, midrange,
midhinge,
trimean, Winsorized mean, mean, or mode. In some embodiments, a correction
curve is
determined using a locally weighted scatterplot smoothing model (e.g., LOESS,
LOWESS,
etc.). See, for example, Benjamini and Speed, 2012, Nucleic Acids Research
40(10): e72;
and Alkan et al., 2009, Nat Genet 41:1061-7. For example, in some embodiments,
the GC
bias curve is determined by LOESS regression of count by GC (e.g., using the
'loess' R
package) on a random sampling (or exhaustive sampling) of bins from the
plurality of
training subjects. In some embodiments, the GC bias curve is determined by
LOESS
regression of count by GC (e.g., using the 'loess' R package), or some other
form of curve
fitting, on a random sampling of bins from a cohort of reference samples that
were sequenced
using the same sequencing techniques used to sequence the test sample.
[0556] In some embodiments, the bin counts are normalized using
principal component
analysis (PCA) to remove higher-order artifacts for a population-based (e.g.,
healthy subjects)
correction. See, for example, Price etal., 2006, Nat Genet 38, pp. 904-909;
Leek and Storey,
2007, PLoS Genet 3, pp. 1724-1735; and Zhao etal., 2015, Clinical Chemistry
61(4), pp.
608-616. Such normalization can be in addition to or instead of any of the
above-identified
normalization techniques. In some such embodiments, to train the PCA
normalization, a data
matrix comprising LOESS normalized bin counts bvi*** from young healthy
subjects in the
plurality of training subjects (or another cohort that was sequenced in the
same manner as the
plurality of training subjects) is used and the data matrix is transformed
into principal
component space thereby obtaining the top N number of principal components
across the
training set. In some embodiments, the top 2, the top 3, the top 4, the top 5,
the top 6, the top
7, the top 8, the top 9 or the top 10 such principal components are used to
build a linear
regression model:
LM (PC1_, , PCN)
171
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Then, each bin by i*** of each respective bin of each respective sample in the
plurality of
reference samples is fit to this linear model to form a corresponding PCA-
normalized bin
count bvi****:
bv = -------------------------------------- ¨
fitLm(pe,,...,pcN)-
t*** bvi".
In other words, for each respective sample in the plurality of reference
samples, a linear
regression model is fit between its normalized bin counts {bvi***, bvi***}
and the top
principal components from the training set. The residuals of this model serve
as final
normalized bin values fbv,****, , bvi****; for the respective sample.
Intuitively, the top
principal components represent noise commonly seen in reference samples, and
therefore
removing such noise (in the form of the top principal components derived from
the healthy
cohort) from the bin values bvi***can effectively improve normalization. See
Zhao et al.,
2015, Clinical Chemistry 61(4), pp. 608-616 for further disclosure on PCA
normalization of
sequence reads using a health population. Regarding the above normalization,
it will be
appreciated that all variables are standardized (e.g., by subtracting their
means and dividing
by their standard deviations) when necessary.
[05571 It will be appreciated that any form of representation of
the number of nucleic
sequence reads mapping to a given bin i can constitute a "bin value- and that
such a bin value
can be in un-normalized form (e.g., bvi) or normalized form (e.g., bvi* , bv,
bvi***, bvi****,
etc.).
[0558] After binning sequences, a bin count or read depth is
determined for each bin. For
example, in some embodiments, the read depth is the average number of times
that the
corresponding region of the human reference genome spanned by the respective
bin is
represented in the plurality of sequence reads obtained from the sequencing
reaction.
[0559] In some embodiments, the read depths for each respective
bin, in the plurality of
bins, are determined by binning sequence reads obtained for the plurality of
cell-free nucleic
acids in a panel-enriched sequencing reaction. In some embodiments, the panel-
enriched
sequencing reaction is an ultra-high depth sequencing, where each locus in the
plurality of
loci in the targeted sequencing panel is sequenced at an average coverage of
at least 1000x, at
least 2500x, or at least 5000x. In some embodiments, read depths are obtained
from targeted
captured sequencing reads (e.g., target bins) and non-specifically captured
off-target reads
(e.g., off-target bins).
172
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0560] The bin values (e.g., bin counts or read depth) generated
from the binning
operation are then compared to bin values for one or more reference sample.
Referring to
Block 512-1, in some embodiments, each respective bin-level sequence ratio in
the plurality
of bin-level sequence ratios is derived from a comparison of (a) a read depth
for the
corresponding bin in the plurality of bins, determined from a sequencing of a
plurality of cell-
free nucleic acids in a liquid biopsy sample of the test subject, to (b) a
measure of central
tendency of read depths for the corresponding bin, across one or more
reference samples (or
simply the read depth of a single reference sample in the case where only one
reference
sample is used). Thus, in some such embodiments, a sequence ratio for a
respective bin is a
comparison of the read depths between the test sample and one or more
reference samples,
e.g., a pool of reference samples. In some embodiments, the one or more
reference sample is
a single sample, two or more samples, five or more samples, or 100 or more
samples.
[0561] For example, in some embodiments, the (a) read depth and
the (b) read depths are
determined by binning sequence reads from one or more panel-enriched
sequencing
reactions, and the plurality of bin-level sequence ratios comprises (i) a
first sub-plurality of
bin-level sequence ratios corresponding to bins that map to the same position
of the human
reference genome as an enriched locus in the panel-enriched sequencing
reaction; and (ii) a
second sub-plurality of bin-level sequence ratios corresponding to bins that
do not map to the
same position of the human reference genome as any enriched locus in the panel-
enriched
sequencing reaction. For example, in some such embodiments, the bin-level
sequence ratios
for target bins and the bin-level sequence ratios for off-target bins are
separately determined.
[0562] In some embodiments, the (a) read depth and the (b) read
depths are log2-
transformed (e.g., 10g2 read depths).
[0563] In some embodiments the ratio of the (a) read depth (X)
and the (b) measure of
central tendency of the read depths (Y) is taken as X/Y, Y/X, logx(X/Y),
logN(Y/X), X'/Y,
Y/X', logx(X7Y), or logN(Y/X'), X/Y', Y'/X, logN(Y'/X) , X'/Y',
Y'/X',
logx(X7Y1), or logiv(Y'/X), where N is any real number greater than 1 and
where example
mathematical transformations of X and Y include, but are not limited to,
raising X or Y to a
power Z, multiplying X or Y by a constant Q, where Z and Q are any real
numbers, and/or
taking an M based logarithm of X and/or Y, where M is a real number greater
than 1.
[0564] In some embodiments, the (a) read depth and the (b) read
depths are centered and
corrected. In some such embodiments, the (a) read depth and the (b) read
depths are median-
173
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
centered. In some embodiments, the correcting comprises correcting for bias
(e.g., GC
content, genome sequence repetitiveness, target size and/or spacing). For
example, in some
embodiments, the method further comprises, for each sample, centering and
correcting the
plurality of read depths corresponding to the plurality of bins, across all
target and off-target
bins in the sample.
105651 In some embodiments, the (b) measure of central tendency
of read depths for the
corresponding bin, across the one or more reference samples, is an arithmetic
mean, a
weighted mean, a midrange, a midhinge, a trimean, a Winsori zed mean, a mean,
a median, or
a mode. In some embodiments, the (b) measure of central tendency of read
depths for the
corresponding bin, across the one or more reference samples, is Tukey's
biweight location.
[0566] In some embodiments, the method further comprises
determining the spread of the
(b) read depths for the corresponding bin, across the one or more reference
samples. In some
such embodiments, the spread is a measure of dispersion including, but not
limited to, a
range, a standard deviation, a standard error, and/or a confidence interval.
In some
embodiments, the spread is a midvariance. For additional background on these
statistical
methods, see, for example, Lax, J Am Stat Assoc, 80, 736-741 (1985), and
Randal, Comput
Stat Data An, 52, 5014-5021 (2008), each of which is hereby incorporated
herein by
reference in its entirety.
[0567] Referring to Block 514-1, in some embodiments, each bin-
level sequence ratio in
the plurality of bin-level sequence ratios is a copy ratio.
[0568] For example, in some embodiments, the centered and
corrected 1og2 read depth of
each bin in the test sample (e.g., the liquid biopsy sample) is subtracted by
the 10g2 read depth
of the corresponding bin in the one or more reference samples (e.g., the
reference pool). This
generates a 10g2 copy ratio between the test sample and the one or more
reference samples.
Then, in some embodiments, the copy ratio of a bin can be defined as:
10g2 copy ratio = log2(tes) ¨ 10g2(ref)
where 10g2(test) (e.g , the test 10g2 read depth) is the median-centered and
corrected 10g2-
transformed read depth, for the liquid biopsy sample, for the respective bin,
and
10g2(ref) (e.g., the reference 10g2 read depth) is determined by calculating
the weighted
average of median-centered and corrected log2-trans formed read depths, for
each reference
sample in the plurality of reference samples, for the respective bin.
174
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0569] In some embodiments, the one or more reference samples
includes one or more
test samples comprising less than a threshold number of copy number
variations. In some
embodiments, the one or more reference samples includes one or more test
samples
comprising one or more copy number variations, where each copy number
variation occurs
less than a threshold number of times in the one or more of test samples. In
some
embodiments, the threshold number of copy number variations is 1, 2, 3, 4, 5,
6, 7, 8, 9, 10 or
more than 10. In some embodiments, the threshold number of occurrences for
each of the
one or more copy number variations is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more
than 10.
[0570] In some alternative embodiments, the one or more reference
samples includes one
or more process-matched normal samples. In some such embodiments, the one or
more
process-matched normal samples are not pooled (e.g., the read depths are not
averaged across
the one or more matched normal samples), and each test sample is normalized
against its
process-matched normal sample.
[0571] In some embodiments, no reference samples are obtained,
and each test sample is
normalized using one or more fixed values for normalization (e.g., a specified
10g2 depth
correction value for each bin in the tumor sample). For example, in some
embodiments, a
fixed value for 10g2 depth correction is a neutral copy number (e.g., 10g2
1.0).
[0572] In some embodiments, the method further comprises removing
(e.g., filtering),
from the plurality of bins, each bin that fails to satisfy one or more
filtering criteria.
[0573] In some embodiments, the one or more filtering criteria
comprises a threshold
reference 1og2 read depth. For example, in some embodiments, each bin that has
a reference
10g2 read depth below a threshold value is removed from the plurality of bins.
In some such
embodiments, the threshold reference 10g2 read depth is less than 5, less than
1, less than 0,
less than -1, less than -2, less than -3, less than -4, less than -5, less
than -6, less than -7, less
than -8, less than -9, or less than -10. In some embodiments, the threshold
reference logo read
depth is between 0 and -10.
[0574] In some embodiments, the one or more filtering criteria
comprises a threshold test
10g2 read depth. For example, in some embodiments, each bin that has a test
10g2 read depth
below a threshold value is removed from the plurality of bins. In some such
embodiments,
the threshold test 10g2 read depth is less than 10, less than 5, less than 1,
less than 0, less than
-1, less than -2, less than -3, less than -4, less than -5, less than -6, less
than -7, less than -8,
175
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
less than -9, or less than -10. In some embodiments, the threshold test 1og2
read depth is
between 5 and -5.
[0575] In some embodiments, the one or more filtering criteria
comprises a proximity of
a test 10g2 read depth to a blacklist value. For example, in some embodiments,
each bin that
has a test 10g2 read depth that is within a specified range around a blacklist
value is removed
from the plurality of bins. In some embodiments, the blacklist value is 0, and
the specified
range is +/- 1 or less (e.g., each bin that has a test 1og2 read depth between
-1 and 1 is
removed from the plurality of bins). In some embodiments, the blacklist value
is 0, and the
specified range is +/- 0.9 or less, +/- 0.8 or less, +/- 0.7 or less, +/- 0.6
or less, +/- 0.5 or less,
+/- 0.4 or less, +/- 0.3 or less, +/- 0.2 or less, or +/- 0.1 or less. In some
embodiments, the
specified range is greater than +/- 1.
105761 In some embodiments, the one or more filtering criteria
comprises a distance of a
test 10g2 read depth from a whitelist value. For example, in some such
embodiments, each
bin that has a test 1og2 read depth that is outside of a specified range
around a whitelist value
is removed from the plurality of bins. In some embodiments, the whitelist
value is a measure
of central tendency of the test 10g2 read depths for a subset of bins in the
plurality of bins.
The measure of central tendency can be a mean, a median, or a mode. In some
embodiments,
the subset of bins is 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10 bins
including the respective bin.
In some embodiments, the subset of bins is 2, 3, 4, 5, 6, 7, 8, 9, 10, or more
than 10
contiguous bins including the respective bin. For example, the measure of
central tendency
of the test 10g2 read depths for a subset of bins can be a local average of
10g2 read depths,
where the local average of 1og2 read depths is determined by calculating a
rolling average for
the subset of bins including the respective bin. In some embodiments, the
specified range
around the whitelist value is at least +/- 1 (e.g., each bin that has a test
10g2 read depth
between that has a difference of 1 or greater from the rolling average is
removed from the
plurality of bins). In some embodiments, the specified range is at least +/-
2, at least +/- 3, at
least +/- 4, or at least +/- 5. In some embodiments, the specified range is
less than +1-1 .
[0577] In some embodiments, the one or more filtering criteria
comprises a threshold
spread (e.g., a standard deviation, a standard error, and/or a confidence
interval) of reference
1og2 read depths, for the respective bin, across all samples in the one or
more reference
samples. For example, in some such embodiments, each bin that has a spread of
read depths
greater than a threshold value is removed from the plurality of bins.
176
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0578] In some embodiments, each bin in the plurality of bins is
assigned a weight, and
the one or more filtering criteria comprises a threshold weight. In some such
embodiments,
the weight is determined based on one or more of: a size of the bin (e.g., the
number of base
pairs in the respective bin); a deviation (e.g., distance) from 0 of the 10g2
read depth for the
respective bin in the pooled reference; and/or the spread of 10g2 read depths
for the respective
bin in the pooled reference. For example, in some embodiments, each bin with a
weight of 0
is removed from the plurality of bins.
[0579] Other methods for binning and/or determining sequence
ratios are possible, as will
be apparent to one skilled in the art. See, for example. Talevich et al., PLoS
Comput Biol,
12:1004873 (2016), the content of which is hereby incorporated by reference,
in its entirety,
for all purposes.
105801 Segments.
[0581] Referring again to Block 502-1, the first dataset further
comprises a plurality of
segment-level sequence ratios, each respective segment-level sequence ratio in
the plurality
of segment-level sequence ratios corresponding to a segment in a plurality of
segments.
[0582] Each respective segment in the plurality of segments
represents a corresponding
region of the human reference genome encompassing a subset of adjacent bins in
the plurality
of bins, and each respective segment-level sequence ratio in the plurality of
segment-level
sequence ratios is determined from a measure of central tendency of the
plurality of bin-level
sequence ratios corresponding to the subset of adjacent bins encompassed by
the respective
segment.
[0583] The first dataset further comprises a plurality of segment-
level measures of
dispersion, each respective segment-level measure of dispersion in the
plurality of segment-
level measures of dispersion (i) corresponding to a respective segment in the
plurality of
segments and (ii) determined using the plurality of bin-level sequence ratios
corresponding to
the subset of adjacent bins encompassed by the respective segment.
105841 Referring to Block 516-1, in some embodiments, one or more
respective segments
in the plurality of segments that represents a corresponding region of the
human reference
genome encodes a target gene. Referring to Block 518-1, in some embodiments,
the target
gene is MET, EGFR, ER13132, CD274, CCNE1, MYC, BRCA1 or BRCA2. Referring to
Block 520-1, in some embodiments, the target gene is any of the genes listed
in Table 1.
177
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0585] Referring to Block 522-1, the method further comprises,
for each respective
segment in the plurality of segments that represents a corresponding region of
the human
reference genome, grouping the respective subset of adjacent bins in the
plurality of bins
based on a similarity between the respective sequence ratios of the subset of
adjacent bins.
Referring to Block 524-1, in some such embodiments, the grouping is performed
using
circular binary segmentation (CBS).
[0586] Circular binary segmentation groups bins into larger
segments that divide each
chromosome into regions comprising equal sequence ratios (e.g., copy number or
copy ratio).
This is generally performed by calculated a statistic for each genomic
position, where the
statistic comprises a likelihood ratio for the null hypothesis (no change in
sequence ratio at
the respective position) against the alternative (one change in sequence ratio
at the respective
position), and where the null hypothesis is rejected if the statistic is
greater than a predefined
distribution threshold. Notably, in circular binary segmentation, the
chromosome is assumed
to be circularized, such that the calculation is performed recursively for
each position (e.g.,
each bin) around the circumference of the circle to identify all change-points
across the
length of the chromosome. See, for example, Olshen el at., Biostatistics 5, 4,
557-572
(2004), doi:10.1093/biostatistics/kxh008, which is hereby incorporated herein
by reference in
its entirety.
[0587] In some embodiments, the grouping (e.g., segmentation) is
performed using a
Fused Lasso algorithm, a wavelet-based algorithm (e.g., HaarSeg), and/or a
Hidden Markov
Model. For example, in some embodiments, the grouping is performed using a 3-
state
Hidden Markov Model, a 5-state Hidden Markov Model, and/or a 3-state Hidden
Markov
Model with fixed amplitude for the loss, neutral, and gain states. In some
embodiments, the
grouping is performed by dividing a respective chromosome into a plurality of
predefined
regions (e.g., chromosome arms) are calculating the sequence ratios for each
predefined
region using a measure of central tendency of the sequence ratios of all bins
within the
predefined region (e.g., a weighted mean of the 1og2 copy ratios of all bins
within each
chromosome arm).
[0588] As described above, the segment-level sequence ratio is
then calculated, for each
segment, as a measure of central tendency for the one or more bins grouped
together by the
segmentation. In some embodiments, the measure of central tendency of the
plurality of bin-
level sequence ratios corresponding to the subset of bins encompassed by the
respective
segment is an arithmetic mean, a weighted mean, a midrange, a midhinge, a
trimean, a
178
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Winsorized mean, a mean, a median, or a mode. Referring to Block 526-1, in
some
embodiments, the measure of central tendency of the plurality of bin-level
sequence ratios is
a weighted mean. For example, a segment-level copy ratio can be calculated as
the weighted
mean of the plurality of copy ratios for all bins grouped within the segment.
[0589] In some embodiments, the segmentation further comprises
obtaining a measure of
dispersion based on the sequence ratios (e.g., copy ratios) for each bin in
the subset of
adjacent bins. In some embodiments, each respective segment-level measure of
dispersion in
the plurality of segment-level measures of dispersion is a confidence
interval, a standard
deviation, a standard error, a variance, or a range. Referring to Block 528-1,
in some
embodiments, each respective segment-level measure of dispersion in the
plurality of
segment-level measures of dispersion is a confidence interval, and determining
each
respective segment-level measure of dispersion in the plurality of segment-
level measures of
dispersion comprises bootstrapping the plurality of bin-level sequence ratios
corresponding to
the subset of bins encompassed by the respective segment. In some alternative
embodiments,
determining segment-level measures of dispersion (e.g., segment-level
confidence intervals)
is performed using normal distributions, binomial distributions, and/or
statistical models for
estimation as will be apparent to one skilled in the art.
[0590] Copy Number Status Annotations.
[0591] Referring again to Block 500-1, the present disclosure
provides systems and
methods for validating a copy number variation in a test subject, such as a
copy number
status annotation assigned to a genomic segment.
[0592] In some embodiments, a respective segment in the plurality
of segments is
annotated with a copy number status annotation when the corresponding segment-
level
sequence ratio satisfies one or more segment-level sequence ratio thresholds.
[0593] In some embodiments, a copy number status annotation is a
qualitative status. For
example, in some such embodiments, a copy number status annotation is selected
from the
group consisting of "amplified", "deleted", or "neutral".
[0594] As an example, the annotation can comprise, when the
segment-level sequence
ratio is a positive number, marking the segment as "amplified"; when the
segment-level
sequence ratio is a negative number, marking the segment as -deleted"; and
when the
segment-level sequence ratio is zero or within a specified range around zero,
marking the
179
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
segment as
"neutral".
[0595] As another example, the annotation can comprise, when the
segment-level
sequence ratio is greater than a first threshold, marking the segment as
"amplified"; when the
segment-level sequence ratio is less than a second threshold, marking the
segment as
"deleted"; and when the segment-level sequence ratio is between the first and
the second
thresholds, marking the segment as -neutral". In an embodiment, the one or
more segment-
level sequence ratio thresholds are one or more segment-level copy ratio
thresholds, where
the copy number status annotation is -amplified- if the segment-level copy
ratio is greater
than 0.03, -deleted" if the segment-level copy ratio is less than -0.5, or -
neutral" if between -
0.5 and 0.03.
105961 In some embodiments, a copy number status annotation is a
quantitative status
(e.g., an integer copy number).
[0597] In some embodiments, the annotation comprises, for each
segment, rounding the
segment-level sequence ratio to the nearest integer and assigning an absolute
copy number
based on one or more integer segment-level sequence ratio thresholds. For
example, in some
embodiments, segment-level copy numbers can be estimated based on positive
correlations
with segment-level copy ratios. In some embodiments, the annotation further
comprises, for
each segment, determining whether the segment-level sequence ratio falls
within a specified
range in a plurality of specified ranges, and assigning an absolute copy
number (e.g., an
integer copy number) based on the specified range.
[0598] In some embodiments, the annotation further comprises
rescaling the segment-
level sequence ratio based on one or more scaling factors (e.g., tumor
fraction, B-allele
frequency, known ploidy, and/or point estimates (mean, median, maximum, etc.)
of somatic
variant allele frequencies). For example, a segment-level copy ratio can be
divided by a
tumor fraction estimate for the test subject or the biological sample, thus
estimating the copy
ratio that would be expected in a pure tumor sample.
[0599] In some embodiments, the method further comprises removing
(e.g., filtering)
from the plurality of segments each respective segment that fails to satisfy
one or more
filtering criteria.
[0600] For example, in some embodiments, the one or more
filtering criteria comprises a
threshold absolute copy number, where each segment that is annotated with an
absolute copy
180
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
number lower than the threshold is removed from the plurality of segments. In
some such
embodiments, the threshold absolute copy number is 2, 3, 4, 5, 6, 7, 8, 9, 10,
or more than 10
copies.
[0601] In some embodiments, the one or more filtering criteria
comprises one or more
threshold values for a measure of dispersion, where (i) each segment in the
plurality of
segments is annotated with an absolute copy number; (ii) for a subset of
adjacent segments,
the measure of dispersion is calculated using the absolute copy number of each
segment in
the subset of adjacent segments; and (iii) the removing from the plurality of
segments each
respective segment that fails to satisfy the one or more filtering criteria
comprises removing
the each segment in the subset of adjacent segments. Thus, if the measure of
dispersion for a
group of adjacent segments fails to satisfy a filtering criterion, then all of
the segments used
to calculate the measure of dispersion are removed from the plurality of
segments. In some
embodiments, the measure of dispersion is a confidence interval, and the
filtering criterion is
inclusion of zero.
[0602] Other methods for annotating and preprocessing genomic
segments are possible,
as will be apparent to one skilled in the art. See, for example, Talevich et
al., PLoS Comput
Biol, 12:1004873 (2016), the content of which is hereby incorporated by
reference, in its
entirety, for all purposes.
[0603] In some embodiments, genomic region binning, coverage
calculation, bias
correction, normalization to a reference pool, segmentation, visualization and
annotation are
performed using any methods and/or software, or any embodiments, combinations,

substitutions, additions, and/or deletions thereof as will be apparent to one
skilled in the art.
[0604] Validation Filters.
[0605] Referring to Block 530-1, the method further comprises
validating a copy number
status annotation of a respective segment in the plurality of segments that is
annotated with a
copy number variation by applying the first dataset to an algorithm having a
plurality of
filters.
[0606] Referring to Block 532-1, the plurality of filters
comprises (1) a measure of
central tendency bin-level sequence ratio filter that is fired when a measure
of central
tendency of the plurality of bin-level sequence ratios corresponding to the
subset of bins
encompassed by the respective segment fails to satisfy one or more bin-level
sequence ratio
thresholds.
181
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0607] In some embodiments, the measure of central tendency of
the plurality of bin-
level sequence ratios corresponding to the subset of bins encompassed by the
respective
segment fails to satisfy one or more bin-level sequence ratio thresholds when
the measure of
central tendency is lower than a bin-level sequence ratio amplification
threshold. In some
embodiments, a bin-level sequence ratio amplification threshold is between -
0.5 and 5,
between -0.1 and 3, between -0.047 and 1.6, or between 0 and 0.5. In some
embodiments, a
bin-level sequence ratio amplification threshold is lower than 0.3.
[0608] In some alternative embodiments, the measure of central
tendency of the plurality
of bin-level sequence ratios corresponding to the subset of bins encompassed
by the
respective segment fails to satisfy one or more bin-level sequence ratio
thresholds when the
measure of central tendency is higher than a bin-level sequence ratio deletion
threshold. In
some embodiments, a bin-level sequence ratio deletion threshold is between -5
and 0.5,
between -2 and 0, between -1 and -0.2, or between -0.75 and -0.25.
[0609] In some embodiments, the measure of central tendency of
the plurality of bin-
level sequence ratios corresponding to the subset of bins encompassed by the
respective
segment is an arithmetic mean, a weighted mean, a midrange, a midhinge, a
trimean, a
Winsorized mean, a mean, a median or a mode of the bin-level sequence ratios
for all the
respective bins encompassed by the respective segment. Referring to Block 534-
1, in some
embodiments, the measure of central tendency of the plurality of bin-level
sequence ratios is
a median.
[0610] In some embodiments, the measure of central tendency of
the plurality of bin-
level sequence ratios in the (1) a measure of central tendency bin-level
sequence ratio filter is
different from the measure of central tendency of the plurality of bin-level
sequence ratios
used to determine the segment-level sequence ratio (e.g., where the (1) filter
is a median copy
ratio filter for all the bins in the segment, and the segment-level sequence
ratio is calculated
from a weighted mean of the bins in the segment).
[0611] Referring to Block 536-1, the plurality of filters further
comprises (2) a
confidence filter that is fired when the segment-level measure of dispersion
(e.g., confidence
interval) corresponding to the respective segment fails to satisfy a
confidence threshold.
106121 In some embodiments, the segment-level measure of
dispersion (e.g., confidence
interval) corresponding to the respective segment fails to satisfy a
confidence threshold (e.g.,
for amplification) when the lower bound of the measure of dispersion is lower
than the
182
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
confidence threshold. In some alternative embodiments, the segment-level
measure of
dispersion (e.g., confidence interval) corresponding to the respective segment
fails to satisfy a
confidence threshold (e.g., for deletion) when the upper bound of the measure
of dispersion is
higher than the confidence threshold.
[0613] Referring to Block 538-1, in some embodiments, the
confidence threshold is a
measure of central tendency of the segment-level sequence ratios corresponding
to all other
segments that map to the same chromosome of the human reference genome as the
respective
segment (e.g., all other segments excluding the respective segment, if the
segment is located
at an end of the chromosome).
[0614] Referring to Block 540-1, in some embodiments, the
confidence threshold
comprises a measure of central tendency of the segment-level sequence ratios
corresponding
to all preceding segments that map to the same chromosome of the human
reference genome
as the respective segment, and the measure of central tendency of the segment-
level sequence
ratios corresponding to all subsequent segments that map to the same
chromosome of the
human reference genome as the respective segment (e.g., all preceding segments
and all
following segments, if the respective segment is not located at an end of the
chromosome).
In some such embodiments, the (2) confidence filter tests the upper or lower
bound of the
measure of dispersion (e.g., confidence interval) against two independent
confidence
thresholds (e.g., one preceding measure of central tendency, and one following
measure of
central tendency), where the bound of the measure of dispersion must satisfy
both confidence
thresholds in order to pass the filter. In some such embodiments, the two
independent
confidence thresholds have different values.
[0615] In some embodiments, the measure of central tendency of
the segment-level
sequence ratios in the (2) confidence filter is an arithmetic mean, a weighted
mean, a
midrange, a midhinge, a trimean, a Winsorized mean, a mean, a median or a
mode.
[0616] Referring to Block 542-1, the plurality of filters further
comprises (3) a measure
of central tendency-plus-deviation bin-level sequence ratio filter that is
fired when a measure
of central tendency of the plurality of bin-level sequence ratios
corresponding to the subset of
bins encompassed by the respective segment fails to satisfy one or more
measure of central
tendency-plus-deviation bin-level sequence ratio thresholds. The one or more
measure of
central tendency-plus-deviation bin-level copy ratio thresholds are derived
from (i) a measure
of central tendency of the bin-level sequence ratios corresponding to the
plurality of bins that
183
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
map to the same chromosome of the human reference genome as the respective
segment, and
(ii) a measure of dispersion across the bin-level sequence ratios
corresponding to the plurality
of bins that map to the respective chromosome. The (i) measure of central
tendency is
calculated using all of the bins that map to the respective chromosome,
including the bins
encompassed by the respective segment under investigation.
106171 In some embodiments, the measure of dispersion across the
bin-level sequence
ratios in the (3) measure of central tendency-plus-deviation filter is a
variance, standard
deviation, or interquartile range across the bin-level copy ratios. In some
embodiments, the
measure of dispersion is a median of a plurality of absolute deviations, where
each absolute
deviation corresponds to a bin in the plurality of bins that map to the
chromosome and is
calculated by subtracting the "chromosome sequence ratio- (e.g., the median of
all bin-level
sequence ratios for the plurality of bins in the chromosome) from each bin's
sequence ratio.
[0618] In some embodiments, the one or more measures of central
tendency-plus-
deviation bin-level sequence ratio thresholds (e.g., for amplifications) is a
sum of (i) a
measure of central tendency value of the bin-level sequence ratios
corresponding to the
plurality of bins that map to the same chromosome (e.g., the -chromosome
sequence ratio"),
and (ii) the measure of central tendency value of a plurality of absolute
dispersions, where
each absolute dispersion is determined using a comparison (e.g., a
subtraction) between each
bin-level sequence ratio corresponding to each bin in the plurality of bins
that map to the
same chromosome as the respective segment, and the measure of central tendency
value of
the bin-level sequence ratios measured in (i). The measure of central tendency
of the
plurality of bin-level sequence ratios corresponding to the subset of bins
encompassed by the
respective segment fails to satisfy the one or more measure of central
tendency-plus-deviation
bin-level sequence ratio thresholds when the measure of central tendency of
the plurality of
bin-level sequence ratios corresponding to the subset of bins encompassed by
the respective
segment is lower than the one or more measure of central tendency-plus-
deviation bin-level
sequence ratio thresholds.
[0619] For example, in some embodiments, a segment annotated with
an amplification
status will pass the (3) filter if the median copy ratio of all bins
encompassed in the segment
is equal to or higher than the median plus the median absolute deviation (MAD)
of all bins'
copy ratios on the same chromosome.
184
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0620] In some embodiments, the one or more measure of central
tendency-plus-
deviation bin-level sequence ratio thresholds (e.g, for deletions) comprises
(i) a measure of
central tendency value of the bin-level sequence ratios corresponding to the
plurality of bins
that map to the same chromosome (e.g._ the "chromosome sequence ratio"), minus
(ii) the
measure of central tendency value of a plurality of absolute dispersions,
where each absolute
dispersion is determined using a comparison (e.g., a subtraction) between each
bin-level
sequence ratio corresponding to each bin in the plurality of bins that map to
the same
chromosome as the respective segment, and the measure of central tendency
value of the bin-
level sequence ratios measured in (i). The measure of central tendency of the
plurality of bin-
level sequence ratios corresponding to the subset of bins encompassed by the
respective
segment fails to satisfy the one or more measure of central tendency-plus-
deviation bin-level
sequence ratio thresholds when the measure of central tendency of the
plurality of bin-level
sequence ratios corresponding to the subset of bins encompassed by the
respective segment is
higher than the one or more measure of central tendency-plus-deviation bin-
level sequence
ratio thresholds_
[0621] In some such embodiments, the one or more measure of
central tendency-plus-
deviation bin-level sequence ratio thresholds is the measure of central
tendency value of the
bin-level sequence ratios corresponding to the plurality of bins that map to
the same
chromosome, minus the measure of central tendency value of the plurality of
absolute
dispersions multiplied by a factor k. In some embodiments, k is between 0.1
and 0.95,
between 0.3 and 0.9, between 0.5 and 0.85, between 0.65 and 0.8, or between
0.73 and 0.77.
[0622] For example, in some embodiments, a segment annotated with
a deletion status
will pass the (3) filter if the median copy ratio of all bins encompassed in
the segment is less
than or equal to the median minus 0.75 of the median absolute deviation (MAD)
of all bins'
copy ratios on the same chromosome.
[0623] Referring to Block 544-1, in some embodiments, the
plurality of filters further
comprises (4) a segment-level sequence ratio filter that is fired when the
segment-level
sequence ratio corresponding to the respective segment fails to satisfy one or
more segment-
level sequence ratio thresholds.
[0624] In some embodiments, the segment-level sequence ratio
corresponding to the
respective segment fails to satisfy one or more segment-level sequence ratio
thresholds when
the segment-level sequence ratio is lower than a segment-level sequence ratio
amplification
185
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
threshold. In some such embodiments, a segment-level sequence ratio
amplification
threshold is between -0.5 and 5, between -0.1 and 3, between -0.047 and 1.6,
or between 0
and 0.5.
[0625] In some alternative embodiments, the segment-level
sequence ratio corresponding
to the respective segment fails to satisfy one or more segment-level sequence
ratio thresholds
when the segment-level sequence ratio is higher than a segment-level sequence
ratio deletion
threshold. In some such embodiments, a segment-level sequence ratio deletion
threshold is
between -5 and 0.5, between -2 and 0, between -1 and -0.2, or between -0.75
and -0.25.
[0626] For example, in some embodiments, a segment annotated with
an amplification
status will pass the (4) segment-level sequence ratio filter if the segment's
copy ratio is
greater than 0.03, and a segment annotated with a deletion status will pass
the (4) segment-
level sequence ratio filter if the segment's copy ratio is less than -0.5. In
some embodiments,
the amplification and/or deletion thresholds are specified by the user or
practitioner. In some
embodiments, the amplification and/or deletion thresholds are optimized for
improved
specificity and sensitivity for one or more test samples.
[0627] In some embodiments, the threshold for the (4) segment-
level sequence ratio filter
is determined by (i) estimating a circulating tumor fraction for the liquid
biopsy sample, and
(ii) calculating an expected 1og2 copy ratio for a high copy gain or deletion,
where the
expected 1og2 copy ratio is used as the threshold. In some embodiments, a high
copy gain is
at least 4 copies. In some embodiments, a high copy gain is at least 5 copies.
In some
embodiments, a high copy gain is at least 6 copies. In some embodiments, a
high copy gain
is at least 7 copies. In some embodiments, a high copy gain is at least 8
copies. In some
embodiments, a high copy gain is at least 9 copies. In some embodiments, a
high copy gain
is at least 10 copies.
[0628] In some embodiments, an additional filter is used that
filters out candidate
segments that are longer than threshold length. In some embodiments, the
threshold length is
determined empirically. In some embodiments, the threshold length is at least
15 Mb. In
some embodiments, the threshold length is at least 20 Mb. In some embodiments,
the
threshold length is at least 25 Mb. In some embodiments, the threshold length
is at least 30
Mb. In some embodiments, the threshold length is at least 35 Mb. In some
embodiments, the
threshold length is no more than 50 Mb. In some embodiments, the threshold
length is no
186
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
more than 40 Mb. In some embodiments, the threshold length is no more than 30
Mb. In
some embodiments, the threshold length is from 15 Mb to 50 Mb.
[0629] In some embodiments, one or more of the validation filters
disclosed herein are
optionally included in the plurality of validation filters applied to the
first dataset. For
example, in some embodiments, the plurality of validation filters comprises
less than one,
less than two, less than three, or less than four of the validation filters
described in the present
disclosure. In some embodiments, any one or more of the validation filters
described herein
can include any modifications, substitutions, additions and/or combinations
thereof, as will be
apparent to one skilled in the art.
[0630] Validating Copy Number Variations.
[0631] Referring to Block 546-1, the method further comprises,
when a filter in the
plurality of filters is fired, the copy number status annotation of the
respective segment is
rejected; and when no filter in the plurality of filters is fired, the copy
number status
annotation of the respective segment is validated.
[0632] In some embodiments, validation of an amplification status
requires satisfaction of
each filter in a plurality of amplification filters, and validation of a
deletion status requires
satisfaction of each filter in a plurality of deletion filters. Thus, all
filters in the plurality of
filters applied to the first dataset must be appropriate for the type of copy
number status
annotation to be validated.
[0633] For example, referring to Block 548- I , in some
embodiments, the method further
comprises validating an amplification status of a respective segment in the
plurality of
segments, by applying the first dataset to an algorithm having a plurality of
filters. The
plurality of filters comprises (1) a measure of central tendency bin-level
sequence ratio filter
that is fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment is
lower than a
bin-level sequence ratio amplification threshold; (2) a confidence filter that
is fired when the
lower bound of the segment-level measure of dispersion corresponding to the
respective
segment is lower than the confidence threshold; and (3) a measure of central
tendency-plus-
deviation bin-level sequence ratio filter that is fired when a measure of
central tendency of
the plurality of bin-level sequence ratios corresponding to the subset of bins
encompassed by
the respective segment is lower than the measure of central tendency-plus-
deviation bin-level
sequence ratio threshold. When a filter in the plurality of filters is fired,
the amplification
187
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
status of the respective segment is rejected; and when no filter in the
plurality of filters is
fired, the amplification status of the respective segment is validated.
[0634] Referring to Block 550-1, in some alternative embodiments,
the method further
comprises validating a deletion status of a respective segment in the
plurality of segments, by
applying the first dataset to an algorithm having a plurality of filters. The
plurality of filters
comprises (1) a measure of central tendency bin-level sequence ratio filter
that is fired when a
measure of central tendency of the plurality of bin-level sequence ratios
corresponding to the
subset of bins encompassed by the respective segment is higher than a bin-
level sequence
ratio deletion threshold; (2) a confidence filter that is fired when the upper
bound of the
segment-level measure of dispersion corresponding to the respective segment is
higher than
the confidence threshold; and (3) a measure of central tendency-plus-deviation
bin-level
sequence ratio filter that is fired when a measure of central tendency of the
plurality of bin-
level sequence ratios corresponding to the subset of bins encompassed by the
respective
segment is higher than the measure of central tendency-plus-deviation bin-
level sequence
ratio threshold. When a filter in the plurality of filters is fired, the
deletion status of the
respective segment is rejected; and when no filter in the plurality of filters
is fired, the
deletion status of the respective segment is validated.
[0635] In some embodiments, the plurality of filters comprises a
plurality of
amplification filters and a plurality of deletion filters.
[0636] In some embodiments, the copy number status annotation is -
neutral", and the
validating the copy number status annotation comprises firing at least one
filter in the
plurality of amplification filters and firing at least one filter in the
plurality of deletion filters.
[0637] In some embodiments, a segment is flagged as ambiguous if
less than a threshold
number of filters is fired. For example, in some embodiments, a segment is
flagged as
ambiguous if less than 4, less than 3, or less than 2 filters are fired.
[0638] In some embodiments, a validated copy number variation for
a segment is
assigned to the segment and to each bin encompassed by the respective segment.
[0639] Applications to Precision Oncology.
[0640] Referring to Block 552-1, in some embodiments, the method
further comprises,
after the validating, applying the validated copy number variation of the
respective segment
to a diagnostic assay.
188
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0641] For example, in some embodiments, the method further
comprises treating a
patient with a cancer containing a copy number variation of a target gene by
determining
whether the copy number variation of the target gene is a focal copy number
variation by
validating the copy number variation in the patient, thus determining whether
the patient has
an aggressive form of the cancer associated with a focal copy number variation
of the target
gene. The method further comprises, when the patient has the aggressive form
of cancer
associated with focal copy number variation of the target gene, administering
a first therapy
for the aggressive form of the cancer to the patient, and when the patient
does not have the
aggressive form of cancer associated with focal copy number variation of the
target gene,
administering a second therapy for a less aggressive form of the cancer to the
patient.
[0642] In some such embodiments, the first therapy is selected
from Table 2. In some
embodiments, the first therapy is trastuzumab, lapatinib, or crizotinib.
[0643] Table 2. Matched therapies for selected targeted panel
genes.
Gene Cohort Therapies
MET Ovarian Cancer, Cervical Cancer, Chromophobe Crizotinib
Renal Cell Carcinoma, Liver Cancer, Endocrine
Tumor, Oropharyngeal Cancer, Retinoblastoma,
Biliary Cancer, Adrenal cancer, Breast Cancer,
Melanoma, Non-Clear Cell Renal Cell Carcinoma,
Tumor of Unknown Origin, Kidney Cancer, Bladder
Cancer, Gastric Cancer, Bone Cancer, Non-Small
Cell Lung Cancer, Thymoma, Prostate Cancer,
Clear Cell Renal Cell Carcinoma, Skin Cancer,
Thyroid Cancer, Sarcoma, Testicular cancer, Head
and Neck Cancer, Head and Neck Squamous Cell
Carcinoma, Meningioma, Peritoneal cancer,
Endometrial Cancer, Pancreatic Cancer, Esophageal
Cancer
MET Neural, Brain Cancer, Glioblastoma, Low Grade
Crizotinib
Glioma
MET Non-Small Cell Lung Cancer Crizotinib
Osimertinib - Resistance
MET Chromophobe Renal Cell Carcinoma, Non-Clear Crizotinib
Cell Renal Cell Carcinoma, Kidney Cancer, Clear Savolitinib
Cell Renal Cell Carcinoma
189
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Gene Cohort Therapies
MET Colorectal Cancer Panitumumab +
Cabozantinib
Cetuximab or
Panitumumab - Resistance
EGFR Ovarian Cancer, Cervical Cancer, Chromophobe Cetuximab
or
Renal Cell Carcinoma, Liver Cancer, Endocrine Panitumumab
Tumor, Oropharyngeal Cancer, Retinoblastoma, Gefitinib
Biliary Cancer, Adrenal cancer, Breast Cancer, Lapatinib
Melanoma, Non-Clear Cell Renal Cell Carcinoma,
Tumor of Unknown Origin, Kidney Cancer, Bladder
Cancer, Bone Cancer, Non-Small Cell Lung Cancer,
Tlwmoma, Prostate Cancer, Clear Cell Renal Cell
Carcinoma, Skin Cancer, Thyroid Cancer, Sarcoma,
Testicular cancer, Head and Neck Cancer,
Meningioma, Peritoneal cancer, Endometrial
Cancer, Pancreatic Cancer, Small Cell Lung Cancer
EGFR Brain Cancer, Glioblastoma Depatuxizumab
EGFR Colorectal Cancer, Gastric Cancer, Esophageal Cetuximab
or
Cancer Panitumumab
EGFR Head and Neck Squamous Cell Carcinoma Cetuximab
Panitumumab
Gefitinib
Lapatinib
ERBB2 Colorectal Cancer Trastuzumab
Lapatinib + Trastuzumab
Cetuximab or
Panitumumab - Resistance
ERBB2 Breast Cancer Trastuzumab +
Pertuzumab
Ado-Trastuzumab
Emtansine
Lapatinib + Trastuzumab
Trastuzumab
Neratinib
Tanespimycin +
Trastuzumab
190
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Gene Cohort Therapies
ERBB2 Uveal Melanoma, Liver Cancer, Endocrine Tumor, Trastuzumab
Oropharyngeal Cancer, Retinoblastoma, Basal Cell Lapatinib
Carcinoma, Melanoma, Tumor of Unknown Origin, Neratinib
Bladder Cancer, Bone Cancer, Non-Small Cell Lung
Cancer, Thymoma, Prostate Cancer, Skin Cancer,
Thyroid Cancer, Sarcoma, Testicular cancer, Head
and Neck Cancer, Head and Neck Squamous Cell
Carcinoma, Peritoneal cancer, Pancreatic Cancer
ERBB2 Breast Cancer Trastuzumab +
Pertuzumab
Ado-Trastuzumab
Emtansine
Lapatinib + Trastuzumab
Fulvestrant + Trastuzumab
Trastuzumab
Neratinib
Tanespimycin +
Trastuzumab
ERBB2 Gastric Cancer Trastuzumab
Lapatinib
Trastuzumab +
Pertuzumab
Neratinib
ERBB2 Biliary Cancer Trastuzumab
ERBB2 Esophageal Cancer Trastuzumab
ERBB2 Ovarian Cancer, Cervical Cancer, Endometrial Trastuzumab
Cancer Neratinib
Afatinib
CD274 Non-Small Cell Lung Cancer Nivolumab +
Durvalumab
or Avelumab
Unfavorable Prognosis
191
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Gene Cohort Therapies
CD274 Ovarian Cancer, Cervical Cancer, Colorectal Nivolumab +
Durvalumab
Cancer, Chromophobe Renal Cell Carcinoma, Liver or Avelumab
Cancer, Endocrine Tumor, Oropharyngeal Cancer,
Retinoblastoma, Biliary Cancer, Adrenal cancer,
Brain Cancer, Breast Cancer, Melanoma, Non-Clear
Cell Renal Cell Carcinoma, Glioblastoma, Tumor of
Unknown Origin, Kidney Cancer, Bladder Cancer,
Gastric Cancer, Bone Cancer, Thymoma, Prostate
Cancer, Clear Cell Renal Cell Carcinoma, Skin
Cancer, Thyroid Cancer, Sarcoma, Testicular
cancer, Head and Neck Cancer, Head and Neck
Squamous Cell Carcinoma, Meningioma, Peritoneal
cancer, Endometrial Cancer, Pancreatic Cancer,
Esophageal Cancer
CCNE1 Breast Cancer Palbociclib
SNS-032
MK-2206 + Dinaciclib
CCNE1 Ovarian Cancer, Cervical Cancer, Uveal Melanoma, SNS-032
Colorectal Cancer, Chromophobe Renal Cell MK-2206 +
Dinaciclib
Carcinoma, Liver Cancer, Endocrine Tumor,
Oropharyngeal Cancer, Retinoblastoma, Biliary
Cancer, Adrenal cancer, Neural, Neuroblastoma,
Basal Cell Carcinoma, Brain Cancer, Breast Cancer,
Melanoma, Non-Clear Cell Renal Cell Carcinoma,
Glioblastoma, Tumor of Unknown Origin, Kidney
Cancer, Gastrointestinal Stromal Tumor,
Medulloblastoma, Bladder Cancer, Gastric Cancer,
Bone Cancer, Non-Small Cell Lung Cancer,
Thymoma, Low Grade Glioma, Prostate Cancer,
Clear Cell Renal Cell Carcinoma, Skin Cancer,
Thyroid Cancer, Sarcoma, Testicular cancer, Head
and Neck Cancer, Head and Neck Squamous Cell
Carcinoma, Meningioma, Peritoneal cancer,
Endometrial Cancer, Pancreatic Cancer,
Mesothelioma, Esophageal Cancer, Small Cell Lung
Cancer
MYC Breast Cancer Doxorubicin +
Cyclophosphamide +
Docetaxel
Palbociclib
192
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Gene Cohort Therapies
MYC Ovarian Cancer, Cervical Cancer, Uveal Melanoma,
Palbociclib
Colorectal Cancer, Chromophobe Renal Cell
Carcinoma, Liver Cancer, Endocrine Tumor,
Oropharyngeal Cancer, Retinoblastoma, Biliary
Cancer, Adrenal cancer, Neural, Neuroblastoma,
Basal Cell Carcinoma, Brain Cancer, Melanoma,
Non-Clear Cell Renal Cell Carcinoma,
Glioblastoma, Tumor of Unknown Origin, Kidney
Cancer, Gastrointestinal Stromal Tumor,
Medulloblastoma, Bladder Cancer, Gastric Cancer,
Bone Cancer, Non-Small Cell Lung Cancer,
Thymoma, Low Grade Glioma, Prostate Cancer,
Clear Cell Renal Cell Carcinoma, Skin Cancer,
Thyroid Cancer, Sarcoma, Testicular cancer, Head
and Neck Cancer, Head and Neck Squamous Cell
Carcinoma, Meningioma, Peritoneal cancer,
Endometrial Cancer, Pancreatic Cancer,
Mesothelioma, Esophageal Cancer, Small Cell Lung
Cancer
BRCAI Colorectal Cancer, Endocrine Tumor, Biliary Olaparib or
Niraparib or
Cancer, Tumor of Unknown Origin, Gastric Cancer, Rucaparib
Non-Small Cell Lung Cancer, Head and Neck Olaparib or
Talazoparib
Cancer, Head and Neck Squamous Cell Carcinoma, Adavosertib
Endometrial Cancer, Esophageal Cancer
BRCAI Pancreatic Cancer Cisplatin +
Gemcitabine
Olaparib
Talazoparib
Niraparib or Rucaparib
Cisplatin + Olaparib
Adavosertib
BRCA1 Breast Cancer Olaparib or
Talazoparib
Niraparib or Rucaparib
BRCAI Prostate Cancer Olaparib
BRCAI Ovarian Cancer Olaparib or
Niraparib or
Rucaparib
Cisplatin or Carboplatin
Cisplatin + Olaparib
Adavosertib
BRCA2 Colorectal Cancer, Biliary Cancer, Brain Cancer, Olaparib or
Niraparib or
Tumor of Unknown Origin, Bladder Cancer, Gastric Rucaparib
Cancer, Non-Small Cell Lung Cancer, Sarcoma, Olaparib or
Talazoparib
Endometrial Cancer, Esophageal Cancer
193
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Gene Cohort Therapies
BRCA2 Prostate Cancer Olaparib or
Niraparib or
Rucaparib
Olaparib
Nivolumab or
Pembrolizumab
BRCA2 Pancreatic cancer Cisplatin +
Gemcitabine
Olaparib
Talazoparib
Niraparib or Rucaparib
Cisplatin + Olaparib
BRCA2 Breast Cancer Olaparib or
Talazoparib
Niraparib or Rucaparib
Nivolumab or
Pembrolizumab
BRCA2 Ovarian Cancer Olaparib or
Niraparib or
Rucaparib
Cisplatin or Carboplatin
Nivolumab or
Pembrolizumab
[0644] In some embodiments, the method further comprises
generating a report (e.g., for
use by a physician) comprising the validated copy number status of the
respective segment
for the biological sample of the respective test subject. In some such
embodiments, the
generated report further comprises matched therapies (e.g., treatments and/or
clinical trials)
based on the copy number status of the respective segment.
[0645] In some embodiments, the method further comprises disease
screening and/or
monitoring over a plurality of time points. For example, in some embodiments,
the method is
used for monitoring disease progression and/or recurrence after treatment, for
assessing the
efficacy of a treatment, and/or for performing comparative studies using
liquid biopsy
samples and matched solid tissue samples.
[0646] For example, in some embodiments, the method further
comprises obtaining a
second dataset that comprises a plurality of bin-level sequence ratios, each
respective bin-
level sequence ratio in the plurality of bin-level sequence ratios
corresponding to a respective
bin in a plurality of bins, where each respective bin in the plurality of bins
represents a
corresponding region of a human reference genome, and each respective bin-
level sequence
194
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
ratio in the plurality of bin-level sequence ratios is determined from a
sequencing of a
plurality of cell-free nucleic acids in a second liquid biopsy sample of the
test subject and one
or more reference samples. The second dataset further comprises a plurality of
segment-level
sequence ratios, each respective segment-level sequence ratio in the plurality
of segment-
level sequence ratios corresponding to a segment in a plurality of segments,
where each
respective segment in the plurality of segments represents a corresponding
region of the
human reference genome encompassing a subset of adjacent bins in the plurality
of bins, and
each respective segment-level sequence ratio in the plurality of segment-level
sequence ratios
is determined from a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of adjacent bins encompassed by the respective
segment. The
second dataset also includes a plurality of segment-level measures of
dispersion, each
respective segment-level measure of dispersion in the plurality of segment-
level measures of
dispersion (i) corresponding to a respective segment in the plurality of
segments and (ii)
determined using the plurality of bin-level sequence ratios corresponding to
the subset of
adjacent bins encompassed by the respective segment
[0647] The method further includes validating a copy number
status annotation of a
respective segment in the plurality of segments that is annotated with a copy
number
variation by applying the second dataset to an algorithm having a plurality of
filters. The
plurality of filters can include any of the filters disclosed herein.
[0648] In some such embodiments, the first liquid biopsy sample
is obtained at a first
time point and the second liquid biopsy sample of the test subject is obtained
at a second time
point. For example, in some embodiments, the second time point is at least 1
day, at least 1
week, at least 1 month, at least 2 months, at least 3 months, at least 6
months, or at least 1
year after the first time point.
[0649] Longitudinal Testing
[0650] In some embodiments, one or more liquid biopsy assays
described herein may be
used to analyze specimens from a patient taken over the course of the
patient's treatment.
For example, a blood specimen may be obtained periodically and/or upon
indication of
response to therapy, disease relapse, and/or disease progression. In some
embodiments, the
one or more liquid biopsy assays may be used on a specimen collected from the
patient each
month, every two months, every three months, every four months, every five
months, every
6-12 months, and so forth. In some embodiments, the longitudinal use of liquid
biopsy
195
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
assays may be used to track clonal evolution to identify resistance mutations.
In some
embodiments, the longitudinal use of liquid biopsy assays may be used to track
evolution of
mutations, such as EGFR or APC mutations.
[0651] In some embodiments, longitudinal use of liquid biopsy
assays may be used to
detect emerging therapy resistance mechanisms. In some embodiments,
longitudinal use of
liquid biopsy assays may be used to detect AR gene alterations. In some
embodiments,
longitudinal use of liquid biopsy assays may be used to detect WNT pathway
alterations in
meRPC associated with resistance to enzalutimide and abiraterone. In some
embodiments,
longitudinal use of liquid biopsy assays may be used to detect ER mutations,
such as ER
mutations associated with resistance to endocrine therapy in breast cancer. In
some
embodiments, longitudinal use of liquid biopsy assays may be used to detect
EGFR mutations
responsible for anti-EGFR therapy resistance (e.g., T790M) in NSCLC. In some
embodiments, longitudinal use of liquid biopsy assays may be used to detect
KRAS, NRAS,
MET, ERBB2, FLT3, or EGFR mutations associated with primary or acquired
resistance to
EGFR inhibitors in colorectal cancer. In some embodiments, longitudinal use of
liquid
biopsy assays may be used to assess gene alterations from tumor cells shed by
primary tumor
and metastatic sites.
[0652] In some embodiments the one or more blood specimens may be
collected from the
patient in a home-based environment. For example, the blood specimens may be
collected by
a mobile phlebotomist.
[0653] For example, a first blood specimen, a second blood
specimen, and a third blood
specimen may be collected from a patient during the course of treatment.
[0654] In some embodiments, the first blood specimen may be
analyzed using at least an
improvement in somatic variant identification, e.g., as described herein in
the section entitled
"Systems and Methods for Improved Validation of Somatic Sequence Variants"
and/or
"Variant Identification," the second blood specimen may be analyzed using at
least an
improvement in somatic variant identification, e.g., as described herein in
the section entitled
-Systems and Methods for Improved Validation of Somatic Sequence Variants"
and/or
"Variant Identification," and the third blood specimen may be analyzed using
at least an
improvement in somatic variant identification, e.g., as described herein in
the section entitled
-Systems and Methods for Improved Validation of Somatic Sequence Variants"
and/or
"Variant Identification."
196
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0655] In some embodiments, the first blood specimen may be
analyzed using at least an
improvement in focal copy number identification, e.g., as described herein in
the section
entitled "Systems and Methods for Improved Validation of Copy Number
Variation" and/or
"Copy Number Variation," the second blood specimen may be analyzed using at
least an
improvement in focal copy number identification, e.g., as described herein in
the section
entitled "Systems and Methods for Improved Validation of Copy Number
Variation" and/or
"Copy Number Variation,- and the third blood specimen may be analyzed using at
least an
improvement in focal copy number identification, e.g., as described herein in
the section
entitled "Systems and Methods for Improved Validation of Copy Number Variation-
and/or
"Copy Number Variation."
[0656] In some embodiments, the first blood specimen may be
analyzed using at least an
improvement in circulating tumor fraction determination, e.g., as described
herein in the
section entitled "Systems and Methods for Improved Circulating Tumor Fraction
Estimates"
and/or -Circulating Tumor Fraction," the second blood specimen may be analyzed
using at
least an improvement in circulating tumor fraction determination, e.g., as
described herein in
the section entitled -Systems and Methods for Improved Circulating Tumor
Fraction
Estimates" and/or "Circulating Tumor Fraction," and the third blood specimen
may be
analyzed using using at least an improvement in circulating tumor fraction
determination,
e.g., as described herein in the section entitled "Systems and Methods for
Improved
Circulating Tumor Fraction Estimates" and/or "Circulating Tumor Fraction."
[0657] Diagnostic Applications.
[0658] Referring now to Block 600-1, the present disclosure also
provides a method for
treating a patient with a cancer containing a copy number variation of a
target gene.
[0659] Referring to Block 602-1, the method comprises determining
whether the patient
has an aggressive form of cancer associated with a focal copy number variation
of the target
gene.
[0660] A focal copy number variation of a target gene can be
associated with, for
example, recurrence, high-grade forms of a cancer, aggressive forms of a
cancer, tumor
growth, and/or other aberrations. See, for example, Nord et al., Int. J.
Cancer, 126, 1390-
1402 (2010), which is hereby incorporated herein by reference in its entirety.
[0661] In some embodiments, the target gene is any of the
embodiments described above.
For example, referring to Block 604-1, in some embodiments, the target gene is
any of the
197
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
genes listed in Table 1. Referring to Block 606-1, in some embodiments, the
target gene is
MET, EGFR, ERBB2, CD274, CCNE1, MYC, BRCA1 or BRCA2.
[0662] Referring to Block 608-1, the method further comprises
obtaining a first
biological sample of the cancer from the patient. In some embodiments, the
biological
sample is a liquid biopsy sample or a solid tissue biological sample. In some
embodiments,
the biological sample is a liquid biopsy sample or a tumor biopsy sample. In
some
embodiments, the biological sample comprises (e.g., is obtained, prepared,
sequenced, and/or
analyzed by) any of the methods and/or embodiments described above, or any
modifications,
substitutions, and/or combinations thereof as will be apparent to one skilled
in the art.
[0663] Referring to Block 610-1, the method further comprises
performing copy number
variation analysis on the first biological sample to identify the copy number
status of the
target gene in the cancer, where the copy number variation analysis generates
a first dataset.
[0664] The first dataset includes a plurality of bin-level
sequence ratios, each respective
bin-level sequence ratio in the plurality of bin-level sequence ratios
corresponding to a
respective bin in a plurality of bins, where each respective bin in the
plurality of bins
represents a corresponding region of a human reference genome, and each
respective bin-
level sequence ratio in the plurality of bin-level sequence ratios is
determined from a
sequencing of a plurality of nucleic acids in the first biological sample of
the cancer from the
patient and one or more reference samples.
[0665] The first dataset also includes a plurality of segment-
level sequence ratios, each
respective segment-level sequence ratio in the plurality of segment-level
sequence ratios
corresponding to a segment in a plurality of segments, where each respective
segment in the
plurality of segments represents a corresponding region of the human reference
genome
encompassing a subset of adjacent bins in the plurality of bins, and the
plurality of segment-
level sequence ratios is determined from a measure of central tendency of the
plurality of bin-
level sequence ratios corresponding to the subset of adjacent bins encompassed
by the
respective segment.
[0666] The first dataset further comprises a plurality of segment-
level measures of
dispersion, each respective segment-level measure of dispersion in the
plurality of segment-
level measures of dispersion (i) corresponding to a respective segment in the
plurality of
segments and (ii) determined using the plurality of bin-level sequence ratios
corresponding to
the subset of adjacent bins encompassed by the respective segment.
198
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0667] Methods for obtaining the first dataset, including
binning, segmenting, calculating
sequence ratios and measures of dispersion, normalizing and/or preprocessing,
can comprise
any of the methods and/or embodiments described above, or any modifications,
substitutions,
and/or combinations thereof as will be apparent to one skilled in the art.
[0668] Referring to Block 612-1, the method further comprises
determining whether the
copy number variation of the target gene is a focal copy number variation by
applying the
first dataset to an algorithm having a plurality of copy number variation
filters.
[0669] Referring to Block 614-1, in some embodiments, the
plurality of copy number
variation filters comprises a measure of central tendency bin-level sequence
ratio filter that is
fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more bin-level sequence ratio thresholds, thus determining that the
copy number
variation of the target gene is not a focal copy number variation when fired.
[0670] Referring to Block 616-1, in some embodiments, the
plurality of copy number
variation filters further comprises a confidence filter that is fired when the
segment-level
measure of dispersion corresponding to the respective segment fails to satisfy
a confidence
threshold, thus determining that the copy number variation of the target gene
is not a focal
copy number variation when fired.
[0671] Referring to Block 618-1, in some embodiments, the
plurality of copy number
variation filters further comprises a measure of central tendency-plus-
deviation bin-level
sequence ratio filter that is fired when a measure of central tendency of the
plurality of bin-
level sequence ratios corresponding to the subset of bins encompassed by the
respective
segment fails to satisfy one or more measure of central tendency-plus-
deviation bin-level
sequence ratio thresholds. The one or more measure of central tendency-plus-
deviation bin-
level copy ratio thresholds are derived from (i) a measure of the bin-level
sequence ratios
corresponding to the plurality of bins that map to the same chromosome of the
human
reference genome as the respective segment, and (ii) a measure of dispersion
across the bin-
level sequence ratios corresponding to the plurality of bins that map to the
respective
chromosome. The method further comprises determining that the copy number
variation of
the target gene is not a focal copy number variation when fired.
[0672] Referring to Block 620-1, in some embodiments, the
plurality of copy number
variation filters further comprises a segment-level sequence ratio filter that
is fired when the
199
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
segment-level sequence ratio corresponding to the respective segment fails to
satisfy one or
more segment-level sequence ratio thresholds, thus determining that the copy
number
variation of the target gene is not a focal copy number variation when fired.
[0673] In some embodiments, the plurality of copy number
variation filters comprises
any of the methods and/or embodiments described above, or any modifications,
substitutions,
and/or combinations thereof as will be apparent to one skilled in the art.
[0674] Referring to Block 622-1, the method further comprises,
when the patient has the
aggressive form of cancer associated with focal copy number variation of the
target gene,
administering a first therapy for the aggressive form of the cancer to the
patient, and when the
patient does not have the aggressive form of cancer associated with focal copy
number
variation of the target gene, administering a second therapy for a less
aggressive form of the
cancer to the patient.
[0675] Referring to Block 624-1 and Block 626-1, in some
embodiments, the first therapy
is selected from Table 2. In some such embodiments, the first therapy is
trastuzumab,
lapatinib, or crizotinib.
[0676] Referring to Block 628-1 and Block 630-1, in some
embodiments, the method
further comprises generating a report (e.g., for use by a physician)
comprising the copy
number status of the target gene. In some such embodiments, the generated
report further
comprises matched therapies (e.g., treatments and/or clinical trials) based on
the copy number
status of the respective segment. When the patient has the aggressive form of
cancer
associated with focal copy number variation of the target gene, a first
therapy for the
aggressive form of the cancer is matched to the patient, and when the patient
does not have
the aggressive form of cancer associated with focal copy number variation of
the target gene,
a second therapy for a less aggressive form of the cancer is matched to the
patient.
[0677] The present disclosure also provides a computer system
comprising one or more
processors and a non-transitory computer-readable medium including computer-
executable
instructions that, when executed by the one or more processors, cause the
processors to
perform any of the methods and embodiments disclosed herein.
[0678] The present disclosure also provides a non-transitory
computer-readable storage
medium having stored thereon program code instructions that, when executed by
a processor,
cause the processor to perform any of the methods and embodiments disclosed
herein.
200
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0679] In some embodiments, the methods described herein include
generating a clinical
report 139-3 (e.g., a patient report), providing clinical support for
personalized cancer
therapy, and/or using the information curated from sequencing of a liquid
biopsy sample, as
described above. In some embodiments, the report is provided to a patient,
physician,
medical personnel, or researcher in a digital copy (for example, a JSON
object, a pdf file, or
an image on a website or portal), a hard copy (for example, printed on paper
or another
tangible medium). A report object, such as a JSON object, can be used for
further processing
and/or display. For example, information from the report object can be used to
prepare a
clinical laboratory report for return to an ordering physician. In some
embodiments, the
report is presented as text, as audio (for example, recorded or streaming), as
images, or in
another format and/or any combination thereof
[0680] The report includes information related to the specific
characteristics of the
patient's cancer, e.g., detected genetic variants, epigenetic abnormalities,
associated
oncogenic pathogenic infections, and/or pathology abnormalities. In some
embodiments,
other characteristics of a patient's sample and/or clinical records are also
included in the
report. For example, in some embodiments, the clinical report includes
information on
clinical variants, e.g., one or more of copy number variants (e.g., for
actionable genes
CCNE1, CD274(PD-L1), EGFR, ERBB2(HER2), MET, MYC, BRCA1, and/or BRCA2),
fusions, translocations, and/or rearrangements (e.g., in actionable genes ALK,
ROS1, RET,
NTRK1, FGFR2, FGFR3, NTRK2 and/or NTRK3), pathogenic single nucleotide
polymorphisms, insertion-deletions (e.g., somatic/tumor and/or
germline/normal), therapy
biomarkers, microsatellite instability status, and/or tumor mutational burden.
[0681] Conversion of solid tumor test to liquid biopsy test. In
one embodiment, the solid
tissue sample is insufficient for NGS testing (for example, the sample is too
small or too
degraded, the amount or quality of nucleic acids extracted from the sample
does not result in
quality NGS results that would result in reliable determination of variants
and/or other
genetic characteristics of the sample), and the physician or patient may
decide to convert the
solid tissue test that was ordered to a liquid biopsy test to be performed on
a liquid biopsy
sample collected from the same patient. The resulting report and/or display of
the results on
a portal may include an "xF Conversion Badge" to distinguish any order that
has been
converted from solid tissue test to a liquid biopsy test (compared to, for
example, a liquid
biopsy test that was not initially ordered as a solid tissue test). This will
allow a user to
201
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
identify which orders have been converted by this process, and distinguish
between orders
that were intentionally placed for the liquid biopsy panel.
[0682] Longitudinal Reporting. In various embodiments, a report
may include and/or
compare the results of multiple liquid biopsy tests and/or solid tumor tests
(for example,
multiple tests associated with the same patient). The results of multiple
liquid biopsy tests
and/or solid tumor tests may be displayed on a portal in a variety of
configurations that may
be selected and/or customized by the viewer. The tests may have been performed
at different
times, and the samples on which the tests were performed may have been
collected at
different times.
[0683] Download result. Clinical and/or molecular data associated
with a patient (for
example, information that would be included in the report), may be aggregated
and made
available via the portal. Any portion of the report data may be available for
download (for
example, as a CSV file) by the physician and/or patient. In various
embodiments, the data
may include data related to genetic variants, RNA expression levels,
immunotherapy markers
(including MSI and TMB), RNA fusions, etc. In one embodiment, if a physician
or medical
facility has ordered multiple tests (all tests may be associated with the same
patient or tests
may be associated with multiple patients), results associated with more than
one test may be
aggregated into a single file for downloading.
Systems and Methods for Improved Validation of Somatic Sequence Variants
[0684] Below, systems and methods for improving validation of
somatic sequence
variants, e.g., within the context of the methods and systems described above,
are described
with reference to Figures 5A2 and 5B2.
[0685] Many of the embodiments described below, in conjunction
with Figures 5A2 and
5B2, relate to analyses performed using sequencing data for cIDNA obtained
from a liquid
biopsy sample of a cancer patient. Generally, these embodiments are
independent and, thus,
not reliant upon any particular DNA sequencing methods. However, in some
embodiments,
the methods described below include generating the sequencing data.
[0686] For example, provided herein is a generalized application
of Bayes' Theorem
through the likelihood ratio test for diagnostic assays that allows dynamic
calibration of
filtering thresholds for somatic sequence variant detection in a patient, in
accordance with
some embodiments of the present disclosure. These thresholds are based on
sample specific
error rate, error rate from a pool of process matched healthy control samples,
and/or a cohort
202
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
of human solid tumors to inform our probability models. The method takes the
form of the
following formula:
, ( sensitivity )
odds(post- test) = (odds(pre- test)) x ___________________________
1 ¨ specificity)
where:
odds(post- test) is the post-test odds of a variant being positive given the
application of Bayes Theorem,
odds(pre- test) is the pre-test odds of a positive given the cancer type of
the
patient and the prevalence (measured as a fraction) of alterations detected in
that gene or
within a specific genomic window within a reference population with the cancer
type,
sensitivity is the sensitivity bin nearest that measured for the assay at a
proposed
circulating variant fraction,
specificity is a term to be solved for, denoting the level of uncertainty that
is
acceptable given some fixed value of odds(post- test). Specificity can be
replaced as the
quantile of the beta binomial distribution (see below) defined by the within
sample
trinucleotide error rate and the background base position specific error rate,
d(beta-binomial) is a beta binomial distribution defined by specified
parameters
(alpha, beta, Pr), and
Min(A0) is the minimum number of alternate alleles observed for a given
sample.
[0687] Given a fixed value for odds(post- test), it is possible
to solve instead for
specificity or, rather, the minimum acceptable quantile of the beta binomial
error distribution.
Therefore, the equation can be reframed as:
(
( odds (pre- test) ))
specificity = 1 ¨ (sensitivity) x ______________________________
odds(post- test))
Solving for specificity gives the quantile of the beta binomial function which
can then be
plugged into quantile(beta-binomial) to derive a minimum number of alternative
alleles
observed at a given depth, or:
( odds(pre- test) ))
(
\
Min(A0) = quantile 1 ¨ (sensitivity) x _______________________ , d (beta-
binomial)
odds(post- test)
1
203
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0688] Determining pre-test probability:
106891 In some embodiments, pre-test probability, which is
related to odds(pre- test), is
defined as:
odds(pre- test) ¨
probability(pre-test)
1 - probabitity(pre-test)P
and is determined through historical data derived from matched solid tumor
test data. By
analyzing an extensive set of cancers and using process matched liquid biopsy
and tissue
biopsy samples to identify somatic variants with high confidence, it is
possible to accurately
assess the prevalence of specific variants within a population of advanced
human cancers.
For a population of patients most likely to require liquid biopsy type tests,
the sampling
distribution most closely models the distribution into which any given patient
receiving the
test will fall. To model this prevalence, there are two factors at play: gene
level prevalence,
and genomic window level prevalence.
[0690] Assessing prevalence by sliding window segmentation:
[0691] In some embodiments, in order to get an accurate estimate
of prevalence, it is
critical to divide the estimated rate of mutation by the mechanism of disease.
Gain of
function (GOF) mutations tend to cluster in "hotspots," whereas loss of
function (LOF)
mutations tend to be scattered throughout a gene and suppress or eliminate a
protein's wild
type behaviors. Due to this evolutionary constraint on mutation position,
prevalence
calculation must take into account whether a gene has a GOF or LOF mechanism
of disease.
While this cannot be directly analyzed given available data, it is possible to
bootstrap this
calculation by segmentation of mutational prevalence across exons.
[0692] Based on historical sequencing data, it is possible to bin
mutations by exon. In
order to assess whether a single exon is enriched for mutations over the rest
of the gene (a
hotspot or GOF gene), a rolling Poisson test of difference is applied jumping
from exon to
exon. If there is a single (or multiple) exons that show statistically
significant deviation from
other exons within the gene, that region is annotated as the window of
interest. Prevalence is
subsequently calculated as prevalence within the exon(s) encompassing the
window of
interest.
[0693] If no exons can be shown to be over-represented for
mutations, the gene is
assumed to have an LOF mechanism of action and the prevalence for the whole
gene having
an alteration within the specified cancer type is used. When a variant is
being assessed for
204
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
filtering, the prevalence within the pre-specified window or the prevalence
within the gene
itself is used as the pre-test probability (Pr(pre-test)) for the likelihood
ratio test.
[0694] Referring to Block 500-2, the present disclosure provides
a method for validating
a somatic sequence variant in a test subject having a cancer condition, at a
computer system
having one or more processors, and memory storing one or more programs for
execution by
the one or more processors.
[0695] Referring to Block 502-2, the method comprises obtaining,
from a first
sequencing reaction, a corresponding sequence of each cell-free DNA fragment
in a first
plurality of cell-free DNA fragments in a liquid biopsy sample of the test
subject, thus
obtaining a first plurality of sequence reads, e.g., a plurality of de-
duplicated sequence reads,
where each sequence read correspond to a unique cell-free DNA fragment from
the sample.
In some embodiments, the first plurality of sequence reads includes at least
1000 sequence
reads. In some embodiments, the first plurality of sequence reads includes at
least 10,000
sequence reads. In some embodiments, the first plurality of sequence reads
includes at least
100,000 sequence reads. In some embodiments, the first plurality of sequence
reads includes
at least 200,000, 300,000, 400,000, 500,000, 750,000, 1,000,000, 2,500,000,
5,000,000
sequence reads, or more.
[0696] In some embodiments, the liquid biopsy sample is blood. In
some embodiments,
the liquid biopsy sample comprises blood, whole blood, peripheral blood,
plasma, serum, or
lymph of the test subject.
[0697] In some embodiments, the cancer condition is a particular
type and stage of cancer
(e.g., stage 2 lung cancer). Advantageously, the variant filtering methods
described herein
are superior to filtering methods that simply account for the tumor fraction
of a sample. This
is achieved, in part, by accounting for the types of mutations found in a
particular type of
cancer, which improves the quality of the pre-odds probability of finding a
particular type of
variant (e.g., a variant within a particular genomic region) in a sample from
a subject with a
known type of cancer. Accordingly, in some embodiments, the pre-odds
probabilities are
based on as specific of a cancer type as possible, e.g., accounting for one or
more of a type of
cancer, an origin of the cancer, the stage of the cancer, any previously known
genomic
variants in the cancer (e.g., whether a breast cancer subject is BRCA1 or
BRCA2 positive), a
personal characteristic of the subject, e.g., age, gender, race, smoking
status, alcohol
consumption status, etc.), any pathology classification of the cancer, etc.
However, there are
205
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
practical considerations when determining the level of specificity for which a
subject's cancer
should be specified when matching the cancer to a training cohort. For
instance, when an
insufficient number of training samples from matching samples are available
for calculation
of pre-test odds, the specificity of the cancer classification should be
reduced in order to
provide a large enough sample of training data to provide meaningful prior
information.
106981 In some embodiments, the test subject, the liquid biopsy
sample, the cancer
condition, and/or methods and systems for obtaining, accessioning, storing,
processing,
preparing and/or analyzing thereof, comprise any of the embodiments as
described above in
the present disclosure with reference to Figures 2-4.
[0699] In some embodiments, the first sequencing reaction is a
panel-enriched
sequencing reaction. For example, in some embodiments, the first sequencing
reaction is a
panel-enriched sequencing reaction of a first plurality of enriched loci, and
each respective
locus in the plurality of enriched loci are sequenced at an average unique
sequence depth of at
least 250x. In some such embodiments, each respective locus in the plurality
of enriched loci
are sequenced at an average unique sequence depth of at least 1000x. In some
embodiments,
the first plurality of sequence reads is obtained from ultra-high depth
sequencing (e.g., where
each locus in a plurality of loci are sequenced at an average coverage of at
least 1000x, at
least 2500x, or at least 5000x). Example genes that are informative for
precision oncology,
e.g., when implemented in a liquid biopsy-based assay, are shown in Table 1.
In some
embodiments, a panel-enriched sequencing reaction described herein uses a
probe set that
includes at least 5, 10, 15, 20, 25, 30, 40, 50, 60, 70, 75, 80, 90, 100, or
all 105 of the genes
listed in Table 1.
[0700] In some embodiments, the first sequencing reaction is a
whole genome sequencing
reaction, and the average sequencing depth of the reaction across the genome
is at least 5x,
10x, 15x, 20x, 25x, 30x, 40x, 50x, or higher.
[0701] In some embodiments, the first plurality of sequence reads
includes at least 50,000
sequence reads, at least 100,000 sequence reads, at least 250,000 sequence
reads, at least
500,000 sequence reads, at least 1,000,000 sequence reads, at least 5,000,000
sequence reads,
or more.
107021 In some embodiments, the first sequencing reaction and/or
the first plurality of
sequence reads includes any of the embodiments as described above in the
present disclosure.
For example, in some embodiments, methods and systems for nucleic acid
extraction, library
206
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
preparation, capture and hybridization, pooling, sequencing, aligning,
normalization and/or
other sequence read processing comprise any of the embodiments as described
above in the
present disclosure with reference to Figures 2-4.
[0703] Referring to Block 504-2, the method further comprises
aligning each respective
sequence read in the first plurality of sequence reads to a reference sequence
for the species
of the subject thus identifying (i) a variant allele fragment count for a
candidate variant,
where the candidate variant maps to a locus in the reference sequence, and
(ii) a locus
fragment count for the locus encompassing the candidate variant. In some
embodiments, the
variant allele fragment count refers to a unique number of sequence reads in
the test subject
that encompass the candidate variant. In some embodiments, the locus fragment
count refers
to the number of sequence reads in the test subject that map to the respective
locus
encompassing the candidate variant.
[0704] As described above, in some embodiments, the reference
sequence is a reference
genome, e.g., a reference human genome. In some embodiments, a reference
genome has
several blacklisted regions, such that the reference genome covers only about
75%, 80%,
85%, 90%, 95%, 98%, 99%, 99.5%, or 99.9% of the entire genome for the species
of the
subject. In some embodiments, the reference sequence for the subject covers at
least 10% of
the entire genome for the species of the subject, or at least 15%, 20%, 25%,
30%, 35%, 40%,
45%, 50%, 55%, 60%, 65%, 70%, 75%, or more of the entire genome for the
species of the
subject. In some embodiments, the reference sequence for the subject
represents a partial or
whole exome for the species of the subject. For instance, in some embodiments,
the
reference sequence for the subject covers at least 10% of the exome for the
species of the
subject, or at least 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%,
70%, 75%,
80%, 85%, 90%, 95%, 98%, 99%. 99.9%, or 100% of the exome for the species of
the
subject. In some embodiments, the reference sequence covers a plurality of
loci that
constitute a panel of genomic loci, e.g., a panel of genes used in a panel-
enriched sequencing
reaction. An example of genes useful for precision oncology, e.g., which may
be targeted
with such a panel, are shown in Table 1. Accordingly, in some embodiments, the
reference
sequence for the subject covers at least 100 kb of the genome for the species
of the subject.
In other embodiments, the reference sequence for the subject covers at least
250 kb, 500 kb,
750 kb, 1 Mb, 2 Mb, 5 Mb, 10 Mb, 25 Mb, 50 Mb, 100 Mb, 250 Mb, or more of the
genome
for the species of the subject. However, in some embodiments, there is no size
limitation of
the reference sequence. For example, in some embodiments, the reference
sequence can be a
207
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
sequence for a single locus, e.g., a single exon, gene, etc.) within the
genome for the species
of the subject.
[0705] Referring to Block 506-2, the method further comprises
comparing the variant
allele fragment count for the candidate variant against a dynamic variant
count threshold for
the locus in the reference sequence that the candidate variant maps to. The
dynamic variant
count threshold is based upon a pre-test odds of a positive variant call for
the locus based
upon the prevalence of variants in a genomic region that includes the locus
from a first set of
nucleic acids obtained from a cohort of subjects having the cancer condition.
[0706] For example, in some embodiments, the dynamic variant
count threshold is
determined based on the number of sequence variants that map to the respective
locus,
obtained from a sequencing of nucleic acids from a cohort of subjects having
the cancer
condition (e.g., a baseline variant threshold). In some embodiments, the
cohort of subjects
having the cancer condition are matched to at least one personal
characteristic of the test
subject (e.g., age, gender, race, smoking status, average alcohol consumption,
other
underlying medical conditions, etc.).
[0707] In some embodiments, the dynamic variant count threshold
is also based upon a
sequencing error rate for the sequencing reaction. For example, in some such
embodiments,
the sequencing error rate for the sequencing reaction is a trinucleotide
sequencing error rate.
In some embodiments, the dynamic variant count threshold is also based upon a
background
sequencing error rate determined for the locus.
[0708] Referring to Block 508-2, in some embodiments, the method
further comprises
obtaining a distribution of variant detection sensitivities as a function of
circulating variant
allele fraction from the cohort of subjects. The distribution of variant
detection sensitivities
is based on the circulating variant allele fraction of a second set of nucleic
acids collected
from the cohort of subjects relative to variant alleles detected in the first
set of nucleic acids
collected from the cohort of subjects. The first set of nucleic acids are from
solid tumor
biopsies of the cohort of subjects, and the second set of nucleic acids are
cell-free nucleic
acids from liquid biopsies of the cohort of subjects.
[0709] Figure 6A2 illustrates a flow chart of a method 600-2 for
obtaining a distribution
of variant detection sensitivities as a function of circulating variant allele
fraction from a
cohort of subjects, in accordance with some embodiments of the present
disclosure. For
example, referring to Block 602-2, matched liquid biopsy and solid tumor
samples are
208
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
obtained from a set of training subjects. In some embodiments, the training
subjects
comprise any of the cancer conditions, personal characteristics, and/or
feature data described
above in the present disclosure. Furthermore, in some embodiments, obtaining
the matched
liquid biopsy and solid tumor samples comprise any of the methods and
embodiments
described above in the present disclosure.
107101 Referring to Block 604-2, the solid tumor sample is
sequenced (e.g, by extracting
nucleic acids from the solid tumor sample and performing a sequencing reaction
for the
sample). The plurality of sequence reads obtained from sequencing the solid
tumor sample
are aligned to a reference genome (e.g., a human reference genome), thus
determining any
sequence variants included in the solid tumor sample. Referring to Block 606-
2, the liquid
biopsy sample is sequenced as described above for the solid tumor sample, thus
determining
any sequence variants included in the liquid biopsy sample.
[0711] Referring to Block 608-2, the results of the sequencing
reactions are compared by
comparing the sequence variants detected in the liquid biopsy sample against
the sequence
variants detected in the solid tumor sample (e.g., a measure of how many of
the variants
detected in the solid tumor sample were also detected in the liquid biopsy
sample, or a
circulating variant allele fraction). The comparison determines a variant
detection sensitivity
for each variant (e.g., corresponding to a respective locus) in the liquid
biopsy sample.
Referring to Block 610-2, each variant detection sensitivity is binned, in a
plurality of bins,
with respect to an estimated tumor fraction for the liquid biopsy sample, thus
obtaining a
distribution of variant detection sensitivities.
[0712] In some embodiments, a distribution of variant detection
sensitivities is
established based on a set of training samples (e.g., sensitivity distribution
training data) with
known variant allele fractions, e.g., samples derived from a solid tumor
sample for which one
or more variant allele fraction has been determined (e.g., by deep sequencing
of the sample).
For example, in some embodiments, nucleic acids from each of a plurality of
training samples
181 having a known variant allele fraction 184 for one or more variant alleles
183 is
sequenced according to a processed-matched sequencing reaction (e.g., using a
substantially
identical or identical sequencing reaction), and it is determined whether each
sequence
variant can be detected, e.g., defining a detection status 185 for each
locus/variant 183. Over
a large number of training samples, a specificity of detection of variants
having different
variant allele fractions can be determined. In some embodiments, the
specificity is
determined on a locus-by-locus basis, such that the specificity of detection
is specific for the
209
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
genomic region or locus encompassing the candidate sequence variant. In some
embodiments, the specificity is determined globally, e.g., not on a locus-by-
locus basis.
[0713] Referring again to Block 508-2, in some embodiments, the
method comprises
estimating a circulating variant fraction for the candidate variant. In some
embodiments, the
circulating variant fraction for the candidate variant is the ratio of the
variant allele fragment
count to the locus fragment count (e.g., the proportion of sequence reads that
include the
candidate variant in the plurality of sequence reads that map to the
respective locus
encompassing the variant). In some embodiments, the circulating variant
fraction is based
only upon the variant allele frequency for that locus. In some alternative
embodiments, the
circulating variant fraction is a circulating tumor fraction determined for
the sample.
[0714] For example, in some embodiments, the circulating variant
fraction is specific to
the variant being validated. In some such embodiments, the estimated variant
fraction is
determined by calculating the percentage of sequence reads encompassing the
locus that
include the variant (e.g., a variant allele fraction).
[0715] In some embodiments, the estimated circulating variant
fraction for the candidate
variant is an estimated tumor fraction for the sample, where the estimated
tumor fraction for
the sample is estimated based on a second sequencing reaction comprising low-
pass whole-
genome methylation sequencing of a second plurality of cell-free DNA fragments
in the
liquid biopsy sample of the test subject.
[0716] In some such embodiments, the dynamic threshold for the
locus is set based upon
a desired variant detection specificity determined by the relationship:
( ( odds (pre- test) )
specificity = 1 ¨ (sensitivity) x ______________________________
odds(post- test))
where sensitivity is the variant detection sensitivity in the distribution of
variant
detection sensitivities that corresponds to the circulating variant fraction
for the candidate
variant, odds(post- test) is the post-test odds of a positive variant call for
the locus, and
odds(pre- test) is the pre-test odds of the positive variant call for the
locus.
[0717] In some embodiments, the specificity is used to select a
quantile of a beta-
binomial distribution of the minimal variant allele fragment count required to
support a
positive variant call for the locus, thus defining the dynamic threshold for
the locus. The
beta-binomial distribution is defined by the sequencing error rate for the
sequencing reaction
and the background sequencing error rate determined for the locus. For
example, in some
210
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
embodiments, the minimum number of alternative alleles required to validate a
positive
variant call is represented by
( ( odds(pre- test) ))
Min(A0) = quantile 1 ¨ (sensitivity) x _______________________ , d (beta-
binomial)
odds(post- test)
107181 In some embodiments, as described in Figure 6A2, obtaining
the distribution of
variant detection sensitivities comprises binning variant detection
sensitivities in a plurality
of bins as a function of circulating variant allele fraction. Each bin in the
plurality of bins is
associated with a corresponding variant detection sensitivity and sensitivity
is the variant
detection sensitivity corresponding to the respective bin, in the plurality of
bins that
encompasses the circulating variant fraction for the candidate variant. In
some alternative
embodiments, the distribution of variant detection sensitivities is a
continuous function.
[0719] Additional details and embodiments for obtaining
thresholds for filtering variants
(e.g., dynamic thresholds) are described above in the present disclosure (see,
Example
Methods: Variant Identification).
[0720] In some embodiments, the pre-test odds of a positive
variant call for the locus is
based on (i) the prevalence of variants in the genomic region that includes
the locus from the
first set of nucleic acids obtained from the cohort of subjects having the
cancer condition
(e.g., the percentage of patients with the particular cancer type that have a
variant in the
region of interest), and (ii) a known or inferred effect of the variants. When
the known or
inferred effect of a variant is loss-of-function (L0F) of a gene that includes
the locus, the
genomic region used to compute the pre-test probability is the entire gene,
and when the
known or inferred effect of a variant is gain-of-function (G0F) of the gene
that includes the
locus, the genomic region used to compute the pre-test probability is the
exon, of the gene,
that includes the locus.
[0721] In some such embodiments, the effect of the variants is
inferred by binning each
respective variant of the variants in the genomic region that includes the
locus from the first
set of nucleic acids obtained from the cohort of subjects having the cancer
condition into a
respective bin, in a plurality of bins for the gene that include the locus,
corresponding to the
exon encompassing the respective variant in the gene. Each bin in the
plurality of bins
corresponds to a different exon of the respective gene. After determining
whether any bin in
the plurality of bins contains significantly more variants than the other bins
in the plurality of
bins, the effect of the sequence variant is inferred to be a gain-of-function
of the gene when a
211
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
bin contains significantly more variants than the other bins in the plurality
of bins.
Alternatively, the effect of the sequence variant is inferred to be a loss-of-
function of the gene
when no bin in the plurality of bins contains significantly more sequence
variants than the
other bins in the plurality of bins.
[0722] For example, Figures 7A2 and 7B2 illustrates a method of
inferring an effect of a
sequence variant as a gain-of-function or a loss-of-function of a gene, in
accordance with
some embodiments of the present disclosure.
[0723] Figure 7A2 illustrates a gene 700-A with a plurality of
exons 701-A, 702-A, 703-
A. Each exon corresponds to a bin in a plurality of bins. A first exon 701-A
comprises a
region of interest (e.g., a locus) that encompasses a candidate variant. A
plurality of
sequence variants (e.g., 704-A, 705-A, 706-A, 707-A, 708-A, 709-A) is obtained
from a
sequencing of nucleic acids from a cohort of subjects, where each sequence
variant maps to a
respective locus in the gene. The effect of the variants is inferred by
binning each sequence
variant into the respective bin corresponding to the exon to which the
respective variant
maps. Thus, sequence variants 704-A, 705-A, 706-A, and 707-A are binned into
the bin
corresponding to exon 701-A, sequence variant 708-A is binned into the bin
corresponding to
exon 702-A, and sequence variant 709-A is binned into the bin corresponding to
exon 703-A.
In Figure 7A2, it can be determined that the bin corresponding to exon 701-A
contains
significantly more variants than the other bins in the plurality of bins, and
thus the effect of
the sequence variant is inferred to be a gain-of-function of the gene. In such
case, the
genomic region used to compute the pre-test probability is the exon 701-A of
the gene, that
includes the locus encompassing the candidate variant.
[0724] Alternatively, Figure 7B2 illustrates a gene 700-B with a
plurality of exons 701-B,
702-B, 703-B. Each exon corresponds to a bin in a plurality of bins. A first
exon 701-B
comprises a region of interest (e.g., a locus) that encompasses a candidate
variant. A
plurality of sequence variants (e.g., 704-B, 705-B, 706-B, 707-B, 708-B, 709-
B) is obtained
from a sequencing of nucleic acids from a cohort of subjects, where each
sequence variant
maps to a respective locus in the gene. The effect of the variants is infen-ed
by binning each
sequence variant into the respective bin corresponding to the exon to which
the respective
variant maps. Thus, sequence variants 704-B and 705-B are binned into the bin
corresponding to exon 701-B, sequence variant 706-B and 707-B are binned into
the bin
corresponding to exon 702-B, and sequence variant 708-B and 709-B are binned
into the bin
corresponding to exon 703-B. In Figure 7B2, it can be determined that no bin
in the plurality
212
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
of bins contains significantly more sequence variants than the other bins in
the plurality of
bins, and thus the effect of the sequence variant is inferred to be a loss-of-
function of the
gene. In such case, the genomic region used to compute the pre-test
probability is the entire
gene.
[0725] In some such embodiments, determining whether any bin in
the plurality of bins
contains significantly more variants than the other bins in the plurality of
bins comprises
applying a rolling Poisson test of difference between bin counts corresponding
to adjacent
exons in the gene.
[0726] Referring to Block 510-2, the method further comprises
validating the presence of
the somatic sequence variant in the test subject when the variant allele
fragment count for the
candidate variant satisfies the dynamic variant count threshold for the locus,
or rejecting the
presence of the somatic sequence variant in the test subject when the variant
allele fragment
count for the candidate variant does not satisfy the dynamic variant count
threshold for the
locus. In some embodiments, the validating includes other variant filtering
criteria, as
described above in the present disclosure (see, Example Methods: Variant
Identification).
[0727] In some embodiments, the methods and systems disclosed
herein are used for
precision oncology applications. For example, in some embodiments, the method
further
comprises generating a report for the test subject comprising the identity of
variant alleles
having variant allele counts, in the first sequencing reaction, that satisfy
lhe dynamic variant
count threshold. In some embodiments, the generated report further comprises
therapeutic
recommendations for the test subject based on the identity of one or more of
the reported
variant alleles. Additional embodiments for precision oncology applications,
including
matched clinical trials, matched therapies, report generation, and/or other
aspects of the
digital and laboratory health care platform are described in detail below.
[0728] Another aspect of the present disclosure provides a
computer system comprising
one or more processors and a non-transitory computer-readable medium including
computer-
executable instructions that, when executed by the one or more processors,
cause the
processors to perform a method according to any one of the embodiments
disclosed herein.
[0729] Another aspect of the present disclosure provides a non-
transitory computer-
readable storage medium having stored thereon program code instructions that,
when
executed by a processor, cause the processor to perform the method according
to any one of
the embodiments disclosed herein.
213
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0730] In some embodiments, the methods described herein include
generating a clinical
report 139-3 (e.g., a patient report), providing clinical support for
personalized cancer
therapy, and/or using the information curated from sequencing of a liquid
biopsy sample, as
described above. In some embodiments, the report is provided to a patient,
physician,
medical personnel, or researcher in a digital copy (for example, a JSON
object, a pdf file, or
an image on a website or portal), a hard copy (for example, printed on paper
or another
tangible medium). A report object, such as a JSON object, can be used for
further processing
and/or display. For example, information from the report object can be used to
prepare a
clinical laboratory report for return to an ordering physician. In some
embodiments, the
report is presented as text, as audio (for example, recorded or streaming), as
images, or in
another format and/or any combination thereof
[0731] The report includes information related to the specific
characteristics of the
patient's cancer, e.g., detected genetic variants, epigenetic abnormalities,
associated
oncogenic pathogenic infections, and/or pathology abnormalities. In some
embodiments,
other characteristics of a patient's sample and/or clinical records are also
included in the
report. For example, in some embodiments, the clinical report includes
information on
clinical variants, e.g., one or more of copy number variants (e.g., for
actionable genes
CCNE1, CD274(PD-L1), EGFR, ERBB2(HER2), MET, MYC, BRCA1, and/or BRCA2),
fusions, translocations, and/or rearrangements (e.g., in actionable genes ALK,
ROS1, RET,
NTRK1, FGFR2, FGFR3, NTRK2 and/or NTRK3), pathogenic single nucleotide
polymorphisms, insertion-deletions (e.g., somatic/tumor and/or
germline/normal), therapy
biomarkers, microsatellite instability status, and/or tumor mutational burden.
[0732] Conversion of solid tumor test to liquid biopsy test. In
one embodiment, the solid
tissue sample is insufficient for NGS testing (for example, the sample is too
small or too
degraded, the amount or quality of nucleic acids extracted from the sample
does not result in
quality NGS results that would result in reliable determination of variants
and/or other
genetic characteristics of the sample), and the physician or patient may
decide to convert the
solid tissue test that was ordered to a liquid biopsy test to be performed on
a liquid biopsy
sample collected from the same patient. The resulting report and/or display of
the results on
a portal may include an "xF Conversion Badge" to distinguish any order that
has been
converted from solid tissue test to a liquid biopsy test (compared to, for
example, a liquid
biopsy test that was not initially ordered as a solid tissue test). This will
allow a user to
214
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
identify which orders have been converted by this process, and distinguish
between orders
that were intentionally placed for the liquid biopsy panel.
[0733] Longitudinal Reporting. In various embodiments, a report
may include and/or
compare the results of multiple liquid biopsy tests and/or solid tumor tests
(for example,
multiple tests associated with the same patient). The results of multiple
liquid biopsy tests
and/or solid tumor tests may be displayed on a portal in a variety of
configurations that may
be selected and/or customized by the viewer. The tests may have been performed
at different
times, and the samples on which the tests were performed may have been
collected at
different times.
[0734] Download result. Clinical and/or molecular data associated
with a patient (for
example, information that would be included in the report), may be aggregated
and made
available via the portal. Any portion of the report data may be available for
download (for
example, as a CSV file) by the physician and/or patient. In various
embodiments, the data
may include data related to genetic variants, RNA expression levels,
immunotherapy markers
(including MSI and TMB), RNA fusions, etc. In one embodiment, if a physician
or medical
facility has ordered multiple tests (all tests may be associated with the same
patient or tests
may be associated with multiple patients), results associated with more than
one test may be
aggregated into a single file for downloading.
Systems and Methods for Improved Circulating Tumor Fraction Estimates
[0735] Below, systems and methods for improving circulating tumor
fraction estimates,
e.g., within the context of the methods and systems described above, are
described with
reference to Figures 4F3, 5A3-B3, and 6A3-C3.
[0736] Many of the embodiments described below, in conjunction
with Figures 4F3,
5A3-B3, and 6A3-C3, relate to analyses performed using sequencing data for
cfDNA
obtained from a liquid biopsy sample of a cancer patient. Generally, these
embodiments are
independent and, thus, not reliant upon any particular DNA sequencing methods.
However,
in some embodiments, the methods described below include generating the
sequencing data.
[0737] As described herein, in some embodiments, the methods
described herein (e.g.,
methods 400-3 and 500-3 as illustrated in Figures 4F3 and 5A3-B3) include one
or more data
collection steps, in addition to data analysis and downstream steps. For
example, as
described herein, e.g., with reference to Figures 2 and 3, in some
embodiments, the methods
include collection of a liquid biopsy sample and, optionally, one or more
matching biological
215
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
samples from the subject (e.g., a matched cancerous and/or matched non-
cancerous sample
from the subject). Likewise, as described herein, e.g., with reference to
Figures 2 and 3, in
some embodiments, the methods include extraction of DNA from the liquid biopsy
sample
(cfDNA) and, optionally, one or more matching biological samples from the
subject (e.g., a
matched cancerous and/or matched non-cancerous sample from the subject).
Similarly, as
herein, e.g., with reference to Figures 2 and 3, in some embodiments, the
methods include
nucleic acid sequencing of DNA from the liquid biopsy (cfDNA) sample and,
optionally, one
or more matching biological samples from the subject (e.g., a matched
cancerous and/or
matched non-cancerous sample from the subject).
107381 However, in other embodiments, the methods described
herein begin with
obtaining nucleic acid sequencing results, e.g., raw or collapsed sequence
reads of DNA from
a liquid biopsy sample (cfDNA) and, optionally, one or more matching
biological samples
from the subject (e.g., a matched cancerous and/or matched non-cancerous
sample from the
subject), from which the genomic features needed for estimating circulating
tumor fraction
(e.g., variant allele count and/or variant allele fraction) can be determined.
For example, in
some embodiments, sequencing data 122 for a patient 121 is accessed and/or
downloaded
over network 105 by system 100.
[0739] Similarly, in some embodiments, the methods described
herein begin with
obtaining the genomic features needed for estimating circulating tumor
fraction (e.g., variant
allele count and/or variant allele fraction) for a sequencing of a liquid
biopsy sample and,
optionally, one or more matching biological samples from the subject (e.g., a
matched
cancerous and/or matched non-cancerous sample from the subject). For example,
in some
embodiments, variant allele counts and/or variant allele fractions for
sequencing data 122 of
patient 121 is accessed and/or downloaded over network 105 by system 100.
[0740] Figure 4F3 illustrates a flow chart of a method for
precision oncology including
determining accurate circulating tumor fraction estimates using on-target and
off-target
sequence reads, in accordance with some embodiments of the present disclosure.
107411 In some embodiments, the method includes obtaining (402-3)
cell-free DNA
sequencing data 122 from a sequencing reaction of a liquid biopsy sample of a
test subject
121 (e.g., sequence reads 123-1-1-1, . . . 123-1-1-K for sequence run 122-1-1
for a liquid
biopsy sample from patient 121-1, as illustrated in Figure 1B) As described
herein, in some
embodiments, the obtaining includes a step of sequencing cell-free nucleic
acids from a liquid
216
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
biopsy sample. Example methods for sequencing cell-free nucleic acids are
described herein.
The sequence reads obtained from the targeted-panel sequencing include a first
subset of
sequence reads that map to one or more target genes (e.g., on-target reads) in
the panel and a
second subset of sequence reads that map to an off-target portion of the
reference genome
(e.g., off-target reads). In some embodiments, the plurality of sequence reads
includes at
least 1000 sequence reads. In some embodiments, the first plurality of
sequence reads
includes at least 10,000 sequence reads. In some embodiments, the first
plurality of sequence
reads includes at least 100,000 sequence reads. In some embodiments, the first
plurality of
sequence reads includes at least 200,000, 300,000, 400,000, 500,000, 750,000,
1,000,000,
2,500,000, 5,000,000 sequence reads, or more.
[0742] In some embodiments, the panel size is relatively small,
e.g., less than 1000 genes,
less than 750 genes, less than 500 genes, less than 250 genes, less than 200
genes, less than
150 genes, less than 125 genes, less than 100 genes, less than 75 genes, less
than 50 genes,
etc. In some such embodiments, the sequencing reaction is performed at a read
depth of
100X or more, 250X or more, 500X or more, 1000X or more, 2500X or more, 5000X
or
more, 10,000X or more, 20,000X or more, or 30,000X or more. In some
embodiments, the
sequencing panel comprises 1 or more, 10 or more, 20 or more, 50 or more, 100
or more, 150
or more, 200 or more, 300 or more, 500 or more, or 1000 or more genes. In some

embodiments, the sequencing panel comprises one or more genes listed in Table
1. In some
embodiments, the sequencing panel includes at least 2, 3, 4, 5, 10, 15, 20,
25, 30, 40, 50, 60,
70, 80, 90, 100, or all of the genes listed in Table 1. In some embodiments,
the sequencing
panel comprises one or more genes selected from the group consisting of MET,
EGFR,
ERBB2, CD274, CCNE1, MYC, BRCA1 and BRCA2. In some embodiments, the
sequencing panel includes at least 2, 3, 4, 5, 6, 7, or all 8 of MET, EGFR,
ERBB2, CD274,
CCNE1, MYC, BRCA1 and BRCA2. In some embodiments, the sequencing reaction is a

whole exome sequencing reaction.
[0743] Sequence reads 123 from the sequencing data 122 are then
aligned (404-3) to a
human reference sequence (e.g., a human genome or a portion of a human genome,
e.g., 1%,
5%, 10%, 15%, 20%, 25%, 30%, 40%, 50%, 60%, 75%, 90%, 95%, 99%, or more of the

human genome, or to a map of a human reference genome or a set of human
reference
genomes, or a portion thereof), thereby generating a plurality of aligned
reads 124.
Optionally, the pre-aligned sequence reads 123 and/or aligned sequence reads
124 are pre-
processed (408-3) using any of the methods disclosed above (e.g.,
normalization, bias
217
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
correction, etc.). In some embodiments, as described herein, device 100
obtains previously
aligned sequence reads.
[0744] As described above, in some embodiments, the reference
sequence is a reference
genome, e.g., a reference human genome. In some embodiments, a reference
genome has
several blacklisted regions, such that the reference genome covers only about
75%, 80%,
85%, 90%, 95%, 98%, 99%, 99.5%, or 99.9% of the entire genome for the species
of the
subject. In some embodiments, the reference sequence for the subject covers at
least 10% of
the entire genome for the species of the subject, or at least 15%, 20%, 25%,
30%, 35%, 40%,
45%, 50%, 55%, 60%, 65%, 70%, 75%, or more of the entire genome for the
species of the
subject. In some embodiments, the reference sequence for the subject
represents a partial or
whole exome for the species of the subject. For instance, in some embodiments,
the
reference sequence for the subject covers at least 10% of the exome for the
species of the
subject, or at least 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%,
70%, 75%,
80%, 85%, 90%, 95%, 98%, 99%, 99.9%, or 100% of the exome for the species of
the
subject. In some embodiments, the reference sequence covers a plurality of
loci that
constitute a panel of genomic loci, e.g., a panel of genes used in a panel-
enriched sequencing
reaction. An example of genes useful for precision oncology, e.g., which may
be targeted
with such a panel, are shown in Table 1. Accordingly, in some embodiments, the
reference
sequence for the subject covers at least 100 kb of the genome for the species
of the subject.
In other embodiments, the reference sequence for the subject covers at least
250 kb, 500 kb,
750 kb, 1 Mb, 2 Mb, 5 Mb, 10 Mb, 25 Mb, 50 Mb, 100 Mb, 250 Mb, or more of the
genome
for the species of the subject. However, in some embodiments, there is no size
limitation of
the reference sequence. For example, in some embodiments, the reference
sequence can be a
sequence for a single locus, e.g., a single exon, gene, etc.) within the
genome for the species
of the subject.
[0745] In some embodiments, the bins for off-target sequence
reads (those sequence
reads that do not correspond to a sequencing panel enrichment probe) are
established to
provide roughly uniform distribution of sequence reads to each bin, e.g.,
based on training
data establishing historical distributions of sequence reads across the genome
for a given
targeted-panel sequencing reaction. In some embodiments, the method includes
processes for
enforcing uniformity, such as defining different bin sizes, GC correction, and
sequencing
depth corrections. In other embodiments, the binning is performed based upon a

predetermined bin size. In some embodiments, the plurality of bins includes at
least 10, 25,
218
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
50, 100, 250, 500, 1000, 2500, 5000, 10,000, 25,000, 50,000, or more bins
distributed across
the reference sequence (e.g, the genome) for the species of the subject. In
some
embodiments, the bins are distributed relatively uniformly across the
reference sequence, e.g.,
such that the each encompasses a similar number of bases, e.g., about 0.5 kb,
1 kb, 2 kb, 5 kb,
kb, 25 kb, 50 kb, 100 kb or more bases. Each respective bin in the plurality
of bins
represents a corresponding region of a reference sequence (e.g., genome) for
the species of
the subject. In some embodiments, the bins are distributed relatively
uniformly across the
reference sequence, e.g., such that the each encompasses a similar number of
bases, e.g.,
about 0.5 kb, 1 kb, 2 kb, 5 kb, 10 kb, 25 kb, 50 kb, 100 kb or more bases.
Each respective
bin-level sequence ratio in the plurality of bin-level sequence ratios is
determined from a
comparison of the first plurality of sequence reads to sequence reads from one
or more
reference samples. In some embodiments, the one or more reference samples are
process-
matched reference samples. That is, in some embodiments the one or more
reference samples
are prepared for sequencing using the same methodology as used to prepare the
sample from
the test subject. Similarly, in some embodiments, the one or more reference
samples are
sequenced using the same sequencing methodology as used to sequence the sample
from the
test subject. In this fashion, internal biases for particular regions or
sequences are controlled
for in the reference samples.
[0746] In some embodiments, binned sequence reads are segmented
via circular binary
segmentation (CBS). For example, in some embodiments, the method includes
genomic
region binning, coverage calculation, bias correction, normalization to a
reference pool,
segmentation, and/or visualization (e.g., using CNVkit).
[0747] In some embodiments, the method includes determining a
sequence ratio (e.g., a
coverage ratio) for a plurality of segments of the genome using the. e.g.,
binned, corrected,
normalized, and/or segmented sequence reads as described above. In some
embodiments,
coverage ratio (CR) is calculated for the plurality of segments based on the
following
relationship (Block 410-3):
normalized sample coverage
1o92(CR) = 1092( (1)
normalized pool coverage =
[0748] In some embodiments, the data is then cleaned-up by (i)
removing segments
located on sex chromosomes, and/or (ii) removing segments with fewer probes
than a
minimal threshold. In some embodiments, segments are then fitted to integer
copy states via
a maximum likelihood estimation (e.g., an expectation-maximization algorithm
412-3) using,
219
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
for example, the sum of squared error of segment 1og2 ratios (e.g., normalized
to genomic
interval size) to expected coverage ratios given a putative copy state and
tumor purity.
[0749] For example, in some embodiments, the method includes
calculating expected
sequence ratios 414-3 (e.g., coverage ratios) for a set of copy states at a
given tumor purity.
For instance, for a set of tumor purity values TP, and a set of copy states
CN, the expected
1og2 coverage ratio is calculated for each tumor purity (TPi) and copy number
state (CNi)
according to:
1o92(CR) = 1og2 ((2(1 -TPi)+(CNI)(TPi))1(2)). (2)
[0750] In some embodiments, the method includes calculating the
distance 416-3 to the
closest copy state expected sequence ratio (e.g., coverage ratio) at the given
tumor purity,
where the distance (e.g., error) for a segment k (CRk) from the expected copy
state is defined
as:
dk = I CR" ¨ CRk I . (3)
[0751] In some embodiments, the method includes assigning segment
copy states by
selecting expected copy states with the closest sequence ratio. That is, the
copy state of the
segment that is closest to the expected state is assigned the copy state with
the smallest
distance:
CNk = argmin(dk). (4)
The error for that segment is therefore the minimum distance:
Ek = (min(dk))2 . (5)
[0752] In some embodiments, the method then includes estimating
the circulating tumor
fraction for the test subject based on a measure of fit between corresponding
segment-level
coverage ratios and integer copy states across the plurality of simulated
circulated tumor
fractions.
[0753] In some embodiments, estimating the circulating tumor
fraction comprises
minimization of an error between corresponding segment-level coverage ratios
and integer
copy states across the plurality of simulated circulated tumor fractions. For
example, in some
embodiments, the method includes summing the weighted errors for each tumor
purity and
selecting the model with the lowest score. In some embodiments, the scores 418-
3 for each
segment are weighted by the number of probes on that segment. The number of
probes is
220
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
highly co-linear with the length of the segment. In some embodiments, the
weighting is
performed according to:
W / = Ek Elcik, (6)
where:
wi is the weighted score for the sample copy ratios at tumor purity i,
Ek is the error of segment k to its closest copy state, and
/k is the number of probes on segment k.
[0754] The circulating tumor fraction estimate, therefore, is
selected as the tumor purity
with the lowest score (Block 420-3):
TP = argmin(0), (7)
e.g., where 170 = w f
k- 0.131, = = = 14/0.99)=
[0755] In some embodiments, estimating the circulating tumor
fraction includes
identifying a plurality of local optima for fit (e.g., minima for the error
between
corresponding segment-level coverage ratios and integer copy states across the
plurality of
simulated circulated tumor fractions), and selecting the local optima (e.g.,
minima) that is
closest to a second estimate of circulating tumor fraction determined by a
different
methodology.
[0756] For example, Figure 19 is an example plot of the errors
between corresponding
segment-level coverage ratios and integer copy states determined across a
plurality of
simulated circulated tumor fractions ranging from about 0 to about 1. As seen
in the plot,
there are two local minima 1902 and 1904 for the error, representing two
possible solutions
for the circulating tumor fraction for the liquid biopsy sample. In some
embodiments, a
second estimation of circulating tumor fraction 1906 or 1908 is determined,
e.g., according to
any of the methods described in the "Circulating Tumor Fraction" section
above. The second
estimation of circulating tumor fraction is then compared with the local
minima, and the local
minima that is closest to the second circulating tumor estimate is selected as
the circulating
tumor fraction for the liquid biopsy sample. For instance, if the second
circulating tumor
fraction 1906 was determined to be about 0.35, first local minima 1902 would
be selected,
and the circulating tumor fraction for the sample would be estimated to be
about 0.325.
However, if the second circulating tumor fraction 1908 was determined to be
about 0.65,
221
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
second local minima 1904 would be selected, and the circulating tumor fraction
for the
sample would be estimated to be about 0.625.
[0757] In some embodiments, the second estimate of circulating
tumor fraction is
generated by detecting a plurality of germline variants in the liquid biopsy
sample based on
the first plurality of sequence reads and determining, for each respective
germline variant in
the plurality of germline variants, a corresponding germline variant allele
frequency for the
liquid biopsy sample, thereby determining a plurality of germline variant
allele frequencies
for the liquid biopsy sample. For each respective germline variant in the
plurality of germline
variants, an absolute value of the difference between the corresponding
germline variant
allele frequency for the liquid biopsy sample and a germline variant allele
frequency for the
respective germline variant allele in a non-cancerous tissue of the subject is
then determined,
thereby generating a plurality of germline variant allele deltas for the
liquid biopsy sample.
The second estimated circulating tumor fraction for the liquid biopsy sample
is then defined
as twice the value of the maximum germline variant allele delta in the
plurality of germline
variant allele deltas.
[0758] In some embodiments, for each respective germline variant
in the plurality of
germline variants, the corresponding germline variant allele frequency for the
respective
germline variant allele in anon-cancerous tissue of the subject is defined as
0.5. However, in
other embodiments, for each respective germline variant in the plurality of
germline variants,
the corresponding germline variant allele frequency for the respective
germline variant allele
in a non-cancerous tissue of the subject is determined based on a second
sequencing reaction
of nucleic acids from a non-cancerous sample of the subject. For example, in
some
embodiments, a plurality of somatic variants is detected in the liquid biopsy
sample based on
the first plurality of sequence reads. For each respective somatic variant in
the plurality of
somatic variants, a corresponding somatic variant allele frequency is
determined for the
liquid biopsy sample, thereby determining a plurality of somatic variant
allele frequencies for
the liquid biopsy sample. The second estimated circulating tumor fraction for
the liquid
biopsy sample as twice the value of the largest somatic variant allele
frequency in the
plurality of somatic variant allele frequencies.
[0759] In some embodiments, the second estimate of circulating
tumor fraction is
generated by detecting a plurality of somatic variants in the liquid biopsy
sample based on the
first plurality of sequence reads, determining, for each respective somatic
variant in the
plurality of somatic variants, a corresponding somatic variant allele
frequency for the liquid
222
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
biopsy sample, thereby determining a plurality of somatic variant allele
frequencies for the
liquid biopsy sample, and then estimating the circulating tumor fraction for
the liquid biopsy
sample as the value of the largest somatic variant allele frequency in the
plurality of somatic
variant allele frequencies.
[0760] An example of the off-target tumor estimation method
described above is
illustrated in Figures 6A3, 6B3, and 6C3, in accordance with some embodiments
of the
present disclosure. The plot in Figure 6A3 shows the 1og2 coverage ratios,
calculated using
Eq. (1) using off-target sequence reads from a test liquid biopsy sample
(e.g., binned,
corrected, normalized, and segmented using CNVkit). Segments were filtered to
remove
segments on sex chromosomes and segments with fewer than a minimum number of
probes
and arranged according to chromosome (indicated along the x-axis).
[0761] A set of tumor purity values TP and a set of copy states
CN were selected for
calculation of expected 10g2 coverage ratio. In this implementation, TP =
[0.01, 0.02, ... ,
0.991 and CN = 110, 1, 2, 3, 41. Thus, using Eq. (2), the expected 10g2
coverage ratio can be
calculated for each possible combination of TP i and CN. For example, for TP =
0.5 and CN
= 4, the expected 10g2 coverage ratio is 0.58, and for TP = 0.5 and each
possible value of CN,
the set of expected 10g2 coverage ratios is CRTp=0.5= 11-1, -0.415, 0, 0.322,
0.5851. Values for
expected 10g2 coverage ratios are indicated in Figure 6A3 by the horizontal
bars marked CNo,
CNi, CN2, CN3, CN4.
[0762] Referring to Figure 6B3, the distances (e.g., error) for
each segment from the
expected copy state were determined using Eq. (3) and indicated by the
vertical arrows (e.g.,
for a segment k) . Eqs. (4) and (5) were then used to determine the copy state
of the segment
by selecting the expected copy state with the minimum distance. For example,
in Figure 6B3,
at TP = 0.5 (e.g., a tumor purity of 50%) the segment k is closest to the
expected copy state of
0, and thus the segment is assigned a copy state of 0. Figure 6C3 further
illustrates the
selection of copy states and minimum distances for each segment in the
plurality of segments
across each chromosome in the reference genome.
[0763] The minimum distances for each segment in the plurality of
segments across the
reference genome were summed, for each tumor purity value in the set, thus
obtaining a score
for each tumor purity. For example, Figure 6C3 illustrates a plurality of
minimum distances,
between each segment and its closest copy state value, for the plurality of
segments across the
reference genome. Additionally, the scores for each segment were weighted by
the number
223
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
of probes on the segment, according to Eq. (6). Finally, the tumor purity with
the smallest
score (e.g., smallest error) was selected according to Eq. (7), thus obtaining
the circulating
tumor fraction estimate for the test liquid biopsy sample.
[0764] Optionally, the method generates a circulating tumor
fraction estimate 422-3 that
can be reported as a biomarker. The ctFE is used, in some embodiments, to
match therapies
and/or clinical trials (Block 424-3) and can be included in a patient report
426-3 indicating
the ctFE.
[0765] Optionally, the tumor fraction estimate obtained by the
method is used (423-3) in
one or more of the variant identification methods described herein, e.g., with
respect to
feature extraction module 145 illustrated in Figure 1A.
[0766] Figures 5A3-5B3 collectively provide a flow chart of
processes and features for
determining accurate circulating tumor fraction estimates using off-target
sequence reads, in
accordance with some embodiments of the present disclosure.
[0767] The present disclosure provides a method 500-3 for
estimating a circulating tumor
fraction for a test subject from panel-enriched sequencing data for a
plurality of sequences.
[0768] Referring to Block 502-3, the method comprises obtaining,
from a first panel-
enriched sequencing reaction, a first plurality of sequence reads, where the
first plurality of
sequence reads comprises at least 100,000 sequence reads.
[0769] The plurality of sequences comprises (i) a corresponding
sequence for each cell-
free DNA fragment in a first plurality of cell-free DNA fragments obtained
from a liquid
biopsy sample from the test subject. Each respective cell-free DNA fragment in
the first
plurality of cell-free DNA fragments corresponds to a respective probe
sequence in a
plurality of probe sequences used to enrich cell-free DNA fragments in the
liquid biopsy
sample in the first panel-enriched sequencing reaction.
[0770] The plurality of sequences further comprises (ii) a
corresponding sequence for
each cell-free DNA fragment in a second plurality of cell-free DNA fragments
obtained from
the liquid biopsy sample. Each respective cell-free DNA fragment in the second
plurality of
DNA fragments does not correspond to any probe sequence in the plurality of
probe
sequences.
[0771] For example, in some embodiments, the plurality of
sequence reads from a first
panel-enriched sequencing reaction includes a first subset of sequence reads
that correspond
224
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
to cfDNA fragments targeted by one or more probes in a targeted enrichment
panel (e.g., on-
target), and a second subset of sequence reads that correspond to cfDNA
fragments the map
to an off-target region of the reference genome not targeted by any of the
probes in the
targeted enrichment panel (e.g., off-target).
[0772] In some embodiments, the plurality of sequence reads
comprises at least 500,000
sequence reads, or at least 1,000,000 sequence reads, or at least 2,000,000
sequence reads, or
at least 5,000,000 sequence reads.
[0773] In some embodiments, the obtaining, accessioning, storing,
preparing, processing
and/or analyzing the liquid biopsy sample from the test subject comprises any
of the methods
and/or embodiments described above in the present disclosure. In some
embodiments, the
sequencing reaction comprises any of the methods and/or embodiments described
above in
the present disclosure.
[0774] In some embodiments, the plurality of probe sequences used
to enrich cell-free
DNA fragments in the liquid biopsy sample in the first panel-enriched
sequencing reaction
collectively map to at least 25 different genes in human reference genome. In
some
embodiments, the plurality of probe sequences collectively maps to at least
50, at least 100, at
least 250, at least 500, or at least 1000 different genes in the human
reference genome. In
some embodiments, the plurality of probe sequences collectively maps to at
least 10 of the
genes listed in Table 1. In some embodiments, the plurality of probe sequences
collectively
maps to at least 20, 25, 30, 40, 50, 60, 75, 100, or all 105 of the genes
listed in Table 1.
[0775] For example, in some embodiments, a targeted enrichment
panel comprises any of
the embodiments described above in the present disclosure. For example, in
some
embodiments, the targeted enrichment panel includes probes targeting one or
more gene loci,
e.g., exon or intron loci. In some embodiments, the targeted enrichment panel
includes
probes targeting one or more locus not encoding a protein, e.g., regulatory
loci, miRNA loci,
and other non-coding loci, e.g., that have been found to be associated with
cancer. In some
embodiments, the plurality of loci includes at least 25, 50, 100, 150, 200,
250, 300, 350, 400,
500, 750, 1000, 2500, 5000, or more human genomic locus.
[0776] In some embodiments, the targeted enrichment panel
includes probes targeting
one or more of the genes listed in Table 1. In some embodiments, the targeted
enrichment
panel includes probes targeting at least 5 of the genes listed in Table 1. In
some
embodiments, the targeted enrichment panel includes probes targeting at least
10 of the genes
225
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
listed in Table 1. In some embodiments, the targeted enrichment panel includes
probes
targeting at least 25 of the genes listed in Table 1. In some embodiments, the
targeted
enrichment panel includes probes targeting at least 50 of the genes listed in
Table 1. In some
embodiments, the targeted enrichment panel includes probes targeting at least
75 of the genes
listed in Table 1. In some embodiments, the targeted enrichment panel includes
probes
targeting at least 100 of the genes listed in Table 1. In some embodiments,
the targeted
enrichment panel includes probes targeting all of the genes listed in Table 1.
[0777] Referring to Block 504-3, the method comprises determining
a plurality of bin-
level coverage ratios from the plurality of sequences. Each respective bin-
level coverage
ratio in the plurality of bin-level coverage ratios corresponding to a
respective bin in a
plurality of bins, and each respective bin in the plurality of bins represents
a corresponding
region of a human reference genome. Additionally, each respective bin-level
coverage ratio
in the plurality of bin-level coverage ratios is determined from a comparison
of (i) a number
of sequence reads in the first plurality of sequence reads that map to the
corresponding bin
and (ii) a number of sequence reads from one or more reference samples that
map to the
corresponding bin.
[0778] In some embodiments, each bin is defined as any region of
a reference genome
(e.g., that maps to a location in a reference genome). For example, in some
embodiments, a
bin is any number of bases in size. In some embodiments, a bin is 1, 2, 3, 4,
5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, or more than 30
base pairs long. In some embodiments, a bin is at least 30, at least 40, at
least 50, at least 60,
at least 70, at least 80, at least 90, at least 100, at least 110, at least
120, at least 130, at least
140, at least 150, at least 160, at least 170, at least 180, at least 190, or
at least 200 base pairs
long. In some embodiments, a bin is between 5 base pairs and 100,000 base
pairs long. In
some embodiments, a bin is between 10 and 10,000 base pairs long. In some
embodiments, a
bin is greater than 100,000 base pairs long. In some embodiments, each bin in
the plurality of
bins is the same size. In some embodiments, a first bin in the plurality of
bins is a different
size from a second bin in the plurality of bins. In some embodiments, each bin
further
comprises a start and end position that corresponds to a location on a
reference genome. In
some embodiments, the plurality of bins comprises at least 10, at least 50, at
least 100, at
least 1,000, at least 2,000, at least 5,000, at least 10,000, at least 20,000,
at least 50,000, at
least 100,000, at least 500,000, at least 1x106, at least 2x106, at least
5x106, at least 1x107, or
at least 1x108 bins.
226
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0779] In some embodiments, on average, each respective bin in
the plurality of bins has
two or more, three or more, five or more, ten or more, fifteen or more, twenty
or more, fifty
or more, one hundred or more, five hundred or more, one thousand or more, ten
thousand or
more, or 100,000 or more sequence reads in the plurality of sequence reads
mapping onto the
portion of the reference genome corresponding to the respective bin, where
each such
sequence read uniquely represents a different molecule in the plurality of
cell-free nucleic
acids in the liquid biopsy sample. For instance, in some embodiments, the
plurality of cell-
free nucleic acids in the liquid biopsy sample are sequenced with a sequencing
methodology
that makes use of unique molecular identifier (UMIs) for each cell-free
nucleic acid in the
liquid biopsy sample and each sequence read in the plurality of sequence reads
has a unique
UMI. In such embodiments, sequence reads with the same UMI are bagged
(collapsed) into a
single sequence read bearing the UMI.
[0780] In some embodiments, each bin-level coverage ratio in the
plurality of bin-level
coverage ratios comprises any measurement of a number of copies of a genomic
sequence
compared to a reference sequence (e.g., a copy ratio, 10g2 ratio, coverage
ratio, base fraction,
allele fraction (e.g., VAF), tumor ploidy, etc.).
[0781] In some embodiments, each sequence read in the first
plurality of sequence reads
that map to the corresponding bin (e.g., used for comparison to determine a
bin-level
coverage ratio) is a unique sequence read. In some embodiments, each sequence
read in the
first plurality of sequence reads that map to the corresponding bin comprises
one or more
unique identifiers (e.g., a unique molecular identifier or UMI). For example,
in some
embodiments, each sequence read that originates (e.g., was amplified or
sequenced from) a
unique original cfDNA fragment comprises an identifier that indicates the
original cfDNA
fragment from which the sequence read is derived. In some such embodiments, a
plurality of
duplicate sequence reads originating from the same original cfDNA fragment
share the same
identifier.
[0782] In some embodiments, the sequence reads from the one or
more reference samples
that map to the corresponding bin are prepared using a DNA extraction and
enrichment
matched process, e.g., where the same process used on the test sample is also
used on the one
or more reference samples. In some embodiments, the sequence reads from the
one or more
reference samples are prepared using the same sequencing methodology as is
used to
generate the sequence reads for the test sample.
227
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0783] In some alternative embodiments, the determining a
plurality of bin-level
coverage ratios from the plurality of sequences is determined from a
comparison of (i) a
number of sequence reads in the second plurality of sequence reads that map to
the
corresponding bin and (ii) a number of sequence reads from one or more
reference samples
that map to the corresponding bin. For example, in some such embodiments, the
determining
the plurality of bin-level coverage ratios is performed using off-target
sequence reads (e.g.,
not panel-enriched sequence reads) rather than on-target sequence reads (e.g.,
panel-
enriched). In some such embodiments, the bins, sequence reads, method of
preparing the one
or more reference samples and/or method of determining the coverage ratios
comprises any
of the presently disclosed embodiments described above.
[0784] Referring to Block 506-3, the method further comprises
determining a plurality of
segment-level coverage ratios. A plurality of segments is formed by grouping
respective
subsets of adjacent bins in the plurality of bins based on a similarity
between the respective
coverage ratios of the subset of adjacent bins. For each respective segment in
the plurality of
segments, a segment-level coverage ratio is determined based on the
corresponding bin-level
coverage ratios for each bin in the respective segment.
[0785] In some embodiments, the segmentation is performed using
circular binary
segmentation (CBS). In some embodiments, the segment-level coverage ratio
comprises any
measurement of a number of copies of a genomic sequence compared to a
reference sequence
(e.g., a copy ratio, 10g2 ratio, coverage ratio, base fraction, allele
fraction (e.g., VAF), tumor
ploidy, etc.). In some embodiments, the segment-level coverage ratio is
obtained by a
measure of central tendency of the plurality of bin-level coverage ratios for
each bin in the
respective segment. For example, in some embodiments, the segment-level
coverage ratio is
obtained by an arithmetic mean, a weighted mean, a midrange, a midhinge, a
trimean, a
Winsorized mean, a mean, a median or a mode of the plurality of bin-level
coverage ratios for
each bin in the respective segment.
[0786] In some embodiments, each segment is further filtered to
remove one or more
segments that fail to satisfy a filtering criterion. In some such embodiments,
the filtering
criterion is a position on a sex chromosome, where segments that are located
on sex
chromosomes are removed from the plurality of segments. In some embodiments,
the
filtering criterion is a minimum per-segment probe threshold. In some such
embodiments,
the filtering is performed by tallying the number of probes in the targeted
enrichment panel
that correspond to reference sequences spanned by the respective segment_ If
the probe count
228
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
for the respective segment obtained from the tallying is below a specified
probe threshold,
then the segment is removed from the plurality of segments.
[0787] Referring to Block 508-3, the method further comprises
fitting, for each respective
simulated circulating tumor fraction in a plurality of simulated circulating
tumor fractions,
each respective segment in the plurality of segments to a respective integer
copy state in a
plurality of integer copy states. The fitting is performed by identifying the
respective integer
copy state in the plurality of integer copy states that best matches the
segment-level coverage
ratio. The fitting thus generates, for each respective simulated circulating
tumor fraction in
the plurality of simulated tumor fractions, a respective set of integer copy
states for the
plurality of segments.
[0788] In some embodiments, a simulated circulating tumor
fraction is a specified value.
In some embodiments, the simulated circulating tumor fraction is between 10-6
and 0.999. In
some embodiments, the simulated circulating tumor fraction is between 10-5 and
0.999. In
some embodiments, the simulated circulating tumor fraction is between 10 and
0.999. In
some embodiments, the simulated circulating tumor fraction is between 0.001
and .999. In
some embodiments, the simulated circulating tumor fraction is between 0.01 and
.99. In
some embodiments, the simulated circulating tumor fraction is 0 or 100. In
some
embodiments, the plurality of simulated circulating tumor fractions comprises
at least 10
simulated circulating tumor fractions. In some embodiments, the plurality of
simulated
circulating tumor fractions comprises at least 5, 10, 15, 20, 25, 30, 40, 50,
60, 70, 80, 90, 100
or more simulated circulating tumor fractions.
[0789] In some embodiments, the plurality of circulating tumor
fractions comprises 1, 2,
3,4, 5,6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48,
49, 50, 51, 52, 53, 54,
55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,
74, 75, 76, 77, 78, 79,
80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98,
99, 100, or more than
100 circulating tumor fraction values.
107901 In some embodiments, the plurality of simulated
circulating tumor fractions spans
a range of at least from 5% to 25%. In some embodiments, the plurality of
simulated
circulating tumor fractions spans a range of at least from 1% to 50%. In some
embodiments,
the plurality of simulated circulating tumor fractions spans a range having a
lower boundary
between about 0.1% and about 5% (e.g., 0.1%, 0.2%, 0.3%, 0.4%, 0.5%, 0.6%,
0.7%, 0.8%,
229
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
0.9%, 1%, 1.5%, 2%, 3%, 4%, or 5%) and an upper boundary between about 25% and
about
100% (e.g., 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%,
90%,
95%, or 100%). In some embodiments, the span between each consecutive pair of
simulated
tumor fractions is no more than 5%. In some embodiments, the span between
consecutive
pairs of simulated tumor fractions is no more than 1%, 2%, 3%, 4%, 5%, 6%, 7%,
8%, 9%, or
10%. In some embodiments, the span between consecutive pairs of simulated
tumor fractions
is consistent through the entire range of simulated tumor fractions. In other
embodiments, the
span between consecutive pairs of simulated tumor fractions increases as the
simulated tumor
fraction increases. That is, in some embodiments, the span between low
simulated tumor
fractions is small and the span between high tumor fractions is larger.
[0791] In some embodiments, the plurality of circulating tumor
fractions comprises every
value between 0 and 1 (that is, between 1% circulating tumor fraction and 100%
circulating
tumor fraction) with a span of 0.01 between each pair of values (e.g., 0.01,
0.02, 0.03,... 0.98,
0.99).
[0792] In some embodiments, the plurality of integer copy states
comprises a 1-copy
state, a 2-copy state, a 3-copy state, and a 4-copy state. In some
embodiments, the plurality
of integer copy states includes at least 3 states, at least 4 states, at least
5 states, at least six
states, or more. In some embodiments, the plurality of states represents a
span of consecutive
integer values, generally starting from 1. In some embodiments, the plurality
of integer copy
states comprises at least 1, at least 2, at least 3, at least 4, at least 5,
at least 6, at least 7, at
least 8, at least 9, or at least 10 copy states.
[0793] In some embodiments, the integer copy state is used to
obtain a coverage ratio for
each respective copy state in the plurality of copy states and each respective
simulated
circulating tumor fraction in the plurality of simulated circulating tumor
fractions. In some
embodiments, the coverage ratio is a log2-transformed coverage ratio (e.g.,
where negative
numbers indicate copy number loss and positive numbers indicate copy number
gain). In
some embodiments, the coverage ratio is between -3 and 3. In some embodiments,
the
coverage ratio is between -4 and -3, between -3 and -2, between -2 and -1,
between -1 and 0,
between 0 and 1, between 1 and 2, between 2 and 3, or between 3 and 4.
[0794] In some embodiments, the fitting includes using a maximum
likelihood estimation
method to fit each respective segment in the plurality of segments to the
respective integer
copy state. In some such embodiments, the maximum likelihood estimation method
is an
230
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
expectation maximization algorithm that considers the error between each of
the plurality of
copy states and the segment-level coverage ratio at each of the plurality of
simulated
circulating tumor fractions.
[0795] In some such embodiments, the identifying the respective
integer copy state that
best matches the segment-level coverage ratio is performed by, for each
respective segment
in the plurality of segments, selecting the copy state with the smallest
distance (e.g, the
smallest error) from the segment-level coverage ratio for the respective
segment, and
assigning the respective copy state to the segment. In some such embodiments,
the method
further comprises assigning a copy state to each segment in the plurality of
segments, based
on a consideration (e.g., a minimization) of the error. In some embodiments,
the
consideration is performed for each possible copy state corresponding to each
segment in the
plurality of segments, and the procedure is then repeated for each simulated
circulating tumor
fraction in the plurality of circulating tumor fractions. Thus, each iteration
of the procedure
will produce a plurality of sets of integer copy states, where each set of
integer copy states is
associated with a simulated circulating tumor fraction in the plurality of
circulating tumor
fractions, and where each integer copy state in the set of integer copy states
is associated with
a segment in the plurality of segments.
[0796] In some embodiments, the fitting includes, for each
respective simulated tumor
fraction in the plurality of simulated tumor fractions: determining, for each
respective integer
copy state in the plurality of integer copy states, a corresponding expected
coverage ratio;
comparing, for each respective segment in the plurality of segments, the
corresponding
segment-level coverage ratio to the each of the expected coverage ratio for
each respective
integer copy state in the plurality of integer copy states; and assigning, for
each respective
segment in the plurality of segments, a corresponding integer copy state based
on the
comparison.
[0797] Thus, in some such embodiments, the consideration of the
error between each
integer copy state and the segment-level coverage ratio of each segment is
determined using a
comparison between the expected coverage ratio of each integer copy state and
the segment-
level coverage ratio of each segment.
[0798] In some such embodiments, for each respective integer copy
state in the plurality
of integer copy states, the corresponding expected coverage ratio is
determined according to
the relationship:
231
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
(2(1-Tp1)+(cN1)(Tpt)),
[0799] log2(CR) = 1092
2
[0800] where CR is the expected coverage ratio; TPL is the
respective simulated
circulating tumor fraction, and CNj is the respective integer copy state.
[0801] Referring to Block 510-3, the method further comprises
determining the
circulating tumor fraction for the test subject based on optimization (e.g.,
minimization) of an
error between corresponding segment-level coverage ratios and integer copy
states (e.g.,
relative to the fitted integer copy state) across the plurality of simulated
circulated tumor
fractions.
[0802] In some embodiments, the determining the circulating tumor
fraction for the test
subject comprises determining a measure of fit, for each respective simulated
tumor fraction
in the plurality of simulated tumor fractions, based on the aggregate of a
difference, for each
respective segment in the plurality of segments, between the respective
segment-level
coverage ratio and the expected coverage ratio for the corresponding copy
state fit to the
respective segment. The determining further comprises selecting the simulated
tumor
fraction, in the plurality of tumor fractions, with the best measure of fit.
[0803] In some embodiments, the measure of fit for each
respective segment, in the
plurality of segments, is defined by the relationship:
[0804] W = Ek Eklk,
108051 where wi is the measure of fit for simulated tumor
fraction 1, Ek is the square of
the difference between the respective segment-level coverage ratio and
expected coverage
ratio for the copy state k at tumor fraction i, 1k is the number of probe
sequences, in the
plurality of probe sequences, that fall within the respective segment.
[0806] For example, in some embodiments, the optimization of the
respective segment-
level errors is a minimization of error to obtain an error score. In some
embodiments, the
error score is determined by calculating the sum of errors between each of the
plurality of
assigned copy states and the segment-level coverage ratio (e.g., relative to
the fitted integer
copy state), for each segment in the plurality of segments, for each of a
plurality of simulated
circulating tumor fractions. Thus, in embodiments where the error between the
segment-level
coverage ratio for each segment and the assigned copy state for the respective
segment is a
minimized error (e.g., due to the selection of nearest copy states), the sum
of errors thus
generates a minimized error score for each simulated circulating tumor
fraction in the
232
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
plurality of circulating tumor fractions. In some such embodiments, the
minimized error
scores are compared, and the smallest score is selected, thus selecting the
circulating tumor
fraction estimate having the corresponding smallest score as the circulating
tumor fraction
estimate for the test subject. In some embodiments, the error scores are
further weighted
prior to summing (e.g., by weighting each error in the summed error score
based upon a
number of probes corresponding to the respective segment). Additional
embodiments and
examples illustrating determining the circulating tumor fraction for the test
subject are
described above with reference to Figures 4F3 and 6A3-C3.
[0807] In some embodiments, the obtained circulating tumor
fraction estimate is used for
further downstream analysis and biomarker detection (e.g., calculation of
variant allele
fractions, variant calling, and/or identification of other metrics). In some
embodiments, the
obtained circulating tumor fraction estimate is used as a metric for disease
detection,
diagnosis, and/or treatment. In some embodiments, the obtained circulating
tumor fraction
estimate is included in a clinical report made available to the patient or a
clinician. In some
embodiments, the obtained circulating tumor fraction estimate is used to
select appropriate
therapies and/or clinical trials for assessment of treatment response.
[0808] In some embodiments, the methods described herein include
generating a clinical
report 139-3 (e.g., a patient report), providing clinical support for
personalized cancer
therapy, and/or using the information curated from sequencing of a liquid
biopsy sample, as
described above. In some embodiments, the report is provided to a patient,
physician,
medical personnel, or researcher in a digital copy (for example, a JSON
object, a pdf file, or
an image on a website or portal), a hard copy (for example, printed on paper
or another
tangible medium). A report object, such as a JSON object, can be used for
further processing
and/or display. For example, information from the report object can be used to
prepare a
clinical laboratory report for return to an ordering physician. In some
embodiments, the
report is presented as text, as audio (for example, recorded or streaming), as
images, or in
another format and/or any combination thereof.
[0809] The report includes information related to the specific
characteristics of the
patient's cancer, e.g., detected genetic variants, epigenetic abnormalities,
associated
oncogenic pathogenic infections, and/or pathology abnormalities. In some
embodiments,
other characteristics of a patient's sample and/or clinical records are also
included in the
report. For example, in some embodiments, the clinical report includes
information on
clinical variants, e.g., one or more of copy number variants (e.g., for
actionable genes
233
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
CCNE1, CD274(PD-L1), EGFR, ERBB2(HER2), MET, MYC, BRCA1, and/or BRCA2),
fusions, translocations, and/or rearrangements (e.g., in actionable genes ALK,
ROS1, RET,
NTRK1, FGFR2, FGFR3, NTRK2 and/or NTRK3), pathogenic single nucleotide
polymorphisms, insertion-deletions (e.g., somatic/tumor and/or
germline/normal), therapy
biomarkers, micros atellite instability status, and/or tumor mutational
burden.
108101 Conversion of solid tumor test to liquid biopsy test. In
one embodiment, the solid
tissue sample is insufficient for NGS testing (for example, the sample is too
small or too
degraded, the amount or quality of nucleic acids extracted from the sample
does not result in
quality NGS results that would result in reliable determination of variants
and/or other
genetic characteristics of the sample), and the physician or patient may
decide to convert the
solid tissue test that was ordered to a liquid biopsy test to be performed on
a liquid biopsy
sample collected from the same patient. The resulting report and/or display of
the results on
a portal may include an "xF Conversion Badge" to distinguish any order that
has been
converted from solid tissue test to a liquid biopsy test (compared to, for
example, a liquid
biopsy test that was not initially ordered as a solid tissue test). This will
allow a user to
identify which orders have been converted by this process, and distinguish
between orders
that were intentionally placed for the liquid biopsy panel.
[0811] Longitudinal Reporting. In various embodiments, a report
may include and/or
compare the results of multiple liquid biopsy tests and/or solid tumor tests
(for example,
multiple tests associated with the same patient). The results of multiple
liquid biopsy tests
and/or solid tumor tests may be displayed on a portal in a variety of
configurations that may
be selected and/or customized by the viewer. The tests may have been performed
at different
times, and the samples on which the tests were performed may have been
collected at
different times.
[0812] Download result. Clinical and/or molecular data associated
with a patient (for
example, information that would be included in the report), may be aggregated
and made
available via the portal. Any portion of the report data may be available for
download (for
example, as a CSV file) by the physician and/or patient. In various
embodiments, the data
may include data related to genetic variants, RNA expression levels,
immunotherapy markers
(including MSI and TMB). RNA fusions, etc. In one embodiment, if a physician
or medical
facility has ordered multiple tests (all tests may be associated with the same
patient or tests
may be associated with multiple patients), results associated with more than
one test may be
aggregated into a single file for downloading.
234
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0813] Methods Integrating Multiple Improvement
[0814] Advantageously, the present disclosure describes several
improvements relating to
the analysis of cell-free DNA in a liquid biopsy sample from a subject with
cancer. For
instance, among other aspects, the present disclosure describes improvements
in (i) somatic
variant (e.g., SNP) identification, (ii) focal copy number variation
identification, and (iii)
circulating tumor fraction determination. It is contemplated that various
combinations of
these improvements, as well as other non-conventional aspects described
herein, may be
integrated into a common bioinformatics pipeline for analyzing liquid biopsy
samples. For
instance, in some embodiments, a bioinformatics pipeline integrating one, two,
or all three of
these improvements is further improved by parallel analysis of nucleic acids
from a solid
cancerous tissue sample of the subject, by parallel analysis of nucleic acids
from a non-
cancerous tissue of the subject, or both. Examples of various combinations of
improvements
that may be combined into a single liquid biopsy bioinformatic pipeline,
methods associated
thereof, systems for performing such methods, and/or non-transitory computer
readable
media for executing such methods are described below. It will be appreciated
that these
combinations can be performed with any other preparatory or bioinformatic
steps described
in the other methods described herein, e.g., methods 200, 400-1, 400-2, 400-3,
450, 500-1,
500-2, 500-3, 600-1, and 600-2 as illustrated in Figures 2, 4, 5, and 6, and
further described
above.
[0815] In some embodiments, a bioinformatics pipeline for
analyzing nucleic acids in a
liquid biopsy is provided that integrates at least an improvement in somatic
variant
identification, e.g., as described above in the section entitled "Systems and
Methods for
Improved Validation of Somatic Sequence Variants- and/or "Variant
Identification,- and an
improvement in focal copy number identification, e.g., as described above in
the section
entitled "Systems and Methods for Improved Validation of Copy Number
Variation" and/or
"Copy Number Variation."
[0816] Accordingly, in some embodiments, a method is provided for
analyzing a liquid
biopsy sample from a subject with cancer that includes (i) obtaining, from a
first sequencing
reaction of cell-free DNA fragments, a first plurality of sequence reads
aligned to a reference
sequence for the species of the subject, (ii) determining whether a respective
candidate
sequence variant (e.g., a SNP) identified from the first plurality of aligned
sequence reads can
be validated as a somatic sequence variant by comparing a corresponding
variant allele
fragment count for the respective candidate sequence variant to a dynamic
variant count
235
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
threshold for the locus of the reference sequence that the candidate variant
maps to, where the
dynamic variant count threshold is based upon a pre-test odds of a positive
variant call for the
locus based upon a prevalence of variants in a genomic region that includes
the locus from a
first set of nucleic acids obtained from a cohort of subjects having the
cancer condition, such
that when the corresponding variant fragment count satisfies the dynamic
variant count
threshold, the presence of the somatic sequence is validated, and when the
corresponding
variant fragment count does not satisfy the dynamic variant count threshold,
the presence of
the somatic sequence is rejected, and (iii) determining whether a candidate
focal copy number
variation for a respective genomic segment, identified from the first
plurality of aligned
sequence reads, can be validated as a somatic focal copy number variation by
(a) determining
bin-level sequence ratios, segment-level sequence ratios, and segment-level
measures of
dispersion from a comparison of (i) the sequence reads in the first plurality
of sequence reads
that map to respective genomic bins or genomic segments to (ii) sequence reads
from one or
more reference samples that map to the same respective genomic bins or genomic
segments,
e.g., as described above in the section titled "Systems and Methods for
Improved Validation
of Copy Number Variation" and/or -Copy Number Variation," and (b) determining
whether
determined bin-level sequence ratios, segment-level sequence ratios, and
segment-level
measures of dispersion corresponding to the respective genomic segment satisfy
a plurality of
filters that include (1) a measure of central tendency bin-level sequence
ratio filter that is
fired when a measure of central tendency of the plurality of bin-level
sequence ratios
corresponding to the subset of bins encompassed by the respective segment
fails to satisfy
one or more bin-level sequence ratio thresholds, (2) a confidence filter that
is fired when the
segment-level measure of dispersion corresponding to the respective segment
fails to satisfy a
confidence threshold, and (3) a measure of central tendency-plus-deviation bin-
level
sequence ratio filter that is fired when a measure of central tendency of the
plurality of bin-
level sequence ratios corresponding to the subset of bins encompassed by the
respective
segment fails to satisfy one or more measure of central tendency-plus-
deviation bin-level
sequence ratio thresholds, e.g., as described above in the section titled
"Systems and Methods
for Improved Validation of Copy Number Variation" and/or -Copy Number
Variation," such
that when the determined bin-level sequence ratios, segment-level sequence
ratios, and
segment-level measures of dispersion satisfy all of the filters in the
plurality of filters, the
focal copy number variation is validated, and when the determined bin-level
sequence ratios,
segment-level sequence ratios, and segment-level measures of dispersion do not
satisfy all of
the filters in the plurality of filters, the focal copy number variation is
rejected.
236
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0817] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and an improvement in focal copy
number
identification, also integrates an improvement in circulating tumor fraction
determination,
e.g., as described above in the section entitled "Systems and Methods for
Improved
Circulating Tumor Fraction Estimates- and/or "Circulating Tumor Fraction.- In
some
embodiments, the bioinformatics pipeline integrating at least an improvement
in somatic
variant identification and an improvement in focal copy number identification,
is further
improved by parallel analysis of nucleic acids from a solid cancerous tissue
sample of the
subject, e.g., as described above in the section entitled "Concurrent Testing.-
In some
embodiments, the bioinformatics pipeline integrating at least an improvement
in somatic
variant identification and an improvement in focal copy number identification,
is further
improved by parallel analysis of nucleic acids from a non-cancerous tissue
sample of the
subject, e.g., as described above in the section entitled "Concurrent Testing.-

[0818] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and an improvement in focal copy
number
identification, also integrates an improvement in circulating tumor fraction
determination,
e.g., as described above in the section entitled "Systems and Methods for
Improved
Circulating Tumor Fraction Estimates" and/or -Circulating Tumor Fraction," and
is further
improved by parallel analysis of nucleic acids from a solid cancerous tissue
sample of the
subject, e.g., as described above in the section entitled "Concurrent
Testing." In some
embodiments, the bioinformatics pipeline integrating at least an improvement
in somatic
variant identification and an improvement in focal copy number identification,
also integrates
an improvement in circulating tumor fraction determination, e.g., as described
above in the
section entitled "Systems and Methods for Improved Circulating Tumor Fraction
Estimates"
and/or "Circulating Tumor Fraction," and is further improved by parallel
analysis of nucleic
acids from a non-cancerous tissue sample of the subject, e.g., as described
above in the
section entitled "Concurrent Testing." In some embodiments, the bioinformatics
pipeline
integrating at least an improvement in somatic variant identification and an
improvement in
focal copy number identification, is further improved by parallel analysis of
nucleic acids
from a solid cancerous tissue sample of the subject, e.g., as described above
in the section
entitled -Concurrent Testing" and is further improved by parallel analysis of
nucleic acids
from a non-cancerous tissue sample of the subject, e.g., as described above in
the section
entitled -Concurrent Testing."
237
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0819] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and an improvement in focal copy
number
identification, also integrates an improvement in circulating tumor fraction
determination,
e.g., as described above in the section entitled "Systems and Methods for
Improved
Circulating Tumor Fraction Estimates- and/or "Circulating Tumor Fraction,- is
further
improved by parallel analysis of nucleic acids from a solid cancerous tissue
sample of the
subject, e.g., as described above in the section entitled "Concurrent Testing-
and is further
improved by parallel analysis of nucleic acids from a non-cancerous tissue
sample of the
subject, e.g., as described above in the section entitled "Concurrent Testing-
.
[0820] In some embodiments, a bioinformatics pipeline for
analyzing nucleic acids in a
liquid biopsy is provided that integrates at least an improvement in somatic
variant
identification, e.g., as described above in the section entitled "Systems and
Methods for
Improved Validation of Somatic Sequence Variants" and/or "Variant
Identification," and an
improvement in circulating tumor fraction determination, e.g., as described
above in the
section entitled "Systems and Methods for Improved Circulating Tumor Fraction
Estimates"
and/or -Circulating Tumor Fraction."
[0821] Accordingly, in some embodiments, a method is provided for
analyzing a liquid
biopsy sample from a subject with cancer that includes (i) obtaining, from a
first sequencing
reaction of cell-free DNA fragments, a first plurality of sequence reads
aligned to a reference
sequence for the species of the subject, (ii) determining whether a respective
candidate
sequence variant (e.g., a SNP) identified from the first plurality of aligned
sequence reads can
be validated as a somatic sequence variant by comparing a corresponding
variant allele
fragment count for the respective candidate sequence variant to a dynamic
variant count
threshold for the locus of the reference sequence that the candidate variant
maps to, where the
dynamic variant count threshold is based upon a pre-test odds of a positive
variant call for the
locus based upon a prevalence of variants in a genomic region that includes
the locus from a
first set of nucleic acids obtained from a cohort of subjects having the
cancer condition, such
that when the corresponding variant fragment count satisfies the dynamic
variant count
threshold, the presence of the somatic sequence is validated, and when the
corresponding
variant fragment count does not satisfy the dynamic variant count threshold,
the presence of
the somatic sequence is rejected, and (iii) estimating a circulating tumor
fraction for the
subject by (a) determining bin-level coverage ratios and segment-level
coverage ratios from a
comparison of (i) the number of sequence reads in the first plurality of
sequence reads that
238
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
map to respective genomic bins or genomic segments and (ii) the number of
sequence reads
from one or more reference samples that map to the same respective genomic
bins or
genomic segments, e.g., as described above in the section titled "Systems and
Methods for
Improved Circulating Tumor Fraction Estimates" and/or "Circulating Tumor
Fraction," (b)
identifying integer copy states that best match segment-level coverage ratios
by fitting
segments to integer copy states for a plurality of simulated circulating tumor
fractions, and
(c) estimating the circulating tumor fraction for the test subject based on a
measure of fit
between corresponding segment-level coverage ratios and integer copy states
across the
plurality of simulated circulated tumor fractions, e.g., as described above in
the section titled
"Systems and Methods for Improved Circulating Tumor Fraction Estimates" and/or

"Circulating Tumor Fraction."
108221 In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and an improvement in
circulating tumor
fraction determination, also integrates an improvement in focal copy number
identification,
e.g., as described above in the section entitled "Systems and Methods for
Improved
Validation of Copy Number Variation" and/or -Copy Number Variation." In some
embodiments, the bioinformatics pipeline integrating at least an improvement
in somatic
variant identification and an improvement in circulating tumor fraction
determination, is
further improved by parallel analysis of nucleic acids from a solid cancerous
tissue sample of
the subject, e.g., as described above in the section entitled "Concurrent
Testing." In some
embodiments, the bioinformatics pipeline integrating at least an improvement
in somatic
variant identification and an improvement in circulating tumor fraction
determination, is
further improved by parallel analysis of nucleic acids from a non-cancerous
tissue sample of
the subject, e.g., as described above in the section entitled "Concurrent
Testing"
108231 In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and an improvement in
circulating tumor
fraction determination, also integrates an improvement in focal copy number
identification,
e.g., as described above in the section entitled -Systems and Methods for
Improved
Validation of Copy Number Variation" and/or "Copy Number Variation," and is
further
improved by parallel analysis of nucleic acids from a solid cancerous tissue
sample of the
subject, e g , as described above in the section entitled "Concurrent
Testing." In some
embodiments, the bioinformatics pipeline integrating at least an improvement
in somatic
variant identification and an improvement in circulating tumor fraction
determination, also
239
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
integrates an improvement in focal copy number identification, e.g., as
described above in the
section entitled "Systems and Methods for Improved Validation of Copy Number
Variation"
and/or "Copy Number Variation," and is further improved by parallel analysis
of nucleic
acids from a non-cancerous tissue sample of the subject, e.g., as described
above in the
section entitled "Concurrent Testing.- In some embodiments, the bioinformatics
pipeline
integrating at least an improvement in somatic variant identification and an
improvement in
circulating tumor fraction determination, is further improved by parallel
analysis of nucleic
acids from a solid cancerous tissue sample of the subject, e.g., as described
above in the
section entitled "Concurrent Testing,- and is further improved by parallel
analysis of nucleic
acids from a non-cancerous tissue sample of the subject, e.g., as described
above in the
section entitled "Concurrent Testing."
[0824] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and an improvement in
circulating tumor
fraction determination, also integrates an improvement in focal copy number
identification,
e.g., as described above in the section entitled "Systems and Methods for
Improved
Validation of Copy Number Variation" and/or -Copy Number Variation," is
further improved
by parallel analysis of nucleic acids from a solid cancerous tissue sample of
the subject, e.g.,
as described above in the section entitled -Concurrent Testing" and is further
improved by
parallel analysis of nucleic acids from a non-cancerous tissue sample of the
subject, e.g., as
described above in the section entitled "Concurrent Testing."
[0825] In some embodiments, a bioinformatics pipeline for
analyzing nucleic acids in a
liquid biopsy is provided that integrates at least an improvement in somatic
variant
identification, e.g., as described above in the section entitled "Systems and
Methods for
Improved Validation of Somatic Sequence Variants" and/or -Variant
Identification," and
parallel analysis of nucleic acids from a solid cancerous tissue sample of the
subject, e.g., as
described above in the section entitled -Concurrent Testing."
[0826] Accordingly, in some embodiments, a method is provided for
analyzing a liquid
biopsy sample from a subject with cancer that includes (i) obtaining, from a
first sequencing
reaction of cell-free DNA fragments, a first plurality of sequence reads
aligned to a reference
sequence for the species of the subject, (ii) determining whether a respective
candidate
sequence variant (e.g., a SNP) identified from the first plurality of aligned
sequence reads can
be validated as a somatic sequence variant by comparing a corresponding
variant allele
fragment count for the respective candidate sequence variant to a dynamic
variant count
240
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
threshold for the locus of the reference sequence that the candidate variant
maps to, where the
dynamic variant count threshold is based upon a pre-test odds of a positive
variant call for the
locus based upon a prevalence of variants in a genomic region that includes
the locus from a
first set of nucleic acids obtained from a cohort of subjects having the
cancer condition, such
that when the corresponding variant fragment count satisfies the dynamic
variant count
threshold, the presence of the somatic sequence is validated, and when the
corresponding
variant fragment count does not satisfy the dynamic variant count threshold,
the presence of
the somatic sequence is rejected, and (iii) obtaining, from a second
sequencing reaction of
nucleic acid fragments in a solid tumor biopsy sample from the subject, a
second plurality of
sequence reads aligned to a reference sequence for the species of the subject,
and analyzing
the nucleic acids from the solid tumor biopsy sample using a parallel analysis
including, at
least, determining whether a respective candidate sequence variant (e.g., a
SNP) identified
from the second plurality of aligned sequence reads can be validated as a
somatic sequence
variant.
[0827] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and parallel analysis of nucleic
acids from a
solid cancerous tissue sample of the subject, also integrates an improvement
in focal copy
number identification, e.g., as described above in the section entitled -
Systems and Methods
for Improved Validation of Copy Number Variation" and/or -Copy Number
Variation." In
some embodiments, the bioinformatics pipeline integrating at least an
improvement in
somatic variant identification and parallel analysis of nucleic acids from a
solid cancerous
tissue sample of the subject, also integrates an improvement in circulating
tumor fraction
determination, e.g., as described above in the section entitled "Systems and
Methods for
Improved Circulating Tumor Fraction Estimates- and/or "Circulating Tumor
Fraction.- In
some embodiments, the bioinformatics pipeline integrating at least an
improvement in
somatic variant identification and parallel analysis of nucleic acids from a
solid cancerous
tissue sample of the subject, also integrates an improvement in circulating
tumor fraction
determination, e.g., as described above in the section entitled "Systems and
Methods for
Improved Circulating Tumor Fraction Estimates" and/or "Circulating Tumor
Fraction." In
some embodiments, the bioinformatics pipeline integrating at least an
improvement in
somatic variant identification and parallel analysis of nucleic acids from a
solid cancerous
tissue sample of the subject, is further improved by parallel analysis of
nucleic acids from a
241
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
non-cancerous tissue sample of the subject, e.g., as described above in the
section entitled
"Concurrent Testing."
[0828] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and parallel analysis of nucleic
acids from a
solid cancerous tissue sample of the subject, also integrates an improvement
in focal copy
number identification, e.g., as described above in the section entitled
"Systems and Methods
for Improved Validation of Copy Number Variation" and/or "Copy Number
Variation," and
also integrates an improvement in circulating tumor fraction determination,
e.g., as described
above in the section entitled "Systems and Methods for Improved Circulating
Tumor Fraction
Estimates" and/or "Circulating Tumor Fraction." In some embodiments, the
bioinformatics
pipeline integrating at least an improvement in somatic variant identification
and parallel
analysis of nucleic acids from a solid cancerous tissue sample of the subject,
also integrates
an improvement in focal copy number identification, e.g., as described above
in the section
entitled "Concurrent Testing," and is further improved by parallel analysis of
nucleic acids
from a non-cancerous tissue sample of the subject, e.g., as described above in
the section
entitled "Concurrent Testing." In some embodiments, the bioinformatics
pipeline integrating
at least an improvement in somatic variant identification and parallel
analysis of nucleic acids
from a solid cancerous tissue sample of the subject, also integrates an
improvement in
circulating tumor fraction determination, e.g., as described above in the
section entitled
"Systems and Methods for Improved Circulating Tumor Fraction Estimates" and/or

"Circulating Tumor Fraction," and is further improved by parallel analysis of
nucleic acids
from a non-cancerous tissue sample of the subject, e.g., as described above in
the section
entitled "Concurrent Testing."
[0829] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and parallel analysis of nucleic
acids from a
solid cancerous tissue sample of the subject, also integrates an improvement
in focal copy
number identification, e.g., as described above in the section entitled
"Systems and Methods
for Improved Validation of Copy Number Variation" and/or "Copy Number
Variation," also
integrates an improvement in circulating tumor fraction determination, e.g.,
as described
above in the section entitled "Systems and Methods for Improved Circulating
Tumor Fraction
Estimates" and/or "Circulating Tumor Fraction," and is further improved by
parallel analysis
of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as
described above in
the section entitled "Concurrent Testing."
242
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0830] In some embodiments, a bioinformatics pipeline for
analyzing nucleic acids in a
liquid biopsy is provided that integrates at least an improvement in somatic
variant
identification, e.g., as described above in the section entitled "Systems and
Methods for
Improved Validation of Somatic Sequence Variants" and/or "Variant
Identification," and
parallel analysis of nucleic acids from a non-cancerous tissue sample of the
subject, e.g., as
described above in the section entitled "Concurrent Testing."
[0831] Accordingly, in some embodiments, a method is provided for
analyzing a liquid
biopsy sample from a subject with cancer that includes (i) obtaining, from a
first sequencing
reaction of cell-free DNA fragments, a first plurality of sequence reads
aligned to a reference
sequence for the species of the subject, (ii) determining whether a respective
candidate
sequence variant (e.g., a SNP) identified from the first plurality of aligned
sequence reads can
be validated as a somatic sequence variant by comparing a corresponding
variant allele
fragment count for the respective candidate sequence variant to a dynamic
variant count
threshold for the locus of the reference sequence that the candidate variant
maps to, where the
dynamic variant count threshold is based upon a pre-test odds of a positive
variant call for the
locus based upon a prevalence of variants in a genomic region that includes
the locus from a
first set of nucleic acids obtained from a cohort of subjects having the
cancer condition, such
that when the corresponding variant fragment count satisfies the dynamic
variant count
threshold, the presence of the somatic sequence is validated, and when the
corresponding
variant fragment count does not satisfy the dynamic variant count threshold,
the presence of
the somatic sequence is rejected, and (iii) obtaining, from a second
sequencing reaction of
nucleic acid fragments in a non-cancerous tissue sample from the subject, a
second plurality
of sequence reads aligned to a reference sequence for the species of the
subject, and
analyzing the nucleic acids from the non-cancerous tissue sample using a
parallel analysis
including, at least, determining whether a respective candidate sequence
variant (e.g., a SNP)
identified from the second plurality of aligned sequence reads can be
validated as a somatic
sequence variant.
[0832] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and parallel analysis of nucleic
acids from a
non-cancerous tissue sample of the subject, also integrates an improvement in
focal copy
number identification, e.g., as described above in the section entitled
"Systems and Methods
for Improved Validation of Copy Number Variation" and/or -Copy Number
Variation." In
some embodiments, the bioinformatics pipeline integrating at least an
improvement in
243
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
somatic variant identification and parallel analysis of nucleic acids from a
non-cancerous
tissue sample of the subject, also integrates an improvement in circulating
tumor fraction
determination, e.g., as described above in the section entitled "Systems and
Methods for
Improved Circulating Tumor Fraction Estimates" and/or "Circulating Tumor
Fraction." In
some embodiments, the bioinformatics pipeline integrating at least an
improvement in
somatic variant identification and parallel analysis of nucleic acids from a
non-cancerous
tissue sample of the subject, is further improved by parallel analysis of
nucleic acids from a
solid cancerous tissue sample of the subject, e.g., as described above in the
section entitled
"Concurrent Testing."
[0833] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and parallel analysis of nucleic
acids from a
non-cancerous tissue sample of the subject, also integrates an improvement in
focal copy
number identification, e.g., as described above in the section entitled
"Systems and Methods
for Improved Validation of Copy Number Variation" and/or -Copy Number
Variation," and
also integrates an improvement in circulating tumor fraction determination,
e.g., as described
above in the section entitled -Systems and Methods for Improved Circulating
Tumor Fraction
Estimates" and/or "Circulating Tumor Fraction." In some embodiments, the
bioinformatics
pipeline integrating at least an improvement in somatic variant identification
and parallel
analysis of nucleic acids from a non-cancerous tissue sample of the subject,
also integrates an
improvement in focal copy number identification, e.g., as described above in
the section
entitled "Systems and Methods for Improved Validation of Copy Number
Variation" and/or
-Copy Number Variation," and is further improved by parallel analysis of
nucleic acids from
a solid cancerous tissue sample of the subject, e.g., as described above in
the section entitled
"Concurrent Testing.- In some embodiments, the bioinformatics pipeline
integrating at least
an improvement in somatic variant identification and parallel analysis of
nucleic acids from a
non-cancerous tissue sample of the subject, also integrates an improvement in
circulating
tumor fraction determination, e.g., as described above in the section entitled
"Systems and
Methods for Improved Circulating Tumor Fraction Estimates- and/or -Circulating
Tumor
Fraction," and is further improved by parallel analysis of nucleic acids from
a solid cancerous
tissue sample of the subject, e.g., as described above in the section entitled
"Concurrent
Testing."
[0834] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in somatic variant identification and parallel analysis of nucleic
acids from a
244
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
non-cancerous tissue sample of the subject, also integrates an improvement in
focal copy
number identification, e.g., as described above in the section entitled
"Systems and Methods
for Improved Validation of Copy Number Variation" and/or "Copy Number
Variation," also
integrates an improvement in circulating tumor fraction determination, e.g.,
as described
above in the section entitled "Systems and Methods for Improved Circulating
Tumor Fraction
Estimates" and/or "Circulating Tumor Fraction," and is further improved by
parallel analysis
of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as
described above
in the section entitled -Concurrent Testing."
[0835] In some embodiments, a bioinformatics pipeline for
analyzing nucleic acids in a
liquid biopsy is provided that integrates at least an improvement in focal
copy number
identification, e.g., as described above in the section entitled "Systems and
Methods for
Improved Validation of Copy Number Variation" and/or "Copy Number Variation,"
and an
improvement in circulating tumor fraction determination, e.g., as described
above in the
section entitled -Systems and Methods for Improved Circulating Tumor Fraction
Estimates"
and/or "Circulating Tumor Fraction.-
[0836] Accordingly, in some embodiments, a method is provided for
analyzing a liquid
biopsy sample from a subject with cancer that includes (i) obtaining, from a
first sequencing
reaction of cell-free DNA fragments, a first plurality of sequence reads
aligned to a reference
sequence for the species of the subject, (ii) determining whether a candidate
focal copy
number variation for a respective genomic segment, identified from the first
plurality of
aligned sequence reads, can be validated as a somatic focal copy number
variation by (a)
determining bin-level sequence ratios, segment-level sequence ratios, and
segment-level
measures of dispersion from a comparison of (i) the sequence reads in the
first plurality of
sequence reads that map to respective genomic bins or genomic segments to (ii)
sequence
reads from one or more reference samples that map to the same respective
genomic bins or
genomic segments, e.g., as described above in the section titled "Systems and
Methods for
Improved Validation of Copy Number Variation" and/or "Copy Number Variation,"
and (b)
determining whether determined bin-level sequence ratios, segment-level
sequence ratios,
and segment-level measures of dispersion corresponding to the respective
genomic segment
satisfy a plurality of filters that include (1) a measure of central tendency
bin-level sequence
ratio filter that is fired when a measure of central tendency of the plurality
of bin-level
sequence ratios corresponding to the subset of bins encompassed by the
respective segment
fails to satisfy one or more bin-level sequence ratio thresholds, (2) a
confidence filter that is
245
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
fired when the segment-level measure of dispersion corresponding to the
respective segment
fails to satisfy a confidence threshold, and (3) a measure of central tendency-
plus-deviation
bin-level sequence ratio filter that is fired when a measure of central
tendency of the plurality
of bin-level sequence ratios corresponding to the subset of bins encompassed
by the
respective segment fails to satisfy one or more measure of central tendency-
plus-deviation
bin-level sequence ratio thresholds, e.g., as described above in the section
titled "Systems and
Methods for Improved Validation of Copy Number Variation- and/or "Copy Number
Variation," such that when the determined bin-level sequence ratios, segment-
level sequence
ratios, and segment-level measures of dispersion satisfy all of the filters in
the plurality of
filters, the focal copy number variation is validated, and when the determined
bin-level
sequence ratios, segment-level sequence ratios, and segment-level measures of
dispersion do
not satisfy all of the filters in the plurality of filters, the focal copy
number variation is
rejected, and (iii) estimating a circulating tumor fraction for the subject by
(a) determining
bin-level coverage ratios and segment-level coverage ratios from a comparison
of (i) the
number of sequence reads in the first plurality of sequence reads that map to
respective
genomic bins or genomic segments and (ii) the number of sequence reads from
one or more
reference samples that map to the same respective genomic bins or genomic
segments, e.g.,
as described above in the section titled "Systems and Methods for Improved
Circulating
Tumor Fraction Estimates" and/or "Circulating Tumor Fraction," (b) identifying
integer copy
states that best match segment-level coverage ratios by fitting segments to
integer copy states
for a plurality of simulated circulating tumor fractions, and (c) estimating
the circulating
tumor fraction for the test subject based on a measure of fit between
corresponding segment-
level coverage ratios and integer copy states across the plurality of
simulated circulated tumor
fractions, e.g., as described above in the section titled "Systems and Methods
for Improved
Circulating Tumor Fraction Estimates" and/or -Circulating Tumor Fraction."
[0837] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in focal copy number identification and an improvement in
circulating tumor
fraction determination, also integrates an improvement in somatic variant
identification, e.g.,
as described above in the section entitled "Systems and Methods for Improved
Validation of
Somatic Sequence Variants" and/or "Variant Identification." In some
embodiments, the
bioinformatics pipeline integrating at least an improvement in focal copy
number
identification and an improvement in circulating tumor fraction determination,
is further
improved by parallel analysis of nucleic acids from a solid cancerous tissue
sample of the
246
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
subject, e.g., as described above in the section entitled "Concurrent
Testing." In some
embodiments, the bioinformatics pipeline integrating at least an improvement
in focal copy
number identification and an improvement in circulating tumor fraction
determination, is
further improved by parallel analysis of nucleic acids from a non-cancerous
tissue sample of
the subject, e.g., as described above in the section entitled "Concurrent
Testing-
108381 In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in focal copy number identification and an improvement in
circulating tumor
fraction determination, also integrates an improvement in somatic variant
identification, e.g.,
as described above in the section entitled "Systems and Methods for Improved
Validation of
Somatic Sequence Variants" and/or "Variant Identification," and is further
improved by
parallel analysis of nucleic acids from a solid cancerous tissue sample of the
subject, e.g., as
described above in the section entitled "Concun-ent Testing." In some
embodiments, the
bioinformatics pipeline integrating at least an improvement in focal copy
number
identification and an improvement in circulating tumor fraction determination,
also integrates
an improvement in somatic variant identification, e.g., as described above in
the section
entitled "Systems and Methods for Improved Validation of Somatic Sequence
Variants"
and/or "Variant Identification," and is further improved by parallel analysis
of nucleic acids
from a non-cancerous tissue sample of the subject, e.g., as described above in
the section
entitled "Concurrent Testing" In some embodiments, the bioinformatics pipeline
integrating
at least an improvement in focal copy number identification and an improvement
in
circulating tumor fraction determination, is further improved by parallel
analysis of nucleic
acids from a solid cancerous tissue sample of the subject, e.g., as described
above in the
section entitled "Concurrent Testing," and is further improved by parallel
analysis of nucleic
acids from a non-cancerous tissue sample of the subject, e.g., as described
above in the
section entitled "Concurrent Testing"
[0839] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in focal copy number identification and an improvement in
circulating tumor
fraction determination, also integrates an improvement in somatic variant
identification, e.g.,
as described above in the section entitled "Systems and Methods for Improved
Validation of
Somatic Sequence Variants" and/or "Variant Identification," is further
improved by parallel
analysis of nucleic acids from a solid cancerous tissue sample of the subject,
e.g., as
described above in the section entitled "Concurrent Testing," and is further
improved by
247
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
parallel analysis of nucleic acids from a non-cancerous tissue sample of the
subject, e.g., as
described above in the section entitled -Concun-ent Testing."
[0840] In some embodiments, a bioinformatics pipeline for
analyzing nucleic acids in a
liquid biopsy is provided that integrates at least an improvement in focal
copy number
identification, e.g., as described above in the section entitled -Systems and
Methods for
Improved Validation of Copy Number Variation" and/or -Copy Number Variation,"
and
parallel analysis of nucleic acids from a solid cancerous tissue sample of the
subject, e.g., as
described above in the section entitled "Concurrent Testing."
[0841] Accordingly, in some embodiments, a method is provided for
analyzing a liquid
biopsy sample from a subject with cancer that includes (i) obtaining, from a
first sequencing
reaction of cell-free DNA fragments, a first plurality of sequence reads
aligned to a reference
sequence for the species of the subject, (ii) determining whether a candidate
focal copy
number variation for a respective genomic segment, identified from the first
plurality of
aligned sequence reads, can be validated as a somatic focal copy number
variation by (a)
determining bin-level sequence ratios, segment-level sequence ratios, and
segment-level
measures of dispersion from a comparison of (i) the sequence reads in the
first plurality of
sequence reads that map to respective genomic bins or genomic segments to (ii)
sequence
reads from one or more reference samples that map to the same respective
genomic bins or
genomic segments, e.g., as described above in the section titled "Systems and
Methods for
Improved Validation of Copy Number Variation" and/or -Copy Number Variation,"
and (b)
determining whether determined bin-level sequence ratios, segment-level
sequence ratios,
and segment-level measures of dispersion corresponding to the respective
genomic segment
satisfy a plurality of filters that include (1) a measure of central tendency
bin-level sequence
ratio filter that is fired when a measure of central tendency of the plurality
of bin-level
sequence ratios corresponding to the subset of bins encompassed by the
respective segment
fails to satisfy one or more bin-level sequence ratio thresholds, (2) a
confidence filter that is
fired when the segment-level measure of dispersion corresponding to the
respective segment
fails to satisfy a confidence threshold, and (3) a measure of central tendency-
plus-deviation
bin-level sequence ratio filter that is fired when a measure of central
tendency of the plurality
of bin-level sequence ratios corresponding to the subset of bins encompassed
by the
respective segment fails to satisfy one or more measure of central tendency-
plus-deviation
bin-level sequence ratio thresholds, e.g., as described above in the section
titled -Systems and
Methods for Improved Validation of Copy Number Variation" and/or "Copy Number
248
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Variation,- such that when the determined bin-level sequence ratios, segment-
level sequence
ratios, and segment-level measures of dispersion satisfy all of the filters in
the plurality of
filters, the focal copy number variation is validated, and when the determined
bin-level
sequence ratios, segment-level sequence ratios, and segment-level measures of
dispersion do
not satisfy all of the filters in the plurality of filters, the focal copy
number variation is
rejected, and (iii) obtaining, from a second sequencing reaction of nucleic
acid fragments in a
solid tumor biopsy sample from the subject, a second plurality of sequence
reads aligned to a
reference sequence for the species of the subject, and analyzing the nucleic
acids from the
solid tumor biopsy sample using a parallel analysis including, at least,
determining whether a
candidate focal copy number variation for a respective genomic segment,
identified from the
second plurality of aligned sequence reads, can be validated as a somatic
focal copy number
variation.
[0842] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in focal copy number identification and parallel analysis of
nucleic acids from a
solid cancerous tissue sample of the subject, also integrates an improvement
in somatic
variant identification, e.g., as described above in the section entitled -
Systems and Methods
for Improved Validation of Somatic Sequence Variants" and/or "Variant
Identification." In
some embodiments, the bioinformatics pipeline integrating at least an
improvement in focal
copy number identification and parallel analysis of nucleic acids from a solid
cancerous
tissue sample of the subject, also integrates an improvement in circulating
tumor fraction
determination, e.g., as described above in the section entitled "Systems and
Methods for
Improved Circulating Tumor Fraction Estimates" and/or -Circulating Tumor
Fraction."
[0843] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in focal copy number identification and parallel analysis of
nucleic acids from a
solid cancerous tissue sample of the subject, also integrates an improvement
in somatic
variant identification, e.g., as described above in the section entitled
"Systems and Methods
for Improved Validation of Somatic Sequence Variants" and/or "Variant
Identification," and
also integrates an improvement in circulating tumor fraction determination,
e.g., as described
above in the section entitled "Systems and Methods for Improved Circulating
Tumor Fraction
Estimates" and/or "Circulating Tumor Fraction." In some embodiments, the
bioinformatics
pipeline integrating at least an improvement in focal copy number
identification and parallel
analysis of nucleic acids from a solid cancerous tissue sample of the subject,
also integrates
an improvement in circulating tumor fraction determination, e.g., as described
above in the
249
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
section entitled "Systems and Methods for Improved Circulating Tumor Fraction
Estimates"
and/or "Circulating Tumor Fraction," and is further improved by parallel
analysis of nucleic
acids from a non-cancerous tissue sample of the subject, e.g., as described
above in the
section entitled "Concurrent Testing."
[0844] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in focal copy number identification and parallel analysis of
nucleic acids from a
solid cancerous tissue sample of the subject, also integrates an improvement
in somatic
variant identification, e.g., as described above in the section entitled
"Systems and Methods
for Improved Validation of Somatic Sequence Variants" and/or -Variant
Identification," also
integrates an improvement in circulating tumor fraction determination, e.g.,
as described
above in the section entitled "Systems and Methods for Improved Circulating
Tumor Fraction
Estimates" and/or "Circulating Tumor Fraction," and is further improved by
parallel analysis
of nucleic acids from a non-cancerous tissue sample of the subject, e.g., as
described above in
the section entitled -Concurrent Testing."
[0845] In some embodiments, a bioinformatics pipeline for
analyzing nucleic acids in a
liquid biopsy is provided that integrates at least an improvement in focal
copy number
identification, e.g., as described above in the section entitled "Systems and
Methods for
Improved Validation of Copy Number Variation" and/or -Copy Number Variation,"
and
parallel analysis of nucleic acids from a non-cancerous tissue sample of the
subject, e.g., as
described above in the section entitled -Concurrent Testing."
[0846] Accordingly, in some embodiments, a method is provided for
analyzing a liquid
biopsy sample from a subject with cancer that includes (i) obtaining, from a
first sequencing
reaction of cell-free DNA fragments, a first plurality of sequence reads
aligned to a reference
sequence for the species of the subject, (ii) determining whether a candidate
focal copy
number variation for a respective genomic segment, identified from the first
plurality of
aligned sequence reads, can be validated as a somatic focal copy number
variation by (a)
determining bin-level sequence ratios, segment-level sequence ratios, and
segment-level
measures of dispersion from a comparison of (i) the sequence reads in the
first plurality of
sequence reads that map to respective genomic bins or genomic segments to (ii)
sequence
reads from one or more reference samples that map to the same respective
genomic bins or
genomic segments, e.g., as described above in the section titled "Systems and
Methods for
Improved Validation of Copy Number Variation" and/or -Copy Number Variation,"
and (b)
determining whether determined bin-level sequence ratios, segment-level
sequence ratios,
250
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
and segment-level measures of dispersion corresponding to the respective
genomic segment
satisfy a plurality of filters that include (1) a measure of central tendency
bin-level sequence
ratio filter that is fired when a measure of central tendency of the plurality
of bin-level
sequence ratios corresponding to the subset of bins encompassed by the
respective segment
fails to satisfy one or more bin-level sequence ratio thresholds, (2) a
confidence filter that is
fired when the segment-level measure of dispersion corresponding to the
respective segment
fails to satisfy a confidence threshold, and (3) a measure of central tendency-
plus-deviation
bin-level sequence ratio filter that is fired when a measure of central
tendency of the plurality
of bin-level sequence ratios corresponding to the subset of bins encompassed
by the
respective segment fails to satisfy one or more measure of central tendency-
plus-deviation
bin-level sequence ratio thresholds, e.g., as described above in the section
titled "Systems and
Methods for Improved Validation of Copy Number Variation" and/or -Copy Number
Variation,- such that when the determined bin-level sequence ratios, segment-
level sequence
ratios, and segment-level measures of dispersion satisfy all of the filters in
the plurality of
filters, the focal copy number variation is validated, and when the determined
bin-level
sequence ratios, segment-level sequence ratios, and segment-level measures of
dispersion do
not satisfy all of the filters in the plurality of filters, the focal copy
number variation is
rejected, and (iii) obtaining, from a second sequencing reaction of nucleic
acid fragments in a
non-cancerous tissue sample from the subject, a second plurality of sequence
reads aligned to
a reference sequence for the species of the subject, and analyzing the nucleic
acids from the
non-cancerous tissue sample using a parallel analysis including, at least,
determining whether
a candidate focal copy number variation for a respective genomic segment,
identified from
the second plurality of aligned sequence reads, can be validated as a somatic
focal copy
number variation.
[0847] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in focal copy number identification and parallel analysis of
nucleic acids from a
non-cancerous tissue sample of the subject, also integrates an improvement in
somatic variant
identification, e.g., as described above in the section entitled "Systems and
Methods for
Improved Validation of Somatic Sequence Variants" and/or "Variant
Identification." In
some embodiments, the bioinformatics pipeline integrating at least an
improvement in focal
copy number identification and parallel analysis of nucleic acids from a non-
cancerous tissue
sample of the subject, also integrates an improvement in circulating tumor
fraction
determination, e.g., as described above in the section entitled -Systems and
Methods for
251
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Improved Circulating Tumor Fraction Estimates" and/or "Circulating Tumor
Fraction." In
some embodiments, the bioinformatics pipeline integrating at least an
improvement in focal
copy number identification and parallel analysis of nucleic acids from a non-
cancerous tissue
sample of the subject, is further improved by parallel analysis of nucleic
acids from a solid
cancerous tissue sample of the subject, e.g., as described above in the
section entitled
"Concurrent Testing."
[0848] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in focal copy number identification and parallel analysis of
nucleic acids from a
non-cancerous tissue sample of the subject, also integrates an improvement in
somatic variant
identification, e.g., as described above in the section entitled "Systems and
Methods for
Improved Validation of Somatic Sequence Variants- and/or "Variant
Identification,- and also
integrates an improvement in circulating tumor fraction determination, e.g.,
as described
above in the section entitled "Systems and Methods for Improved Circulating
Tumor Fraction
Estimates" and/or "Circulating Tumor Fraction." In some embodiments, the
bioinformatics
pipeline integrating at least an improvement in focal copy number
identification and parallel
analysis of nucleic acids from a non-cancerous tissue sample of the subject,
also integrates an
improvement in somatic variant identification, e.g., as described above in the
section entitled
"Systems and Methods for Improved Validation of Somatic Sequence Variants"
and/or
"Variant Identification," and is further improved by parallel analysis of
nucleic acids from a
solid cancerous tissue sample of the subject, e.g., as described above in the
section entitled
"Concurrent Testing." In some embodiments, the bioinformatics pipeline
integrating at least
an improvement in focal copy number identification and parallel analysis of
nucleic acids
from a non-cancerous tissue sample of the subject, also integrates an
improvement in
circulating tumor fraction determination, e.g., as described above in the
section entitled
"Systems and Methods for Improved Circulating Tumor Fraction Estimates" and/or

"Circulating Tumor Fraction," and is further improved by parallel analysis of
nucleic acids
from a solid cancerous tissue sample of the subject, e.g., as described above
in the section
entitled "Concurrent Testing.-
108491 In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in focal copy number identification and parallel analysis of
nucleic acids from a
non-cancerous tissue sample of the subject, also integrates an improvement in
somatic variant
identification, e.g., as described above in the section entitled -Systems and
Methods for
Improved Validation of Somatic Sequence Variants" and/or "Variant
Identification," also
252
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
integrates an improvement in circulating tumor fraction determination, e.g.,
as described
above in the section entitled "Systems and Methods for Improved Circulating
Tumor Fraction
Estimates" and/or "Circulating Tumor Fraction," and is further improved by
parallel analysis
of nucleic acids from a solid cancerous tissue sample of the subject, e.g., as
described above
in the section entitled "Concurrent Testing.-
108501 In some embodiments, a bioinformatics pipeline for
analyzing nucleic acids in a
liquid biopsy is provided that integrates at least an improvement in
circulating tumor fraction
determination, e.g., as described above in the section entitled "Systems and
Methods for
Improved Circulating Tumor Fraction Estimates" and/or -Circulating Tumor
Fraction," and
parallel analysis of nucleic acids from a solid cancerous tissue sample of the
subject, e.g., as
described above in the section entitled -Concurrent Testing.-
108511 Accordingly, in some embodiments, a method is provided for
analyzing a liquid
biopsy sample from a subject with cancer that includes (i) obtaining, from a
first sequencing
reaction of cell-free DNA fragments, a first plurality of sequence reads
aligned to a reference
sequence for the species of the subject, (ii) estimating a circulating tumor
fraction for the
subject by (a) determining bin-level coverage ratios and segment-level
coverage ratios from a
comparison of (i) the number of sequence reads in the first plurality of
sequence reads that
map to respective genomic bins or genomic segments and (ii) the number of
sequence reads
from one or more reference samples that map to the same respective genomic
bins or
genomic segments, e.g., as described above in the section titled "Systems and
Methods for
Improved Circulating Tumor Fraction Estimates" and/or "Circulating Tumor
Fraction," (b)
identifying integer copy states that best match segment-level coverage ratios
by fitting
segments to integer copy states for a plurality of simulated circulating tumor
fractions, such
that the circulating tumor fraction for the subject is determined from an
optimization of error
between corresponding segment-level coverage ratios and integer copy states
across the
plurality of simulated circulating tumor fractions, e.g., as described above
in the section titled
"Systems and Methods for Improved Circulating Tumor Fraction Estimates" and/or

-Circulating Tumor Fraction," and (iii) obtaining, from a second sequencing
reaction of
nucleic acid fragments in a solid tumor biopsy sample from the subject, a
second plurality of
sequence reads aligned to a reference sequence for the species of the subject,
and analyzing
the nucleic acids from the solid tumor biopsy sample using a parallel analysis
including at
least estimating a tumor fraction for the subject from the second plurality of
aligned sequence
reads.
253
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0852] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a solid cancerous tissue sample of the subject, also integrates an
improvement in
somatic variant identification, e.g., as described above in the section
entitled "Systems and
Methods for Improved Validation of Somatic Sequence Variants- and/or "Variant
Identification." In some embodiments, the bioinformatics pipeline integrating
at least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a solid cancerous tissue sample of the subject, also integrates an
improvement in
focal copy number identification, e.g., as described above in the section
entitled "Systems
and Methods for Improved Validation of Copy Number Variation" and/or "Copy
Number
Variation." In some embodiments, the bioinformatics pipeline integrating at
least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a solid cancerous tissue sample of the subject, is further improved
by parallel
analysis of nucleic acids from a non-cancerous tissue sample of the subject,
e.g., as described
above in the section entitled "Concurrent Testing."
[0853] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a solid cancerous tissue sample of the subject, also integrates an
improvement in
somatic variant identification, e.g., as described above in the section
entitled "Systems and
Methods for Improved Validation of Somatic Sequence Variants" and/or "Variant
Identification," and also integrates an improvement in focal copy number
identification, e.g.,
as described above in the section entitled -Systems and Methods for Improved
Validation of
Copy Number Variation" and/or "Copy Number Variation." In some embodiments,
the
bioinformatics pipeline integrating at least an improvement in circulating
tumor fraction
determination and parallel analysis of nucleic acids from a solid cancerous
tissue sample of
the subject, also integrates an improvement in somatic variant identification,
e.g., as
described above in the section entitled -Systems and Methods for Improved
Validation of
Somatic Sequence Variants- and/or "Variant Identification,- and is further
improved by
parallel analysis of nucleic acids from a non-cancerous tissue sample of the
subject, e.g., as
described above in the section entitled "Concurrent Testing." In some
embodiments, the
bioinformatics pipeline integrating at least an improvement in circulating
tumor fraction
determination and parallel analysis of nucleic acids from a solid cancerous
tissue sample of
the subject, also integrates an improvement in focal copy number
identification, e.g., as
254
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
described above in the section entitled -Systems and Methods for Improved
Validation of
Copy Number Variation" and/or "Copy Number Variation," and is further improved
by
parallel analysis of nucleic acids from a non-cancerous tissue sample of the
subject, e.g., as
described above in the section entitled -Concurrent Testing."
[0854] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a solid cancerous tissue sample of the subject, also integrates an
improvement in
somatic variant identification, e.g., as described above in the section
entitled "Systems and
Methods for Improved Validation of Somatic Sequence Variants" and/or -Variant
Identification," also integrates an improvement in focal copy number
identification, e.g., as
described above in the section entitled -Systems and Methods for Improved
Validation of
Copy Number Variation" and/or "Copy Number Variation," and is further improved
by
parallel analysis of nucleic acids from a non-cancerous tissue sample of the
subject, e.g., as
described above in the section entitled -Concurrent Testing."
[0855] In some embodiments, a bioinformatics pipeline for
analyzing nucleic acids in a
liquid biopsy is provided that integrates at least an improvement in
circulating tumor fraction
determination, e.g., as described above in the section entitled "Systems and
Methods for
Improved Circulating Tumor Fraction Estimates" and/or -Circulating Tumor
Fraction," and
parallel analysis of nucleic acids from a non-cancerous tissue sample of the
subject, e.g., as
described above in the section entitled -Concurrent Testing."
[0856] Accordingly, in some embodiments, a method is provided for
analyzing a liquid
biopsy sample from a subject with cancer that includes (i) obtaining, from a
first sequencing
reaction of cell-free DNA fragments, a first plurality of sequence reads
aligned to a reference
sequence for the species of the subject, (ii) estimating a circulating tumor
fraction for the
subject by (a) determining bin-level coverage ratios and segment-level
coverage ratios from a
comparison of (i) the number of sequence reads in the first plurality of
sequence reads that
map to respective genomic bins or genomic segments and (ii) the number of
sequence reads
from one or more reference samples that map to the same respective genomic
bins or
genomic segments, e.g., as described above in the section titled "Systems and
Methods for
Improved Circulating Tumor Fraction Estimates" and/or -Circulating Tumor
Fraction," (b)
identifying integer copy states that best match segment-level coverage ratios
by fitting
segments to integer copy states for a plurality of simulated circulating tumor
fractions, and
(c) estimating the circulating tumor fraction for the test subject based on a
measure of fit
255
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
between corresponding segment-level coverage ratios and integer copy states
across the
plurality of simulated circulated tumor fractions, e.g., as described above in
the section titled
"Systems and Methods for Improved Circulating Tumor Fraction Estimates" and/or

"Circulating Tumor Fraction." and (iii) obtaining, from a second sequencing
reaction of
nucleic acid fragments in a non-cancerous tissue sample from the subject, a
second plurality
of sequence reads aligned to a reference sequence for the species of the
subject, and
analyzing the nucleic acids from the non-cancerous tissue sample using a
parallel analysis
including at least estimating a tumor fraction for the subject from the second
plurality of
aligned sequence reads.
108571 In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a non-cancerous tissue sample of the subject, also integrates an
improvement in
somatic variant identification, e.g., as described above in the section
entitled "Systems and
Methods for Improved Validation of Somatic Sequence Variants" and/or -Variant
Identification.- In some embodiments, the bioinformatics pipeline integrating
at least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a non-cancerous tissue sample of the subject, also integrates an
improvement in
focal copy number identification, e.g., as described above in the section
entitled -Systems
and Methods for Improved Validation of Copy Number Variation" and/or "Copy
Number
Variation." In some embodiments, the bioinformatics pipeline integrating at
least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a non-cancerous tissue sample of the subject, is further improved
by parallel
analysis of nucleic acids from a solid cancerous tissue sample of the subject,
e.g., as
described above in the section entitled -Concurrent Testing.-
[0858] In some embodiments, the bioinformatics pipeline
integrating at least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a non-cancerous tissue sample of the subject, also integrates an
improvement in
somatic variant identification, e.g., as described above in the section
entitled -Systems and
Methods for Improved Validation of Somatic Sequence Variants" and/or "Variant
Identification," and also integrates an improvement in focal copy number
identification, e.g.,
as described above in the section entitled "Systems and Methods for Improved
Validation of
Copy Number Variation" and/or -Copy Number Variation." In some embodiments,
the
bioinformatics pipeline integrating at least an improvement in circulating
tumor fraction
256
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
determination and parallel analysis of nucleic acids from anon-cancerous
tissue sample of
the subject, also integrates an improvement in somatic variant identification,
e.g., as
described above in the section entitled -Systems and Methods for Improved
Validation of
Somatic Sequence Variants" and/or "Variant Identification," and is further
improved by
parallel analysis of nucleic acids from a solid cancerous tissue sample of the
subject, e.g., as
described above in the section entitled "Concurrent Testing." In some
embodiments, the
bioinformatics pipeline integrating at least an improvement in circulating
tumor fraction
determination and parallel analysis of nucleic acids from a non-cancerous
tissue sample of
the subject, also integrates an improvement in focal copy number
identification, e.g., as
described above in the section entitled "Systems and Methods for Improved
Validation of
Copy Number Variation" and/or "Copy Number Variation," and is further improved
by
parallel analysis of nucleic acids from a solid cancerous tissue sample of the
subject, e.g., as
described above in the section entitled -Concurrent Testing.-
[0859] In some embodiments, the bioinfonnatics pipeline
integrating at least an
improvement in circulating tumor fraction determination and parallel analysis
of nucleic
acids from a non-cancerous tissue sample of the subject, also integrates an
improvement in
somatic variant identification, e.g., as described above in the section
entitled "Systems and
Methods for Improved Validation of Somatic Sequence Variants" and/or -Variant
Identification," also integrates an improvement in focal copy number
identification, e.g., as
described above in the section entitled "Systems and Methods for Improved
Validation of
Copy Number Variation" and/or "Copy Number Variation,- and is further improved
by
parallel analysis of nucleic acids from a solid cancerous tissue sample of the
subject, e.g., as
described above in the section entitled -Concurrent Testing."
Variant Characterization
[0860] In some embodiments, a predicted functional effect and/or
clinical interpretation
for one or more identified variants is curated by using information from
variant databases. In
some embodiments, a weighted-heuristic model is used to characterize each
variant.
[0861] In some embodiments, identified clinical variants are
labeled as "potentially
actionable", "biologically relevant", "variants of unknown significance
(VUSs)", or
-benign". Potentially actionable alterations are protein-altering variants
with an associated
therapy based on evidence from the medical literature. Biologically relevant
alterations are
protein-altering variants that may have functional significance or have been
observed in the
medical literature but are not associated with a specific therapy_ Variants of
unknown
257
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
significance (VUSs) are protein-altering variants exhibiting an unclear effect
on function
and/or without sufficient evidence to determine their pathogenicity. In some
embodiments,
benign variants are not reported. In some embodiments, variants are identified
through
aligning the patient's DNA sequence to the human genome reference sequence
version hg19
(GRCh37). In some embodiments, actionable and biologically relevant somatic
variants are
provided in a clinical summary during report generation.
[0862] For instance, in some embodiments, variant classification
and reporting is
performed, where detected variants are investigated following criteria from
known
evolutionary models, functional data, clinical data, literature, and other
research endeavors,
including tumor organoid experiments. In some embodiments, variants are
prioritized and
classified based on known gene-disease relationships, hotspot regions within
genes, internal
and external somatic databases, primary literature, and other features of
somatic drivers.
Variants can be added to a patient (or sample, for example, organoid sample)
report based on
recommendations from the AMP/ASCO/CAP guidelines. Additional guidelines may be

followed. Briefly, pathogenic variants with therapeutic, diagnostic, or
prognostic
significance may be prioritized in the report. Non-actionable pathogenic
variants may be
included as biologically relevant, followed by variants of uncertain
significance.
Translocations may be reported based on features of known gene fusions,
relevant
breakpoints, and biological relevance. Evidence may be curated from public and
private
databases or research and presented as 1) consensus guidelines 2) clinical
research, or 3) case
studies, with a link to the supporting literature. Germline alterations may be
reported as
secondary findings in a subset of genes for consenting patients. These may
include genes
recommended by the American College of Medical Genetics and Genomics (ACMG)
and
additional genes associated with cancer predisposition or drug resistance.
[0863] In some embodiments, a clinical report 139-3 includes
information about clinical
trials for which the patient is eligible, therapies that are specific to the
patient's cancer, and/or
possible therapeutic adverse effects associated with the specific
characteristics of the
patient's cancer, e.g., the patient's genetic variations, epigenetic
abnormalities, associated
oncogenic pathogenic infections, and/or pathology abnormalities, or other
characteristics of
the patient's sample and/or clinical records. For example, in some
embodiments, the clinical
report includes such patient information and analysis metrics, including
cancer type and/or
diagnosis, variant allele fraction, patient demographic and/or institution,
matched therapies
(e.g., FDA approved and/or investigational), matched clinical trials, variants
of unknown
258
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
significance (VUS), genes with low coverage, panel information, specimen
information,
details on reported variants, patient clinical history, status and/or
availability of previous test
results, and/or version of bioinformatics pipeline.
[0864] In some embodiments, the results included in the report,
and/or any additional
results (for example, from the bioinformatics pipeline), are used to query a
database of
clinical data, for example, to determine whether there is a trend showing that
a particular
therapy was effective or ineffective in treating (e.g., slowing or halting
cancer progression),
and/or adverse effects of such treatments in other patients having the same or
similar
characteristics.
[0865] In some embodiments, the results are used to design cell-
based studies of the
patient's biology, e.g., tumor organoid experiments. For example, an organoid
may be
genetically engineered to have the same characteristics as the specimen and
may be observed
after exposure to a therapy to determine whether the therapy can reduce the
growth rate of the
organoid, and thus may be likely to reduce the growth rate of cancer in the
patient associated
with the specimen. Similarly, in some embodiments, the results are used to
direct studies on
tumor organoids derived directly from the patient. An example of such
experimentation is
described in U.S. Provisional Patent Application No. 62/944,292, filed
December 5, 2019, the
content of which is hereby incorporated by reference, in its entirety, for all
purposes.
[0866] As illustrated in Figure 2A, in some embodiments, a
clinical report is checked for
final validation, review, and sign-off by a medical practitioner (e.g., a
pathologist). The
clinical report is then sent for action (e.g., for precision oncology
applications).
Digital and Laboratory Health Care Platform:
[0867] In some embodiments, the methods and systems described
herein are utilized in
combination with, or as part of, a digital and laboratory health care platform
that is generally
targeted to medical care and research. It should be understood that many uses
of the methods
and systems described above, in combination with such a platform, are
possible. One
example of such a platform is described in U.S. Patent Application No.
16/657,804, filed
October 18, 2019, which is hereby incorporated herein by reference in its
entirety for all
purposes.
108681 For example, an implementation of one or more embodiments
of the methods and
systems as described above may include microservices constituting a digital
and laboratory
health care platform supporting analysis of liquid biopsy samples to provide
clinical support
259
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
for personalized cancer therapy. Embodiments may include a single microservice
for
executing and delivering analysis of liquid biopsy samples to clinical support
for personalized
cancer therapy or may include a plurality of microservices each having a
particular role,
which together implement one or more of the embodiments above. In one example,
a first
microservice may execute sequence analysis in order to deliver genomic
features to a second
microservice for curating clinical support for personalized cancer therapy
based on the
identified features. Similarly, the second microservice may execute
therapeutic analysis of
the curated clinical support to deliver recommended therapeutic modalities,
according to
various embodiments described herein.
[0869] Where embodiments above are executed in one or more micro-
services with or as
part of a digital and laboratory health care platform, one or more of such
micro-services may
be part of an order management system that orchestrates the sequence of events
as needed at
the appropriate time and in the appropriate order necessary to instantiate
embodiments above.
A microservices-based order management system is disclosed, for example, in
U.S. Prov.
Patent Application No. 62/873,693, filed July 12, 2019, which is hereby
incorporated herein
by reference in its entirety for all purposes.
[0870] For example, continuing with the above first and second
microservices, an order
management system may notify the first microservice that an order for curating
clinical
support for personalized cancer therapy has been received and is ready for
processing. The
first microservice may execute and notify the order management system once the
delivery of
genomic features for the patient is ready for the second microservice.
Furthermore, the order
management system may identify that execution parameters (prerequisites) for
the second
microservice are satisfied, including that the first microservice has
completed, and notify the
second microservice that it may continue processing the order to curate
clinical support for
personalized cancer therapy, according to various embodiments described
herein.
[0871] In one example, the bioinformatics pipeline (for example,
the liquid biopsy
bioinformatics pipeline) is encoded within a docker container that receives a
direct link to
access a FASTA file (for example, stored in a cloud computing environment, AWS
s3
bucket, GCP storage unit, etc.), from which it generates BAM files, which may
be
orchestrated in part or wholly, for example, by the systems and methods
disclosed in US
Patent App. No. 16/927,976, filed July 13, 2020 and incorporated in its
entirety herein for any
and all purposes.
260
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0872] Where the digital and laboratory health care platform
further includes a genetic
analyzer system, the genetic analyzer system may include targeted panels
and/or sequencing
probes. An example of a targeted panel is disclosed, for example, in U.S.
Prov. Patent
Application No. 62/902,950, filed September 19, 2019, which is incorporated
herein by
reference and in its entirety for all purposes. In one example, targeted
panels may enable the
delivery of next generation sequencing results for providing clinical support
for personalized
cancer therapy according to various embodiments described herein. An example
of the
design of next-generation sequencing probes is disclosed, for example, in U.S.
Prov. Patent
Application No. 62/924,073, filed October 21, 2019, which is incorporated
herein by
reference and in its entirety for all purposes.
[0873] Where the digital and laboratory health care platform
further includes a
bioinformatics pipeline, the methods and systems described above may be
utilized after
completion or substantial completion of the systems and methods utilized in
the
bioinformatics pipeline. As one example, the bioinformatics pipeline may
receive next-
generation genetic sequencing results and return a set of binary files, such
as one or more
BAM files, reflecting nucleic acid (e.g., cfDNA, DNA and/or RNA) read counts
aligned to a
reference genome. The methods and systems described above may be utilized, for
example,
to ingest the cfDNA, DNA and/or RNA read counts and produce genomic features
as a result.
[0874] When the digital and laboratory health care platform
further includes an RNA data
normalizer, any RNA read counts may be normalized before processing
embodiments as
described above. An example of an RNA data normalizer is disclosed, for
example, in U.S.
Patent Application No. 16/581,706, filed September 24, 2019, which is
incorporated herein
by reference and in its entirety for all purposes.
[0875] When the digital and laboratory health care platform
further includes a genetic
data deconvoluter, any system and method for deconvoluting may be utilized for
analyzing
genetic data associated with a specimen having two or more biological
components to
determine the contribution of each component to the genetic data and/or
determine what
genetic data would be associated with any component of the specimen if it were
purified. An
example of a genetic data deconvoluter is disclosed, for example, in U.S.
Patent Application
No. 16/732,229 and PCT/US19/69161, filed December 31, 2019, U.S. Prov. Patent
Application No. 62/924,054, filed October 21, 2019, and U.S. Prov. Patent
Application No.
62/944,995, filed December 6, 2019, each of which is hereby incorporated
herein by
reference and in its entirety for all purposes.
261
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0876] When the digital and laboratory health care platform
further includes an
automated RNA expression caller, RNA expression levels may be adjusted to be
expressed as
a value relative to a reference expression level, which is often done in order
to prepare
multiple RNA expression data sets for analysis to avoid artifacts caused when
the data sets
have differences because they have not been generated by using the same
methods,
equipment, and/or reagents. An example of an automated RNA expression caller
is disclosed,
for example, in U.S. Prov. Patent Application No. 62/943,712, filed December
4, 2019, which
is incorporated herein by reference and in its entirety for all purposes.
[0877] The digital and laboratory health care platform may
further include one or more
insight engines to deliver information, characteristics, or determinations
related to a disease
state that may be based on genetic and/or clinical data associated with a
patient and/or
specimen. Exemplary insight engines may include a tumor of unknown origin
engine, a
human leukocyte antigen (HLA) loss of homozygosity (LOH) engine, a tumor
mutational
burden engine, a PD-Li status engine, a homologous recombination deficiency
engine, a
cellular pathway activation report engine, an immune infiltration engine, a
microsatellite
instability engine, a pathogen infection status engine, and so forth. An
example tumor of
unknown origin engine is disclosed, for example, in U.S. Prov. Patent
Application No.
62/855,750, filed May 31, 2019, which is incorporated herein by reference and
in its entirety
for all purposes. An example of an HLA LOH engine is disclosed, for example,
in U.S. Prov.
Patent Application No. 62/889,510, filed August 20, 2019, which is
incorporated herein by
reference and in its entirely for all purposes. An example of a tumor
mutational burden
(TMB) engine is disclosed, for example, in U.S. Prov. Patent Application No.
62/804,458,
filed February 12, 2019, which is incorporated herein by reference and in its
entirety for all
purposes. An example of a PD-Li status engine is disclosed, for example, in
U.S. Prov.
Patent Application No. 62/854,400, filed May 30, 2019, which is incorporated
herein by
reference and in its entirety for all purposes. An additional example of a PD-
Li status engine
is disclosed, for example, in U.S. Prov. Patent Application No. 62/824,039,
filed March 26,
2019, which is incorporated herein by reference and in its entirety for all
purposes. An
example of a homologous recombination deficiency engine is disclosed, for
example, in U.S.
Prov. Patent Application No. 62/804,730, filed February 12, 2019, which is
incorporated
herein by reference and in its entirety for all purposes. An example of a
cellular pathway
activation report engine is disclosed, for example, in U.S. Prov. Patent
Application No.
62/888,163, filed August 16, 2019, which is incorporated herein by reference
and in its
262
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
entirety for all purposes. An example of an immune infiltration engine is
disclosed, for
example, in U.S. Patent Application No. 16/533,676, filed August 6, 2019,
which is
incorporated herein by reference and in its entirety for all purposes. An
additional example
of an immune infiltration engine is disclosed, for example, in U.S. Patent
Application No.
62/804,509, filed February 12, 2019, which is incorporated herein by reference
and in its
entirety for all purposes. An example of an MSI engine is disclosed, for
example, in U.S.
Patent Application No. 16/653,868, filed October 15, 2019, which is
incorporated herein by
reference and in its entirety for all purposes. An additional example of an
MSI engine is
disclosed, for example, in U.S. Prov. Patent Application No. 62/931,600, filed
November 6,
2019, which is incorporated herein by reference and in its entirely for all
purposes.
[0878] When the digital and laboratory health care platform
further includes a report
generation engine, the methods and systems described above may be utilized to
create a
summary report of a patient's genetic profile and the results of one or more
insight engines
for presentation to a physician. For instance, the report may provide to the
physician
information about the extent to which the specimen that was sequenced
contained tumor or
normal tissue from a first organ, a second organ, a third organ, and so forth.
For example, the
report may provide a genetic profile for each of the tissue types, tumors, or
organs in the
specimen. The genetic profile may represent genetic sequences present in the
tissue type,
tumor, or organ and may include variants, expression levels, information about
gene
products, or other information that could be derived from genetic analysis of
a tissue, tumor,
or organ. The report may include therapies and/or clinical trials matched
based on a portion
or all of the genetic profile or insight engine findings and summaries. For
example, the
therapies may be matched according to the systems and methods disclosed in
U.S. Prov.
Patent Application No. 63/130,504, filed December 24, 2020, which is
incorporated herein by
reference and in its entirety for all purposes. For example, the clinical
trials may be matched
according to the systems and methods disclosed in U.S. Patent Application No.
16/889,779,
filed June 1, 2020, which is incorporated herein by reference and in its
entirety for all
purposes.
108791 The report may include a comparison of the results to a
database of results from
many specimens. In some embodiments, a patient's clinical data and/or
molecular data,
including molecular data generated through the use of the systems and methods
disclosed
herein, (for example, a variant call which may be generated by performing a
liquid biopsy,
including determination of circulating tumor fraction, dynamic variant
thresholding and/or
263
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
CNV) may be compared to information in a knowledge database that includes
clinical and/or
molecular data patterns and prescribed therapies, therapy response data,
survival data, and/or
prognosis data associated with one or more of those patterns. Some of the
associations may
be based on recommendations from regulatory bodies (for example, FDA, NCCN,
etc.),
scientific publications, analysis of large databases of molecular and/or
clinical data, etc. For
example, this comparison may be done for the purpose of determining a likely
prognosis for
the patient, and/or matching therapies and/or clinical trials to which the
patient may be likely
to respond, any of which may be included in the report. An example of methods
and systems
for comparing results to a database of results are disclosed in U.S. Patent
Application No.
16/732,168, filed December 31, 2019, which is incorporated herein by reference
and in its
entirely for all purposes. The information may be used, sometimes in
conjunction with
similar information from additional specimens and/or clinical response
information, to
discover biomarkers or design a clinical trial.
[0880] In some embodiments, if the clinical history includes
information indicating that
the patient had previously been prescribed one or more therapy and did not
respond (for
example, their disease progressed during and/or after receiving the therapy),
the report may
include a note that the patient failed the line(s) of therapy. In this case,
the report may
include and/or emphasize another therapy or therapies that is/are not included
in the patient's
clinical data. In one example, the report may indicate that these other
therapies may be used
as a second, third, or later line of therapy.
[0881] In some embodiments, the systems and methods disclosed
herein include the
administration of one or more therapies to the patient, which may include a
therapy listed on
the report.
[0882] When the digital and laboratory health care platform
further includes application
of one or more of the embodiments herein to organoids developed in connection
with the
platform, the methods and systems may be used to further evaluate genetic
sequencing data
derived from an organoid to provide information about the extent to which the
organoid that
was sequenced contained a first cell type, a second cell type, a third cell
type, and so forth.
For example, the report may provide a genetic profile for each of the cell
types in the
specimen. The genetic profile may represent genetic sequences present in a
given cell type
and may include variants, expression levels, information about gene products,
or other
information that could be derived from genetic analysis of a cell. The report
may include
therapies matched based on a portion or all of the deconvoluted information_
These therapies
264
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
may be tested on the organoid, derivatives of that organoid, and/or similar
organoids to
determine an organoid's sensitivity to those therapies. For example, organoids
may be
cultured and tested according to the systems and methods disclosed in U.S.
Patent
Application No. 16/693,117, filed November 22, 2019; U.S. Prov. Patent
Application No.
62/924,621, filed October 22, 2019; and U.S. Prov. Patent Application No.
62/944,292, filed
December 5, 2019, each of which is incorporated herein by reference and in its
entirety for all
purposes.
[0883] When the digital and laboratory health care platform
further includes application
of one or more of the above in combination with or as part of a medical device
or a laboratory
developed test that is generally targeted to medical care and research, such
laboratory
developed test or medical device results may be enhanced and personalized
through the use
of artificial intelligence. An example of laboratory developed tests,
especially those that may
be enhanced by artificial intelligence, is disclosed, for example, in U.S.
Provisional Patent
Application No. 62/924,515, filed October 22, 2019, which is incorporated
herein by
reference and in its entirety for all purposes.
[0884] It should be understood that the examples given above are
illustrative and do not
limit the uses of the systems and methods described herein in combination with
a digital and
laboratory health care platform.
[0885] The results of the bioinformatics pipeline may be provided
for report generation
208. Report generation may comprise variant science analysis, including the
interpretation of
variants (including somatic and germline variants as applicable) for
pathogenic and biological
significance. The variant science analysis may also estimate microsatellite
instability (MSI)
or tumor mutational burden. Targeted treatments may be identified based on
gene, variant,
and cancer type, for further consideration and review by the ordering
physician. In some
aspects, clinical trials may be identified for which the patient may be
eligible, based on
mutations, cancer type, and/or clinical history. Subsequent validation may
occur, after which
the report may be finalized for sign-out and delivery. In some embodiments, a
first or second
report may include additional data provided through a clinical dataflow 202,
such as patient
progress notes, pathology reports, imaging reports, and other relevant
documents. Such
clinical data is ingested, reviewed, and abstracted based on a predefined set
of curation rules.
The clinical data is then populated into the patient's clinical history
timeline for report
generation.
265
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0886] Further details on clinical report generation are
disclosed in US Patent Application
No. 16/789,363 (PCT/US20/180002), filed February 12, 2020, which is hereby
incorporated
herein by reference in its entirety.
[0887] Any of the embodiments herein may be combined with an
imaging validation
process to identify the accuracy of the assay results, prediction results from
a slide image, or
to compare the results between the assay and the slide image. In one
embodiment, a slide
having a solid or liquid specimen thereon may be converted into a digital
image. The slide
may be stained beforehand or unstained. The digital image may be processed by
one or more
artificial intelligence engines trained to identify one or more biomarkers,
molecular features
including, for example, DNA and RNA or methylation, or imaging features.
Examples of
specimen types, artificial intelligence engines, training methods, biomarkers,
molecular
features, and imaging features are disclosed in U.S. Patent Applications
16/830,186 and
17/139,765, respectively filed March 25, 2020 and December 31, 2020 which are
both
incorporated by reference for all purposes.
[0888] Once a prediction is obtained from the one or more
artificial intelligence engines,
the prediction may be compared against the sequencing results to either
validate the accuracy
of the sequencing result or validate the prediction results. In one
embodiment, the specimen
may not be processed and/or sent to sequencing unless first identified as
likely to occur (or
above a likelihood threshold) by the artificial intelligence engine
prediction.
[0889] Any of the results from embodiments herein may be combined
with a cohort
analysis engine or cohort analytics engine to identify relationships between
one or more
specimens contained within a cohort. A cohort may represent other specimens of
similar
characteristics to the current specimen or other specimens of different
characteristics to the
current specimen. Analysis may include survival curves to identify therapies
which may
improve the treatment of the patient from which the specimen was obtained.
Analysis may
also include identification of the origin of the specimen, for example, when
the specimen is a
metastasis of a tumor having no known origin at the time of biopsy. Examples
of cohort
analysis engines, cohort analytics engines, cohort identification, cohort
selection,
characteristics, and analysis algorithms including survival curves and origin
identification are
disclosed in U.S. Patent Applications 16/732,168 and 15/930,234, respectively
filed
December 31, 2020 and May 12, 2020 which are both incorporated by reference
for all
purposes.
266
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0890] Molecular data, clinical data, genomic data, and other
characteristics associated
with either the specimen or the patient from which the specimen is obtained
are disclosed
herein and may be identified, for example, from an electronic medical health
record or a
system comprising electronic records associated with the patient.
[0891] Additional embodiments directed to retrieving patient data
from a patient data
store
[0892] In some embodiments, an artificial intelligence system
retrieves features
associated with a patient from a patient data store. In some embodiments, a
patient data store
includes one or more feature modules comprising a collection of features
available for every
patient in the system. In some embodiments, these features are used to
generate predictions
of the origin of a patient's tumor. While feature scope across all patients is
informationally
dense, an individual patient's feature set, in some embodiments, is sparsely
populated across
the entirely of the collective feature scope of all features across all
patients. For example, the
feature scope across all patients may expand into the tens of thousands of
features while a
patient's unique feature set may only include a subset of hundreds or
thousands of the
collective feature scope based upon the records available for that patient.
[0893] In some embodiments, feature collections may include a
diverse set of fields
available within patient health records. Clinical information, such as
information of health
records, in some embodiments, are based upon fields which have been entered
into an
electronic medical record (EMR) or an electronic health record (EHR) by a
physician, nurse,
or other medical professional or representative. Other clinical information,
in some
embodiments, is curated from other sources, such as molecular fields from
genetic
sequencing reports. In some embodiments, sequencing may include next-
generation
sequencing (NGS) and comprises long-read, short-read, paired-end, or other
forms of
sequencing a patient's somatic and/or normal genome. In some embodiments, a
comprehensive collection of features in additional feature modules combines a
variety of
features together across varying fields of medicine which may include
diagnoses, responses
to treatment regimens, genetic profiles, clinical and phenotypic
characteristics, and/or other
medical, geographic, demographic, clinical, molecular, or genetic features.
For example, a
subset of features may comprise molecular data features, such as features
derived from an
RNA feature module or a DNA feature module, including sequencing results of a
patient's
germline or somatic specimen(s).
267
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0894] In some embodiments, another subset of features, imaging
features from an
imaging feature module, comprises features identified through review of a
specimen, for
example, through pathologist review, such as a review of stained H&E or IHC
slides. As
another example, a subset of features may comprise derivative features
obtained from the
analysis of the individual and combined results of such feature sets. Features
derived from
DNA and RNA sequencing may include genetic variants from a variant science
module
which are present in the sequenced tissue. Further analysis of the genetic
variants may
include additional steps such as identifying single or multiple nucleotide
polymorphisms,
identifying whether a variation is an insertion or deletion event, identifying
loss or gain of
function, identifying fusions, identifying splicing, calculating copy number
variation (CNV),
calculating microsatellite instability, calculating tumor mutational burden
(TMB), or other
structural variations within the DNA and RNA. Analysis of slides for H&E
staining or IHC
staining may reveal features such as tumor infiltration, programmed death-
ligand 1 (PD-L1)
status, human leukocyte antigen (HLA) status, or other immunological features.
[0895] In some embodiments, features derived from structured,
curated, or electronic
medical or health records may include clinical features such as diagnosis,
symptoms,
therapies, outcomes, patient demographics such as patient name, date of birth,
gender,
ethnicity, date of death, address, smoking status, diagnosis dates for cancer,
illness, disease,
diabetes, depression, other physical or mental maladies, personal medical
history, family
medical history, clinical diagnoses such as date of initial diagnosis, date of
metastatic
diagnosis, cancer staging, tumor characterization, tissue of origin,
treatments and outcomes
such as line of therapy, therapy groups, clinical trials, medications
prescribed or taken,
surgeries, radiotherapy, imaging, adverse effects, associated outcomes,
genetic testing and
laboratory information such as performance scores, lab tests, pathology
results, prognostic
indicators, date of genetic testing, testing provider used, testing method
used, such as genetic
sequencing method or gene panel, gene results, such as included genes,
variants, expression
levels/statuses, or corresponding dates to any of the above. Clinical features
may also include
imaging features.
108961 In some embodiments, an Omics feature module comprises
features derived from
information from additional medical- or research-based Omics fields including
proteomics,
tra-n scri ptomi cs, epi gen omi cs, metabol omi CS, mi crobi omi CS, and
other multi -omi c -fi el d s. In
some embodiments, features derived from an organoid modeling lab include the
DNA and
RNA sequencing information germane to each organoid and results from
treatments applied
268
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
to those organoids. In some embodiments, features derived from imaging data
further
include reports associated with a stained slide, size of tumor, tumor size
differentials over
time including treatments during the period of change, as well as machine
learning
approaches for classifying PDLI status, HLA status, or other characteristics
from imaging
data. In some embodiments, other features include the additional derivative
features sets
from other machine learning approaches based at least in part on combinations
of any new
features and/or those listed above. For example, imaging results may need to
be combined
with MSI calculations derived from RNA expressions to determine additional
further imaging
features. In some embodiments a machine learning model may generate a
likelihood that a
patient's cancer will metastasize to a particular organ or a patient's future
probability of
metastasis to yet another organ in the body. In some embodiments, other
features that are
extracted from medical information are also used. There are many thousands of
features, and
the above listing of types of features are merely representative and should
not be construed as
a complete or limiting listing of features.
[0897] In some embodiments, an alterations module comprises one
or more
microservices, servers, scripts, or other executable algorithms which generate
alteration
features associated with de-identified patient features from the feature
collection. In some
embodiments, alterations modules retrieve inputs from the feature collection
and may provide
alterations for storage. Exemplary alterations modules may include one or more
of the
following alterations as a collection of alteration modules.
[0898] In some embodiments, an IHC (Immunohistochemistry) module
identifies
antigens (proteins) in cells of a tissue section by exploiting the principle
of antibodies binding
specifically to antigens in biological tissues. IHC staining is widely used in
the diagnosis of
abnormal cells such as those found in cancerous tumors. Specific molecular
markers are
characteristic of particular cellular events such as proliferation or cell
death (apoptosis). IHC
is also widely used in basic research to understand the distribution and
localization of
biomarkers and differentially expressed proteins in different parts of a
biological tissue.
Visualizing an antibody-antigen interaction can be accomplished in a number of
ways. In the
most common instance, an antibody is conjugated to an enzyme, such as
peroxidase, that can
catalyze a color-producing reaction in immunoperoxidase staining.
Alternatively, the
antibody can also be tagged to a fluorophore, such as fluorescein or rhodamine
in
immunofluorescence. In some embodiments, approximations from RNA expression
data,
H&E slide imaging data, or other data are generated.
269
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0899] In some embodiments, a Therapies module identifies
differences in cancer cells
(or other cells near them) that help them grow and thrive and drugs that
"target" these
differences. Treatment with these drugs is called targeted therapy. For
example, many
targeted drugs are lethal to the cancer cells with inner "programming" that
makes them
different from normal, healthy cells, while not affecting most healthy cells.
Targeted drugs
may block or turn off chemical signals that tell the cancer cell to grow and
divide rapidly;
change proteins within the cancer cells so the cancer cells die; stop making
new blood vessels
to feed the cancer cells; trigger a patient's immune system to kill the cancer
cells; or carry
toxins to the cancer cells to kill them, without affecting normal cells. Some
targeted drugs
are more "targeted" than others. Some might target only a single change in
cancer cells,
while others can affect several different changes. Others boost the way a
patient's body
fights the cancer cells. This can affect where these drugs work and what side
effects they
cause. In some embodiments, matching targeted therapies may include
identifying the
therapy targets in the patients and satisfying any other inclusion or
exclusion criteria that
might identify a patient for whom a therapy is likely to be effective.
[0900] In some embodiments, a Trial module identifies and tests
hypotheses for treating
cancers having specific characteristics by matching features of a patient to
clinical trials.
These trials have inclusion and exclusion criteria that must be matched to
enroll a patient and
which may be ingested and structured from publications, trial reports, or
other
documentation.
[0901] In some embodiments, an Amplifications module identifies
genes which increase
in count (for example, the number of gene products present in a specimen)
disproportionately
to other genes. Amplifications may cause a gene having the increased count to
go dormant,
become overactive, or operate in another unexpected fashion. In some
embodiments,
amplifications may be detected at a gene level, variant level, RNA transcript
or expression
level, or even a protein level. In some embodiments, detections are performed
across all the
different detection mechanisms or levels and validated against one another.
[0902] In some embodiments, an Isoforms module identifies
alternative splicing (AS), the
biological process in which more than one mRNA type (isoform) is generated
from the
transcript of a same gene through different combinations of exons and introns.
It is estimated
by large-scale genomics studies that 30-60% of mammalian genes are
alternatively spliced.
The possible patterns of alternative splicing for a gene can be very
complicated and the
complexity increases rapidly as the number of introns in a gene increases. In
silico
270
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
alternative splicing prediction may find large insertions or deletions within
a set of mRNA
sharing a large portion of aligned sequences by identifying genomic loci
through searches of
mRNA sequences against genomic sequences, extracting sequences for genomic
loci and
extending the sequences at both ends up to 20 kb, searching the genomic
sequences (repeat
sequences have been masked), extracting splicing pairs (two boundaries of
alignment gap
with GT-AG consensus or with more than two expressed sequence tags aligned at
both ends
of the gap), assembling splicing pairs according to their coordinates,
determining gene
boundaries (splicing pair predictions are generated to this point), generating
predicted gene
structures by aligning mRNA sequences to genomic templates, and comparing
splicing pair
predictions and gene structure predictions to find alternatively spliced
isoforms.
[0903] In some embodiments, an SNP (single-nucleotide
polymorphism) module
identifies a substitution of a single nucleotide that occurs at a specific
position in the genome,
where each variation is present to some appreciable degree within a population
(e.g., greater
than 1%). For example, at a specific base position, or locus, in the human
genome, the C
nucleotide may appear in most individuals, but in a minority of individuals,
the position is
occupied by an A. This means that there is a SNP at this specific position and
the two
possible nucleotide variations, C or A, are said to be alleles for this
position. SNPs underlie
differences in human susceptibility to a wide range of diseases (e.g., sickle-
cell anemia, 13-
thalassemia, and cystic fibrosis result from SNPs). The severity of illness
and the way the
body responds to treatments are also manifestations of genetic variations. For
example, a
single-base mutation in the APOE (apolipoprotein E) gene is associated with a
lower risk for
Alzheimer's disease. A single-nucleotide variant (SNV) is a variation in a
single nucleotide
without any limitations of frequency and may arise in somatic cells. A somatic
single-
nucleotide variation (e.g., caused by cancer) may also be called a single-
nucleotide alteration.
In some embodiments, an MNP (Multiple-nucleotide polymorphisms) module
identifies the
substitution of consecutive nucleotides at a specific position in the genome.
[0904] In some embodiments, an Indels module may identify an
insertion or deletion of
bases in the genome of an organism classified among small genetic variations.
While indels
usually measure from 1 to 10,000 base pairs in length, a microindel is defined
as an indel that
results in a net change of 1 to 50 nucleotides. Indels can be contrasted with
a SNP or point
mutation. An indel inserts and/or deletes nucleotides from a sequence, while a
point mutation
is a form of substitution that replaces one of the nucleotides without
changing the overall
number in the DNA. Indels, being insertions and/or deletions, can be used as
genetic markers
271
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
in natural populations, especially in phylogenetic studies. Indel frequency
tends to be
markedly lower than that of single nucleotide polymorphisms (SNP), except near
highly
repetitive regions, including homopolymers and microsatellites.
[0905] In some embodiments, an MSI (microsatellite instability)
module may identify
genetic hypermutability (predisposition to mutation) that results from
impaired DNA
mismatch repair (MMR). The presence of MSI represents phenotypic evidence that
MMR is
not functioning normally. MMR corrects errors that spontaneously occur during
DNA
replication, such as single base mismatches or short insertions and deletions.
The proteins
involved in MMR correct polymerase errors by forming a complex that binds to
the
mismatched section of DNA, excises the error, and inserts the correct sequence
in its place.
Cells with abnormally functioning MMR are unable to correct errors that occur
during DNA
replication, which causes the cells to accumulate errors in their DNA. This
causes the
creation of novel microsatellite fragments. Polymerase chain reaction-based
assays can
reveal these novel microsatellites and provide evidence for the presence of
MSI.
Microsatellites are repeated sequences of DNA. These sequences can be made of
repeating
units of one to six base pairs in length. Although the length of these
microsatellites is highly
variable from person to person and contributes to the individual DNA
"fingerprint," each
individual has microsatellites of a set length. The most common microsatellite
in humans is a
dinucleotide repeat of the nucleotides C and A, which occurs tens of thousands
of times
across the genome. Microsatellites are also known as simple sequence repeats
(SSRs).
[0906] In some embodiments, a TMB (tumor mutational burden)
module may identify a
measurement of mutations carried by tumor cells and is a predictive biomarker
being studied
to evaluate its association with response to Immuno-Oncology (I-0) therapy.
Tumor cells
with high TMB may have more neoantigens, with an associated increase in cancer-
fighting T
cells in the tumor microenvironment and periphery. These neoantigens can be
recognized by
T cells, inciting an anti-tumor response. TMB has emerged more recently as a
quantitative
marker that can help predict potential responses to immunotherapi es across
different cancers,
including melanoma, lung cancer, and bladder cancer. TMB is defined as the
total number of
mutations per coding area of a tumor genome. Importantly, TMB is consistently
reproducible. It provides a quantitative measure that can be used to better
inform treatment
decisions, such as selection of targeted or immunotherapi es or enrollment in
clinical trials.
[0907] In some embodiments, a CNV (copy number variation) module
may identify
deviations from the normal genome, especially in the number of copies of a
gene, portions of
272
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
a gene, or other portions of a genome not defined by a gene, and any
subsequent implications
from analyzing genes, variants, alleles, or sequences of nucleotides. CNV are
the
phenomenon in which structural variations may occur in sections of
nucleotides, or base
pairs, that include repetitions, deletions, or inversions.
[0908] In some embodiments, a Fusions module may identify hybrid
genes formed from
two previously separate genes. Hybrid genes may be a result of translocation,
interstitial
deletion, or chromosomal inversion. Gene fusion can play an important role in
tumorigenesis. Fusion genes can contribute to tumor formation because they can
produce
much more active abnormal protein than non-fusion genes. Often, fusion genes
are
oncogenes that cause cancer; these include BCR-ABL, TEL-AML1 (ALL with t(12;
21)),
AML1-ETO (M2 AML with t(8 ; 21)), and TMPRSS2-ERG with an interstitial
deletion on
chromosome 21, often occun-ing in prostate cancer. In the case of TMPRSS2-ERG,
by
disrupting androgen receptor (AR) signaling and inhibiting AR expression by
oncogenic ETS
transcription factor, the fusion product regulates prostate cancer. Most
fusion genes are
found from hematological cancers, sarcomas, and prostate cancer. BCAM-AKT2 is
a fusion
gene that is specific and unique to high-grade serous ovarian cancer.
Oncogenic fusion genes
may lead to a gene product with a new or different function from the two
fusion partners.
Alternatively, a proto-oncogene may be fused to a strong promoter, and thereby
the
oncogenic function is set to function by an upregulation caused by the strong
promoter of the
upstream fusion partner. The latter is common in lymphomas, where oncogenes
are
juxtaposed to the promoters of the immunoglobulin genes. Oncogenic fusion
transcripts may
also be caused by trans-splicing or read-through events. Since chromosomal
translocations
play such a significant role in neoplasia, a specialized database of
chromosomal aberrations
and gene fusions in cancer has been created. This database is called Mitelman
Database of
Chromosome Aberrations and Gene Fusions in Cancer.
[0909] In some embodiments, a VUS (variant of unknown
significance) module may
identify variants which are detected in the genome of a patient (especially in
a patient's
cancer specimen) but cannot be classified as pathogenic or benign at the time
of detection.
VUS are catalogued from publications to identify if they may be classified as
benign or
pathogenic.
[0910] In some embodiments, a DNA Pathways module identifies
defects in DNA repair
pathways which enable cancer cells to accumulate genomic alterations that
contribute to their
aggressive phenotype. Cancerous tumors rely on residual DNA repair capacities
to survive
273
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
the damage induced by genotoxic stress which leads to isolated DNA repair
pathways being
inactivated in cancer cells. DNA repair pathways are generally thought of as
mutually
exclusive mechanistic units handling different types of lesions in distinct
cell cycle phases.
Recent preclinical studies, however, provide strong evidence that
multifunctional DNA repair
hubs, which are involved in multiple conventional DNA repair pathways, are
frequently
altered in cancer. Identifying pathways which may be affected may lead to
important patient
treatment considerations.
[0911] In some embodiments, a Raw Counts module identifies a
count of the variants that
are detected from the sequencing data. For DNA, in some embodiments, this
comprises the
number of reads from sequencing which correspond to a particular variant in a
gene. For
RNA, in some embodiments, this comprises the gene expression counts or the
transcriptome
counts from sequencing.
[0912] In some embodiments, classifications comprise
classifications according to one or
more trained models for generating predictions and other structural variant
classification may
include evaluating features from the feature collection, alterations from the
alteration module,
and other classifications from within itself from one or more classification
modules.
Structural variant classification may provide classifications to a stored
classifications storage.
An exemplary classification module may include a classification of a CNV as -
Reportable"
may mean that the CNV has been identified in one or more reference databases
as influencing
the tumor cancer characterization, disease state, or pharmacogenomics, "Not
Reportable"
may mean that the CNV has not been identified as such, and "Conflicting
Evidence- may
mean that the CNV has both evidence suggesting "Reportable" and "Not
Reportable."
Furthermore, a classification of therapeutic relevance is similarly
ascertained from any
reference datasets mention of a therapy which may be impacted by the detection
(or non-
detection) of the CNV. Other classifications may include applications of
machine learning
algorithms, neural networks, regression techniques, graphing techniques,
inductive reasoning
approaches, or other artificial intelligence evaluations within modules. In
some
embodiments, a classifier for clinical trials may include evaluation of
variants identified from
the alteration module which have been identified as significant or reportable,
evaluation of all
clinical trials available to identify inclusion and exclusion criteria,
mapping the patient's
variants and other information to the inclusion and exclusion criteria, and
classifying clinical
trials as applicable to the patient or as not applicable to the patient. In
some embodiments,
similar classifications are performed for therapies, loss-of-function, gain-of-
function,
274
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
diagnosis, microsatellite instability, tumor mutational burden, indels, SNP,
MNP, fusions,
CNV, splicing, and other alterations which may be classified based upon the
results of the
alteration modules. Additionally, in some embodiments, models trained to
classify a type of
tumor for patient with tumors of unknown origin are generated according to the
disclosure
herein. In some embodiments, classifications are generated and stored as part
of a feature
collection in a stored classifications database.
[0913] In some embodiments, each of the feature collection,
alteration module(s),
structural variant, and feature store are communicatively coupled to a data
bus to transfer data
between each module for processing and/or storage. In some embodiments, each
of the
feature collection, alteration module(s), and classifications may be
communicatively coupled
to each other for independent communication without sharing the data bus.
109141 In addition to the above features and enumerated modules,
in some embodiments,
feature modules may further include one or more of the following modules
within their
respective modules as a sub-module or as a standalone module.
[0915] In some embodiments, a germline/somatic DNA feature module
comprises a
feature collection associated with the DNA-derived information of a patient or
a patient's
tumor. These features may include raw sequencing results, such as those stored
in FASTQ,
BAM, VCF, or other sequencing file types known in the art; genes; mutations;
variant calls;
and variant characterizations. In some embodiments, genomic information from a
patient's
normal sample is stored as germline and genomic information from a patient's
tumor sample
is stored as somatic.
[0916] In some embodiments, an RNA feature module comprises a
feature collection
associated with the RNA-derived information of a patient, such as
transcriptome information.
These features may include raw sequencing results, transcriptome expressions,
genes,
mutations, variant calls, and variant characterizations.
[0917] In some embodiments, a metadata module comprises a feature
collection
associated with the human genome, protein structures and their effects, such
as changes in
energy stability based on a protein structure.
[0918] In some embodiments, a clinical module comprises a feature
collection associated
with information derived from clinical records of a patient and records from
family members
of the patient. These may be abstracted from unstructured clinical documents,
EMR, EHR, or
other sources of patient history. Information may include patient symptoms,
diagnosis,
275
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
treatments, medications, therapies, hospice, responses to treatments,
laboratory testing
results, medical history, geographic locations of each, demographics, or other
features of the
patient which may be found in the patient's medical record. Information about
treatments,
medications, therapies, and the like may be ingested as a recommendation or
prescription
and/or as a confirmation that such treatments, medications, therapies, and the
like were
administered or taken.
[0919] In some embodiments, an imaging module comprises a feature
collection
associated with information derived from imaging records of a patient. Imaging
records may
include H&E slides, IHC slides, radiology images, and other medical imaging
which may be
ordered by a physician during the course of diagnosis and treatment of various
illnesses and
diseases. These features may include TMB, ploidy, purity, nuclear-cytoplasmic
ratio, large
nuclei, cell state alterations, biological pathway activations, hormone
receptor alterations,
immune cell infiltration, immune biomarkers of MMR, MSI, PDL1, CD3, FOXP3,
HRD,
PTEN, PIK3CA; collagen or stroma composition, appearance, density, or
characteristics;
tumor budding, size, aggressiveness, metastasis, immune state, chromatin
morphology; and
other characteristics of cells, tissues, or tumors for prognostic predictions.
[0920] In some embodiments, an epigenome module, such as an
epigenome module from
Omics, comprises a feature collection associated with information derived from
DNA
modifications which are not changes to the DNA sequence and regulate the gene
expression.
These modifications are frequently the result of environmental factors based
on what the
patient may breathe, eat, or drink. These features may include DNA
methylation, histone
modification, or other factors which deactivate a gene or cause alterations to
gene function
without altering the sequence of nucleotides in the gene.
[0921] In some embodiments, a microbiome module, such as
microbiome module from
Omics, comprises a feature collection associated with information derived from
the viruses
and bacteria of a patient. Viral genomics may be generated to identify which
viruses are
present in the patient's specimen(s) based upon the genomic features which map
to viral
DNA or RNA (e.g., a viral reference genome(s)) instead of the human genome.
These
features may include viral infections which may affect treatment and diagnosis
of certain
illnesses as well as the bacteria present in the patient's gastrointestinal
tract which may affect
the efficacy of medicines ingested by the patient.
276
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0922]
In some embodiments, a proteome module, such as proteome module from
Omics,
comprises a feature collection associated with information derived from the
proteins
produced in the patient. These features may include protein composition,
structure, and
activity; when and where proteins are expressed; rates of protein production,
degradation, and
steady-state abundance; how proteins are modified, for example, post-
translational
modifications such as phosphotylation; the movement of proteins between
subcellular
compartments; the involvement of proteins in metabolic pathways; how proteins
interact with
one another; or modifications to the protein after translation from the RNA
such as
phosphorylation, ubiquitination, methylation, acetylation, glycosylation,
oxidation, or
nitrosylation.
[0923]
In some embodiments, additional Omics module(s) are included in Omics,
such as
a feature collection associated with all the different field of omics,
including: cognitive
genomics, a collection of features comprising the study of the changes in
cognitive processes
associated with genetic profiles; comparative genomics, a collection of
features comprising
the study of the relationship of genome structure and function across
different biological
species or strains; functional genomics, a collection of features comprising
the study of gene
and protein functions and interactions including transcriptomics;
interactomics, a collection
of features comprising the study relating to large-scale analyses of gene-
gene, protein-
protein, or protein-ligand interactions; metagenomics, a collection of
features comprising the
study of metagenomes such as genetic material recovered directly from
environmental
samples; neurogenomics, a collection of features comprising the study of
genetic influences
on the development and function of the nervous system; pangenomics, a
collection of features
comprising the study of the entire collection of gene families found within a
given species;
personal genomics, a collection of features comprising the study of genomics
concerned with
the sequencing and analysis of the genome of an individual such that once the
genotypes are
known, the individual's genotype can be compared with the published literature
to determine
likelihood of trait expression and disease risk to enhance personalized
medicine suggestions;
epigenomics, a collection of features comprising the study of supporting the
structure of
genome, including protein and RNA binders, alternative DNA structures, and
chemical
modifications on DNA; nucleomics, a collection of features comprising the
study of the
complete set of genomic components which form the cell nucleus as a complex,
dynamic
biological system; lipidomics, a collection of features comprising the study
of cellular lipids,
including the modifications made to any particular set of lipids produced by a
patient;
277
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
proteomics, a collection of features comprising the study of proteins,
including the
modifications made to any particular set of proteins produced by a patient;
immunoproteomics, a collection of features comprising the study of large sets
of proteins
involved in the immune response; phosphoproteomics, a collection of features
comprising the
study of phosphorylation patterns of proteins, including the modifications
made to any
particular set of proteins produced by a patient; nutriproteomics, a
collection of features
comprising the study of identifying molecular targets of nutritive and non-
nutritive
components of the diet including the use of proteomics mass spectrometry data
for protein
expression studies; proteogenomics, a collection of features comprising the
study of
biological research at the intersection of proteomics and genomics including
data which
identifies gene annotations; structural genomics, a collection of features
comprising the study
of 3-dimensional structure of every protein encoded by a given genome using a
combination
of modeling approaches; gly comics, a collection of features comprising the
study of sugars
and carbohydrates and their effects in the patient; foodomics, a collection of
features
comprising the study of the intersection between the food and nutrition
domains through the
application and integration of technologies to improve consumer's well-being,
health, and
knowledge; transcriptomics, a collection of features comprising the study of
RNA molecules,
including mRNA, rRNA, tRNA, and other non-coding RNA, produced in cells;
metabolomics, a collection of features comprising the study of chemical
processes involving
metabolites, or unique chemical fingerprints that specific cellular processes
leave behind, and
their small-molecule metabolite profiles; metabonomics, a collection of
features comprising
the study of the quantitative measurement of the dynamic multiparametric
metabolic response
of cells to pathophysiological stimuli or genetic modification; nutrigenetics,
a collection of
features comprising the study of genetic variations on the interaction between
diet and health
with implications to susceptible subgroups; cognitive genomics, a collection
of features
comprising the study of the changes in cognitive processes associated with
genetic profiles;
pharmacogenomics, a collection of features comprising the study of the effect
of the sum of
variations within the human genome on drugs; pharmacomicrobiomics, a
collection of
features comprising the study of the effect of variations within the human
microbiome on
drugs; toxicogenomics, a collection of features comprising the study of gene
and protein
activity within particular cell or tissue of an organism in response to toxic
substances;
mitointeractome, a collection of features comprising the study of the process
by which the
mitochondria proteins interact; psychogenomics, a collection of features
comprising the study
of the process of applying the powerful tools of genomics and proteomics to
achieve a better
278
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
understanding of the biological substrates of normal behavior and of diseases
of the brain that
manifest themselves as behavioral abnormalities, including applying
psychogenomics to the
study of drug addiction to develop more effective treatments for these
disorders as well as
objective diagnostic tools, preventive measures, and cures; stem cell
genomics, a collection
of features comprising the study of stem cell biology to establish stem cells
as a model
system for understanding human biology and disease states; connectomics, a
collection of
features comprising the study of the neural connections in the brain;
microbiomics, a
collection of features comprising the study of the genomes of the communities
of
microorganisms that live in the digestive tract; cellomics, a collection of
features comprising
the study of the quantitative cell analysis and study using bioimaging methods
and
bioinformatics; tomomics, a collection of features comprising the study of
tomography and
omics methods to understand tissue or cell biochemistry at high spatial
resolution from
imaging mass spectrometry data; ethomics, a collection of features comprising
the study of
high-throughput machine measurement of patient behavior; and videomics, a
collection of
features comprising the study of a video analysis paradigm inspired by
genomics principles,
where a continuous image sequence, or video, can be interpreted as the capture
of a single
image evolving through time of mutations revealing patient insights.
[0924] In some embodiments, a sufficiently robust collection of
features comprises all of
the features disclosed above; however, models and predictions based from the
available
features comprise models which are optimized and trained from a selection of
features that
are much more limiting than the exhaustive feature set. In some embodiments,
such a
constrained feature set comprises as few as tens to hundreds of features. For
example, a
model's constrained feature set may include the genomic results of a
sequencing of the
patient's tumor, derivative features based upon the genomic results, the
patient's tumor
origin, the patient's age at diagnosis, the patient's gender and race, and
symptoms that the
patient brought to their physicians attention during a routine checkup.
[0925] In some embodiments, a feature store may enhance a
patient's feature set through
the application of machine learning and analytics by selecting from any
features, alterations,
or calculated output derived from the patient's features or alterations to
those features. In
some embodiments, such a feature store may generate new features from the
original features
found in feature module or may identify and store important insights or
analysis based upon
the features. In some embodiments, the selection of features is based at least
upon an
alteration or calculation to be generated, and comprises the calculation of
single or multiple
279
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
nucleotide polymorphisms insertion or deletions of the genome, a tumor
mutational burden, a
microsatellite instability, a copy number variation, a fusion, or other such
calculations. In
some embodiments, an exemplary output of an alteration or calculation
generated which may
inform future alterations or calculations includes a finding of hypertrophic
cardiomyopathy
(HCM) and variants in MYH7. In some embodiments, previous classified variants
may be
identified in the patient's genome which may inform the classification of
novel variants or
indicate a further risk of disease. An exemplary approach includes the
enrichment of variants
and their respective classifications to identify a region in MYH7 that is
associated with HCM.
Novel variants detected from a patient's sequencing localized to this region
would increase
the patient's risk for HCM. In some embodiments, features which may be
utilized in such an
alteration detection include the structure of MYH7 and classification of
variants therein. In
some embodiments, a model focused on enrichment may isolate such variants. An
exemplary
output of an alteration or calculation generated which may inform future
alterations or
calculations includes a finding of lung cancer and variants in EGFR, an
epidermal growth
factor receptor gene that is mutated in ¨10% of non-small cell lung cancer and
¨50% of lung
cancers from non-smokers. In some embodiments, previously classified variants
may be
identified in the patient's genome which may inform the classification of
novel variants or
indicate a further risk of disease. An exemplary approach may include the
enrichment of
variants and their respective classifications to identify a region nearby or
with evidence to
interact with EGFR and associated with cancer. Novel variants detected from a
patient's
sequencing localized to this region or interactions with this region would
increase the
patient's risk. In some embodiments, features which may be utilized in such an
alteration
detection include the structure of EGFR and classification of variants
therein. In some
embodiments, a model focused enrichment may isolate such variants.
[0926] In some embodiments, the above referenced classification
model may include one
or more classification models which may be implemented as artificial
intelligence engines
and may include gradient boosting models, random forest models, neural
networks (NN),
regression models, Naive Bayes models, or machine learning algorithms (MLA).
An MLA
or a NN may be trained from a training data set. In an exemplary prediction
profile, a
training data set may include imaging, pathology, clinical, and/or molecular
reports and
details of a patient, such as those curated from an EHR or genetic sequencing
reports. MLAs
include supervised algorithms (such as algorithms where the
features/classifications in the
data set are annotated) using linear regression, logistic regression, decision
trees,
280
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
classification and regression trees, Naïve Bayes, nearest neighbor clustering;
unsupervised
algorithms (such as algorithms where no features/classification in the data
set are annotated)
using Apriori, means clustering, principal component analysis, random forest,
adaptive
boosting; and semi-supervised algorithms (such as algorithms where an
incomplete number
of features/classifications in the data set are annotated) using generative
approach (such as a
mixture of Gaussian distributions, mixture of multinomial distributions,
hidden Markov
models), low density separation, graph-based approaches (such as mincut,
harmonic function,
manifold regularization), heuristic approaches, or support vector machines.
NNs include
conditional random fields, convolutional neural networks, attention based
neural networks,
deep learning, long short term memory networks, or other neural models where
the training
data set includes a plurality of tumor samples, RNA expression data for each
sample, and
pathology reports covering imaging data for each sample. While MLA and neural
networks
identify distinct approaches to machine learning, the terms may be used
interchangeably
herein. Thus, a mention of MLA may include a corresponding NN or a mention of
NN may
include a corresponding MLA unless explicitly stated otherwise. Training may
include
providing optimized datasets, labeling these traits as they occur in patient
records, and
training the MLA to predict or classify based on new inputs. Artificial NNs
are efficient
computing models which have shown their strengths in solving hard problems in
artificial
intelligence. They have also been shown to be universal approximators (can
represent a wide
variety of functions when given appropriate parameters). In some embodiments,
some MLA
may identify features of importance and identify a coefficient, or weight, to
them. The
coefficient may be multiplied with the occurrence frequency of the feature to
generate a
score, and once the scores of one or more features exceed a threshold, certain
classifications
may be predicted by the MLA. In some embodiments, a coefficient schema may be
combined with a rule-based schema to generate more complicated predictions,
such as
predictions based upon multiple features. For example, ten key features may be
identified
across different classifications. In some embodiments, a list of coefficients
may exist for the
key features, and a rule set may exist for the classification. In some
embodiments, a rule set
may be based upon the number of occurrences of the feature, the scaled weights
of the
features, or other qualitative and quantitative assessments of features
encoded in logic known
to those of ordinary skill in the art. In other MLA, features may be organized
in a binary tree
structure. For example, key features which distinguish between the most
classifications may
exist as the root of the binary tree and each subsequent branch in the tree
until a classification
may be awarded based upon reaching a terminal node of the tree. For example, a
binary tree
281
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
may have a root node which tests for a first feature. The occurrence or non-
occurrence of this
feature must exist (the binary decision), and the logic may traverse the
branch which is true
for the item being classified. Additional rules may be based upon thresholds,
ranges, or other
qualitative and quantitative tests. While supervised methods are useful when
the training
dataset has many known values or annotations, the nature of EMR/EHR documents
is that
there may not be many annotations provided. When exploring large amounts of
unlabeled
data, unsupervised methods are useful for binning/bucketing instances in the
data set. A
single instance of the above models, or two or more such instances in
combination, may
constitute a model for the purposes of models, artificial intelligence, neural
networks, or
machine learning algorithms, herein.
[0927] Online Portal
109281 In various embodiments, one or more statistical models and
analyses is combined
to accommodate a particular purpose and, through a variation of the initial
analysis, is used to
solve a number of problems. Such a combination of statistical models and
analyses are, in
some embodiments, stored as a notebook in the Interactive Analysis Portal.
Notebook is a
feature in the Interactive Analysis Portal which provides an easily accessible
framework for
building statistical models and analyses. Once the statistical models and
analyses have been
developed, they may then be shared with different users to analyze and find
answers to
scientific and business questions other than those for which they were
initially developed.
[0929] 1) In some embodiments, the Interactive Analysis Portal
allows input
customization through a simple, intuitive point-and-click/drag-and-drop
interface to narrow
down the cohort for analysis. Cohorts which have been selected, either through
the
Interactive Analysis Portal, Outliers, Smart Cohorts, or other portals of the
Interactive
Analysis Portal, are, in some embodiments, provided to a notebook for
processing.
[0930] 2) In some embodiments, a custom application interface
(API) having a library of
function calls which interface with the Interactive Analysis Portal,
underlying authorized
databases, and any supported statistical models, visualizations, arithmetic
models, and other
provided operations may be provided to the user to integrate a notebook or
workbook with
the Interactive Analysis Portal data, function calls, and other resources.
Exemplary function
calls may include listing authorized sources of data, selecting a datasource,
filtering the
datasource, listing clinical events of the patients in the current filtered
cohort, identification
of fusions from RNA or DNA, identification of genes from RNA or DNA,
identifying
282
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
matching clinical trials, DNA variants, identifying immunohistochemistry
(IHC), identifying
RNA expressions, identifying therapies in the cohort, identifying potential
therapies that are
applicable to treat patients in the cohort, and other cohort or dataset
processing.
[0931] 3) In some embodiments, the Interactive Analysis Portal
allows the Notebook
generation to perform one or more statistical models, analysis, and
visualization or reporting
of results to the narrowed down cohort without having the user code anything
in the notebook
as the selected models, analysis, visualizations, or reports of the notebook
itself are
configured to accept the cohort from the Interactive Analysis Portal and
provide the analysis
on the cohort as is, without user intervention at the code level. Some models
may have
hyperparameters or tuning parameters which may be selected, or the models
themselves may
identify the optimal parameters to be applied based on the cohort and/or other
models,
analysis, visualizations, or reports during run-time.
[0932] 4) In some embodiments, the Interactive Analysis Portal
displays the prepared
results to the user based on the selected notebook.
[0933] 5) In some embodiments, an associated user selects a
previously generated
notebook which applies selected analysis to the narrowed down cohort without
having the
user code or recode anything in the notebook as the notebook itself is
configured to accept the
cohort from the Interactive Analysis Portal and provide the notebook results
without user
intervention.
[0934] 6) In some embodiments, users track the computation
resources used by their
notebooks for understanding the costs for cloud computing or hardware
resources over the
network and may track the popularity of their notebook to judge the
effectiveness of the
statistical analysis that they provide through the notebook.
[0935] In certain embodiments, notebooks provide a benefit to
users by allowing the
Interactive Analysis Portal to provide custom templates to their selected data
and leverage
pre-built healthcare statistical models to provide results to users who are
not sophisticated in
programming. Internal teams may analyze curated data in order to support new
healthcare
insights that both help improve patient care and improve life science
research. Similarly,
external users have easy access to this proprietary real-world data for
analysis and access to
proprietary statistical models.
[0936] A billing model for a user may be provided on a
subscription basis or an on-
demand basis. For example, a user may subscribe to one or more data sets for a
period of
283
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
time, such as a monthly or yearly subscription, or the user may pay on a per-
access basis for
data and notebook usage, such as for loading a specific cohort with
corresponding notebook
and paying a fee to generate the instant results for consumption. Users may
desire a
benchmarking and optimization portal through which they may view and optimize
their
storage and computing resources uses.
109371 Generating a notebook may be performed with a GUI for
notebook editing. A
user may configure a reporting page for a notebook. A reporting page may
include text,
images, and graphs as selected and populated by the users. Preconfigured
elements may be
selected from a list, such as a dropdown list or a drag-and-drop menu.
Preconfigured
elements include statistical analysis modules and machine learning models. For
example, a
user may wish to perform linear regression on the data with respect to
specific features. A
user may select linear regression, and a menu with checkboxes may appear with
features
from their data set which should be supplied to the linear regression model.
Once filled out, a
template for reporting the linear regression results with respect to the
selected features may
be added to the reporting page at a location identified by the active cursor
or the drop location
for a drag-and drop-element. If a user wishes to solve a problem using a
machine learning
model, it may be added to the sheet. A header may be populated identifying the
model, the
hypertuning parameters, and the reported results. In some instances, a model
that was
previously trained may then be applied to the current cohort. In other
instances, the model
may be trained on the fly, for example by selecting annotated features and
associated
outcomes for which the model should be trained. In an unsupervised machine
learning
model, the model may not require selection of annotated features as the
features will be
identified during training. In some embodiments, if a selected statistical
model requires
results from a trained model which are not computed in the template, the
template may
automatically add the trained model to generate the required results prior to
inserting the
selected statistical model to the notebook.
[0938] Statistical analysis models may be predesigned for
calculating the arithmetic mean
of the cohort with respect to a selected feature, the standard
deviation/distribution of the
cohort for a selected feature, regression relationships between variables for
selected features,
sample size determining models for subsetting the cohort into the optimal sub-
population for
analysis, or t-testing modules for identifying statistically significant
features and correlations
in the cohort. Other precomputed statistical analysis modules may perform
cohort analysis to
identify significant correlations and/or features in the cohort, data mining
to identify
284
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
meaningful patterns, or data dredging to match statistical models to the data
and report out
which models may be applicable and add those models to the notebook.
[0939] Machine learning models may apply linear regression
algorithms, non-linear
regression, logistic regression algorithms, classification models, bootstrap
resampling
models, subset selection models, dimensionality reduction models, tree-based
models (such
as bagging, boosting, and random forest), and other supervised or unsupervised
models. As
each model is selected, a target output may be requested from the user
specifying which
feature(s) the model should identify, classify, and/or report. For example, a
user may select
for the model to identify which features most closely correlate to patient
survival in the
cohort, or which features most closely correlate with a positive treatment
outcome in the
cohort. The user may also select which classification labels from the
classification labels of
the model that they wish the model to classify. In an example where the model
may classify
the cohort according to five labels, the user may specify one or more labels
as a binary
classification (patient has label, patient does not have label) such as
whether a patient with a
tumor of unknown origin originated from the breast, lung, or brain. The user
may select only
breast to identify for any tumors of unknown origin whether the tumor may be
classified as
coming from the breast or not from the breast.
[0940] A system for predicting and analyzing patient cohort
response, progression, and
survival may include a back-end layer that includes a patient data store
accessible by a patient
cohort selector module in communication with a patient cohort timeline data
storage. The
patient cohort selector module interacts with a front-end layer that includes
an interactive
analysis portal that may be implemented, in one instance, via a web browser to
allow for on-
demand filtering and analysis of the data store.
[0941] The interactive analysis portal may include a plurality of
user interfaces including
an interactive cohort selection filtering interface that, as discussed in
greater detail below,
permits a user to query and filter elements of the data store. As discussed in
greater detail
below, the portal also may include a cohort funnel and population analysis
interface, a patient
timeline analysis user interface, a patient survival analysis user interface,
and a patient event
likelihood analysis user interface. The portal further may include a patient
next analysis user
interface and one or more patient future analysis user interfaces.
[0942] The back-end layer also may include a distributed
computing and modeling layer
that receives data from the patient cohort timeline data storage to provide
inputs to a plurality
285
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
of modules, including, a time to event modeling module that powers the patient
survival
analysis user interface, an event likelihood module that calculates the
likelihood of one or
more events received at the patient event likelihood analysis user interface
for subsequent
display in that user interface, a next event modeling module that generates
models of one or
more next events for subsequent display at the patient next event analysis
user interface, and
one or more future modeling modules that generate one or more future models
for subsequent
display at the one or more patient future analysis user interfaces.
[0943] The patient data store may be a pre-existing dataset that
includes patient clinical
history, such as demographics, comorbidities, diagnoses and recurrences,
medications,
surgeries, and other treatments along with their response and adverse effects
details. The
Patient Data Store may also include patient genetic/molecular sequencing and
genetic
mutation details relating to the patient, as well as organoid modeling
results. In one aspect,
these datasets may be generated from one or more sources. For example,
institutions
implementing the system may be able to draw from all of their records; for
example, all
records from all doctors and/or patients connected with the institution may be
available to the
institutions agents, physicians, research, or other authorized members.
Similarly, doctors
may be able to draw from all of their records; for example, records for all of
their patients.
Alternatively, certain system users may be able to buy or license aspect to
the datasets, such
as when those users do not have immediate access to a sufficiently robust
dataset, when those
users are looking for even more records, and/or when those users are looking
for specific data
types, such as data reflecting patients having certain primary cancers,
metastases by origin
site and/or diagnosis site, recurrences by origin, metastases, or diagnosis
sites, etc.
[0944] Patient Cohort Filtering User Interface
[0945] A first embodiment of a patient cohort selection filtering
interface may be
provided as a side pane provided along a height (or, alternatively, a length)
of a display
screen, through which attribute criteria (such as clinical, molecular,
demographic etc.) can be
specified by the user, defining a patient population of interest for further
analysis. The side
pane may be hidden or expanded by selecting it, dragging it, double-clicking
it, etc.
[0946] Additionally, or alternatively, the system may recognize
one or more attributes
defined for tumor data stored by the system, where those attributes may be,
for example,
genotypic, phenotypic, genealogical, or demographic. The various selectable
attribute criteria
may reflect patient-related metadata stored in the patient data store, where
exemplary
286
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
metadata may include, for instance: Project Name (which may reflect a database
storing a list
of patients), Gender, Race; Cancer, Cancer Site, Cancer Name; Metastasis,
Cancer Name;
Tumor Site (which may reflect where the tumor was located), Stage (such as I,
II, III, IV, and
unknown), M Stage (such as mO, ml, m2, m3, and unknown); Medication (such as
by Name
or Ingredient); Sequencing (such as gene name or variant), MSI (Microsatellite
Instability)
status, TMB (Tumor Mutational Burden) status; Procedure (such as, by Name); or
Death
(such as, by Event Name or Cause of Death).
[0947] The system also may permit a user to filter patient data
according to any of the
criteria listed herein including those listed under the heading -Features and
Feature
Modules," and include one or more of the following additional criteria:
institution,
demographics, molecular data, assessments, diagnosis site, tumor
characterization, treatment,
or one or more internal criteria. The institution option may permit a user to
filter according to
a specific facility. The demographics option may permit a user to sort, for
example, by one
or more of gender, death status, age at initial diagnosis, or race. The
molecular data option
may permit a user to filter according to variant calls (for example, when
there is molecular
data available for the patient, what the particular gene name, mutation,
mutation effect,
and/or sample type is), abstracted variants (including, for example, gene name
and/or
sequencing method), MSI status (for example, stable, low, or high), or TMB
status (for
example, selectable within or outside of a user-defined ranges). Assessments
may permit a
user to filter according to various system-defined criteria such as smoking
status and/or
menopausal status. Diagnosis site may permit a user to filter according to
primary and/or
metastatic sites. Tumor characterization may permit the user to filter
according to one or
more tumor-related criteria, for example, grade, histology, stage, TNM
Classification of
Malignant Tumours (TNM) and/or each respective T value, N value, and/or M
value.
Treatment may permit the user to select from among various treatment-related
options,
including, for instance, an ingredient, a regimen, a treatment type, etc.
[0948] Certain criteria may permit the user to select from a
plurality of sub-criteria that
may be indicated once the initial criteria are selected. Other criteria may
present the user
with a binary option, for example, deceased or not. Still other criteria may
present the user
with slider or range-type options, for example, age at initial diagnosis may
presented as a
slider with user-selectable lower and upper bounds Still further, for any of
these options, the
system may present the user with a radio button or slider to alternate between
whether the
system should include or exclude patients based on the selected criterion. It
should be
287
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
understood that the examples described herein do not limit the scope of the
types of
information that may be used as criteria. Any type of medical information
capable of being
stored in a structured format may be used as a criteria.
[0949] In another embodiment, the user interface may include a
natural language search
style bar to facilitate filter criteria definition for the cohort, for
example, in the -Ask Gene"
tab of the user interface or via a text input of the filtering interface. In
one aspect, an ability
to specify a query, either via keyboard-type input or via machine-interpreted
dictation, may
define one or more of the subsequent layers of a cohort funnel (described in
greater detail in
the next section). Thus, for example, when employing traditional natural
language
processing software or techniques, an input of -breast cancer patients" would
cause the
system to recognize a filter of "cancer site == breast cancer- and add that as
the next layer of
filtering. Similarly, the system would recognize an input of "pancreatic
patients with adverse
reactions to gemcitabine" and translate it into multiple successive layers of
filtering, for
example, -cancer site == pancreatic cancer" AND -medication == gemcitabine"
AND
"adverse reaction == not null.-
[0950] In a second aspect, the natural language processing may
permit a user to use the
system to query for general insights directly, thereby both narrowing down a
cohort of
patients via one or more funnel levels and also causing the system to display
an appropriate
summary panel in the user interface. Thus, in the situation that the system
receives the query
"What is the 5 years progression-free survival rate for stage III colorectal
cancer patients,
after radiotherapy?- it would translate it into a series of filters such as
"cancer site ==
colorectal" AND "stage == III" AND "treatment == radiotherapy" and then
display five-year
progression-free survival rates using, for example, the patient survival
analysis user interface
30. Similarly, the query -What percentage of female lung cancer patients are
post-
menopausal at a time of diagnosis?" would translate it into a series of
patients such as
"gender == female," "cancer site == lung," and -temporal == at diagnosis,"
determine how
many of the resulting patients had data reflecting a post-menopause situation,
and then
determine the relevant percentage, for example, displaying the results through
one or more
statistical summary charts.
[0951] Cohort Funnel and Population Analysis User Interface
[0952] The cohort funnel and population analysis user interface
may be configured to
permit a user to conduct analysis of a cohort, for the purpose of identifying
key inflection
288
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
points in the distribution of patients exhibiting each attribute of interest,
relative to the
distributions in the general patient population or a patient population whose
data is stored in
the patient data store. In one aspect, the filtering and selection of
additional patient-related
criteria discussed above may be used in connection with the cohort funnel and
population
analysis user interface.
109531 In another embodiment, the system may include a selectable
button or icon that
opens a dialogue box which shows a plurality of selectable tabs, each tab
representing the
same or similar filtering criteria discussed above (Demographics, Molecular
Data,
Assessments, Diagnosis Site, Tumor Characterization, and Treatment). Selection
of each tab
may present the user with the same or similar options for each respective
filter as discussed
above (for example, selecting "Demographics- may present the user with further
options
relating to: Gender, Death Status, Age at Initial Diagnosis, or Race). The
user then may
select one or more options, select "next," and then select whether it is an
inclusion or
exclusion filter, and the corresponding selection is added to the funnel
(discussed in greater
detail below), with an icon moving to be below a next successively narrower
portion of the
funnel.
[0954] Additionally, or alternatively, looking at the cohort, or
set of patients in a
database, the system permits filtering by a plurality of clinical and
molecular factors via a
menu. For example, and with regard to clinical factors, the system may include
filters based
on patient demographics, cancer site, tumor characterization, or molecular
data which further
may include their own subsets of filterable options, such as histology, stage,
and/or grade-
based options for tumor characterization. With regard to molecular factors,
the system may
permit filtering according to variant calls, abstracted variants, MSI, and/or
TMB.
[0955] Although the examples discussed herein provide analysis
with regard to various
cancer types, in other embodiments, it will be appreciated that the system may
be used to
indicate filtered display of other disease conditions, and it should be
understood that the
selection items will differ in those situations to focus particularly on the
relevant conditions
for the other disease.
[0956] The cohort funnel and population analysis user interface
visually may depict the
number of patients in the data set, either all at once or progressively upon
receiving a user's
selection of multiple filtering criteria. In one aspect, the display of
patient frequencies by
filter attribute may be provided using an interactive funnel chart. With each
selection, the
289
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
user interface updates to illustrate the reduction in results matching the
filter criteria; for
example, as more filter criteria are added, fewer patients matching all of the
selected criteria
exist, upon receiving each of a user's filtering factors.
[0957] The above filtering can be performed upon receiving each
user selection of a filter
criterion, the funnel updating to show the narrowing span of the dataset upon
each filter
selection. In that situation a filtering menu such as the one discussed above
may remain
visible in each tab as they are toggled, or may be collapsed to the side, or
may be represented
as a summary of the selected filtered options to keep the user apprised of the
reduced data
set/size.
[0958] With regard to each filtering method discussed above, the
combination of factors
may be based on Boolean-style combinations. Exemplary Boolean-style
combinations may
include, for filtering factors A and B, permitting the user to select whether
to search for
patients with "A AND B," "A OR B," "A AND NOT B," "B AND NOT A," etc.
[0959] The final filtered cohort of interest may form the basis
for further detailed analysis
in the modules or other user interfaces described below. The population of
interest is called a
"cohort". The user interface can provide fixed functional attribute selectors
pre-populated
appropriately based on the available data attributes in a Patient Data Store.
[0960] The display may further indicate a geographic location
clustering plot of patients
and/or demographic distribution comparisons with publicly reported statistics
and/or
privately curated statistics.
109611 Patient Timeline Analysis Module
[0962] Additionally, the system may include a patient timeline
analysis module that
permits a user to review the sequence of events in the clinical life of each
patient. It will be
appreciated that this data may be anonymized, as discussed above, in order to
protect
confidentiality of the patient data.
[0963] Once a user has provided all of his or her desired filter
criteria, e.g., via the cohort
funnel & population analysis user interface, the system permits the user to
analyze the filtered
subset of patients. With respect to the user interface depicted in the
figures, this procedure
may be accomplished by selecting the "Analyze Cohort" option presented in the
upper right-
hand corner of the interface.
290
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0964] After requesting analysis of the filtered subset of
patients, the user interface may
generate a data summary window in the patient timeline analysis user
interface, with one or
more regions providing information about the selected patient subset, for
example, a number
of other distributions across clinical and molecular features. In one aspect,
a first region may
include demographic information such as an average patient age and/or a plot
of patient ages.
A second region may include additional demographic information, such as gender

information, for the subset of patients. A third region may include a summary
of certain
clinical data, including, for example, an analysis of the medications taken by
each of the
patients in the subset. Similarly, a fourth region may include molecular data
about each of
the patients, for example, a breakdown of each genomic variant or alteration
possessed by the
patients in the subset.
[0965] The user interface also permits a user to query the data
summary information
presented in the data summary window or region in order to sort that data
further, e.g., using
a control panel. The system may be configured to sort the patient data based
on one or more
factors including, for example, gender, histology, menopausal status,
response, smoking
status, stage, and surgical procedures. Selecting one or more of these options
may not reduce
the sample size of patients, as was the case above when discussing filtering
being
summarized in the data summary window. Instead, the sort functions may
subdivide the
summarized information into one or more subcategories. For example, medication

information may be sorted by having additional response data layered over it
within the data
summary window, along with a legend explaining the layered response data.
[0966] The subset of patients selected by the user also may be
compared against a second
subset (or "cohort-) of patients, e.g., via a drop-down menu, thereby
facilitating a side-by-
side analysis of the groups. Doing so may permit the user to quickly and
easily see any
similarities, as well as any noticeable differences, between the subsets.
[0967] In one embodiment, an event timeline Gantt style chart is
provided for a high-
level overview, coupled with a tabular detail panel. The display may also
enable the
visualization and comparison of multiple patients concun-ently on a normalized
timeline, for
the purposes of identifying both areas of overlap, and potential discontinuity
across a patient
subset.
[0968] Patient "Survival" Analysis Module
291
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0969] The system further may provide survival analysis for the
subset of patients
through use of the patient survival analysis user interface. This modeling and
visualization
component may enable the user to interactively explore time until event (and
probability at
time) curves and their confidence intervals, for sub-groups of the filtered
cohort of interest.
The time series inception and target events can be selected and dynamically
modified by the
user, along with attributes on which to cluster patient groups within the
chosen population, all
while the curve visualizer reactively adapts to the provided parameters.
[0970] In order to provide the user with flexibility to define
the metes and bounds of that
analysis, the system may permit the user to select one or both of the starting
and ending
events upon which that analysis is based. Exemplary starting events include an
initial
primary disease diagnosis, progression, metastasis, regression, identification
of a first primary
cancer, an initial prescription of medication, etc. Conversely, exemplary
ending events may
include progression, metastasis, recurrence, death, a period of time, and
treatment start/end
dates. Selecting a starting event sets an anchor point for all patients from
which the curve
begins, and selecting an end event sets a horizon for which the curve is
predicting.
[0971] The analysis may be presented to the user in the form of a
plot of ending event, for
example, progression free survival or overall survival, versus time.
Progression for these
purposes may reflect the occurrence of one or more progression events, for
example, a
metastases event, a recurrence, a specific measure of progression for a drug
or independent of
a drug, a certain tumor size or change in tumor size, or an enriched
measurement (such as
measurements which are indirectly extracted from the underlying clinical data
set).
Exemplary enriched measurements may include detecting a stage change (such as
by
detecting a stage 2 categorization changed to stage 3), a regression, or via
an inference (such
as both stage 3 and metastases are inferred from detection of stages 2 and 4,
but no detection
of stage 3).
[0972] Additionally, the system may be configured to permit the
user to focus or zoom in
on a particular time span within the plot. In particular, the user may be able
to zoom in the x-
axis only, the y-axis only, or both the x- and y-axes at the same time. This
functionality may
be particularly useful depending on the type of disease being analyzed, as
certain, aggressive
diseases may benefit from analyzing a smaller window of time than other
diseases. For
example, survival rates for patients with pancreatic cancer tend to be
significantly lower than
for other types of cancer; thus, when analyzing pancreatic cancer, it may be
useful to the user
292
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
to zoom in to a shorter time period, for example, going from about a 5-year
window to about
a 1-year window.
[0973] The user interface also may be configured to modify its
display and present
survival information of smaller groups within the subset by receiving user
inputs
corresponding to additional grouping or sorting criteria. Those criteria may
be clinical or
molecular factors, and the user interface may include a selector such as one
or more drop-
down menus permitting the user to select, e.g., any beginning event or ending
event, as well
as gender, gene, histology, regimens, smoking status, stage, surgical
procedures, etc.
[0974] Selecting one of the criteria then may present the user
with a plurality of options
relevant to that criterion. For example, selecting "regimens" may cause the
system to use one
or more value sets to populate a selectable field generated within the user
interface to prompt
the user to select one or more of the specific medication regimens undertaken
by one or more
of the patients within the subset. Thus, selecting the "Gemcitabine +
Paclitaxel" option,
followed by the "FOLFIRINOX" option, results in the system analyzing the
patient subset
data, determining which patients' records include data corresponding to either
of the selected
regimens, recalculating the survival statistics for those separate groups of
patients, and
updating the user interface to include separate survival plots for each
regimen. Adding a
group/adding two or more selections may result in the system plotting them on
the same chart
to view them side by side, and the user interface may generate a legend with
name, color, and
sample size to distinguish each group.
[0975] The system may permit a greater level of analysis by
calculating and overlaying
statistical ranges with respect to the survival analysis. In particular, the
system may calculate
confidence intervals with regard to each dataset requested by the user and
display those
confidence intervals relative to the survival plots. In one instance, the
desired confidence
interval may be user-established. In another instance, the confidence interval
may be pre-
established by the system and may be, for example, a 68% (one standard
deviation) interval, a
95% (two standard deviations) interval, or a 99.7% (three standard deviations)
interval.
Confidence intervals may be calculated as Kaplan Meier confidence intervals or
using
another type of statistical analysis, as would be appreciated by one of
ordinary skill in the
relevant art.
[0976] As will be appreciated from the previous discussion,
underpinning the utility of
the system is the ability to highlight features and interaction pathways of
high importance
293
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
driving these predictions, and the ability to further pinpoint cohorts of
patients exhibiting
levels of response that significantly deviate from expected norms. In this
context, high
importance may be understood to be based upon feature importance to an outcome
of a
prediction. In particular, features that provide the greatest weight to the
prediction may be
designated as those of high importance. The present system and user interface
provide an
intuitive, efficient method for patient selection and cohort definition given
specific inclusion
and/or exclusion criteria. The system also provides a robust user interface to
facilitate
internal research and analysis, including research and analysis into the
impact of specific
clinical and/or molecular attributes, as well as drug dosages, combinations,
and/or other
treatment protocols on therapeutic outcomes and patient survival for
potentially large,
otherwise unwieldy patient sample sizes.
[0977] The modeling and visualization framework set forth herein
may enable users to
interactively explore auto-detected patterns in the clinical and genomic data
of their filtered
patient cohort, and to analyze the relationship of those patterns to
therapeutic response and/or
survival likelihood. That analysis may lead a user to more informed treatment
decisions for
patients, earlier in the cycle than may be the case without the present system
and user
interface. The analysis also may be useful in the context of clinical trials,
providing robust,
data-backed clinical trial inclusion and/or exclusion analysis. Backed by an
extensive library
of clinical and molecular data, the present system unifies and applies various
algorithms and
concepts relating to clinical analysis and machine learning to generate a
fully integrated,
interactive user interface.
[0978] User interfaces, such as those described herein, may also
be provided for an
interactive mobile device. Examples of mobile user interfaces are disclosed in
US Patent
Application 16/289,027, and filed February 28. 2019 which is incorporated by
reference for
all purposes.
[0979] The present disclosure describes an application interface
that physicians can
reference easily through their mobile or tablet device. Through the
application interface,
reports a physician sees may be supplemented with aggregated data (such as de-
identified
data from other patient reports) to provide critical decision informing
statistics or metrics
right to their fingertips. While a mobile or tablet device is referenced
herein throughout for
the sake of simplicity and consistency, it will be appreciated that the device
running the
application interface may include any device, such as a personal computer or
other hardware
connected through a server hosting the application, or devices such as mobile
cameras that
294
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
permit the capturing of digital images for transfer to another hardware system
connected
through a server hosting the application.
[0980] An exemplary device may be any device capable of receiving
user input and
capturing data a physician may desire to compare against an exemplary cohort
to generate
treatment recommendations. An exemplary cohort may be a patient cohort, such
as a group
of patients with similarities; those similarities may include diagnoses,
responses to treatment
regimens, genetic profiles, and/or other medical, geographic, demographic,
clinical,
molecular, or genetic features.
[0981] Generating a report supplement may be performed by a
physician by opening or
starting-up the application, following prompts provided by the applications to
capture or
upload a report or EMR, and validating any fields from the report that the
application
automatically populated for accuracy. Once captured, the patient's data may be
uploaded to
server and analyzed in real time; furthermore, cohort statistics relating to
the patient's profile
may be delivered to the application for the physician's reference and review.
[0982] In one embodiment, a home screen of a mobile application
is displayed on the
mobile device. The home screen may provide access to patient records through a
patient
interface that may provide, for one or more patients, patient identification
information, such
as the patient's name, diagnosis, and record identifiers, enabling a user
(physician or staff) to
identify a patient and to confirm that the record selected by the user and/or
presented on the
mobile device relates to the patient that the user wishes to analyze. In the
event that a desired
patient is not displayed on the patient interface, the home screen may also
include a search
indicator that, upon selection by the user, receives text input such as a
patient's name, unique
identifier, or diagnosis, that permits the user to filter the patients by the
search criteria of the
text input to search for a specific patient. The mobile device may include a
touch screen,
through which a user may select a desired patient by touching the area on the
application
interface that includes the desired patient identification information. A
cursor (not shown)
may appear on the screen where the user touches to emphasize touch or gestures
received.
109831 Alternative home screens may be implemented that provide a
user with options to
perform other functions, as well as to access a patient identification
information screen, and it
will be appreciated that exemplary embodiments referenced herein are not
intended to limit
the interface of the application in function or design.
295
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0984] A user may add a new patient by selecting a corresponding
"add patient" icon on
the application or through gesture recognition. Exemplary gestures may include
swiping
across the screen of the mobile device to the left or right, using several
fingers to scroll or
swipe, tapping or holding down on a portion of the patient interface not
occupied by patient
identification, or any other designated gesture. Alternatively, if no patient
data is present in
the patient interface, the interface may default to adding a user once active.
Adding a patient
may either be performed manually, by entering patient information into the
application, or
automatically by uploading patient data into the application. Furthermore,
automatic
uploading may be implemented by capturing an image of patient data at the
mobile device,
such as from a report.
[0985] Once a user has selected a patient, completed adding a new
patient, or is adding a
new patient from a report, an electronic document capture screen may appear.
The system
may be configured to capture images of documents that are saved in a plurality
of different
formats. Exemplary electronic document captures may include a structured data
form (such
as JSON, XML, HTML, etc.), an image (such as JPEG, PNG, etc.), a PDF of a
document,
report, or file, or a typeface or handwritten copy of a document, report, or
file.
[0986] In order to electronically capture a physical copy of a
document, the user may
place the document on a surface, such as a surface that provides a contrasting
color or texture
to the document, and aim the mobile device's camera at the document so that an
image of the
document appears in the document capture screen. The user then may select a
document
capture icon to begin a document capture process. In an alternative
embodiment, an
automatic capture may be generated once capture criteria are met. Exemplary
capture criteria
may include that the document bounds are identifiable and/or that the document
is in focus.
[0987] In one embodiment, the document may be sent to an optical
character recognition
(OCR) process and a document classifier may process the OCR output of the
electronic
document capture to recognize document identifiers which are linked to
features of the
document stored in a predefined model for each document. Predefined models may
also be
refen-ed to as predetermined models. Document identifiers may include Form
numbers (such
as Form CA217b, Patient Report Rev .17, AB12937, etc.) indicating a specific
version of a
document which provides key health information in each of the respective
document's
features. Features of a document may include headers, columns, tables, graphs,
and other
standard forms which appear in the document.
296
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[0988] As a result of the OCR output, the application also may
identify medical data
present in the document. Medical data, or key health information, may include
numerous
fields including, but not limited to, patient demographics (such as patient
name, date of birth,
gender, ethnicity, date of death, address, smoking status, diagnosis dates,
personal medical
history, or family medical history), clinical diagnoses (such as date of
initial diagnosis, date
of metastatic diagnosis, cancer staging, tumor characterization, or tissue of
origin), treatments
and outcomes (such as therapy groups, medications, surgeries, radiotherapy,
imaging, adverse
effects, associated outcomes, or corresponding dates), and genetic testing and
laboratory
information (such as genetic testing, performance scores, lab tests, pathology
results,
prognostic indicators, or corresponding dates).
[0989] Each of the fields, for example the address, cancer
staging, medications, or genetic
testing may also have a plurality of subfields. The address field may have
subfields for type
of use (personal or business), street, city, state, zip, country, and a start
or end date (date that
residency at the address begins or expires). Genetic testing may have
subfields for the date of
genetic testing, testing provider used, test method, such as genetic
sequencing method or gene
panel, gene results, such as included genes, variants, expression
levels/statuses, tumor
mutational burden, and microsatellite instability. One type of genetic testing
may be next-
generation sequencing (NGS). The above-provided examples, enumerations, and
lists are not
intended to limit the scope of the available fields and are intended to be
representative of the
nature and structure that fields may take.
[0990] Thereafter one or more pages may be displayed within the
application tabulating
the captured information and/or presenting the captured information within the
context of the
analytics and/or reporting as described herein with respect to one or more
embodiments.
[0991] Stand-alone Device Integration
[0992] Hardware devices incorporating one or more embodiments as
described herein
may be implemented. In one example, a hardware device may record progress
notes or other
documents, automatically converting recorded audio into features and storing
them in a
structured format with respect to a patient. In another example, a hardware
device may
broadcast a response containing one or more analytical results, patient
features, or reports as
described in any of the embodiments above.
[0993] It has been recognized that a relatively small and
portable voice activated and
audio responding interface device (hereinafter "collaboration device") can be
provided
297
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
enabling oncologists to conduct at least initial database access and
manipulation activities. In
at least some embodiments, a collaboration device includes a processor linked
to each of a
microphone, a speaker and a wireless transceiver (e.g., transmitter and
receiver). The
processor runs software for capturing voice signals generated by an
oncologist. An
automated speech recognition (ASR) system converts the voice signals to a text
file which is
then processed by a natural language processor (NLP) or other artificial
intelligence module
(e.g., a natural language understanding module) to generate a data operation
(e.g., commands
to perform some data access or manipulation process such as a query, a filter,
a
memorialization, a clearing of prior queries and filter results, note etc.).
[0994] In at least some embodiments the collaboration device is
used within a
collaboration system that includes a server that maintains and manipulates an
industry
specific data repository. The data operation is received by the collaboration
server and used
to access and/or manipulate data the database data thereby generating a data
response. In at
least some cases, the data response is returned to the collaboration device as
an audio file
which is broadcast to the oncologist as a result associated with the original
query.
[0995] In some cases, the voice signal to text file transcription
is performed by the
collaboration device processor while in other cases the voice signal is
transmitted from the
collaboration device to the collaboration server and the collaboration server
does the
transcription to a text file. In some cases, the text file is converted to a
data operation by the
collaboration device processor and in other cases that conversion is performed
by the
collaboration server. In some cases, the collaboration server maintains or has
access to the
industry specific database so that the server operates as an intermediary
between the
collaboration device and the industry specific database.
[0996] In at least some embodiments the collaboration device is a
dedicated collaboration
device that is provided solely as an interface to the collaboration server and
industry specific
database. In these cases, the collaboration interface device may be on all the
time and may
only run a single dedicated application program so that the device does not
require any boot
up time and can be activated essentially immediately via a single activation
activity
performed by an oncologist.
[0997] For instance, in some cases the collaboration device may
have motion sensors
(e.g., an accelerometer, a gyroscope, etc.) linked to the processor so that
the simple act of
picking up the device causes the processor to activate an application. In
other cases, the
298
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
collaboration device processor may be programmed to "listen" for the phrase
"Hey query"
and once received, activate to capture a next voice signal utterance that
operates as seed data
for generating the text file. In other cases, the processor may be programmed
to listen for a
different activation phrase, such as a brand name of the system or a
combination of a brand
name plus a command indication. For instance, if the brand name of the system
is "One"
then the activation phrase may be "One" or "Go One" or the like. In still
other cases the
collaboration device may simply listen for voice signal utterances that it can
recognize as
oncological queries and may then automatically use any recognized query as
seed data for
text generation.
[0998] In addition to providing audio responses to data
operations, in at least some cases
the system automatically records and stores data operations (e.g., data
defining the
operations) and responses as a collaboration record for subsequent access. The
collaboration
record may include one or the other or both of the original voice signal and
broadcast
response or the text file and a text response corresponding to the data
response. Here, the
stored collaboration record provides details regarding the oncologist's search
and data
operation activities that help automatically memorialize the hypothesis or
idea the oncologist
was considering. In a case where an oncologist asks a series of queries, those
queries and
data responses may be stored as a single line of questioning so that they
together provide
more detail for characterizing the oncologist's initial hypothesis or idea. At
a subsequent
time, the system may enable the oncologist to access the memorialized queries
and data
responses so that she can re-enter a flow state associated therewith and
continue hypothesis
testing and data manipulation using a workstation type interface or other
computer device
that includes a display screen and perhaps audio devices like speakers, a
microphone, etc.,
more suitable for presenting more complex data sets and data representations.
[0999] In addition to simple data search queries, other voice
signal data operation types
are contemplated. For instance, the system may support filter operations where
an oncologist
voice signal message defines a sub-set of the industry specific database set.
For example, the
oncologist may voice the message -Access all medical records for male patients
over 45
years of age that have had pancreatic cancer since 1990," causing the system
to generate an
associated subset of data that meet the specified criteria.
[1000] Importantly, some data responses to oncological queries
will be "audio suitable"
meaning that the response can be well understood and comprehended when
broadcast as an
audio message. In other cases, a data response simply may not be well suited
to be presented
299
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
as an audio output. For instance, where a query includes the phrase "Who is
the patient that I
saw during my last office visit last Thursday?", an audio suitable response
may be -Mary
Brown." On the other hand, if a query is "List all the medications that have
been prescribed
for males over 45 years of age that have had pancreatic cancer since 1978" and
the response
includes a list of 225 medications, the list would not be audio suitable as it
would take a long
time to broadcast each list entry and comprehension of all list entries would
be dubious at
best.
[1001] In cases where a data response is optimally visually
presented, the system may
take alternate or additional steps to provide the response in an intelligible
format to the user.
The system may simply indicate as part of an audio response that response data
would be
more suitably presented in visual format and then present the audio response.
If there is a
proximate large display screen, such as a computer monitor or a television
(TV) such as a
smart TV, the system may pair with that display and present visual data with
or without audio
data. The system may simply indicate that no suitable audio response is
available. In some
embodiments, the system may pair with a computational device that includes a
display, such
as a smartphone, tablet computer, etc.
[1002] Thus, at least some inventive embodiments enable intuitive
and rapid access to
complex data sets essentially anywhere within a wireless communication zone so
that an
oncologist can initiate thought processes in real time when they occur. By
answering
questions when they occur, the system enables oncologists to dig deeper in the
moment into
data and continue the thought process through a progression of queries. Some
embodiments
memorialize an oncologist's queries and responses so that at subsequent times
the oncologist
can re-access that information and continue queries related thereto. In cases
where visual and
audio responses are available, the system may adapt to provide visual
responses when visual
capabilities are present or may simply store the visual responses as part of a
collaboration
record for subsequent access when an oncologist has access to a workstation or
the like.
[1003] In at least some embodiments the disclosure includes a
method for interacting
with a database to access data therein, the method for use with a
collaboration device
including a speaker, a microphone and a processor, the method comprising the
steps of
associating separate sets of state-specific intents and supporting information
with different
clinical report types, the supporting information including at least one
intent-specific data
operation for each state-specific intent, receiving a voice query via the
microphone seeking
information, identifying a specific patient associated with the query,
identifying a state-
300
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
specific clinical report associated with the identified patient, attempting to
select one of the
state-specific intents associated with the identified state-specific clinical
report as a match for
the query, upon selection of one of the state-specific intents, performing the
at least one data
operation associated with the selected state-specific intent to generate a
result, using the result
to form a query response and broadcasting the query response via the speaker.
110041 In some cases the method is for use with at least a first
database that includes
information in addition the clinical reports, the method further including, in
response to the
query, obtaining at least a subset of the information in addition to the
clinical reports, the step
of using the result to form a query response including using the result and
the additional
obtained information to form the query response.
[1005] In some cases, the at least one data operation includes at
least one data operation
for accessing additional information from the database, the step of obtaining
at least a subset
includes obtaining data per the at least one data operation for accessing
additional
information from the database.
[1006] Some embodiments include a method for interacting with a
database to access
data therein, the method for use with a collaboration device including a
speaker, a
microphone and a processor, the method comprising the steps of associating
separate sets of
state-specific intents and supporting information with different clinical
report types, the
supporting information including at least one intent-specific primary data
operation for each
state-specific intent, receiving a voice query via the microphone seeking
information,
identifying a specific patient associated with the query, identifying a state-
specific clinical
report associated with the identified patient, attempting to select one of the
state-specific
intents associated with the identified state-specific clinical report as a
match for the query,
upon selection of one of the state-specific intents, performing the primary
data operation
associated with the selected state-specific intent to generate a result,
performing a
supplemental data operation on data from a database that includes data in
addition to the
clinical report data to generate additional information, using the result and
the additional
information to form a query response and broadcasting the query response via
the speaker.
[1007] Some embodiments include a method of audibly broadcasting
responses to a user
based on user queries about a specific patient molecular report, the method
comprising
receiving an audible query from the user to a microphone coupled to a
collaboration device,
identifying at least one intent associated with the audible query, identifying
at least one data
301
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
operation associated with the at least one intent, associating each of the at
least one data
operations with a first set of data presented on the molecular report,
executing each of the at
least one data operations on a second set of data to generate response data,
generating an
audible response file associated with the response data and providing the
audible response
file for broadcasting via a speaker coupled to the collaboration device.
110081 In at least some cases the audible query includes a
question about a nucleotide
profile associated with the patient. In at least some cases the nucleotide
profile associated
with the patient is a profile of the patient's cancer. In at least some cases
the nucleotide
profile associated with the patient is a profile of the patient's germline. In
at least some cases
the nucleotide profile is a DNA profile. In at least some cases the nucleotide
profile is an
RNA expression profile. In at least some cases the nucleotide profile is a
mutation
biomarker.
[1009] In at least some cases the mutation biomarker is a BRCA
biomarker. In at least
some cases the audible query includes a question about a therapy. In at least
some cases the
audible query includes a question about a gene. In at least some cases the
audible query
includes a question about a clinical data. In at least some cases the audible
query includes a
question about a next-generation sequencing panel. In at least some cases the
audible query
includes a question about a biomarker.
[1010] In at least some cases the audible query includes a
question about an immune
biomarker. In at least some cases the audible query includes a question about
an antibody-
based test. In at least some cases the audible query includes a question about
a clinical trial.
In at least some cases the audible query includes a question about an organoid
assay. In at
least some cases the audible query includes a question about a pathology
image. In at least
some cases the audible query includes a question about a disease type. In at
least some cases
the at least one intent is an intent related to a biomarker. In at least some
cases the biomarker
is a BRCA biomarker. In at least some cases the at least one intent is an
intent related to a
clinical condition. In at least some cases the at least one intent is an
intent related to a clinical
trial.
[1011] In at least some cases the at least one intent is related
to a drug. In at least some
cases the drug intent is related to a drug is chemotherapy. In at least some
cases the drug
intent is an intent related to a PARP inhibitor intent. In at least some cases
the at least one
intent is related to a gene. In at least some cases the at least one intent is
related to
302
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
immunology. In at least some cases the at least one intent is related to a
knowledge database.
In at least some cases the at least one intent is related to testing methods.
In at least some
cases the at least one intent is related to a gene panel. In at least some
cases the at least one
intent is related to a report. In at least some cases the at least one intent
is related to an
organoid process. In at least some cases the at least one intent is related to
imaging.
110121 In at least some cases the at least one intent is related
to a pathogen. In at least
some cases the at least one intent is related to a vaccine. In at least some
cases the at least
one data operation includes an operation to identify at least one treatment
option. In at least
some cases the at least one data operation includes an operation to identify
knowledge about
a therapy. In at least some cases the at least one data operation includes an
operation to
identify knowledge related to at least one drug (e.g., "What drugs are
associated with high
CD40 expression?"). In at least some cases the at least one data operation
includes an
operation to identify knowledge related to mutation testing (e.g., -Was Dwayne
Holder's
sample tested for a KMT2D mutation?"). In at least some cases the at least one
data
operation includes an operation to identify knowledge related to mutation
presence (e.g.,
-Does Dwayne Holder have a KMT2C mutation?"). In at least some cases the at
least one
data operation includes an operation to identify knowledge related to tumor
characterization
(e.g. "Could Dwayne Holder's tumor be a BRCA2 driven tumor?"). In at least
some cases
the at least one data operation includes an operation to identify knowledge
related to testing
requirements (e.g., "What tumor percentage does Tempus require for TMB
results?"). In at
least some cases the at least one data operation includes an operation to
query for definition
information (e.g., -What is PDL1 expression?"). In at least some cases the at
least one data
operation includes an operation to query for expert information (e.g., "What
is the clinical
relevance of PDL1 expression?-; "What are the common risks associated with the
Whipple
procedure?"). In at least some cases the at least one data operation includes
an operation to
identify information related to recommended therapy (e.g., "Dwayne Holder is
in the 88th
percentile of PDL1 expression, is he a candidate for immunotherapy?"). In at
least some
cases the at least one data operation includes an operation to query for
information relating to
a patient (e.g., Dwayne Holder). In at least some cases the at least one data
operation
includes an operation to query for information relating to patients with one
or more clinical
characteristics similar to the patient (e.g., -What are the most common
adverse events for
patients similar to Dwayne Holder?-).
303
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1013] In at least some cases the at least one data operation
includes an operation to
query for information relating to patient cohorts (e.g., "What are the most
common adverse
events for pancreatic cancer patients?"). In at least some cases the at least
one data operation
includes an operation to query for information relating to clinical trials
(e.g., "Which clinical
trials is Dwayne the best match for?-).
110141 In at least some cases the at least one data operation
includes an operation to
query about a characteristic relating to a genomic mutation. In at least some
cases the
characteristic is loss of heterozygosity. In at least some cases the
characteristic reflects the
source of the mutation. In at least some cases the source is germline. In at
least some cases
the source is somatic. In at least some cases the characteristic includes
whether the mutation
is a tumor driver. In at least some cases the first set of data comprises a
patient name.
110151 In at least some cases the first set of data comprises a
patient age. In at least some
cases the first set of data comprises a next-generation sequencing panel. In
at least some
cases the first set of data comprises a genomic variant. In at least some
cases the first set of
data comprises a somatic genomic variant. In at least some cases the first set
of data
comprises a germline genomic variant. In at least some cases the first set of
data comprises a
clinically actionable genomic variant. In at least some cases the first set of
data comprises a
loss of function variant. In at least some cases the first set of data
comprises a gain of
function variant.
[1016] In at least some cases the first set of data comprises an
immunology marker. In at
least some cases the first set of data comprises a tumor mutational burden. In
at least some
cases the first set of data comprises a microsatellite instability status. In
at least some cases
the first set of data comprises a diagnosis. In at least some cases the first
set of data
comprises a therapy. In at least some cases the first set of data comprises a
therapy approved
by the U.S. Food and Drug Administration. In at least some cases the first set
of data
comprises a drug therapy. In at least some cases the first set of data
comprises a radiation
therapy. In at least some cases the first set of data comprises a
chemotherapy. In at least
some cases the first set of data comprises a cancer vaccine therapy. In at
least some cases the
first set of data comprises an oncolytic virus therapy.
[1017] In at least some cases the first set of data comprises an
immunotherapy. In at least
some cases the first set of data comprises a pembrolizumab therapy. In at
least some cases
the first set of data comprises a CAR-T therapy. In at least some cases the
first set of data
304
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
comprises a proton therapy. In at least some cases the first set of data
comprises an
ultrasound therapy. In at least some cases the first set of data comprises a
surgery. In at least
some cases the first set of data comprises a hormone therapy. In at least some
cases the first
set of data comprises an off-label therapy. In at least some cases, the first
set of data
comprises a gene editing therapy. In at least some cases, the gene editing
therapy is clustered
regularly interspaced short palindromic repeats (CRISPR) therapy.
[1018] In at least some cases the first set of data comprises an
on-label therapy. In at
least some cases the first set of data comprises a bone marrow transplant
event. In at least
some cases the first set of data comprises a cryoablation event. In at least
some cases the first
set of data comprises a radiofrequency ablation. In at least some cases the
first set of data
comprises a monoclonal antibody therapy. In at least some cases the first set
of data
comprises an angiogenesis inhibitor. In at least some cases the first set of
data comprises a
PARP inhibitor.
[1019] In at least some cases the first set of data comprises a
targeted therapy. In at least
some cases the first set of data comprises an indication of use. In at least
some cases the first
set of data comprises a clinical trial. In at least some cases the first set
of data comprises a
distance to a location conducting a clinical trial. In at least some cases the
first set of data
comprises a variant of unknown significance. In at least some cases the first
set of data
comprises a mutation effect.
[1020] In at least some cases the first set of data comprises a
variant allele fraction. In at
least some cases the first set of data comprises a low coverage region. In at
least some cases
the first set of data comprises a clinical history. In at least some cases the
first set of data
comprises a biopsy result. In at least some cases the first set of data
comprises an imaging
result. In at least some cases the first set of data comprises an MRI result.
[1021] In at least some cases the data comprises a CT result. In
at least some cases the
first set of data comprises a therapy prescription. In at least some cases the
first set of data
comprises a therapy administration. In at least some cases the first set of
data comprises a
cancer subtype diagnosis. In at least some cases the first set of data
comprises a cancer
subtype diagnosis by RNA class. In at least some cases the first set of data
comprises a result
of a therapy applied to an organoid grown from the patient's cells. In at
least some cases the
first set of data comprises a tumor quality measure. In at least some cases
the first set of data
comprises a tumor quality measure selected from at least one of the set of PD-
L1, MMR,
305
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
tumor infiltrating lymphocyte count, and tumor ploidy. In at least some cases
the first set of
data comprises a tumor quality measure derived from an image analysis of a
pathology slide
of the patient's tumor. In at least some cases the first set of data comprises
a signaling
pathway associated with a tumor of the patient.
[1022] In at least some cases the signaling pathway is a HER
pathway. In at least some
cases the signaling pathway is a MAPK pathway. In at least some cases the
signaling
pathway is a MDM2-TP53 pathway. In at least some cases the signaling pathway
is a PI3K
pathway. In at least some cases the signaling pathway is a mTOR pathway.
[1023] In at least some cases the at least one data operations
includes an operation to
query for a treatment option, the first set of data comprises a genomic
variant, and the
associating step comprises adjusting the operation to query for the treatment
option based on
the genomic variant. In at least some cases the at least one data operations
includes an
operation to query for a clinical history data, the first set of data
comprises a therapy, and the
associating step comprises adjusting the operation to query for the clinical
history data
element based on the therapy. In at least some cases the clinical history data
is medication
prescriptions, the therapy is pembrolizumab, and the associating step
comprises adjusting the
operation to query for the prescription of pembrolizumab.
[1024] In at least some cases the second set of data comprises
clinical health information.
In at least some cases the second set of data comprises genomic variant
information. In at
least some cases the second set of data comprises DNA sequencing information.
In at least
some cases the second set of data comprises RNA information. In at least some
cases the
second set of data comprises DNA sequencing information from short-read
sequencing. In at
least some cases the second set of data comprises DNA sequencing information
from long-
read sequencing. In at least some cases the second set of data comprises RNA
transcriptome
information. In at least some cases the second set of data comprises RNA full-
transcriptome
information. In at least some cases the second set of data is stored in a
single data repository.
In at least some cases the second set of data is stored in a plurality of data
repositories.
110251 In at least some cases the second set of data comprises
clinical health information
and genomic variant information. In at least some cases the second set of data
comprises
immunology marker information. In at least some cases the second set of data
comprises
microsatellite instability immunology marker information. In at least some
cases the second
set of data comprises tumor mutational burden immunology marker information.
In at least
306
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
some cases the second set of data comprises clinical health information
comprising one or
more of demographic information, diagnostic information, assessment results,
laboratory
results, prescribed or administered therapies, and outcomes information.
[1026] In at least some cases the second set of data comprises
demographic information
comprising one or more of patient age, patient date of birth, gender, race,
ethnicity, institution
of care, comorbidities, and smoking history. In at least some cases the second
set of data
comprises diagnosis information comprising one or more of tissue of origin,
date of initial
diagnosis, histology, histology grade, metastatic diagnosis, date of
metastatic diagnosis, site
or sites of metastasis, and staging information. In at least some cases the
second set of data
comprises staging information comprising one or more of TNM, ISS, DSS, FAB,
RAI, and
Binet. In at least some cases the second set of data comprises assessment
information
comprising one or more of performance status (including ECOG or Karnofsky
status),
performance status score, and date of performance status.
[1027] In at least some cases the second set of data comprises
laboratory information
comprising one or more of type of lab (e.g., CBS, CMP, PSA, CEA), lab results,
lab units,
date of lab service, date of molecular pathology test, assay type, assay
result (e.g. positive,
negative, equivocal, mutated, wild type), molecular pathology method (e.g.,
IHC, FISH,
NGS), and molecular pathology provider. In at least some cases the second set
of data
comprises treatment information comprising one or more of drug name, drug
start date, drug
end date, drug dosage, drug units, drug number of cycles, surgical procedure
type, date of
surgical procedure, radiation site, radiation modality, radiation start date,
radiation end date,
radiation total dose delivered, and radiation total fractions delivered.
[1028] In at least some cases the second set of data comprises
outcomes information
comprising one or more of Response to Therapy (e.g., CR, PR, SD, PD), RECIST
score, Date
of Outcome, date of observation, date of progression, date of recurrence,
adverse event to
therapy, adverse event date of presentation, adverse event grade, date of
death, date of last
follow-up, and disease status at last follow up. In at least some cases the
second set of data
comprises information that has been de-identified in accordance with a de-
identification
method permitted by HIPAA.
[1029] In at least some cases the second set of data comprises
information that has been
de-identified in accordance with a safe harbor de-identification method
permitted by HIPAA.
In at least some cases the second set of data comprises information that has
been de-identified
307
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
in accordance with a statistical de-identification method permitted by HIPAA.
In at least
some cases the second set of data comprises clinical health information of
patients diagnosed
with a cancer condition.
[1030] In at least some cases the second set of data comprises
clinical health information
of patients diagnosed with a cardiovascular condition. In at least some cases
the second set
of data comprises clinical health information of patients diagnosed with a
diabetes condition.
In at least some cases the second set of data comprises clinical health
information of patients
diagnosed with an autoimmune condition. In at least some cases the second set
of data
comprises clinical health information of patients diagnosed with a lupus
condition.
[1031] In at least some cases the second set of data comprises
clinical health information
of patients diagnosed with a psoriasis condition. In at least some cases the
second set of data
comprises clinical health information of patients diagnosed with a depression
condition. In at
least some cases the second set of data comprises clinical health information
of patients
diagnosed with a rare disease.
[1032] In at least some embodiments, a method of audibly
broadcasting responses to a
user based on user queries about a specific patient's molecular report is
provided by the
disclosure. The method can be used with a collaboration device that includes a
processor and
a microphone and a speaker linked to the processor. The method can include
storing
molecular reports for a plurality of patients in a system database, receiving
an audible query
from the user via the microphone, identifying at least one intent associated
with the audible
query, identifying at least one data operation associated with the at least
one intent, accessing
the specific patient's molecular report, executing at least one of the
identified at least one data
operations on a first set of data included in the specific patient's molecular
report to generate
a first set of response data, using the first set of response data to generate
an audible response
file, and broadcasting the audible response file via the speaker.
[1033] In at least some cases the method can further include
identifying qualifying
parameters in the audible query, the step of identifying at least one data
operation including
identifying the at least one data operation based on both the identified
intent and the
qualifying parameters.
110341 In at least some cases at least one of the qualifying
parameters includes a patient
identity.
308
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1035] In at least some cases at least one of the qualifying
parameters includes a patient's
disease state.
[1036] In at least some cases at least one of the qualifying
parameters includes a genetic
mutation.
[1037] In at least some cases at least one of the qualifying
parameters includes a
procedure type.
[1038] In at least some cases the method further includes
identifying qualifying
parameters in the specific patient's molecular report, the step of identifying
at least one data
operation including identifying the at least one data operation based on both
the identified
intent and the qualifying parameters.
110391 In at least some cases the method further includes the
step of storing a general
knowledge database that includes non-patient specific data about specific
topics, wherein the
step of identifying at least one data operation associated with the at least
one intent includes
identifying at least first and second data operations associated with the at
least one intent, the
first data operation associated with the specific patient's molecular report
and the second data
operation associated with the general knowledge database.
[1040] In at least some cases the second data operation
associated with the general
knowledge database is executed first to generate second data operation
results, the second
data operation results is used to define the first data operation and the
first data operation
associated with the specific patient's molecular report is executed second to
generate the first
set of response data.
[1041] In at least some cases the first data operation associated
with the specific patient's
molecular report is executed first to generate first data operation results,
the first data
operation results is used to define the second data operation and the second
data operation
associated with the general knowledge database is executed second to generate
the first set of
response data.
[1042] In at least some cases the step of identifying at least
one intent includes
determining that the audible query is associated with the specific patient,
accessing the
specific patient's molecular report, determining the specific patient's cancer
state from the
molecular report and then selecting an intent from a pool of cancer state
related intents.
309
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1043] In at least some cases the method further includes the
step of storing a general
knowledge database that includes non-patient specific data about specific
topics, the method
further including the steps of, upon determining that the audible query is not
associated with
any specific patient, selecting an intent that is associated with the general
knowledge
database.
110441 In at least some cases the collaboration device includes a
portable wireless device
that includes a wireless transceiver.
[1045] In at least some cases the collaboration device is a
handheld device.
[1046] In at least some cases the collaboration device includes
at least one visual
indicator, the processor linked to the visual indicator and controllable to
change at least some
aspect of the appearance of the visual indicator to indicate different states
of the collaboration
device.
[1047] In at least some cases the processor is programmed to
monitor microphone input
to identify a -wake up" phrase, the processor monitoring for the audible query
after the wake
up phrase is detected.
[1048] In at least some cases a series of audible queries is
received via the microphone,
and the at least one of the identified data operations includes identifying a
subset of data that
is usable with subsequent audio queries to identify intents associated with
the subsequent
queries.
[1049] In at least some cases the method can further include the
steps of, based on at least
one audible query received via the microphone and related data in a system
database,
identifying at least one activity that a collaboration device user may want to
perform and
initiating the at least one activity.
[1050] In at least some cases the step of initiating the at least
one activity includes
generating a second audible response file and broadcasting the second audible
response file to
the user seeking verification that the at least one activity should be
performed and monitoring
the microphone for an affirmative response and, upon receiving an affirmative
response,
initiating the at least one activity.
[1051] In at least son-le cases the at least one activity
includes periodically capturing
health information from electronic health records included in the system
database.
310
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1052] In at least some cases the at least one activity includes
checking status of an
existing clinical or lab order.
[1053] In at least some cases the at least one activity includes
ordering a new clinical or
lab order.
[1054] In at least some cases the collaboration device is one of
a smartphone, a tablet
computer, a laptop computer, a desktop computer, or an Amazon Echo.
[1055] In at least some cases the step of initiating the at least
one activity includes
automatically initiating the at least one activity without any initiating
input from the user.
[1056] In at least some cases the method further includes storing
and maintaining a
general cancer knowledge database, persistently updating the specific
patient's molecular
report, automatically identifying at least one intent and associated data
operation related to
the general cancer knowledge database based on the specific patient's
molecular report data,
persistently executing the associated data operation on the general cancer
knowledge
database to generate a new set of response data not previously generated and,
upon
generating a new set of response data, using the new set of response data to
generate another
audible response file and broadcasting the another audible response file via
the speaker.
[1057] In at least some cases the method is also used with an
electronic health records
system that maintains health records associated with a plurality of patients
including the
specific patient, the method further including identifying at least another
data operation
associated with the at least one intent and executing the another data
operation on the specific
patient's health record to generate additional response data.
[1058] In at least some cases the step of using the first set of
response data to generate an
audible response file includes using the response data and the additional
response data to
generate the audible response file.
[1059] In at least some embodiments, a method of audibly
broadcasting responses to a
user based on user queries about a specific patient's molecular report, the
method for use with
a collaboration device that includes a processor and a microphone and a
speaker linked to the
processor is provided by the disclosure. The method includes storing a
separate molecular
report for each of a plurality of patients in a system database, storing a
general cancer
knowledge database that includes non-patient specific data about cancer
topics, receiving an
audible query from the user via the microphone, identifying at least one
intent associated with
the audible query, identifying at least a first data operation associated with
the at least one
311
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
intent and the specific patient's molecular report, identifying at least a
second data operation
associated with the at least one intent and the general cancer knowledge
database, accessing
the specific patient's molecular report and the general cancer knowledge
database, executing
the at least a first data operation on a first set of data included in the
specific patient's
molecular report to generate a first set of response data, executing the at
least a second data
operation of the general cancer knowledge database to generate a second set of
response data,
using at least one of the first and second sets of response data to generate
an audible response
file, and broadcasting the audible response file via the speaker.
[1060] In at least some embodiments, a method of audibly
broadcasting responses to a
user based on user queries about a specific patient's molecular report, the
method for use with
a collaboration device that includes a processor and a microphone and a
speaker linked to the
processor is provided by the disclosure. The method includes storing molecular
reports for a
plurality of patients in a system database, receiving an audible query from
the user via the
microphone, determining that the audible query is associated with the specific
patient,
accessing the specific patient's molecular report, determining the specific
patient's cancer
state from the molecular report, identifying at least one intent from a pool
of intents related to
the specific patient's cancer state and the audible query, identifying at
least one data
operation associated with the at least one intent, executing at least one of
the identified at
least one data operations on a first set of data included in the specific
patient's molecular
report to generate a first set of response data, using the first set of
response data to generate
an audible response file, and broadcasting the audible response file via the
speaker.
[1061] In at least some embodiments, a method of audibly
broadcasting responses to a
user based on user queries about a patient, the method for use with a
collaboration device that
includes a processor and a microphone and a speaker linked to the processor is
provided by
the disclosure. The method includes storing health records for a plurality of
patients in a
system database and storing a general cancer knowledge database, receiving an
audible query
from the user via the microphone, identifying a specific patient associated
with the audible
query, accessing the health records for the specific patient, identifying
cancer related data in
the specific patient/s health records, identifying at least one intent related
to the identified
cancer related data, identifying at least one data operation related to the at
least one intent,
executing the at least one data operation on the general cancer knowledge
database to
generate a first set of response data, using the first set of response data to
generate an audible
response file, and broadcasting the audible response file via the speaker.
312
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1062] In at least some embodiments, a method of audibly
broadcasting responses to a
user based on user queries about a specific patient molecular report is
provided by the
disclosure. The method includes receiving an audible query from the user to a
microphone
coupled to a collaboration device, identifying at least one intent associated
with the audible
query, identifying at least one data operation associated with the at least
one intent,
associating each of the at least one data operations with a first set of data
presented on the
molecular report, executing each of the at least one data operations on a
second set of data to
generate response data, generating an audible response file associated with
the response data,
and providing the audible response file for broadcasting via a speaker coupled
to the
collaboration device.
[1063] In at least some cases the audible query includes a
question about a nucleotide
profile associated with the patient.
[1064] In at least some cases the nucleotide profile associated
with the patient is a profile
of the patient's cancer.
[1065] In at least some cases the nucleotide profile associated
with the patient is a profile
of the patient's germline.
[1066] In at least some cases the nucleotide profile is a DNA
profile.
[1067] In at least some cases the nucleotide profile is an RNA
expression profile.
[1068] In at least some cases the nucleotide profile is a
mutation biomarker.
110691 In at least some cases the mutation biomarker is a BRCA
biomarker.
[1070] In at least some cases the audible query includes a
question about a therapy.
[1071] In at least some cases the audible query includes a
question about a gene.
[1072] In at least some cases the audible query includes a
question about clinical data.
[1073] In at least some cases the audible query includes a
question about a next-
generation sequencing panel.
[1074] In at least some cases the audible query includes a
question about a biomarker.
[1075] In at least some cases the audible query includes a
question about an immune
biomarker.
313
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1076] In at least some cases the audible query includes a
question about an antibody-
based test.
[1077] In at least some cases the audible query includes a
question about a clinical trial.
[1078] In at least some cases the audible query includes a
question about an organoid
assay.
[1079] In at least some cases the audible query includes a
question about a pathology
image.
[1080] In at least some cases the audible query includes a
question about a disease type.
[1081] In at least some cases the at least one intent is an
intent related to a biomarker.
110821 In at least some cases the biomarker is a BRCA biomarker.
[1083] In at least some cases the at least one intent is an
intent related to a clinical
condition.
[1084] In at least some cases the at least one intent is an
intent related to a clinical trial.
[1085] In at least some cases the at least one intent includes a
drug intent related to a
drug.
[1086] In at least some cases the drug intent is related to
chemotherapy.
[1087] In at least some cases the drug intent is an intent
related to a PARP inhibitor.
[1088] In at least some cases the at least one intent is related
to a gene.
[1089] In at least some cases the at least one intent is related
to immunology.
[1090] In at least some cases the at least one intent is related
to a knowledge database.
[1091] In at least some cases the at least one intent is related
to testing methods.
[1092] In at least some cases the at least one intent is related
to a gene panel.
[1093] In at least some cases the at least one intent is related
to a report.
[1094] In at least some cases the at least one intent is related
to an organoid process.
[1095] In at least some cases the at least one intent is related
to imaging.
[1096] In at least some cases the at least one intent is related
to a pathogen.
[1097] In at least some cases the at least one intent is related
to a vaccine.
314
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1098] In at least some cases the at least one data operation
includes an operation to
identify at least one treatment option.
[1099] In at least some cases the at least one data operation
includes an operation to
identify knowledge about a therapy.
[1100] In at least some cases the at least one data operation
includes an operation to
identify knowledge related to at least one drug.
[1101] In at least some cases the at least one data operation
includes an operation to
identify knowledge related to mutation testing.
[1102] In at least some cases the at least one data operation
includes an operation to
identify knowledge related to mutation presence.
[1103] In at least some cases the at least one data operation
includes an operation to
identify knowledge related to tumor characterization.
[1104] In at least some cases the at least one data operation
includes an operation to
identify knowledge related to testing requirements.
[1105] In at least some cases the at least one data operation
includes an operation to
query for definition information.
[1106] In at least some cases the at least one data operation
includes an operation to
query for expert information.
111071 In at least some cases the at least one data operation
includes an operation to
identify information related to recommended therapy.
[1108] In at least some cases the at least one data operation
includes an operation to
query for information relating to a patient.
[1109] In at least some cases the at least one data operation
includes an operation to
quely for information relating to patients with one or more clinical
characteristics similar to
the patient.
[1110] In at least some cases the at least one data operation
includes an operation to
query for information relating to patient cohorts.
[1111] In at least some cases the at least one data operation
includes an operation to
query for information relating to clinical trials.
315
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1112] In at least some cases the at least one data operation
includes an operation to
query about a characteristic relating to a genomic mutation.
[1113] In at least some cases the characteristic is loss of
heterozygosity.
[1114] In at least some cases the characteristic can reflect the
source of the mutation.
[1115] In at least some cases the source is germline.
[1116] In at least some cases the source is somatic.
[1117] In at least some cases the characteristic includes whether
the mutation is a tumor
driver.
[1118] In at least some cases the first set of data includes a
patient name.
[1119] In at least some cases the first set of data includes a
patient age.
[1120] In at least some cases the first set of data includes a
next-generation sequencing
panel.
[1121] In at least some cases the first set of data includes a
genomic variant.
[1122] In at least some cases the first set of data includes a
somatic genomic variant.
[1123] In at least some cases the first set of data includes a
germline genomic variant
[1124] In at least some cases the first set of data includes a
clinically actionable genomic
variant.
[1125] In at least some cases the first set of data includes a
loss of function variant.
[1126] In at least some cases the first set of data includes a
gain of function variant.
[1127] In at least some cases the first set of data includes an
immunology marker.
[1128] In at least some cases the first set of data includes a
tumor mutational burden.
[1129] In at least some cases the first set of data includes a
microsatellite instability
status.
[1130] In at least some cases the first set of data includes a
diagnosis.
[1131] In at least some cases the first set of data includes a
therapy.
[1132] In at least some cases the first set of data includes a
therapy approved by the U.S.
Food and Drug Administration.
316
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1133] In at least some cases the first set of data includes a
drug therapy.
[1134] In at least some cases the first set of data includes a
radiation therapy.
[1135] In at least some cases the first set of data includes a
chemotherapy.
[1136] In at least some cases the first set of data includes a
cancer vaccine therapy.
111371 In at least some cases the first set of data includes an
oncolytic virus therapy.
[1138] In at least some cases the first set of data includes an
immunotherapy.
[1139] In at least some cases the first set of data includes a
pembrolizumab therapy.
[1140] In at least some cases the first set of data includes a
CAR-T therapy.
111411 In at least some cases the first set of data includes a
proton therapy.
111421 In at least some cases the first set of data includes an
ultrasound therapy.
[1143] In at least some cases the first set of data includes a
surgery.
[1144] In at least some cases the first set of data includes a
hormone therapy.
[1145] In at least some cases the first set of data includes an
off-label therapy.
[1146] In at least some cases the first set of data includes an
on-label therapy.
[1147] In at least some cases the first set of data includes a
bone marrow transplant event.
111481 In at least some cases the first set of data includes a
cryoablation event.
[1149] In at least some cases the first set of data includes a
radiofrequency ablation.
[1150] In at least some cases the first set of data includes a
monoclonal antibody therapy.
[1151] In at least some cases the first set of data includes an
angiogenesis inhibitor.
[1152] In at least some cases the first set of data includes a
PARP inhibitor.
[1153] In at least some cases the first set of data includes a
targeted therapy.
[1154] In at least some cases the first set of data includes an
indication of use.
[1155] In at least some cases the first set of data includes a
clinical trial.
[1156] In at least some cases the first set of data includes a
distance to a location
conducting a clinical trial.
[1157] In at least some cases the first set of data includes a
variant of unknown
significance.
317
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1158] In at least some cases the first set of data includes a
mutation effect.
[1159] In at least some cases the first set of data includes a
variant allele fraction.
[1160] In at least some cases the first set of data includes a
low coverage region.
[1161] In at least some cases the first set of data includes a
clinical history.
111621 In at least some cases the first set of data includes a
biopsy result.
[1163] In at least some cases the first set of data includes an
imaging result.
[1164] In at least some cases the first set of data includes an
MRI result.
[1165] In at least some cases the first set of data includes a CT
result.
111661 In at least some cases the first set of data includes a
therapy prescription.
111671 In at least some cases the first set of data includes a
therapy administration.
[1168] In at least some cases the first set of data includes a
cancer subtype diagnosis.
[1169] In at least some cases the first set of data includes a
cancer subtype diagnosis by
RNA class.
[1170] In at least some cases the first set of data includes a
result of a therapy applied to
an organoid grown from the patient's cells.
[1171] In at least some cases the first set of data includes a
tumor quality measure.
[1172] In at least some cases the first set of data includes a
tumor quality measure
selected from at least one of the set of PD-Li, MMR, tumor infiltrating
lymphocyte count,
and tumor ploidy.
[1173] In at least some cases the first set of data includes a
tumor quality measure derived
from an image analysis of a pathology slide of the patient's tumor.
[1174] In at least some cases the first set of data includes a
signaling pathway associated
with a tumor of the patient.
[1175] In at least some cases the signaling pathway is a HER
pathway.
[1176] In at least some cases the signaling pathway is a MAPK
pathway.
[1177] In at least some cases the signaling pathway is a MDM2-
TP53 pathway.
[1178] In at least some cases the signaling pathway is a PI3K
pathway.
318
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1179] In at least some cases the signaling pathway is a mTOR
pathway.
[1180] In at least some cases the at least one data operations
includes an operation to
query for a treatment option, the first set of data includes a genomic
variant, and the
associating step includes adjusting the operation to query for the treatment
option based on
the genomic variant.
[1181] In at least some cases the at least one data operations
includes an operation to
query for a clinical history data, the first set of data includes a therapy,
and the associating
step includes adjusting the operation to query for the clinical history data
element based on
the therapy.
[1182] In at least some cases the clinical history data is
medication prescriptions, the
therapy is pembrolizumab, and the associating step includes adjusting the
operation to query
for the prescription of pembrolizumab.
[1183] In at least some cases the second set of data includes
clinical health information.
[1184] In at least some cases the second set of data includes
genomic variant information.
[1185] In at least some cases the second set of data includes DNA
sequencing
informati on.
[1186] In at least some cases the second set of data includes RNA
information.
[1187] In at least some cases the second set of data includes DNA
sequencing
information from short-read sequencing.
[1188] In at least some cases the second set of data includes DNA
sequencing
information from long-read sequencing.
[1189] In at least some cases the second set of data includes RNA
transcriptome
informati on.
[1190] In at least some cases the second set of data includes RNA
full-transcriptome
information.
[1191] In at least some cases the second set of data is stored in
a single data repository.
[1192] In at least some cases the second set of data is stored in
a plurality of data
repositories.
319
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1193] In at least some cases the second set of data includes
clinical health information
and genomic variant information.
[1194] In at least some cases the second set of data includes
inununology marker
information.
[1195] In at least some cases the second set of data includes
microsatellite instability
immunology marker information.
[1196] In at least some cases the second set of data includes
tumor mutational burden
immunology marker information.
[1197] In at least some cases the second set of data includes
clinical health information
including one or more of demographic information, diagnostic information,
assessment
results, laboratory results, prescribed or administered therapies, and
outcomes information.
[1198] In at least some cases the second set of data includes
demographic information
including one or more of patient age, patient date of birth, gender, race,
ethnicity, institution
of care, comorbidities, and smoking history.
[1199] In at least some cases the second set of data includes
diagnosis information
including one or more of tissue of origin, date of initial diagnosis,
histology, histology grade,
metastatic diagnosis, date of metastatic diagnosis, site or sites of
metastasis, and staging
information.
[1200] In at least some cases the second set of data includes
staging information
including one or more of TNM, ISS, DSS, FAB, RAT, and Binet.
[1201] In at least some cases the second set of data includes
assessment information
including one or more of performance status comprising at least one of ECOG
status or
Karnofsky status, performance status score, and date of performance status.
[1202] In at least some cases the second set of data includes
laboratory information
including one or more of types of lab, lab results, lab units, date of lab
service, date of
molecular pathology test, assay type, assay result, molecular pathology
method, and
molecular pathology provider.
[1203] In at least some cases the second set of data includes
treatment information
including one or more of drug name, drug start date, drug end date, drug
dosage, drug units,
drug number of cycles, surgical procedure type, date of surgical procedure,
radiation site,
320
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
radiation modality, radiation start date, radiation end date, radiation total
dose delivered, and
radiation total fractions delivered.
[1204] In at least some cases the second set of data includes
outcomes information
including one or more of Response to Therapy, RECIST score, Date of Outcome,
date of
observation, date of progression, date of recurrence, adverse event to
therapy, adverse event
date of presentation, adverse event grade, date of death, date of last follow-
up, and disease
status at last follow up.
[1205] In at least some cases the second set of data includes
information that has been de-
identified in accordance with a de-identification method permitted by HIPAA.
[1206] In at least some cases the second set of data includes
information that has been de-
identified in accordance with a safe harbor de-identification method permitted
by HIPAA.
[1207] In at least some cases the second set of data includes
information that has been de-
identified in accordance with a statistical de-identification method permitted
by HIPAA.
[1208] In at least some cases the second set of data includes
clinical health information of
patients diagnosed with a cancer condition.
[1209] In at least some cases the second set of data includes
clinical health information of
patients diagnosed with a cardiovascular condition.
[1210] In at least some cases the second set of data includes
clinical health information of
patients diagnosed with a diabetes condition.
[1211] In at least some cases the second set of data includes
clinical health information of
patients diagnosed with an autoimmune condition.
[1212] In at least some cases the second set of data includes
clinical health information of
patients diagnosed with a lupus condition.
[1213] In at least some cases the second set of data includes
clinical health information of
patients diagnosed with a psoriasis condition.
[1214] In at least some cases the second set of data includes
clinical health information of
patients diagnosed with a depression condition.
[1215] In at least some cases the second set of data includes
clinical health information of
patients diagnosed with a rare disease.
321
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1216] In at least some cases the method is performed in
conjunction with a digital and
laboratory health care platform.
[1217] In at least some cases the digital and laboratory health
care platform can generate
a molecular report as part of a targeted medical care precision medicine
treatment.
[1218] In at least some cases the method can operate on one or
more micro-services.
[1219] In at least some cases the method is performed in
conjunction with one or more
microservices of an order management system.
[1220] In at least some cases the method is performed in
conjunction with one or more
microservices of a medical document abstraction system.
[1221] In at least some cases the method is performed in
conjunction with one or more
microservices of a mobile device application.
[1222] In at least some cases the method is performed in
conjunction with one or more
microservices of a prediction engine.
[1223] In at least some cases the method is performed in
conjunction with one or more
microservices of a cell-type profiling service.
[1224] In at least some cases the method is performed in
conjunction with a variant
calling engine to provide information to a query involving variants.
[1225] In at least some cases the method is performed in
conjunction with an insight
engine.
[1226] In at least some cases the method is performed in
conjunction with a therapy
matching engine.
[1227] In at least some cases the method is performed in
conjunction with a clinical trial
matching engine.
[1228] Embodiments of the information that is catalogued, stored,
analyzed, or reported
according to any embodiments described herein may also be provided through a
stand-alone
hardware device. Examples of stand-alone hardware devices are disclosed in US
Patent App.
No. 16/852,194, filed April 17, 2020 which is incorporated by reference for
all purposes.
[1229]
322
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Specific Embodiments of the Disclosure
[1230] In some aspects, the systems and methods disclosed herein
may be used to support
clinical decisions for personalized treatment of cancer. For example, in some
embodiments,
the methods described herein identify actionable genomic variants and/or
genomic states with
associated recommended cancer therapies. In some embodiments, the recommended
treatment is dependent upon whether or not the subject has a particular
actionable variant
and/or genomic status. Recommended treatment modalities can be therapeutic
drugs and/or
assignment to one or more clinical trials. Generally, current treatment
guidelines for various
cancers are maintained by various organizations, including the National Cancer
Institute and
Merck & Co., in the Merck Manual.
[1231] In some embodiments, the methods described herein further
includes assigning
therapy and/or administering therapy to the subject based on the
identification of an
actionable genomic variant and/or genomic state, e.g., based on whether or not
the subject's
cancer will be responsive to a particular personalized cancer therapy regimen.
For example,
in some embodiments, when the subject's cancer is classified as having a first
actionable
variant and/or genomic state, the subject is assigned or administered a first
personalized
cancer therapy that is associated with the first actionable variant and/or
genomic state, and
when the subject's cancer is classified as having a second actionable variant
and/or genomic
state, the subject is assigned or administered a second personalized cancer
therapy that is
associated with the second actionable variant. Assignment or administration of
a therapy or a
clinical trial to a subject is thus tailored for treatment of the actionable
variants and/or
genomic states of the cancer patient.
Examples
[1232] Example 1 ¨ The Cancer Genome Atlas (TCGA).
[1233] The Cancer Genome Atlas (TCGA) is a publicly available
dataset comprising
more than two petabytes of genomic data for over 11,000 cancer patients,
including clinical
information about the cancer patients, metadata about the samples (e.g., the
weight of a
sample portion, etc.) collected from such patients, histopathology slide
images from sample
portions, and molecular information derived from the samples (e.g., mRNA/miRNA

expression, protein expression, copy number, etc.). The TCGA dataset includes
data on 33
different cancers: breast (breast ductal carcinoma, bread lobular carcinoma)
central nervous
system (glioblastoma multiforme, lower grade glioma), endocrine
(adrenocortical carcinoma,
papillary thyroid carcinoma, paraganglioma & pheochromocytoma),
gastrointestinal
323
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
(cholangiocarcinoma, colorectal adenocarcinoma, esophageal cancer, liver
hepatocellular
carcinoma, pancreatic ductal adenocarcinoma, and stomach cancer), gynecologic
(cervical
cancer, ovarian serous cystadenocarcinoma, uterine carcinosarcoma, and uterine
corpus
endometrial carcinoma), head and neck (head and neck squamous cell carcinoma,
uveal
melanoma), hematologic (acute myeloid leukemia, Thymoma), skin (cutaneous
melanoma),
soft tissue (sarcoma), thoracic (lung adenocarcinoma, lung squamous cell
carcinoma, and
mesothelioma), and urologic (chromophobe renal cell carcinoma, clear cell
kidney
carcinoma, papillary kidney carcinoma, prostate adenocarcinoma, testicular
germ cell cancer,
and urothelial bladder carcinoma).
[1234] Example 2 ¨ Identification of Focal Copy Number Variation
[1235] Figures 7A1, 7B1, and 7C1 collectively illustrate
identification of non-focal and
focal copy number variations in biological samples, in accordance with some
embodiments of
the present disclosure. As defined above, focal copy number variations refer
to small
segments that consist of only a few exons of a gene or several genes that
deviate significantly
from neighboring segments.
[1236] A method in accordance with some embodiments of the
present disclosure was
performed using, as inputs, a test sample BAM file comprising aligned sequence
reads from a
sequencing of nucleic acids from a test sample, a target region BED file
comprising at least
the genes MYC and BRCA2, and a pool of process matched normal samples for
comparison
with test sample sequencing data. The method further used, as inputs for
initial reference
pool construction, a plurality of normal sample BAM files comprising aligned
sequence reads
from a sequence of nucleic acids from a plurality of process matched normal
samples, a
human reference genome file for alignment, a list of mappable regions of the
genome and a
blacklist comprising recurrent problematic areas of the genome.
[1237] The method was performed for three different test samples.
For each test sample,
CNVkit was performed utilizing targeted captured sequencing reads and non-
specifically
captured off-target sequencing reads to infer copy number information. The
targeted
genomic region specified in the probe target BED file was divided into target
bins with an
average size of 100 base pairs. The genomic regions between the target regions
(excluding
regions that could not be mapped reliably) were automatically divided into off-
target bins
with an average size of 150 kilobase pairs. Raw 10g2-transformed read depths
for the
324
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
plurality of sequence reads in each of the target and off-target bins were
then calculated from
the alignments in the input BAM file and written to two tab-delimited .cnn
files.
[1238] A pooled reference was constructed from a panel of process
matched normal
samples. The raw 10g2-transformed read depths for the plurality of sequence
reads in each of
the target and off-target bins in each normal sample were computed as
described above, and
each 10g2 read depth was median-centered and corrected for bias including GC
content,
genome sequence repetitiveness, target size and spacing. The corrected target
and off-target
1og2 read depths were combined across samples, and a weighted average and
spread were
calculated for each bin using Tukey's bivveight location and midvariance.
These values were
written to a tab-delimited reference .cnn file, which was used to normalize
the binned
sequencing data for the test sample.
112391 The raw 10g2 read depths of each test sample were median-
centered and bias-
corrected as described in the reference construction. The corrected 10g2 read
depth of each
bin in the test sample .cnn file was then subtracted by the 1og2 read depth of
each
corresponding bin in the reference .cnn file, thus generating 10g2 copy ratios
indicating
differential copy numbers between the test sample and the reference pool.
These values were
written to a tab-delimited .cnr file.
[1240] The 1og2 copy ratios were then segmented via a circular
binary segmentation
(CBS) algorithm, in which adjacent bins were grouped to larger genomic regions
(e.g.,
segments) of equal copy number. Each segment's copy ratio was calculated as
the weighted
mean of all bins within the segment, and the confidence interval of the mean
for each
respective segment was estimated by bootstrapping the bin-level copy ratios
within the
segment. The segments' genomic ranges, copy ratios and confidence intervals
were then
written to a tab-delimited .cns file. The copy ratios of each segment were
used to generate an
initial copy number status annotation for the respective segment, for
subsequent validation.
[1241] The validation of each copy number status annotation for
each segment was
performed using annotation and filtering. First, bin-level copy ratios (.cnr)
and segment-level
copy ratios with their confidence intervals (.cns) from the CNVkit outputs, as
well as the
probe target region file (.bed) were passed to a python script (e.g., annotate
cnys xfpy), in
which each segment is examined and amplification/deletion is called for the
segment if a
plurality of criteria (e.g., filters) were met. If the plurality of criteria
were not met (e.g., if
any of the filters were fired), the copy number status annotation was
rejected.
325
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1242] The plurality of criteria included a first requirement
that the respective segment's
copy ratio be greater than 0.03 for an amplification, or less than -0.5 for a
deletion. Both
thresholds were determined empirically. More stringent values (e.g., a higher
value for
amplifications or a lower one for deletions) were found to increase
specificity but decrease
sensitivity, especially in low tumor fraction cases.
112431 The plurality of criteria also included a second
requirement that the median copy
ratio of all target bins within the respective segment be greater than 0.03
for an amplification,
or less than -0.5 for a deletion.
[1244] The plurality of criteria also included a third
requirement that, to validate an
amplification annotation, the lower bound of the segment's copy ratio
confidence interval be
greater than the mean copy ratios of all preceding and all subsequent segments
in the same
chromosome. To validate a deletion annotation, the higher bound of the
segment's copy ratio
confidence interval must be less than the mean copy ratios of all preceding
and all subsequent
segments in the same chromosome. Specifically, in such embodiments, the third
criteria
required satisfaction of two threshold values (e.g., mean copy ratio of all
preceding segments
and mean copy ratio of all following segments) to pass the filter.
[1245] In some alternative embodiments of the third requirement
where the segment was
the first or the last segment in a chromosome, then the lower (higher) bound
of the segment's
copy ratio confidence interval must be greater (less) than the mean copy
ratios of all other
segments excluding itself for an amplification (a deletion). Specifically, in
such
embodiments, the third criteria required satisfaction of a single threshold
value (e.g., the
mean copy ratio of all segments other than the segment under examination).
[1246] Finally, the plurality of criteria further included a
fourth requirement that, to
validate an amplification, the median copy ratio of all target bins in the
segment be no less
than the median plus the median absolute deviation (MAD) of all bins' copy
ratios on the
same chromosome. To validate a deletion, the median copy ratio of all target
bins in the
segment must be no greater than the median minus 0.75 of the MAD of all bins'
copy ratios
on the same chromosome. A scaling factor of 0.75 of the MAD was selected for
deletion
annotations in order to account for lower signal-to-noise ratios observed in
deletions, which
were not observed in amplifications.
[1247] The final amplification status of a segment was then
mapped to each target bin in
the segment under examination and a CSV file was generated with the following
columns:
326
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
"amp" - amplification status (amplified, neutral, or deleted); -chrom" -
chromosome, "start"
- start position of a target bin; -end" - end position of a target bin; "gene"
- gene in the target
bin; "meant" - mean coverage of the target bin; "corrected log2 ratio" - copy
ratio of the bin;
"segment log2 ratio" - copy ratio of the segment comprising the target bin;
"seg id" -
numeric index of the segment; and -order - order id of the input sample.
[1248] Figures 7AI and 7B1 illustrate the amplification status of
a first test sample and a
second test sample comprising the MYC gene, validated using the above method.
Figure
7A1 illustrates a scatter plot of bin-level copy ratios ("+": off-target bins;
"-": target bins) and
segment-level copy ratios (horizontal lines) located on chromosome 8 of the
first test sample,
where bin-level and segment-level copy ratios were generated by CNVkit as
described above.
The vertical line highlights the position of the MYC gene on chromosome 8. As
illustrated in
Figure 7A1, the MYC gene locus in the first test sample was identified as
having a copy ratio
of 1.2, suggesting that the sample comprises a copy number variation
associated with the
MYC gene in comparison with pooled reference samples. However, application of
the
annotation and filtering of the method disclosed herein indicated that the MYC
gene was
located in a non-focal amplified segment, as is visually represented by the
solid horizontal
line comprising the MYC gene (e.g., the surrounding segments also exhibit a
copy ratio of
1.2). Figure 7A1 illustrates that the amplification status attributed to the
MYC gene locus by
CNVkit is likely to be an artifact resulting from its inclusion in a non-focal
amplified
segment, and thus is not therapeutically actionable.
[1249] Figure 7B1 illustrates a scatter plot of bin-level copy
ratios ("+-: off-target bins;
"-": target bins) and segment-level copy ratios (horizontal lines) on
chromosome 8 of the
second test sample, where bin-level and segment-level copy ratios were
generated by CNVkit
as described above. In contrast to Figure 7A1, the vertical line in Figure 7B1
highlights the
position of the MYC gene on chromosome 8, which is in a focal amplified
segment with a
copy ratio of 0.97. Notably, the validation of the MYC gene amplification
status as occurring
in a focal amplified segment illustrates that the amplification status called
for the second test
sample represents a real and therapeutically actionable copy number variation.
[1250] Figures 7AI and 7B1 thus illustrate a key application of
the disclosed method, in
which a focal amplification of a gene of therapeutic interest can be
distinguished from a non-
focal amplification (e.g., an artifactual or erroneous call) of the same gene.
By achieving
higher confidence in the identification of copy number variations of disease-
associated genes,
327
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
it is possible to avoid misdiagnoses and make accurate, informed decisions on
necessary
treatments or therapies.
[1251] In an alternative embodiment, Figure 7C1 illustrates the
deletion status of a third
test sample comprising the BRCA2 gene, validated using the above method.
[1252] Figure 7C1 illustrates a scatter plot of bin-level copy
ratios ("+": off-target bins;
"=": target bins) and segment-level copy ratios (horizontal lines) on
chromosome 13 of the
third test sample, where bin-level and segment-level copy ratios were
generated by CNVkit
as described above. The vertical line highlights the position of the BRCA2
gene on
chromosome 13. As observed in Figure 7B1, visual inspection of the scatter
plot confirms
that the BRCA2 gene in the third test sample is contained in a focal deleted
segment with a
copy ratio of -1.1 (e.g., as illustrated by the solid horizontal lines on
either side of the
segment comprising the BRCA2 gene with a copy ratio of approximately zero).
Thus, in
addition to validating copy number status annotations to identify focal
amplifications, the
method can also be used to identify focal deletions, for genes of interest
associated with
human disease.
[1253] Example 3 ¨ Method of Validating a Liquid Biopsy Assay
[1254] Conducting sample collection, storage, nucleic acid
isolation, and library
preparation.
[1255] To validate a liquid biopsy assay in accordance with some
embodiments of the
present disclosure, 188 unique specimens were sequenced. These unique
specimens included
blood specimens purchased from BioIVT, 56 residual plasma samples, 39 whole-
blood
samples, 4 cfDNA reference standards set in synthetic plasma (Horizon
Discovery's
Multiplex I cfDNA Reference Standards HD812, HD813, HD814, HD815), and 2 cfDNA

reference standard isolates (Horizon Discovery's Structural Multiplex cfDNA
reference
standard HD786, and 100% Multiplex I Wild Type Reference Standard HD776).
Furthermore, an additional 55 blood samples with matched tumor samples were
utilized to
compare the liquid biopsy and solid tumor tests, and 375 blood samples were
sequenced for
low-pass whole-genome sequencing (LPWGS) analysis. Sequence data from an
additional
1,000 patient samples that were previously sequenced were utilized for
retrospective and
clinical analyses. All blood was received in Cell-free DNA BCT blood
collection tubes
(Streck). Plasma was prepared immediately after accessioning and stored at -80
C until later
nucleic acid extraction and library preparation. At this time, cfDNA was
isolated from
328
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
plasma using the Qiagen QIAamp MinElute ccfDNA Midi Kit (QIAGEN), conducted
according to instructions provided by the manufacturer. Automated library
preparation was
performed on a SciClone NGSx (Perkin Elmer). All cfDNA samples were normalized
with
molecular grade water to a maximum of 50 microliters (aL).
112561 Conducting the liquid biopsy sequencing assay.
112571 The liquid biopsy assay utilized New England BioLab's
NEBNext UltraTM II
DNA Library Prep Kit for Illumina , IDT's xGen CS Adapters, unique molecular
indices (UMI), and 96 pairs of barcodes to prepare cfDNA sequencing libraries
with unique
sample identifiers (IDs). Each sample was ligated to a dual unique index. The
dual unique
index enables multiplexed sequencing of up to 7 patients and 1 positive
control per SP
NovaSeq flow cell, 16 patients and 1 positive control per Si NovaSeq flow
cell, 34 patients
and 1 positive control per S2 NovaSeq flow cell, and 84 patients and 1
positive control per S4
NovaSeq flow cell. The library preparation protocol is optimized for greater
than or equal to
20 nanograms (ng) cfDNA input to maximize mutation detection sensitivity. The
final
library was sequenced on an Illumina NovaSeq sequencer. Furthermore, analysis
was
performed using a bioinformatics pipeline and analysis server.
112581 The bioinforrnatics pipeline.
112591 Adapter-trimmed FASTQ files are aligned to the nineteenth
edition of the human
reference genome build (hg19) using Burrows-Wheeler Aligner (BWA). Li et al.,
2009,
"Fast and accurate short read alignment with Burrows-Wheeler transform,-
Bioinformatics,
(25), pg. 1754. Following alignment, reads were grouped by alignment position
and UMI
family, and collapsed into consensus sequences using fgbio tools (available
online at
fulcrumgenomics.github.io/fgbio/). Bases with insufficient quality or
significant
disagreement among family members were reverted to N's. Phred scores were
scaled based
on initial base calling estimates combined across all family members.
Following single-
strand consensus sequence generation, duplex consensus sequences were
generated by
comparing the forward and reverse oriented PCR products with mirrored UMI
sequences.
Consensus sequences were re-aligned to the human reference genome using BWA.
BAM
files are generated and indexed after the re-alignment.
112601 SNV and indel variants were detected using VarDict. Lai
etal., 2016, -VarDict: a
novel and versatile variant caller for next-generation sequencing in cancer
research," Nucleic
Acids Res, (44), pg. 108. SNVs were called down to 0.1% VAF for specified
hotspot target
329
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
regions and 0.25% VAF at all other base positions across the panel. Indels
were called down
to 0.5% VAF for variants within specific regions of interest. Any indels
outside of these
regions were called down to 5% VAF. All SNVs and indels were then sorted,
deduplicated,
normalized, and annotated accordingly. Following annotation, variants were
classified as
germline, somatic, or uncertain using a Bayesian model based on prior
expectations informed
by various internal and external databases of germline and cancer variants.
Uncertain
variants are treated as somatic for filtering and reporting purposes.
Following classification,
variants were filtered based on a plurality of quality metrics including
coverage, VAF, strand
bias, and genomic complexity. Additionally, variants were filtered with a
Bayesian tri-
nucleotide context-based model with position level background error rates
estimated from a
pool of process matched healthy controls. Furthermore, known artifactual
variants were
removed.
[1261] Copy number variants (CNVs) were analyzed utilizing CNVkit
and a CNV
annotation and filtering algorithm provided by the present disclosure.
Talevich et at., 2016,
"CNVkit: Genome-Wide Copy Number Detection and Visualization from Targeted DNA

Sequencing," PLoS Comput Biol, (12), pg. 1004873. This CNVkit provides genomic
region
binning, coverage calculation, bias correction, normalization to a reference
pool,
segmentation, and visualization. The 10g2 ratios between the tumor sample and
a pool of
process matched healthy samples from the CNVkit output were annotated and
filtered using
statistical models, such that the amplification status (e.g., amplified or not-
amplified) of each
gene is predicted and non-focal amplifications are removed.
[1262] Rearrangements were detected using the SpeedSeq analysis
pipeline. Chiang et
at., 2015, "SpeedSeq: ultra-fast personal genome analysis and interpretation,-
Nat Methods,
(12), pg. 966. Briefly, FASTQ files were aligned to hg19 using BWA. Split
reads mapped to
multiple positions and read pairs mapped to discordant positions were
identified and
separated, then utilized to detect gene rearrangements by LUMPY. Layer et at.,
2014,
"LUMPY: a probabilistic framework for structural variant discovery," Genome
Biol, (15),
pg. 84. Fusions were then filtered according to the number of supporting
reads.
[1263] Predicted functional effect and clinical interpretation
for each variant was curated
by automated software using information from both internal and external
databases. A
weighted-heuristic model was used, which has logic-based recommendations from
the
AMP/ASCO/CAP/ClinGen Somatic working group and ACMG guidelines. Li et al.,
2017,
"Standards and Guidelines for the Interpretation and Reporting of Sequence
Variants in
330
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
Cancer: A Joint Consensus Recommendation of the Association for Molecular
Pathology,
American Society of Clinical Oncology, and College of American Pathologists,"
The Journal
of molecular diagnostics, (19), pg. 4; Kalia etal., 2017, "Recommendations for
reporting of
secondary findings in clinical exome and genome sequencing, 2016 update (ACMG
SF v2.0):
a policy statement of the American College of Medical Genetics and Genomics,-
Genetics in
Medicine, (19), pg. 249.
[1264] The relative frequency and distribution are determined for
any read containing
repetitive sequences to detect microsatellite instability. To predict the
probability of an
unstable locus, a k-nearest neighbors model (with k = 100) was utilized along
with
normalized percent lower, mean lower, and mean log-likelihood metrics. The
percentage of
unstable loci was calculated from the probabilities of each sample, with
greater than 50%
unstable loci considered microsatellite instability-high (MSI-H).
[1265] The validation approach.
[1266] The present disclosure conducted extensive validation
studies to establish robust
technical perform of the liquid biopsy assay. Limit of detection (LOD) was
determined by
assessing analytical sensitivity in reference standards with 5%, 1%, 05%,
025%,
and 0.1%
VAF generated from the Horizon Discovery reference set. The Horizon Discovery
set
includes 160 bp cfDNA fragments from human cell lines in an artificial plasma
matrix to
closely resemble cfDNA extracted from human plasma. VAFs of SNVs and indels,
including
EGFR (AE746 - A750), EGFR (V769 - D770insASV), EGFR A767 V769dup, EGFR
(L858R), EGFR (T790M), KRAS (G12D), NRAS (A59T), NRAS (Q61K), AKT1 E17K,
PIK3CA (E545K), and GNAll Q209L, and CNVs and rearrangements, including
CCDC6/RET, SLC34A2/ROS1, MET, MYC, and MYCN, were measured in reference
samples by the liquid biopsy assay of the present disclosure. Each measurement
was
conducted with a minimum of three replicates at 10 ng, 30 ng, and 50 ng of
DNA. Sensitivity
was determined by the number of detected variants divided by the total number
of variants
present in the reference samples. Samples with an on-target rate of less than
30% were
excluded from the instant analysis, and MET (4.5 copies) was included in CNV
sensitivity
determinations. Sensitivity of greater than 90% was considered reliable
detection.
[1267] Analytical specificity was determined using 44 normal
samples titrated at 1%,
2.5%, or 5% from a wild-type cfDNA reference standard with a list of confirmed
true-
negative SNVs, indels, CNVs and rearrangements. Specificity was determined by
the
331
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
number of known true-negative variants divided by the number of true-negative
variants plus
false-positive variants identified by the liquid biopsy assay.
[1268] To assess inter-instrument concordance between the
sequencing instruments, 10
patient libraries were sequenced on each instrument (3 NovaSeqs). Variants
seen below the
lower limit of detection (LLOD) (0.25% for SNVs and 0.50% for indels) were
excluded from
concordance analysis.
[1269] To establish analytical accuracy, the results of 40
validation samples were
compared to the results of an orthogonal reference method (Roche's AVENIO
ctDNA assay).
Analytical accuracy was determined by the number of detected variants divided
by the total
number of variants present in the sample. Variants that were off-target or
below LLOD
(0.25% for SNVs and 0.5% for indels) were excluded from the instant analysis.
[1270] Conducting digital droplet polyinerase chain reaction
(ddPCR).
[1271] Five variants were validated on the ddPCR platform: KRAS
G12D (Integrated
DNA Technologies, 1DT, published sequences); TERT promoter mutations c.-124C>T

(C228T) & c.-146C>T (C250T) (Thermo Fisher Scientific); and TP53 p.R273H and
TP53
p.R175H (Thermo Fisher Scientific). Each amplification reaction was performed
in 25 u_1_,
and contained 1X Genotyping Master Mix (Thermo Fisher Scientific), 1X droplet
stabilizer
(RainDance), 1X of primer/probe mixture for TERT and TP53 (for KRAS: 800 nM of
each
primer and 500 nM of each probe) plus template. To improve the lower limit of
detection, 4-
cycle amplification was conducted prior to droplet generation. Amplification
for KRAS was
conducted using the cycling conditions of: 1 cycle of 95 C (0.6 C/s ramp) for
10 minutes, 4
cycles of 95 C (0.6 C/s ramp) for 15 seconds and 60 C for 2 minutes, followed
by 1 cycle of
98 C (0.6 C/s ramp) for 10 minutes. Cycling conditions for the TP53 variants
were the same
as those for KRAS with the exception of the annealing and extension
temperature, which was
set at 55 C for 2 minutes. Amplification for TERT followed Thermo Fisher's
recommendation as follows: 1 cycle of 96 C (1.6 C/s ramp) for 10 minutes, 4
cycles of 98 C
(1.6 C/s ramp) for 30 seconds and 55 C for 2 minutes, followed by 1 cycle of
55 C (1.6 C/s
ramp) for 2 minutes. Accordingly, droplets generated on the RainDance Source,
and
amplification performed following the above cycling conditions with cycle
numbers of 45 for
both KRAS and TP53, and 54 for TERT. Furthermore, droplets were analyzed on a
RainDance Sense droplet reader. Additionally, RainDrop Analyst 11 v1.1.0
analysis software
was utilized to acquire and analyze data.
332
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1272] The concordance between liquid biopsy and solid tumor
assays.
[1273] Matched liquid biopsy and solid tumor sample pairs (n =
55) were used to
determine analytical sensitivity and specificity. Solid tumor and matched
normal samples
obtained from peripheral blood buffy coat were analyzed with the solid tumor
assay, and
corresponding blood plasma samples were analyzed with the liquid biopsy assay
of the
present disclosure. Only variants in the reportable range of both the solid
tumor and liquid
biopsy panels were included in these analyses (e.g., genes in the liquid
biopsy gene panel is a
subset of genes in the solid tumor gene panel). Germline, intronic, and
synonymous variants
identified in the solid tumor assay and the liquid biopsy assay were excluded
from analysis
with the exception of intronic splice variants. To determine analytical
sensitivity, the number
of variants called in both the liquid biopsy assay and the solid tumor assay
(e.g., true
positives) was divided by the sum of true positives and those called only in
the solid tumor
assay. To determine analytical specificity the number of positions reported in
neither the
liquid biopsy assay nor the solid tumor assay (e.g., true negatives) was
divided by the sum of
true negatives and variants only called in the liquid biopsy assay.
[1274] To improve variant calling in the liquid biopsy assay, a
strategy that dynamically
determines local sequence errors using Bayes Theorem and the likelihood ratio
test was
developed. The dynamic threshold was determined using a sample-specific error
rate, the
error rate from healthy control samples, and from a reference cohort of solid
tumor samples.
Accordingly, the method of the present disclosure was conducted on 55 matched
liquid
biopsy/solid tumor tissue samples, with variants detected in the solid tumor
assay as the
source of truth. Using sensitivity thresholds defined by the LOD analysis,
fixed post-test-
odds (e.g., equal to the P(post-test) / [1 - P(post-test)]), as well as pre-
test-odds. The Pre-test-
odds were determined using historical data from the solid tumor assay with an
equation
identical to the post-test-odds calculation). Accordingly, the following
formula was
determined based on the above: specificity = 1 - pre-test-odds * sensitivity /
post-test-odds
[1275] The specificity was input to a beta-binomial function and
yielded the minimum
number of alternate alleles to call a variant at a particular depth. The pre-
test-odds metric
was specific to individual cancer cohorts and individual genes, allowing for
cancer-specific
pre-test-odds to be applied to individual exons.
[1276] Conducting low-pass whole genome sequencing and analysis.
333
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1277] Blood samples from 375 patients were sequenced using low-
pass whole-genome
sequencing (LPWGS) across four flow cells. Sequencing coverage metrics for
these samples
were determined using Picard CollectWgsMetrics. The tumor fraction and ploidy
values for
each sample were estimated using ichorCNA with a specific reference panel of
47 normal
samples. Adalsteinsson etal., (2017), "Scalable whole-exome sequencing of cell-
free DNA
reveals high concordance with metastatic tumors" Nat Commun, (8), pg. 1324.
Reported
variants from the corresponding liquid biopsy analysis of each sample were
utilized to assess
the accuracy of the tumor fraction estimates.
[1278] Determining estimation of circulating tumor fraction.
[1279] Circulating tumor fraction estimate (ctFE) was determined
using a novel method,
Off-Target Tumor Estimation Routine (OTTER), from off-target reads uniformly
distributed
across the human reference genome. As described above, the CNVkit was
conducted on each
sample, and segments were assigned via circular binary segmentation (CBS).
Olshen etal.,
2004, "Circular binary segmentation for the analysis of array-based DNA copy
number data,"
Biostatistics, (5), pg. 557. Segments were then fit to integer copy states via
an expectation-
maximization algorithm using the sum of squared error of the segment 10g2
ratios (e.g.,
normalized to genomic interval size) to expected ratios given a putative copy
state and tumor
purity. Estimates were confirmed by comparing results against LPWGS of the
original
patient isolate. As such, results are shown using randomly selected, de-
identified samples.
[1280] Clinical profiling of liquid biopsy patients.
[1281] De-identified molecular and abstracted clinical data were
evaluated in a cohort of
1,000 patients randomly selected from a specific reference clinicogenomic
database. All data
were de-identified in accordance with the Health Insurance Portability and
Accountability
Act (HIPAA). Dates used for analyses were relative to the first liquid biopsy
sequencing date
of each patient, and year of the first sequencing date was randomly off-set.
Variants included
in the analyses were those classified as pathogenic or likely pathogenic, and
further divided
into actionable if matched to diagnostic, prognostic or therapeutic evidence
or biologically
relevant. Outcomes were determined according to the most recent clinical
response noted in
patient records. The study protocol was submitted to the Advarra Institutional
Review Board
(IRB), which determined the research was exempt from IRB oversight and
approved a waiver
of HIPAA authorization for this study.
[1282] Example 4 ¨ Results of Validating Liquid Biopsy Assay
334
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1283] Liquid biopsy validation summary.
[1284] The liquid biopsy oncology assay is a 105-gene hybrid
capture NGS panel
designed to detect actionable somatic variant targets in plasma. Referring to
Figures 16A
through 16C, the liquid biopsy assay detects mutations in four variant
classes, including:
single nucleotide variants (SNVs) and insertion-deletions (indels) in all 105
genes, copy
number variants (CNVs) in 6 genes, and chromosomal rearrangements in 7 genes.
To
validate the liquid biopsy assay, a total of 188 samples were sequenced. The
runs generated
an average of 261.7 M 40.7 M total reads with 130.7 M 20.3 M read pairs
and a unique
median read depth of 4999.128 1288.843. The average percent of mapped reads
across all
runs was 99.876% 0.0078.
[1285] Referring to Figure 13, determined analytical sensitivity
for all SNVs, indels,
CNVs, and rearrangements targeted in the reference samples is provided.
Accordingly,
SNVs were reliably detected at greater than or equal to 0.25% VAF with 30 ng
of input DNA
(93.75% [45/481 sensitivity), indels at greater than or equal to 0.5% VAF with
30 ng (95.83%
[23/24] sensitivity), CNVs at greater than or equal to 0.5% VAF with 10 ng
(100.00% [8/8]
sensitivity), and rearrangements at greater than or equal to 1% VAF with 30 ng
(90% [9/10]
sensitivity). Referring to Figure 14, analytical specificity is provided in
which 100% for
SNVs, indels, and rearrangements; and 96.2% for CNVs on samples with greater
than or
equal to 0.25% VAF with 30 ng of input DNA.
[1286] Accordingly, intra-assay and inter-assay concordance
between the replicates in the
present disclosure was 100% for SNVs, indicating a high degree of
repeatability and
reproducibility. Moreover, the inter-instrument concordance was 96.70% for
SNVs and
100% for indels, with a combined concordance of 96.83% across instruments.
Additionally,
interfering substances including genomic DNA, ethanol, and isopropanol did not
cause a
change in the detection of variants. Concordance between controls and samples
with
interfering substances was high (e.g., 100%) among samples that passed
filtering, and were
above the LOD.
112871 The accuracy of the liquid biopsy assay compared to
orthogonal assays.
[1288] Referring to Figure 15, to evaluate analytical accuracy,
the present disclosure
compared the liquid biopsy assay to the Roche AVENIO ctDNA assay. In 30 ng
cfDNA
samples analyzed by liquid biopsy assay and AVENIO cfDNA assay (n = 40),
sensitivity for
SNVs, indels, CNVs and rearrangements was 94.8%, 100%, 100%, and 100%,
respectively.
335
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
In the 6 SNVs that were not detected, 5 were called but filtered out due to
insufficient
evidence. In 10 ng samples, sensitivity for SNV, indel, CNV, and
rearrangements was
91.9%, 100%, 80%, and 100%, respectively. Of the 7 SNVs that were not
detected, 6 were
present in sequencing data but filtered out due to insufficient evidence.
[1289] Referring to Figures 8A and 8B, to further validate the
liquid biopsy assay results,
patients with reported variants KRAS G12D (n = 12), TERT c.-124 (n = 7), TERT
c.-146 (n
= 5), TP53 R273H (n = 7), and TP53 R175H (n = 7) were selected for analysis by
ddPCR.
Liquid biopsy NGS VAF was compared with ddPCR VAF to determine concordance.
Accordingly, 100% PPV and a high correlation between ddPCR results and liquid
biopsy
VAF (R2= 0.892), as well as individual variants such as KRAS G12D (R2= 0.970),
as shown
in Figures 8A and 8B. These results indicate the liquid biopsy assay of the
present disclosure
can be used to accurately identify hotspot mutations. Specifically, Figure 8A
illustrates
results of an inter-assay comparison between liquid biopsy, ddPCR, and solid
tumor results
for patients samples with selected variants (n = 38) analyzed by ddPCR and
compared with
liquid biopsy variant allele fraction (VAF), resulting in high correlation
overall (R2 = 0.892).
Figure 8B illustrates results of an inter-assay comparison between liquid
biopsy, ddPCR, and
solid tumor results for patient samples with individual variants such as KRAS
G12D (n = 12,
R2= 0.970).
[1290] The concordance between liquid biopsy and solid tumor
tissue assay.
[1291] Comparisons between analytical sensitivity and specificity
in matched solid tumor
and liquid biopsy tests from 55 patients were determined. Since solid tumor
matched
samples include both tumor tissue and buffy coat (e.g., normal comparator), a
specific
classification strategy was utilized to determine and exclude germline
variants from the
analysis. Beaubier et al., 2019, "Clinical validation of the xT next-
generation targeted
oncology sequencing assay,- Onctotarget, 10(24), pg. 2384. Removing intronic
and
synonymous variants, benign and likely benign variants, as well as variants
below the LOD
for solid tumor and liquid biopsy assays resulted in 145 concordant SNVs, 20
concordant
indels, and 11 concordant CNVs. 66 SNVs, 11 indels, and 8 CNVs were identified
that were
reported in the solid tumor assay but not the liquid biopsy assay, as well as
209 SNVs, 14
indels, and 7 CNVs that were reported in the liquid biopsy assay but not the
solid tumor
assay. Accordingly, the specificity of the liquid biopsy assay was 100.00% for
SNVs and
indels and 96.67% for CNVs. Referring to Figure 17, a Bayesian dynamic
filtering
methodology was utilized to further reduce discordance by 11.45%, improving
the specificity
336
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
of variant calling in the liquid biopsy assay. The overall sensitivity of the
liquid biopsy assay
compared to the solid tumor assay was 68.18% for SNVs and indels and 57.89%
for CNVs.
When limiting analysis to clinically actionable targets, 107 concordant
variants and 37
discordant, for a sensitivity of 74.31%, were reported.
[1292] Furthermore, comparisons between the sample classification
of reportable variants
between matched samples with liquid biopsy and solid tumor testing were
determined.
Referring to Figure 8C, variants were considered CH variants if found in the
plasma as well
as in the solid tumor normal sample but were not present at levels consistent
with germline
variation. Accordingly, this classification of germline and CH variants in
liquid biopsy is
possible with a corresponding solid tumor assay or a germline sequencing
analysis from the
buffy coat. Notably, two samples have a large number of variants only detected
in liquid
biopsy, many of which are at low VAFs. These samples were subsequently
determined to
have very high tumor mutational burdens (TMBs) in their corresponding solid
tumor
analyses. Accordingly, the large number of liquid biopsy variants at low VAFs
and high
TMBs suggest that these tumors may be more heterogeneous and that some
variants are more
easily detected in blood. Specifically, Figure 8C illustrates results of an
inter-assay
comparison between liquid biopsy, ddPCR, and solid tumor results for sample
classification
of reportable variants, in which microsatellite instability (MSI) was detected
by the liquid
biopsy assay in six out of sixteen MSI-high patents, with 100% as indicated by
the one or
more blue dots depicted above the dotted line.
[1293] Finally, liquid biopsy validation samples were utilized to
assess microsatellite
instability in patients whose MSI status was previously confirmed by a
specific reference
clinically validated solid tumor MSI test or immunohistochemistry. Referring
to Figure 8D,
the liquid biopsy assay reported MSI-H status in 37.5% (6/16) of orthogonally
confirmed
MSI-H patients at 100% (6/6) positive predictive value. Accordingly,
comparisons between
the solid tumor and liquid biopsy assays demonstrate the strengths of the
liquid biopsy assay
and the added value of using multiple assays to detect genomic drivers of
cancer.
Specifically, Figure 8D illustrates results from liquid biopsy and solid tumor
assays compared
in patients who received both tests (n = 55) of Figure 8A and Figure 8B, in
which the percent
circulating tumor DNA VAF, depicted above the dashed line, and number of
reportable
variants detected, depicted below the dashed line, for each individual patient
were
categorized by assay type and CHIP or germline status.
[1294] OTTER, a novel method for estimating tumor fraction.
337
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1295] An accurate measure of tumor fraction can provide an
improved understanding of
variants identified through liquid biopsy testing. In the present disclosure,
a novel method,
Off-Target Tumor Estimation Routine (OTTER), for determining a more accurate
circulating
tumor fraction estimate (ctFE) was developed. Referring to Figures 9A and 9B,
comparisons
between OTTER ctFE with VAFs from 1,000 random patient samples across cancer
types
were determined, such that liquid biopsy ctFE correlates with max pathogenic
VAF and
median VAF. Referring to Figures 9C through 9F, removing germline variants and
amplified
regions from these analyses further increased the correlation. Plausible
liquid biopsy ctFE
estimates are expected to be greater than or equal to the maximal somatic VAF
in a sample
that is not on an amplified region. Referring to Figure 9H, overall, after
removing germline
variants and variants on amplified regions, 90.8% of median VAFs are less than
or equal to
the corresponding liquid biopsy ctFEs. Referring to Figure 9H, the
distribution of liquid
biopsy ctFE for the liquid biopsy 1,000 cohort is provided. Accordingly, the
median ctFE
was 0.07 with a mean ctFE of 0.12.
[1296] In addition to VAF, LPWGS is increasingly utilized to
estimate tumor fractions
and thought to be a more accurate measure than VAF. Adalsteinsson el al.,
2017; Chen el al.,
2019, "Next-generation sequencing in liquid biopsy: cancer screening and early
detection,"
Hum Genomics, (13), pg. 34. Referring to Figure 9G, comparisons between LPWGS
ichorCNA-predicted circulating tumor fraction to the OTTER ctFE in matched
patient
samples (n = 375) determined a strong correlation between methods (R2 = 0.843,
P = 4.71e-
152). Accordingly, this correlation indicates that OTTER ctFEs are highly
concordant with
estimates using LPWGS but can be determined directly from the targeted-panel
sequencing
without requiring additional sequencing.
[1297] Specifically, Figure 9A illustrates results from
circulating tumor fraction estimate
(ctFE) and variant allele fraction (VAF) in which ctFE of liquid biopsy-
sequenced patients (n
= 1,000) was correlated with max pathogenic VAF (R2= 0.38). Figure 9B
illustrates results
from ctFE and VAF in which ctFE of liquid biopsy-sequenced patients (n =
1,000) was
correlated with medium VAF (R2= 0.35). Figure 9C illustrates results from ctFE
and VAF in
which ctFE of liquid biopsy-sequenced patients (n = 1,000) in which germline
variants were
removed, increasing the correlation with max pathogenic VAF (R2= 0.40). Figure
9D
illustrates results from ctFE and VAF in which ctFE of liquid biopsy-sequenced
patients (n =
1,000) in which germline variants were removed, without increasing the
correlation with
medium VAF (R2= 0.35). Figure 9E illustrates results from ctFE and VAF in
which ctFE of
338
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
liquid biopsy-sequenced patients (n = 1,000) in which amplified regions from
these analyses
were removed, increasing the correlation with max pathogenic VAF (R2= 0.41).
Figure 9F
illustrates results from ctFE and VAF in which ctFE of liquid biopsy-sequenced
patients (n =
1,000) in which amplified regions from these analyses were removed, increasing
the
correlation with medium VAF (R2= 0.36). Figure 9G illustrates results from
ctFE and VAF
in which ctFE of liquid biopsy-sequenced patients (n = 1,000) in which samples
that also
underwent low-pass whole genome sequencing (LPWGS, n = 375), a strong
correlation
between LPWGS-predicted tumor fraction and ctFE (R2= 0.843) is found.
Furthermore,
Figure 9G illustrates results from ctFE and VAF in which ctFE of liquid biopsy-
sequenced
patients (n = 1,000) and the overall distribution of ctFE across the cohort
(median ctFE =
0.07, mean ctFE = 0.12, and standard deviation = 0.15).
[1298] Retrospective clinical profiling of the liquid biopsy
assay against a 1,000-subject
cohort.
[1299] To evaluate the clinical utility of the liquid biopsy, de-
identified molecular and
clinical data from 1,000 samples across cancer types were selected for
clinical profiling. This
included 55.7% female and 44.3% male patients, with a median age of 66 years,
and
interquartile range of 15. Referring to Figure 18, this cohort included
patients from 24 cancer
categories, with breast (n = 254), colorectal (n = 98), lung (n = 241),
pancreatic (n = 83), and
prostate (n = 96) being the most common. Referring to Figure 10A, the median
ctFE
predicted by OTTER was 0.07 for all cancer types, with the exception of
prostate, which was
0.06. Referring to Figure 10B, in this cohort, 8,099 mutations were reported,
of which 2,732
were pathogenic, and 2,238 were clinically actionable. Specifically, Figure
10A illustrates
circulating tumor fraction estimate (ctFE) and mutational landscape by cancer
type, in which
median ctFE among the most common cancer types was 0.07, with the exception of
prostate
(ctFE = 0.06). Figure 10B illustrates circulating tumor fraction estimate
(ctFE) and
mutational landscape by cancer type, in which variants are categorized as
reportable,
pathogenic, or actionable. Across all patients, the most commonly mutated gene
was TP53.
The heatmap was normalized within rows to depict the most prevalent variants
detected for
each common cancer type in the cohort (breast n = 254, colorectal n = 98, lung
n = 241,
pancreatic n = 83, and prostate n = 96).
[1300] Accordingly, the most frequently mutated gene in the
liquid biopsy 1,000 cohort
was TP53 (51.1% of patients). The most commonly mutated genes were TP53,
PIK3CA,
ESR1, BRCA2, NF1, ATM and APC in breast cancer, TP53, EGFR, ATM and KRAS in
lung
339
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
cancer, and TP53, APC, and KRAS in colorectal cancer. These findings are
consistent with
existing literature on commonly mutated genes in each cancer type and suggest
the liquid
biopsy test accurately detects variants of interest to the broader cancer
community. van
Helden eta!, 2019; Dal Maso etal., 2019; Savli etal., 2019, "TP53, EGFR and
PIK3CA
gene variations observed as prominent biomarkers in breast and lung cancer by
plasma cell-
free DNA genomic testing," J Biotechnol, (300), pg. 87; Cheng etal., 2019,
"Liquid Biopsy
Detects Relapse Five Months Earlier than Regular Clinical Follow-Up and Guides
Targeted
Treatment in Breast Cancer," Case Rep Oncol Med, pg. 6545298; Keup et al.,
2019,
"Targeted deep sequencing revealed variants in cell-free DNA of hormone
receptor-positive
metastatic breast cancer patients," Cell Mol Life Sci, print.; Li etal., 2019,
"Genomic
profiling of cell-free circulating tumor DNA in patients with colorectal
cancer and its fidelity
to the genomics of the tumor biopsy," J Gastrointest Oncol, (10), pg. 831.
[1301] Advanced disease is associated with higher estimated tumor
fraction.
[1302] A goal of liquid biopsy assays of the present disclosure
is to more efficiently
monitor treatment response and predict disease progression in patients over
time. To
establish proof of concept, the association of ctFE with advanced disease
states was
investigated. Accordingly, referring to Figure 11A, a significant difference
in ctFE between
stages (P = 2.97e-5) was determined. However, since the majority of patients
had advanced
disease at the time of testing, more early stage samples are necessary to
further verify these
findings. Referring to Figure 11B, ctFE in patients with metastatic disease
was evaluated to
determined that ctFE increases when distant sites are affected. Indeed,
referring to Figure
11C, patients with no metastatic lesions had a significantly lower ctFE than
patients with one
or more distant sites (P = 4.77e-7), further highlighting the potential of
ctFE for disease
monitoring. Specifically, Figure 11A illustrates circulating tumor fraction
estimate (cfrE)
according to stage and number of distant metastases among the liquid biopsy
1,000 cohort, in
which there was a significant difference in ctFE between stages (Kruskal-
Wallis P = 2.97e-5).
Accordingly, patients with stage 4 cancer (n = 879, median ctFE = 0.07) had a
higher ctFE
than those with stages 1 (n = 20, median ctFE = 0.06), 2 (n = 25, median ctFE
= 0.06), or 3 (n
= 76, median ctFE = 0.06). Figures 11B and 11C illustrate that ctFE increased
with the
number of metastatic distant sites (Mann-Whitney U test P = 7.57e-7), and
there was a
significant difference in ctFE between patients with no metastatic lesions (n
= 116) and those
with 1 or more distant sites affected (n = 884, Mann-Whitney U test P = 2.12e-
5). The
sensitivity and specificity shown to the right-hand side of the Figure 11C
represent the
340
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
probability that a binary metastasis status prediction is correct at a given
ctFE threshold.
Accordingly, the model predicts metastasis with greater confidence at higher
ctFE.
[1303] Estimated tumor fraction correlates with response to
treatment.
[1304] To determine how ctFE changes in response to treatment,
comparisons between
ctFE with the most recent clinical response outcome were determined.
Accordingly, referring
to Figure 12A, patients classified as having complete response were determined
to have a
significantly lower median ctFE of 0.05, compared to 0.06, 0.06, and 0.08 in
patients with
stable disease, partial response, and progressive disease, respectively.
Additionally, referring
to Figure 12B, patients with multiple liquid biopsy tests were determined to
have large
differences in ctFE between test dates. For example, referring to Figure 12C,
one breast
cancer case had a ctFE of 0.05 at initial liquid biopsy testing. After
treatment with
bevacizumab and paclitaxel, clinical notes indicate the patient was classified
as having stable
disease. Eribulin treatment was started shortly after, but the patient was
later diagnosed with
progressive disease. A second liquid biopsy test, which was performed
approximately 200
days after the initial liquid biopsy test, revealed a ctFE of 0.26, which
supports the
progressive disease diagnosis. Alternatively, in a breast cancer patient with
progressive
disease who was treated with investigational new drug therapies, the patient's
status was
updated to stable disease shortly after the first liquid biopsy test, which
revealed a ctFE of
0.05. Approximately 100 days later, the patient's second liquid biopsy test
revealed a ctFE of
0.09. The patient likely received no further treatment before the third liquid
biopsy test,
which revealed a ctFE of 0.27, suggesting this patient's disease had
progressed. Specifically,
Figure 12A illustrates circulating tumor fraction estimate (cfTE) and
abstracted clinical
outcomes in a sub-cohort of the liquid biopsy 1000 (n = 388) in which patients
with complete
response (n = 9, ctFE = 0.05) exhibited lower ctFE than those with progressive
disease (n =
298, ctFE = 0.08), partial response (n = 56, ctFE = 0.06), or stable disease
(n = 25, ctFE =
0.06). Figure 12B illustrates that ctFE was also assessed temporally among a
few randomly
selected patients with multiple liquid biopsy tests throughout the course of
treatment (n = 26),
with most patients showing large differences in ctFE between test dates.
Figure 12C
illustrates four exemplary cases highlighting the utility of ctFE in relation
to treatment course
and disease status.
[1305] In the case of a lung cancer patient who underwent
multiple rounds of treatment,
including carboplatin, pemetrexed, and etoposide, a decrease in ctFE between
liquid biopsy
tests (0.72 to 0.47) was determined. However, the ctFE was still extremely
high after
341
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
treatment, making progressive disease likely. Indeed, the patient was
classified as having
progressive disease by their oncologist shortly before the second liquid
biopsy test date.
Alternatively, a patient who had undergone treatment with osimertinib and
crizotinib
approximately 50 days before the first liquid biopsy test showed very little
change in ctFE
between test dates (0.3-0.11) and was classified as stable shortly before the
second liquid
biopsy test. Referring to Figures 11A through 12C, while conclusions about the
larger
population based on these individual cases cannot be determined, the changes
in ctFE in
response to treatment is consistent with the above analyses showing that
higher ctFEs are
associated with advanced disease. Additionally, these results illustrate how
serial testing can
be beneficial for precision oncology in individual patients. These results
further highlight the
need for longitudinal studies with serial liquid biopsy testing in a larger
cohort of patients.
[1306] While liquid biopsy is a promising tool for improving
outcomes in precision
oncology, there are challenges that must be overcome before it can replace
large panel NGS
tissue genotyping. For example, in early stage disease, when treatments have
much higher
success rates, many patients have low ctDNA fractions that may be below the
LOD for liquid
biopsies, limiting clinical utility because of the risk of false negatives.
Bettegowda etal.,
2014, "Detection of circulating tumor DNA in early- and late-stage human
malignancies," Sci
Transl Med, (6), pg. 224; Xue etal., 2019, -Early detection and monitoring of
cancer in
liquid biopsy: advances and challenges," Expert Rev Mol Diagn, (19), pg. 273;
Hennigan et
al., 2019, "Low Abundance of Circulating Tumor DNA in Localized Prostate
Cancer," JCO
Precis Oncol, (3), print; Abbosh et at., 2018, "Early stage NSCLC - challenges
to
implementing ctDNA-based screening and MRD detection," Nat Rev Clin Oncol,
(15), pg.
577. Consequently, most studies to date have focused on late stage patients
for assay
validation and research. Furthermore, while validation studies of existing
liquid biopsy
assays have shown high sensitivity and specificity, few studies have
corroborated results with
orthogonal methods, or between NGS testing platforms. Cheng et at., 2019,
"Clinical
Validation of a Cell-Free DNA Gene Panel," J Mol Diagn, (21), pg. 632;
Hanibuchi etal.,
2019, "Development, validation, and comparison of gene analysis methods for
detecting
EGFR mutation from non-small cell lung cancer patients-derived circulating
free DNA,"
Oncotarget, (10), pg. 3654; Van Laar et al., 2018, "Development and validation
of a plasma-
based melanoma biomarker suitable for clinical use," Br J Cancer, (118), pg.
857; Odegaard
etal., 2018, "Validation of a Plasma-Based Comprehensive Cancer Genotyping
Assay
Utilizing Orthogonal Tissue- and Plasma-Based Methodologies," Clin Cancer Res,
(24), pg.
342
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
3539; Clark et al., 2018, "Analytical Validation of a Hybrid Capture-Based
Next-Generation
Sequencing Clinical Assay for Genomic Profiling of Cell-Free Circulating Tumor
DNA," J
Mol Diagn, (20), pg. 686; Plagnol et at., 2018, "Analytical validation of a
next generation
sequencing liquid biopsy assay for high sensitivity broad molecular
profiling," PLoS One,
(13), pg. 0193802. Kuderer et al. compared commercially available liquid and
tissue NGS
platforms and found only 22% concordance in genetic alterations. Kuderer
etal., 2017,
"Comparison of 2 Commercially Available Next-Generation Sequencing Platforms
in
Oncology," JAMA Oncol, (3), pg. 996. Other reports of liquid biopsy based
studies are
limited by comparison to non-comprehensive tissue testing algorithms including
Sanger
sequencing, small NGS hotspot panels, PCR and FISH, which may not contain all
NCCN
guideline genes in their reportable range, thus suffering in comparison to a
more
comprehensive liquid biopsy assay. Leighl et al., 2019. Since the 105 gene
liquid biopsy
assay is a subset of the 648 gene solid tumor tissue-based assay, the
concordance data
presented herein (74.31% for actionable variants) represents a direct
comparison to a
comprehensive NGS test which includes the entire reportable range of the
liquid biopsy
assay. Beaubier etal., 2019, -Integrated genomic profiling expands clinical
options for
patients with cancer," Nat Biotechnol, (37), pg. 1351. While this concordance
is high relative
previous reports, 25.69% of actionable variants would have been missed if only
one of the
tests were performed. Thus, liquid biopsies provide the greatest value to
patients when used
in combination with standard tissue genotyping. Furthermore, having both tests
enabled
additional analyses to exclude germline and CH variants, significantly
improving specificity.
113071 Accordingly, the systems and methods of the present
disclosure provides
analytical and clinical validation of the liquid biopsy assay. The systems and
methods of the
present disclosure provide high accuracy compared to orthogonal methods,
including tissue
biopsy, Avenio liquid biopsy, ddPCR, and LPWGS. The systems and methods of the
present
disclosure also provide improvements upon existing methodologies for
estimating circulating
tumor fraction. Notably, in combination with real-world clinical data, the
systems and
methods of the present disclosure demonstrate the value and suitability of
liquid biopsy
testing for monitoring disease progression, predicting objective measures of
response, and
assessing treatment outcomes. As such, the results obtained through validating
the systems
and methods of the present disclosure strongly support utilizing the liquid
biopsy assay in
routine monitoring of cancer patients with advanced disease.
343
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1308] Example 5 ¨ A retrospective analysis on the prognostic
value of Off-Target
Tumor Estimation Routine, a novel circulating tumor fraction (ctFE)
calculation, in
patients with advanced prostate cancer.
[1309] Prostate-specific antigen (PSA) is a biomarker for
monitoring tumor burden and
treatment response. However, due to multiple variable factors (e.g., variation
in PSA
production by prostate cancer (PC) cells, PSA level variation between
patients, and PSA level
variation during the course of the disease), non-invasive biomarkers are
needed for better
prognostication and assessing therapeutic response. We recently developed the
Off-Target
Tumor Estimation Routine method, described herein, which calculates
circulating tumor
fraction estimates (ctFE) using on- and off-target reads from a targeted-panel
liquid biopsy
assay (DNA-Seq of 105 genes at 5,000x depth in circulating tumor DNA [ctDNA]
from
peripheral blood samples). Here, we analyze the prognostic value of ctFE for
advanced PC
patients undergoing liquid biopsy testing.
[1310] We retrospectively analyzed 108 NGS results from 80
patients treated at Ben
Taub Hospital (BTH) with locally advanced, biochemically recurrent or
metastatic prostate
cancer. We calculated ctFE for all patients using this method, which evaluates
the copy state
of regions across the genome. Survival analysis was based on a 6-month follow-
up. For
prognostic analysis, the highest ctFE was used for each patient with >1 xF
result. Patients
were classified as: 1. Low (ctFE-L: ctFE < 0.02); 2. High (ctFE-H: ctFE >
0.02); or 3.
Converters (ctFE-H to L: ctFE drop below 0.02 during follow-up). In 16
metastatic PC
patients receiving first-line androgen deprivation therapy (1LADT, augmented
with
abiraterone/prednisone), pre-treatment and on-treatment ctFE data as well as
clinical follow-
up (median: 12 months) were examined.
[1311] Results: 65/80 (81%) patients were classified as ctFE-L.
Of these, 64 (98%) were
alive at the 6-month follow-up, and one was deceased due to a non-PC-related
cause. 15/80
(19%) patients had a least one ctFE-H estimate. Of these, 7 (47%) were
deceased due to PC-
related causes within 6 months (range: 2-172 days, median: 15 days), while the
remaining 8
(53%) showed ctFE-H to -L conversion in response to treatment and were alive
at the 6-
month follow-up. Among 16 metastatic PC patients, 1LADT lowered ctFE in 12
patients; of
these, 10 patients continued responding to treatment during the follow-up
period. The 4
patients whose ctFE did not drop became castration-resistant during this
period.
344
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
[1312] Conclusions: Our data suggest that ctFE may predict PC
patient overall survival.
ctFE-L status is associated with patient survival at 6-month follow-up.
Conversely, ctFE-H
status is associated with death unless the implementation of a new active
treatment can
convert the patient to ctFE-L upon rechecking. Changes in ctFE may also
correlate with
response to 1LADT. Our study illustrates the potential of using ctFE as a tool
for PC
prognostication.
REFERENCES CITED AND ALTERNATIVE EMBODIMENTS
[1313] M1 references cited herein are incorporated herein by
reference in their entirety
and for all purposes to the same extent as if each individual publication or
patent or patent
application was specifically and individually indicated to be incorporated by
reference in its
entirety for all purposes.
[1314] Log2 transformed copy ratios, 10g2 copy ratios, 10g2-
transformed depths, 10g2-
transformed read depths, 10g2 depths, corrected 10g2 depths, 10g2 ratios. 10g2
read depths, and
10g2 depth correction values have been discussed herein by way of example. In
each instance
where such a term is used, it will be appreciated that log base 2 is presented
by way of
example only and that the present disclosure is not so limited. Indeed,
logarithms to any base
N may be used, (e.g., where N is a positive number greater than 1 for
instance), and thus the
present disclosure fully supports logN transformed copy ratios, logN copy
ratios, logN-
transformed depths, logN-transformed read depths, logN depths, corrected logN
depths, logN
ratios, logN read depths, and logN depth correction values as respective
substitutes for 10g2
transformed copy ratios, 10g2 copy ratios, 10g2-transformed depths, 10g2-
transformed read
depths, 10g2 depths, corrected 10g2 depths, 10g2 ratios, 10g2 read depths, and
10g2 depth
correction values.
113151 The present invention can be implemented as a computer
program product that
comprises a computer program mechanism embedded in a non-transitory computer
readable
storage medium. These program modules can be stored on a CD-ROM, DVD, magnetic
disk
storage product, USB key, or any other non-transitory computer readable data
or program
storage product.
[1316] Many modifications and variations of this invention can be
made without
departing from its spirit and scope, as will be apparent to those skilled in
the art. The specific
embodiments described herein are offered by way of example only. The
embodiments were
chosen and described in order to best explain the principles of the invention
and its practical
345
CA 03167253 2022- 8-5

WO 2021/168146
PCT/US2021/018622
applications, to thereby enable others skilled in the art to best utilize the
invention and
various embodiments with various modifications as are suited to the particular
use
contemplated. The invention is to be limited only by the terms of the appended
claims, along
with the full scope of equivalents to which such claims are entitled.
346
CA 03167253 2022- 8-5

Representative Drawing

Sorry, the representative drawing for patent document number 3167253 was not found.

Administrative Status

For a clearer understanding of the status of the application/patent presented on this page, the site Disclaimer , as well as the definitions for Patent , Administrative Status , Maintenance Fee  and Payment History  should be consulted.

Administrative Status

Title Date
Forecasted Issue Date Unavailable
(86) PCT Filing Date 2021-02-18
(87) PCT Publication Date 2021-08-26
(85) National Entry 2022-08-05

Abandonment History

There is no abandonment history.

Maintenance Fee

Last Payment of $100.00 was received on 2023-12-08


 Upcoming maintenance fee amounts

Description Date Amount
Next Payment if small entity fee 2025-02-18 $50.00
Next Payment if standard fee 2025-02-18 $125.00

Note : If the full payment has not been received on or before the date indicated, a further fee may be required which may be one of the following

  • the reinstatement fee;
  • the late payment fee; or
  • additional fee to reverse deemed expiry.

Patent fees are adjusted on the 1st of January every year. The amounts above are the current amounts if received by December 31 of the current year.
Please refer to the CIPO Patent Fees web page to see all current fee amounts.

Payment History

Fee Type Anniversary Year Due Date Amount Paid Paid Date
Registration of a document - section 124 $100.00 2022-08-05
Application Fee $407.18 2022-08-05
Maintenance Fee - Application - New Act 2 2023-02-20 $100.00 2022-12-13
Maintenance Fee - Application - New Act 3 2024-02-19 $100.00 2023-12-08
Registration of a document - section 124 $125.00 2024-01-30
Owners on Record

Note: Records showing the ownership history in alphabetical order.

Current Owners on Record
TEMPUS AI, INC.
Past Owners on Record
TEMPUS LABS, INC.
Past Owners that do not appear in the "Owners on Record" listing will appear in other documentation within the application.
Documents

To view selected files, please enter reCAPTCHA code :



To view images, click a link in the Document Description column. To download the documents, select one or more checkboxes in the first column and then click the "Download Selected in PDF format (Zip Archive)" or the "Download Selected as Single PDF" button.

List of published and non-published patent-specific documents on the CPD .

If you have any difficulty accessing content, you can call the Client Service Centre at 1-866-997-1936 or send them an e-mail at CIPO Client Service Centre.


Document
Description 
Date
(yyyy-mm-dd) 
Number of pages   Size of Image (KB) 
National Entry Request 2022-08-05 2 58
Miscellaneous correspondence 2022-08-05 1 69
Assignment 2022-08-05 4 114
Patent Cooperation Treaty (PCT) 2022-08-05 1 60
Patent Cooperation Treaty (PCT) 2022-08-05 1 36
Patent Cooperation Treaty (PCT) 2022-08-05 1 36
Patent Cooperation Treaty (PCT) 2022-08-05 1 37
Patent Cooperation Treaty (PCT) 2022-08-05 1 36
Patent Cooperation Treaty (PCT) 2022-08-05 1 36
Patent Cooperation Treaty (PCT) 2022-08-05 1 67
Description 2022-08-05 346 18,730
Claims 2022-08-05 40 1,728
Drawings 2022-08-05 70 2,444
International Search Report 2022-08-05 6 203
Declaration 2022-08-05 1 49
Declaration 2022-08-05 1 50
Correspondence 2022-08-05 2 50
National Entry Request 2022-08-05 11 306
Abstract 2022-08-05 1 21
Cover Page 2022-11-09 1 41
Abstract 2022-10-19 1 21
Claims 2022-10-19 40 1,728
Drawings 2022-10-19 70 2,444