Method, computer-accessible medium and system for base-calling and alignment

التفاصيل البيبلوغرافية
العنوان: Method, computer-accessible medium and system for base-calling and alignment
Patent Number: 10964,408
تاريخ النشر: March 30, 2021
Appl. No: 13/266662
Application Filed: April 27, 2010
مستخلص: Exemplary methods, procedures, computer-accessible medium, and systems for base-calling, aligning and polymorphism detection and analysis using raw output from a sequencing platform can be provided. A set of raw outputs can be used to detect polymorphisms in an individual by obtaining a plurality of sequence read data from one or more technologies (e.g., using sequencing-by-synthesis, sequencing-by-ligation, sequencing-by-hybridization, Sanger sequencing, etc.). For example, provided herein are exemplary methods, procedures, computer-accessible medium and systems, which can include and/or be configured for obtaining raw output from a sequencing platform configured to be used for reading fragment(s) of genomes, obtaining reference sequences for the genomes obtained independently from the raw output, and generating a base-call interpretation and/or alignment using the raw output and the reference sequences. For example, a score function can be determined based on information associated with the sequencing platform that can be used to analyze polymorphisms based on the base-call interpretation and/or alignment.
Inventors: Mishra, Bhubaneswar (New York, NY, US); Narzisi, Giuseppe (New York, NY, US)
Assignees: New York University (New York, NY, US)
Claim: 1. A non-transitory computer-accessible medium having stored thereon computer executable instructions for assembling at least one genetic sequence which, when executed by a hardware processing arrangement, configure the hardware processing arrangement to: (a) obtain a series of raw intensity outputs from a sequencing platform configured to (i) be used for reading a fragment of at least one genome and (ii) use a sequencing-by-ligation procedure, wherein each of the obtained raw intensity outputs comprises a plurality of randomly located short sequence reads, and wherein each of the randomly located short sequence reads has a read length of at least 48 base pairs (bps); (b) obtain at least one reference sequence for the at least one genome, wherein the at least one reference sequence for the at least one genome is obtained independently from the series of first raw intensity outputs obtained from the sequencing platform; (c) automatically generate a search tree comprising a plurality of nodes, wherein each of the plurality of nodes corresponds to a particular nucleotide base; (d) automatically select a node of the plurality of nodes in the search tree; (e) automatically expand the selected node by creating a plurality of child nodes, each of the plurality of child nodes corresponding to a particular further nucleotide base; (f) automatically generate a score for one or more of the plurality of child nodes, wherein the score is a function of (i) at least one raw intensity output from the series of raw intensity outputs, (ii) the plurality of reference sequences, and (iii) the nucleotide base to which a particular one of the plurality of child nodes corresponds; (g) automatically select one or more of the plurality of child nodes based on the score; (h) automatically repeat procedures (e)-(g) for the selected child node; (i) automatically generate a path through the plurality of nodes; and (j) automatically assemble the at least one genetic sequence based on the path.
Claim: 2. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to: automatically generate the score using a score function; determine the score function based on information associated with a sequencing platform from which the series of raw intensity outputs are obtained; and with the score function, analyze polymorphisms based on at least one of the raw intensity outputs or the reference sequences.
Claim: 3. The computer-accessible medium of claim 1 , wherein the sequencing platform is further configured to utilize at least one of a Sanger chemistry procedure or a sequencing-by-synthesis procedure.
Claim: 4. The computer-accessible medium of claim 1 , wherein the read length is at least 78 bps.
Claim: 5. The computer-accessible medium of claim 1 , wherein each of the raw intensity outputs further comprises at least one error associated with at least one of the plurality of randomly located short sequence reads.
Claim: 6. The computer-accessible medium of claim 5 , wherein the at least one error is related to at least one of an incorrect base-call, a missing base, one or more inserted bases, one or more deleted bases, or a homopolymeric compression.
Claim: 7. The computer-accessible medium of claim 1 , wherein the at least one genome comprises a genome from at least one of (i) one or more diseased cells, (ii) one or more normal cells, (iii) at least one individual organism, (iv) at least one population, or (v) at least one ecological system.
Claim: 8. The computer-accessible medium of claim 1 , wherein the at least one reference sequence is obtained from at least one of (i) a mathematical model, (ii) existing data, (iii) genomic single-molecules, or (iv) genomic materials that are at least one of amplified or otherwise modified.
Claim: 9. The computer-accessible medium of claim 2 , wherein the analyzing procedure comprises a branch-and-bound process.
Claim: 10. The computer-accessible medium of claim 1 , wherein the processing arrangement is further configured to generate the score based on an alignment between the raw intensity outputs and the at least one reference sequence.
Claim: 11. The computer-accessible medium of claim 10 , wherein the alignment includes determining, with the processing arrangement, if any of the raw intensity outputs is contained are within the reference sequences.
Claim: 12. A method for assembling at least one genetic sequence, comprising: (a) obtaining a series of raw intensity outputs from a sequencing platform configured to (i) be used for reading a fragment of at least one genome and (ii) use a sequencing-by-ligation procedure, wherein each of the obtained raw intensity outputs comprises a plurality of randomly located short sequence reads, and wherein each of the randomly located short sequence reads has a read length of at least 48 base pairs (bps); (b) obtaining at least one reference sequence for the at least one genome, wherein the at least one reference sequence for the at least one genome is obtained independently from the series of raw intensity outputs obtained from the sequencing platform; (c) automatically generating a search tree comprising a plurality of nodes, wherein each of the plurality of nodes corresponds to a particular nucleotide base; (d) automatically selecting a node of the plurality of nodes in the search tree; (e) automatically expanding the selected node by creating a plurality of child nodes, each of the child nodes corresponding to a particular further nucleotide base; (f) automatically generating a score for one or more of the child nodes, wherein the score is a function of (i) at least one raw intensity output from the series of raw intensity outputs, (ii) the plurality of reference sequences, and (iii) the nucleotide base to which a particular one of the plurality of child nodes corresponds; (g) automatically selecting one or more of the plurality of child nodes based on the score; (h) automatically repeating procedures (e)-(g) for the selected child node; (i) automatically generating a path through the plurality of nodes; and (j) using a computer hardware arrangement, automatically assembling the at least one genetic sequence based on the path.
Claim: 13. The method of claim 12 , further comprising: automatically generating the score using a score function; automatically determining the score function based on information associated with a sequencing platform from which the series of raw intensity outputs are obtained; and with the score function, automatically analyzing polymorphisms based on at least one of the raw intensity outputs or the reference sequences.
Claim: 14. The method of claim 12 , wherein the sequencing platform is further configured to utilize at least one of a Sanger chemistry procedure or a sequencing-by-synthesis procedure.
Claim: 15. The method of claim 12 , wherein the read length is at least 78 bps.
Claim: 16. The method of claim 12 , wherein each of the raw intensity outputs further comprises at least one error associated with at least one of the plurality of randomly located short sequence reads.
Claim: 17. The method of claim 16 , wherein the at least one error is related to at least one of an incorrect base-call, a missing base, one or more inserted bases, one or more deleted bases, or a homopolymeric compression.
Claim: 18. The method of claim 12 , wherein the at least one genome comprises a genome from at least one of (i) one or more diseased cells, (ii) one or more normal cells, (iii) at least one individual organism, (iv) at least one population, or (v) at least one ecological system.
Claim: 19. The method of claim 12 , wherein the at least one reference sequence is obtained from at least one of (i) a mathematical model, (ii) existing data, (iii) genomic single-molecules, or (iv) genomic materials that are at least one of amplified or otherwise modified.
Claim: 20. The method of claim 15 , wherein the analyzing procedure comprises a branch-and-bound process.
Claim: 21. The method of claim 12 , further comprising at least one of displaying or storing information associated with the generated score in a storage arrangement in at least one of a user-accessible format or a user-readable format.
Claim: 22. The method of claim 12 , further comprising automatically generating the score based on an alignment between the raw intensity outputs and the at least one reference sequence.
Claim: 23. The method of claim 22 , wherein the alignment includes automatically determining if any of the raw intensity outputs is contained are within the reference sequences.
Claim: 24. A system for assembling at least one genetic sequence, comprising: a computer hardware arrangement configured to: (a) obtain a series of raw intensity outputs from a sequencing platform configured to (i) be used for reading a fragment of at least one genome and (ii) use a sequencing-by-ligation procedure, wherein each of the obtained raw intensity outputs comprises a plurality of randomly located short sequence reads, and wherein each of the randomly located short sequence reads has a read length of at least 48 base pairs (bps); (b) obtain at least one reference sequence for the at least one genome, wherein the at least one reference sequence for the at least one genome is obtained independently from the series of raw intensity outputs obtained from the sequencing platform; (c) automatically generate a search tree comprising a plurality of nodes, wherein each of the plurality of nodes corresponds to a particular nucleotide base; (d) automatically select a node of the plurality of nodes in the search tree; (e) automatically expand the selected node by creating a plurality of child nodes, each of the child nodes corresponding to a particular further nucleotide base; (f) automatically generate a score for one or more of the child nodes, wherein the score is a function of (i) at least one raw intensity output from the series of raw intensity outputs, (ii) the plurality of reference sequences, and (iii) the nucleotide base to which a particular one of the plurality of child nodes corresponds; (g) automatically select one or more of the plurality of child nodes based on the score; (h) automatically repeat procedures (e)-(g) for the selected child node; (i) automatically generate a path through the plurality of nodes; and (j) automatically assemble the at least one genetic sequence based on the path.
Claim: 25. The system of claim 24 , wherein the computer hardware arrangement is further configured to: automatically generate the score using a score function; automatically determine the score function based on information associated with a sequencing platform from which the series of raw intensity outputs are obtained; and with the score function, automatically analyze polymorphisms based on at least one of the raw intensity outputs or the reference sequences.
Claim: 26. The system of claim 24 , wherein the sequencing platform is further configured to utilize at least one of a Sanger chemistry procedure or a sequencing-by-synthesis procedure.
Claim: 27. The system of claim 24 , wherein the read length is at least 78 bps.
Claim: 28. The system of claim 24 , wherein each of the raw intensity outputs further comprises at least one error associated with at least one of the plurality of randomly located short sequence reads.
Claim: 29. The system of claim 28 , wherein the at least one error is related to at least one of an incorrect base-call, a missing base, one or more inserted bases, one or more deleted bases, or a homopolymeric compression.
Claim: 30. The system of claim 24 , wherein the at least one genome comprises a genome from at least one of (i) one or more diseased cells, (ii) one or more normal cells, (iii) at least one individual organism, (iv) at least one population, or (v) at least one ecological system.
Claim: 31. The system of claim 24 , wherein the at least one reference sequence is obtained from at least one of (i) a mathematical model, (ii) existing data, (iii) genomic single-molecules, or (iv) genomic materials that are at least one of amplified or otherwise modified.
Claim: 32. The system of claim 25 , wherein the analyzing procedure comprises a branch-and-bound process.
Claim: 33. The system of claim 24 , wherein the computer hardware arrangement is further configured to automatically generate the score based on an alignment between the raw intensity outputs and the at least one reference sequence.
Claim: 34. The system of claim 33 , wherein the alignment includes automatically determining, using the computer hardware arrangement, if any of the raw intensity outputs is contained are within the reference sequences.
Claim: 35. The computer-accessible medium of claim 1 , wherein the read length is at least 100 bps.
Claim: 36. The method of claim 12 , wherein the read length is at least 100 bps.
Claim: 37. The system of claim 24 , wherein the read length is at least 100 bps.
Patent References Cited: 7232656 June 2007 Balasubramanian et al.
2002/0055112 May 2002 Patil et al.
2004/0053246 March 2004 Sorenson
2005/0221341 October 2005 Shimkets et al.
















































Other References: Brockman, W. et al. Quality scores and SNP detection in sequencing-by-synthesis systems. Genome Research 18, 763-770 (2008). cited by examiner
Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Research 27, 2369-2376 (1999). cited by examiner
Hillier, L. W. et al. Whole-genome sequencing and variant discovery in C. elegans. Nature Methods 5, 183-188 (2008). cited by examiner
Horton, P. A branch and bound algorithm for local multiple alignment. Pacific Symposium on Biocomputing 368-383 (1996). cited by examiner
Li, H., Ruan, J. & Durbin, R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18, 1851-1858 (2008). cited by examiner
Marth, G. T. et al. A general approach to single-nucleotide polymorphism discovery. Nature Genetics 23, 452-456 (1999). cited by examiner
Ossowski, S. et al. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Research 18, 2024-2033 (2008). cited by examiner
Salzberg, S. L., Church, D., DiCuccio, M., Yaschenko, E. & Ostell, J. The genome Assembly Archive: a new public resource. PLoS Biology 2, E285:1273-1275 (2004). cited by examiner
Schatz, M. C., Phillippy, A. M., Shneiderman, B. & Salzberg, S. L. Hawkeye: An interactive visual analytics tool for genome assemblies. Genome Biology 8, R34:1-12 (2007). cited by examiner
Schmid, K. J. et al. Large-scale identification and analysis of genome-wide single-nucleotide polymorphisms for mapping in Arabidopsis thaliana. Genome Research 13, 1250-1257 (2003). cited by examiner
Smith, D. R. et al. Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Research 18, 1638-1642 (2008). cited by examiner
Giddings, M. C., Brumley, R. L., Haker, M. & Smith, L. M. An adaptive, object oriented strategy for base calling in DNA sequence analysis. Nucleic Acids Research 21, 4530-4540 (1993). cited by examiner
Chevreux, B. et al. Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 14, 1147-1159 (2004). cited by examiner
Gordon, D., Abajian, C. & Green, P. Consed: A Graphical Tool for Sequence Finishing. Genome Res. 8, 195-202 (1998). cited by examiner
Luque, G. & Alba, E. Metaheuristics for the DNA Fragment Assembly Problem. Int. J. Comput. Intell. Res. 1, 98-108 (2005). cited by examiner
Sundquist, A., Ronaghi, M., Tang, H., Pevzner, P. & Batzoglou, S. Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS One 2, e484 (2007). cited by examiner
Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713-714 (2008). cited by examiner
Rougemont, J. et al. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics 9, 431:1-12 (2008). cited by examiner
Illumina, Inc. Genome Analyzer Pipeline Software User Guide. (2008). cited by examiner
Smith, A. D., Xuan, Z. & Zhang, M. Q. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 9, 128:1-8 (2008). cited by examiner
Stephens, M., Sloan, J. S., Robertson, P. D., Scheet, P. & Nickerson, D. A. Automating sequence-based detection and genotyping of SNPs from diploid samples. Nature Genetics 38, 375-381 (2006). cited by examiner
Chevreux, B., Pfisterer, T. & Suhai, S. Automatic Assembly and Editing of Genomic Data. in Genomics and Proteomics: Functional and Computational Aspects 51-65 (Kluwer Academic Publishers, 2000). cited by examiner
Kircher, M. & Kelso, J. High-throughput DNA sequencing—concepts and limitations. BioEssays 32, 524-36 (2010). cited by examiner
Loman, N. J. et al. Performance comparison of benchtop high-throughput sequencing platforms. Nature Biotechnology 30, 434-439 (2012). cited by examiner
Metzker, M. L. Sequencing technologies—the next generation. Nature Reviews Genetics 11, 31-46 (2010). cited by examiner
Niedringhaus, T. P., Milanova, D., Kerby, M. B., Snyder, M. P. & Barron, A. E. Landscape of next-generation sequencing technologies. Analytical Chemistry 83, 4327-4341 (2011). cited by examiner
Shendure, J. & Ji, H. Next-generation DNA sequencing. Nature Biotechnology 26, 1135-1145 (2008). cited by examiner
Illunnina, Inc. Genome Analyzer IIx System Specification. 2009. cited by examiner
International Search Report for PCT/US2010/032613 dated Dec. 8, 2010. cited by applicant
International Written Opinion for PCT/US2010/032613 dated Dec. 8, 2010. cited by applicant
B. Ewing et al., “Base-Calling of Automated Sequencer Traces Using Phred. Accuracy Assessment,” Genome Research, vol. 8, pp. 175-185, 1998. cited by applicant
B. Ewing et al., “Base-Calling of Automated Sequencer Traces Using Phred. Error Probabilities,” Genome Research, vol. 8, pp. 186-194, 1998. cited by applicant
Nyren, P. et al. “Solid Phase DNA minisequencing by an Enzymatic Luminometric Inorganic Pyrophosphate Detection Assay” Annal. Biochem. vol. 208;1, pp. 171-175; 1993. cited by applicant
Ronaghi, M.et al. “PCR-Introduced Loop Structure as Primer in DNA sequencing” Biotechniques, vol. 25;5, pp. 876-884, 1998. cited by applicant
Margulies, M.et al. “Genome Sequencing in Micro-fabricated High-Density Picaoliter Reactors” Nature, vol. 437;15, pp. 376-380, 2005. cited by applicant
Erlich Y., et al. “Alta-Cyclic: a self-optimizing base caller for next-generation sequencing” Nature Methods, vol. 5; 8, pp. 679-682, 2008. cited by applicant
Barany, F. “The Ligase Chain Reaction in a PCR World” PCR Methods Applications., vol. 1;5 pp. 5-16, 1991. cited by applicant
Nickerson, D.A., et al. “Automated DNA Diagnostics Using an ELISA-Based Oligonucleotide Ligation Assay” PNAS, vol. 87; 22, pp. 8923-8927, 1991. cited by applicant
Drmanac, R.,et al. “DNA Sequence Determination by Hybridization: A Strategy for Efficient Large-Scale Sequencing” Science, vol. 260; pp. 1649-1652, 1993. cited by applicant
Broude, N.E., et al. “Enhanced DNA Sequencing by Hybridization” PNAS, vol. 91; 8, p. 3072-3076, 1994. cited by applicant
Levene, M.J., etal. “Zero-Mode Waveguides for Single-Molecule Analysis at High Concentrations” Science, vol. 299, pp. 682-686 2003. cited by applicant
Fologea, D. et al. “Detecting Single Stranded DNA with a Solid State Nanopore” Nano Letters, vol. 5, No. 10, pp. 1905-1909, 2005. cited by applicant
Meller, A. et al., Rapid Nanopore Discrimination Between Single Polynucleotide Molecules, PNAS, vol. 97, No. 3, pp. 1079-1084, 2000. cited by applicant
The International HapMap Consortium, the International HapMap Project, Nature, vol. 426, No. 18, pp. 789-796, 2003. cited by applicant
The International HapMap Consortium, “A Haplotype Map of the Human Genome”, Nature, vol. 437, No. 27, pp. 1299-1320, 2005. cited by applicant
M. Stephens and P. Donelly,“A Comparison of Bayesian Methods for Haplotype Reconstruction from Population Genotype Data” Am. J. of Hum. Genet., vol. 73;5, pp. 1162-1169, 2003. cited by applicant
L. Feuk et al., “Structural Variation in the Human Genome” Nature Review Genetics, vol. 7, No. 2, pp. 85-97, 2006. cited by applicant
J.Sebat et al. “Large-Scale Copy Number Polymorphism in the Human Genome” Science, vol. 305, No. 5683, pp. 525-528, 2004. cited by applicant
Efron, B., “Large-scale simultaneous hypothesis testing: the choice of a null hypothesis” J. Am. Statist. Assoc., vol. 99, pp. 96-104, 2004. cited by applicant
Primary Examiner: Harward, Soren
Attorney, Agent or Firm: Hunton Andrews Kurth LLP
رقم الانضمام: edspgr.10964408
قاعدة البيانات: USPTO Patent Grants