Academic Journal
How to optimally sample a sequence for rapid analysis
العنوان: | How to optimally sample a sequence for rapid analysis |
---|---|
المؤلفون: | Frith, Martin C, Shaw, Jim, Spouge, John L |
المساهمون: | Kelso, Janet, Japan Science and Technology Agency, National Library of Medicine, National Institutes of Health |
المصدر: | Bioinformatics ; volume 39, issue 2 ; ISSN 1367-4811 |
بيانات النشر: | Oxford University Press (OUP) |
سنة النشر: | 2023 |
الوصف: | Motivation We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. Results We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. Availability and implementation Source code is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online. |
نوع الوثيقة: | article in journal/newspaper |
اللغة: | English |
DOI: | 10.1093/bioinformatics/btad057 |
DOI: | 10.1093/bioinformatics/btad057/48907444/btad057.pdf |
الاتاحة: | http://dx.doi.org/10.1093/bioinformatics/btad057 https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad057/48907444/btad057.pdf https://academic.oup.com/bioinformatics/article-pdf/39/2/btad057/49124149/btad057.pdf |
Rights: | https://creativecommons.org/licenses/by/4.0/ |
رقم الانضمام: | edsbas.CF51DC14 |
قاعدة البيانات: | BASE |
DOI: | 10.1093/bioinformatics/btad057 |
---|