Academic Journal

How to optimally sample a sequence for rapid analysis

التفاصيل البيبلوغرافية
العنوان: How to optimally sample a sequence for rapid analysis
المؤلفون: Frith, Martin C, Shaw, Jim, Spouge, John L
المساهمون: Kelso, Janet, Japan Science and Technology Agency, National Library of Medicine, National Institutes of Health
المصدر: Bioinformatics ; volume 39, issue 2 ; ISSN 1367-4811
بيانات النشر: Oxford University Press (OUP)
سنة النشر: 2023
الوصف: Motivation We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. Results We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. Availability and implementation Source code is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online.
نوع الوثيقة: article in journal/newspaper
اللغة: English
DOI: 10.1093/bioinformatics/btad057
DOI: 10.1093/bioinformatics/btad057/48907444/btad057.pdf
الاتاحة: http://dx.doi.org/10.1093/bioinformatics/btad057
https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btad057/48907444/btad057.pdf
https://academic.oup.com/bioinformatics/article-pdf/39/2/btad057/49124149/btad057.pdf
Rights: https://creativecommons.org/licenses/by/4.0/
رقم الانضمام: edsbas.CF51DC14
قاعدة البيانات: BASE
الوصف
DOI:10.1093/bioinformatics/btad057