Embeddings of genomic region sets capture rich biological associations in lower dimensions

التفاصيل البيبلوغرافية
العنوان: Embeddings of genomic region sets capture rich biological associations in lower dimensions
المؤلفون: Nathan C. Sheffield, Jason P. Smith, Donald E. Brown, Erfaneh Gharavi, Aidong Zhang, Aaron Gu, Guangtao Zheng
بيانات النشر: Cold Spring Harbor Laboratory, 2021.
سنة النشر: 2021
مصطلحات موضوعية: Set (abstract data type), Computer science, Robustness (computer science), business.industry, Pattern recognition, Word2vec, Interval (mathematics), Artificial intelligence, Representation (mathematics), business, Functional genomics, Peak calling, Curse of dimensionality
الوصف: MotivationGenomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis.ResultsWe propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody, or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody, and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data.Availabilityhttps://github.com/databio/regionset-embedding
DOI: 10.1101/2021.05.07.443166
URL الوصول: https://explore.openaire.eu/search/publication?articleId=doi_________::1264aca76154a5c9f227a91cf98fab74
https://doi.org/10.1101/2021.05.07.443166
Rights: OPEN
رقم الانضمام: edsair.doi...........1264aca76154a5c9f227a91cf98fab74
قاعدة البيانات: OpenAIRE
الوصف
DOI:10.1101/2021.05.07.443166