Report
Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models
العنوان: | Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models |
---|---|
المؤلفون: | Lee, Joseph, Yang, Shu, Baik, Jae Young, Liu, Xiaoxi, Tan, Zhen, Li, Dawei, Wen, Zixuan, Hou, Bojian, Duong-Tran, Duy, Chen, Tianlong, Shen, Li |
سنة النشر: | 2024 |
المجموعة: | Computer Science Quantitative Biology |
مصطلحات موضوعية: | Computer Science - Machine Learning, Computer Science - Computation and Language, Quantitative Biology - Genomics |
الوصف: | Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM. |
نوع الوثيقة: | Working Paper |
URL الوصول: | http://arxiv.org/abs/2410.01795 |
رقم الانضمام: | edsarx.2410.01795 |
قاعدة البيانات: | arXiv |
الوصف غير متاح. |