Loading Now

Summary of Knowledge-driven Feature Selection and Engineering For Genotype Data with Large Language Models, by Joseph Lee et al.


Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

by Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen

First submitted to arxiv on: 2 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL); Genomics (q-bio.GN)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the challenge of predicting complex phenotypes from genotype data using a small, interpretable set of variant features. Conventional approaches often struggle due to the high dimensionality of genotype data. The authors draw inspiration from pre-trained language models (LLMs) and their ability to process complex biomedical concepts. They develop FREEFORM, a novel framework that utilizes LLMs for feature selection and engineering in tabular genotype data. The framework is designed with chain-of-thought and ensembling principles to select and engineer features using the intrinsic knowledge of LLMs. Evaluations on two distinct datasets, genetic ancestry and hereditary hearing loss, show that FREEFORM outperforms several data-driven methods, particularly in low-shot regimes.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper tries to solve a difficult problem: predicting what traits you might have based on your DNA. Right now, scientists use computer programs to analyze DNA, but it’s hard because there are so many pieces of information to look at. The authors wanted to see if they could use special language models that are good at understanding complex medical ideas to help with this task. They created a new way to do this called FREEFORM, which uses the language models to find important parts of the DNA and make predictions about traits. When they tested it on two different types of data, they found that FREEFORM did better than other computer programs.

Keywords

» Artificial intelligence  » Feature selection