Summary of 3ds: Decomposed Difficulty Data Selection’s Case Study on Llm Medical Domain Adaptation, by Hongxin Ding et al.
3DS: Decomposed Difficulty Data Selection’s Case Study on LLM Medical Domain Adaptation
by Hongxin Ding, Yue Fang, Runchuan Zhu, Xinke Jiang, Jinyang Zhang, Yongxin Xu, Xu Chu, Junfeng Zhao, Yasha Wang
First submitted to arxiv on: 13 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large Language Models (LLMs) excel in general tasks but struggle in specialized domains like healthcare due to limited domain-specific knowledge. Fine-tuning often relies on heuristic methods, such as GPT-4 annotation or manual data selection, with a focus on diverse, high-quality datasets. However, these methods overlook the model’s inherent knowledge distribution, introducing noise, redundancy, and irrelevant data. The proposed two-stage model-centric data selection framework, Decomposed Difficulty Data Selection (3DS), aligns data with the model’s knowledge distribution for optimized adaptation. Stage 1 applies Prompt-Driven Data Selection via Explicit Alignment, filtering out irrelevant data based on internal knowledge. Stage 2 performs Decomposed Difficulty Data Selection using three metrics: Instruction Understanding, Response Confidence, and Response Correctness. An attention-based importance weighting mechanism captures token importance for accurate difficulty calibration. This approach ensures selected data is aligned with the model’s knowledge and preferences, making it more effective for domain adaptation. Extensive experiments on real-world healthcare datasets demonstrate 3DS’ superiority in accuracy by over 5.29%. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large Language Models struggle to adapt to specialized domains like healthcare. To improve this, a new approach called Decomposed Difficulty Data Selection (3DS) was developed. This method uses the model’s internal knowledge to select data that is relevant and challenging for the model to learn from. The approach has two stages: first, it filters out irrelevant data based on the model’s understanding of instructions and its confidence in responding correctly. Then, it selects data based on three metrics: how well the model understands the instruction, how confident it is in its response, and whether the response is correct. This approach ensures that the selected data is aligned with the model’s knowledge and preferences, making it more effective for domain adaptation. |
Keywords
» Artificial intelligence » Alignment » Attention » Domain adaptation » Fine tuning » Gpt » Prompt » Token