Loading Now

Summary of Tsds: Data Selection For Task-specific Model Finetuning, by Zifan Liu et al.


TSDS: Data Selection for Task-Specific Model Finetuning

by Zifan Liu, Amin Karbasi, Theodoros Rekatsinas

First submitted to arxiv on: 15 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
TSDS (Task-Specific Data Selection) is a framework for selecting training data to finetune foundation models for specific tasks, leveraging a small representative set of examples from the target task. This optimization problem uses distribution alignment loss based on optimal transport to capture discrepancies between selected data and the target distribution. A regularizer encourages diversity among candidate data, incorporating kernel density estimation to reduce near-duplicates. Efficient algorithms compute the optimal solution using approximate nearest neighbor search techniques. Evaluations on continued pretraining and instruction tuning of language models show that instruction tuning with TSDS outperforms full dataset usage and beats baseline selection methods by 1.5 points in F1 score on average.
Low GrooveSquid.com (original content) Low Difficulty Summary
Researchers are trying to make machines smarter by fine-tuning special models for specific tasks. To do this, they need the right data to train these models. TSDS is a new way to choose the best data for fine-tuning, using just a few examples of what you want the model to learn. This helps make sure the selected data is similar to what you want the model to do well on. They also added some extra rules to avoid picking too much of the same kind of data. The team tested this method and found that it works better than other ways of choosing data, even when using only a small part of all the available data.

Keywords

» Artificial intelligence  » Alignment  » Density estimation  » F1 score  » Fine tuning  » Instruction tuning  » Nearest neighbor  » Optimization  » Pretraining