Loading Now

Summary of Data Quality Control in Federated Instruction-tuning Of Large Language Models, by Yaxin Du and Rui Ye and Fengting Yuchi and Wanru Zhao and Jingjing Qu and Yanfeng Wang and Siheng Chen


Data Quality Control in Federated Instruction-tuning of Large Language Models

by Yaxin Du, Rui Ye, Fengting Yuchi, Wanru Zhao, Jingjing Qu, Yanfeng Wang, Siheng Chen

First submitted to arxiv on: 15 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes FedDQC, a novel federated instruction tuning framework with dynamic data quality control for privacy-preserving collaborative learning of large language models (LLMs). The decentralized nature of Federated Learning (FL) exacerbates data quality challenges, as local clients lack global visibility to filter noisy or low-quality samples before training. To address this issue, the authors introduce two key innovations: instruction-response alignment (IRA), an efficient client-side metric for quality evaluation requiring only low-cost inference; and a quality-aware hierarchical FL training framework that supports adaptive data quality assessment at each hierarchy. The framework progressively fine-tunes the LLM from high- to low-IRA data in a collaborative manner, enabling dynamic adjustments throughout the training process. Extensive experiments on synthetic and real-world datasets show that FedDQC significantly improves LLM performance on mixed-quality data in FL.
Low GrooveSquid.com (original content) Low Difficulty Summary
FedDQC is a new way to train language models while keeping data private. Right now, training large language models requires combining lots of data from different places. This can be tricky because some of the data might not be very good or even fake. To fix this problem, researchers created a new framework that helps figure out which data is good and which isn’t. They did this by creating two special tools: one to measure how well each piece of data matches what it’s supposed to say (IRA), and another to adjust the training process based on the quality of the data. This means that the model gets trained first with high-quality data, then lower-quality data, and so on. They tested this new framework on some fake data and real-world datasets and found that it improved the performance of the language models.

Keywords

» Artificial intelligence  » Alignment  » Federated learning  » Inference  » Instruction tuning