Loading Now

Summary of Fine-tuning Large Language Models with User-level Differential Privacy, by Zachary Charles et al.


Fine-Tuning Large Language Models with User-Level Differential Privacy

by Zachary Charles, Arun Ganesh, Ryan McKenna, H. Brendan McMahan, Nicole Mitchell, Krishna Pillutla, Keith Rush

First submitted to arxiv on: 10 Jul 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This research investigates methods for training large language models while ensuring user-level differential privacy, which safeguards each user’s data contributions. Two variants of DP-SGD are studied: example-level sampling (ELS) with per-example gradient clipping and user-level sampling (ULS) with per-user gradient clipping. A novel user-level DP accountant is derived to compute tight privacy guarantees for ELS. The study finds that ULS generally outperforms ELS, especially when users have diverse data collections. Experiments are conducted in synthetic mean estimation and LLM fine-tuning tasks under fixed compute budgets, demonstrating the effectiveness of ULS in scenarios requiring strong privacy guarantees or large compute resources.
Low GrooveSquid.com (original content) Low Difficulty Summary
We explore ways to train big language models safely, so that each person’s contributions remain private. Two approaches are tested: sampling individual examples (ELS) and sampling people (ULS). To ensure ELS is safe, we create a new way to measure how much data is protected. The results show that ULS usually works better when people have different types of data. We tried our methods on synthetic data and fine-tuning big language models under limited computer power.

Keywords

» Artificial intelligence  » Fine tuning  » Synthetic data