Summary of Prompt Public Large Language Models to Synthesize Data For Private On-device Applications, by Shanshan Wu et al.
Prompt Public Large Language Models to Synthesize Data for Private On-device Applications
by Shanshan Wu, Zheng Xu, Yanxiang Zhang, Yuanbo Zhang, Daniel Ramage
First submitted to arxiv on: 5 Apr 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates how large language models (LLMs) pre-trained on public data can improve the quality of pre-training data for on-device language models trained with differential privacy (DP) and federated learning (FL). The authors design LLM prompts to filter and transform existing public data, generating new synthetic datasets that resemble real user data distributions. By pre-training a model on these synthetic datasets, the authors achieve a 19.0% and 22.8% relative improvement in next word prediction accuracy compared to a baseline model pre-trained on a standard public dataset. The method also achieves evaluation accuracy comparable to or better than the baseline during DP FL fine-tuning over millions of mobile devices. Additionally, the final model outperforms the baseline in production A/B testing. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper explores how large language models can improve the quality of pre-training data for on-device language models trained with differential privacy and federated learning. The authors create synthetic datasets by filtering and transforming public data to resemble real user data distributions. By training a model on these synthetic datasets, they achieve better performance in predicting next words compared to using standard public datasets. This is important because it allows for better communication and collaboration between devices without sharing their private data. |
Keywords
* Artificial intelligence * Federated learning * Fine tuning