Summary of Prompt Public Large Language Models to Synthesize Data For Private On-device Applications, by Shanshan Wu et al.

Prompt Public Large Language Models to Synthesize Data for Private On-device Applications

by Shanshan Wu, Zheng Xu, Yanxiang Zhang, Yuanbo Zhang, Daniel Ramage

First submitted to arxiv on: 5 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper investigates how large language models (LLMs) pre-trained on public data can improve the quality of pre-training data for on-device language models trained with differential privacy (DP) and federated learning (FL). The authors design LLM prompts to filter and transform existing public data, generating new synthetic datasets that resemble real user data distributions. By pre-training a model on these synthetic datasets, the authors achieve a 19.0% and 22.8% relative improvement in next word prediction accuracy compared to a baseline model pre-trained on a standard public dataset. The method also achieves evaluation accuracy comparable to or better than the baseline during DP FL fine-tuning over millions of mobile devices. Additionally, the final model outperforms the baseline in production A/B testing.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper explores how large language models can improve the quality of pre-training data for on-device language models trained with differential privacy and federated learning. The authors create synthetic datasets by filtering and transforming public data to resemble real user data distributions. By training a model on these synthetic datasets, they achieve better performance in predicting next words compared to using standard public datasets. This is important because it allows for better communication and collaboration between devices without sharing their private data.

Keywords

* Artificial intelligence * Federated learning * Fine tuning

Prompt Public Large Language Models to Synthesize Data for Private On-device Applications

by Shanshan Wu, Zheng Xu, Yanxiang Zhang, Yuanbo Zhang, Daniel Ramage

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Pixel-wise Rl on Diffusion Models: Reinforcement Learning From Rich Feedback, by Mo Kordzanganeh et al.

Summary of Eclipse: Efficient Compositional Lipschitz Constant Estimation For Deep Neural Networks, by Yuezhu Xu et al.

Related Posts