Summary of A Survey Of Data Synthesis Approaches, by Hsin-yu Chang et al.
A Survey of Data Synthesis Approaches
by Hsin-Yu Chang, Pei-Yu Chen, Tun-Hsiang Chou, Chang-Sheng Kao, Hsuan-Yun Yu, Yen-Ting Lin, Yun-Nung Chen
First submitted to arxiv on: 4 Jul 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper presents a comprehensive review of synthetic data techniques. It outlines four primary goals for using synthetic data in data augmentation: improving diversity, balancing data, addressing domain shifts, and resolving edge cases. The authors categorize synthetic data techniques into four categories based on prevailing machine learning approaches: expert-knowledge-based methods, direct training, pre-training followed by fine-tuning, and foundation models without fine-tuning. Additionally, they discuss four types of synthetic data filtering goals: basic quality, label consistency, and data distribution. Furthermore, the paper explores future directions for synthetic data research, highlighting three crucial areas of focus: improving quality, evaluating synthetic data, and multi-model data augmentation. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study looks at how to create fake data that can help machines learn better. It talks about four main reasons why making fake data is important: making sure the data is diverse, balancing the amount of different types of data, adjusting for changes in the type of data, and dealing with unusual cases. The researchers also group ways to make fake data into categories based on how they relate to machine learning techniques. Finally, they suggest three key areas where synthetic data research should focus: making sure the fake data is good quality, figuring out how well the fake data works, and using multiple models to augment data. |
Keywords
» Artificial intelligence » Data augmentation » Fine tuning » Machine learning » Synthetic data