Loading Now

Summary of Improving Text-to-audio Models with Synthetic Captions, by Zhifeng Kong et al.


Improving Text-To-Audio Models with Synthetic Captions

by Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

First submitted to arxiv on: 18 Jun 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel approach is proposed for obtaining high-quality training data, specifically captions, for text-to-audio models. The method leverages an audio language model to synthesize accurate and diverse captions at scale, addressing limitations of prior methods. A dataset called AF-AudioSet is created using this pipeline, and its effectiveness is evaluated by pre-training text-to-audio models on synthetic captions. Significant improvements are achieved in audio generation quality, reaching a new state-of-the-art on AudioCaps and MusicCaps.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper proposes a way to get better training data for machines that turn text into sound. Right now, it’s hard to find good captions for this task. The researchers created a system that uses an audio language model to make lots of different and accurate captions from scratch. They used this system to make a big dataset called AF-AudioSet and tested whether using these fake captions to train other machines would make them better at generating sound. It worked! The results show that this new approach is really good and could be used to make even better machines in the future.

Keywords

* Artificial intelligence  * Language model