Loading Now

Summary of Indicvoices-r: Unlocking a Massive Multilingual Multi-speaker Speech Corpus For Scaling Indian Tts, by Ashwin Sankar and Srija Anand and Praveen Srinivasa Varadhan and Sherry Thomas and Mehak Singal and Shridhar Kumar and Deovrat Mehendale and Aditi Krishana and Giri Raju and Mitesh Khapra


IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS

by Ashwin Sankar, Srija Anand, Praveen Srinivasa Varadhan, Sherry Thomas, Mehak Singal, Shridhar Kumar, Deovrat Mehendale, Aditi Krishana, Giri Raju, Mitesh Khapra

First submitted to arxiv on: 9 Sep 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a novel approach to enhancing text-to-speech (TTS) synthesis for Indian languages by leveraging large-scale automatic speech recognition (ASR) datasets. The authors develop IndicVoices-R (IV-R), the largest multilingual Indian TTS dataset, containing 1,704 hours of high-quality speech from 10,496 speakers across 22 Indian languages. IV-R matches the quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS. The authors also introduce the IV-R Benchmark, assessing zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices. They demonstrate that fine-tuning an English pre-trained model on a combined dataset of high-quality IndicTTS and IV-R data results in better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone. The authors release all data and code, opening up new possibilities for TTS models for all 22 official Indian languages.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper creates a large database of Indian voices that can be used to create more realistic speech for Indian languages. This is important because there isn’t much good data available for Indian languages right now. The authors use a combination of old and new techniques to make the database, which they call IndicVoices-R. They also created a test to see how well the database works, which they call the IV-R Benchmark. The results show that using this new database can help TTS models be more accurate when speaking in Indian languages.

Keywords

» Artificial intelligence  » Few shot  » Fine tuning  » Generalization  » Zero shot