Summary of Rethinking Mushra: Addressing Modern Challenges in Text-to-speech Evaluation, by Praveen Srinivasa Varadhan et al.
Rethinking MUSHRA: Addressing Modern Challenges in Text-to-Speech Evaluation
by Praveen Srinivasa Varadhan, Amogh Gulati, Ashwin Sankar, Srija Anand, Anirudh Gupta, Anirudh Mukherjee, Shiva Kumar Marepally, Ankur Bhatia, Saloni Jaju, Suvrat Bhooshan, Mitesh M. Khapra
First submitted to arxiv on: 19 Nov 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper addresses the lack of a robust evaluation framework for Text-to-Speech (TTS) models. The current methods, such as Mean Opinion Score (MOS) tests, have limitations like failing to differentiate between similar models or being time-intensive. The authors propose two refined variants of the MUSHRA test, which is a promising alternative for evaluating multiple TTS systems simultaneously. The first variant enables fairer ratings for synthesized samples that surpass human reference quality, while the second variant reduces ambiguity by providing clear fine-grained guidelines. The paper also releases MANGO, a massive dataset of 246,000 human ratings for Indian languages, to aid in analyzing human preferences and developing automatic metrics for evaluating TTS systems. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary TTS models are getting better at talking like humans, but we need a good way to measure how well they’re doing. Right now, we use tests like MOS, which don’t work very well because they can’t tell the difference between similar models or take too long. The MUSHRA test is better, but it has its own problems, like relying on people listening to human speech and then judging how close the computer sounds are. This paper looks at the MUSHRA test and finds two big issues: when people compare synthesized speech to real speech, they tend to judge it based on what the real speech sounds like, rather than just the quality of the speech itself. Also, there’s not a clear way for people to agree on what makes good or bad synthesized speech. To fix this, the authors propose two new versions of the MUSHRA test that try to solve these problems. |