Summary of Robust and Unbounded Length Generalization in Autoregressive Transformer-based Text-to-speech, by Eric Battenberg et al.
Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech
by Eric Battenberg, RJ Skerry-Ryan, Daisy Stanton, Soroosh Mariooryad, Matt Shannon, Julian Salazar, David Kao
First submitted to arxiv on: 29 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel approach is proposed to improve the robustness and length generalization capabilities of Autoregressive (AR) Transformer-based encoder-decoder text-to-speech (TTS) systems. The enhancements aim to address issues with longer sequences, where models tend to drop or repeat words or produce erratic output. The method introduces an alignment mechanism providing relative location information for cross-attention operations, which is learned during training via backpropagation. This approach benefits from the flexible modeling power of multi-head self- and cross-attention operations. The Very Attentive Tacotron system achieves naturalness and expressiveness comparable to a baseline T5-based TTS system while eliminating issues with repeated or dropped words and enabling generalization to any practical utterance length. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary A new way is developed to make text-to-speech systems better at handling long texts. Currently, these systems can struggle when asked to generate longer sequences of speech, often repeating or dropping words. The proposed approach fixes this problem by adding a mechanism that provides information about the location of words in a sentence. This helps the system understand how to pay attention to different parts of the text and produce more natural-sounding speech. The new system, called Very Attentive Tacotron, is just as good as other systems at producing natural-sounding speech, but it can handle any length of utterance. |
Keywords
» Artificial intelligence » Alignment » Attention » Autoregressive » Backpropagation » Cross attention » Encoder decoder » Generalization » T5 » Transformer