Summary of Translatotron-v(ison): An End-to-end Model For In-image Machine Translation, by Zhibin Lan et al.
Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation
by Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Min Zhang, Jinsong Su
First submitted to arxiv on: 3 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel approach to in-image machine translation (IIMT), which translates an image containing texts into an image containing translations. The authors argue that conventional cascaded methods suffer from limitations such as error propagation, massive parameters, and difficulties in deployment. To address these issues, they develop an end-to-end IIMT model called Translatotron-V(ision), consisting of four modules: image encoder, image decoder, target text decoder, and image tokenizer. The target text decoder alleviates the language alignment burden, while the image tokenizer converts long pixel sequences into shorter visual tokens. A two-stage training framework is also presented to assist the model in learning alignment across modalities and languages. The authors evaluate their model using a location-aware metric called Structure-BLEU and demonstrate competitive performance compared to cascaded models with fewer parameters. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a new way to translate images containing text into images with translated text. The current methods have problems, like errors carrying over and needing many parameters. To fix this, the authors designed an all-in-one model that can learn how to translate texts in different languages while keeping the original image’s features. The model has four parts: one for understanding the image, one for decoding the translation, one for helping with language alignment, and one for breaking down long pixel sequences into smaller ones. They also developed a new way to train the model and a method to measure how well it works. The results show that their model does as well or better than other methods using fewer parameters. |
Keywords
» Artificial intelligence » Alignment » Bleu » Decoder » Encoder » Tokenizer » Translation