Summary of Let’s Fuse Step by Step: a Generative Fusion Decoding Algorithm with Llms For Multi-modal Text Recognition, By Chan-jan Hsu et al.
Let’s Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition
by Chan-Jan Hsu, Yi-Chang Chen, Feng-Ting Liao, Pei-Chen Ho, Yu-Hsiang Wang, Po-Chun Hsu, Da-shan Shiu
First submitted to arxiv on: 23 May 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
| Summary difficulty | Written by | Summary | 
|---|---|---|
| High | Paper authors | High Difficulty Summary Read the original abstract here | 
| Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The novel Generative Fusion Decoding (GFD) framework integrates Large Language Models (LLMs) into multi-modal text recognition systems like automatic speech recognition (ASR) and optical character recognition (OCR). GFD enables seamless fusion during decoding by mapping token spaces, making it compatible with various auto-regressive models without re-training. This plug-and-play approach simplifies feature alignment, allowing LLMs to correct errors and reduce computation latencies in tasks like long-form speech recognition and instruction-aware speech recognition. GFD also enables fusing recognition models deficient in Chinese text recognition with LLMs extensively trained on Chinese. The framework’s three main advantages include its ability to simplify complexity, capitalize in-context learning, and enable fusion of deficient models. | 
| Low | GrooveSquid.com (original content) | Low Difficulty Summary GFD is a new way to combine language models with machines that can recognize speech or text. This helps improve the accuracy of these machines, especially when dealing with long speech recordings or speech that contains instructions. GFD also makes it possible to combine models that are good at recognizing certain types of text, like Chinese characters. Overall, GFD is a step forward in combining different language models to make them work better together. | 
Keywords
* Artificial intelligence * Alignment * Multi modal * Token




