Summary of Vocabulary Expansion Of Chat Models with Unlabeled Target Language Data, by Atsuki Yamaguchi et al.
Vocabulary Expansion of Chat Models with Unlabeled Target Language Data
by Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras
First submitted to arxiv on: 16 Dec 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Chat models outperform base models in both conversation and task-solving abilities, but require adaptation for underrepresented languages. A common technique is vocabulary expansion (VE) with target language tokens, pre-training on language-specific data. However, using chat data can be costly or non-existent. Adapting chat models with unlabeled data could result in catastrophic forgetting. This paper investigates the impact of using unlabeled target language data for VE on chat models. We show that off-the-shelf VE generally performs well across target language tasks and models (71% of cases), but underperforms when source chat models are strong. To improve adapted models, we propose post-hoc techniques that inject information from the source model without further training. Our experiments reveal the effectiveness of our methods, achieving performance improvements in 87% of cases. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Chat models can do tasks and talk to humans better than basic models. But they need help to work with languages they weren’t trained on. One way is to add new words to the model’s vocabulary and then train it on language-specific data. However, getting chat data for some languages might be hard or expensive. Another idea is to use unlabeled data, but that could cause the model to forget what it learned before. This paper looks at how using unlabeled target language data can help adapt chat models. They found that this method usually works well (71% of cases), but not when the source model was already very good. To make adapted models even better, they suggest adding information from the original model without training again. Their tests showed that these methods are effective and improve performance in 87% of cases. |