Summary of Tokenization Matters! Degrading Large Language Models Through Challenging Their Tokenization, by Dixuan Wang et al.
Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization
by Dixuan Wang, Yanda Li, Junyuan Jiang, Zepeng Ding, Guochao Jiang, Jiaqing Liang, Deqing Yang
First submitted to arxiv on: 27 May 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large Language Models (LLMs) have shown impressive language understanding and generation capabilities. However, they often produce inaccurate responses to specific queries. This limitation stems from the tokenization step LLMs undergo, which is an inherent limitation shared by all LLMs. Incorrect tokenization hinders precise input understanding, leading to unsatisfactory output. To demonstrate this flaw, we created an adversarial dataset (ADT) that challenges LLMs’ tokenization capabilities. ADT consists of two subsets: manually constructed ADT-Human and automatically generated ADT-Auto. Our results show that ADT is effective in degrading the capabilities of leading LLMs like GPT-4o, Llama-3, Qwen2.5-max, and others. Additionally, our automatic data generation method has been proven efficient and robust, applicable to any open-source LLM. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large Language Models (LLMs) are very smart at understanding and generating language. But sometimes they get things wrong. This happens because of the way they “cut up” words into smaller pieces called tokens. If they don’t do this correctly, it can cause them to make mistakes. To see how well LLMs can handle this challenge, we created a special set of examples that tries to trick them. We tested some popular LLMs and found that our examples did indeed make them worse at understanding language. |
Keywords
» Artificial intelligence » Gpt » Language understanding » Llama » Tokenization