Summary of Longbench V2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks, by Yushi Bai et al.
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
by Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li
First submitted to arxiv on: 19 Dec 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces LongBench v2, a benchmark designed to assess the ability of Large Language Models (LLMs) to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. The benchmark consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and practicality, nearly 100 highly educated individuals with diverse professional backgrounds contributed to the dataset. Automated and manual review processes were employed to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Evaluation reveals that the best-performing model, when directly answering questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle long-context challenges in LongBench v2. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a special test called LongBench v2 to see how well computers can understand big pieces of text. It has lots of tricky questions that need deep thinking, and it’s meant for very smart computer programs called Large Language Models (LLMs). The test has 503 questions, with some having really long texts to read through. People from all sorts of jobs helped make the questions, and even experts only got about half of them right within a short time. This shows that computers need to get better at thinking deeply and using more computer power to do well on this kind of test. |
Keywords
» Artificial intelligence » Inference