Loading Now

Summary of Longbench V2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks, by Yushi Bai et al.


LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

by Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, Juanzi Li

First submitted to arxiv on: 19 Dec 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper introduces LongBench v2, a benchmark designed to assess the ability of Large Language Models (LLMs) to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. The benchmark consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and practicality, nearly 100 highly educated individuals with diverse professional backgrounds contributed to the dataset. Automated and manual review processes were employed to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Evaluation reveals that the best-performing model, when directly answering questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle long-context challenges in LongBench v2.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper creates a special test called LongBench v2 to see how well computers can understand big pieces of text. It has lots of tricky questions that need deep thinking, and it’s meant for very smart computer programs called Large Language Models (LLMs). The test has 503 questions, with some having really long texts to read through. People from all sorts of jobs helped make the questions, and even experts only got about half of them right within a short time. This shows that computers need to get better at thinking deeply and using more computer power to do well on this kind of test.

Keywords

» Artificial intelligence  » Inference