Summary of Discoverybench: Towards Data-driven Discovery with Large Language Models, by Bodhisattwa Prasad Majumder et al.
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models
by Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, Peter Clark
First submitted to arxiv on: 1 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper presents DiscoveryBench, a comprehensive benchmark designed to evaluate the capabilities of large language models (LLMs) in automating data-driven discovery. The benchmark formalizes the multi-step process of discovery and consists of 264 tasks across six domains, including sociology and engineering. Each task is defined by a dataset, metadata, and a discovery goal in natural language. Additionally, the authors provide 903 synthetic tasks for controlled evaluations. They use several popular LLM-based reasoning frameworks as baselines and find that even the best system scores only 25%. The authors’ structured formalism of data-driven discovery enables facet-based evaluation, providing insights into different failure modes. Overall, DiscoveryBench serves as a valuable resource to improve LLMs in data-driven discovery. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about using big language models to help scientists find new ideas and discoveries from datasets. The authors created a special test called DiscoveryBench that checks how well these models can do this job. They used real-world examples from published papers to create 264 tasks across six different areas like sociology and engineering. Each task has a dataset, some information about the data, and a goal for what they want to find. The authors also created 903 fake tasks to test the models in a controlled way. They found that even the best model only got 25% of the tasks right, which shows how hard it is to automate this process. Overall, DiscoveryBench helps us understand how to make language models better at finding new discoveries. |