Loading Now

Summary of Grapheval36k: Benchmarking Coding and Reasoning Capabilities Of Large Language Models on Graph Datasets, by Qiming Wu et al.


GraphEval36K: Benchmarking Coding and Reasoning Capabilities of Large Language Models on Graph Datasets

by Qiming Wu, Zichen Chen, Will Corcoran, Misha Sra, Ambuj K. Singh

First submitted to arxiv on: 23 Jun 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the limitations of large language models (LLMs) in processing and manipulating structured data, particularly graphs. The authors introduce GraphEval36K, a comprehensive graph dataset consisting of 40 graph coding problems and 36,900 test cases to evaluate LLMs’ ability to solve graph-based tasks. The dataset is categorized into eight primary and four sub-categories to assess performance across different types of graphs. Benchmarking ten LLMs, the study finds that private models outperform open-source ones, although the gap is narrowing. The authors also analyze LLM performance on directed vs undirected graphs, various graph concepts, and network models. Furthermore, they propose Structured Symbolic Decomposition (SSD), an instruction-based method to enhance LLM performance on complex graph tasks. SSD improves the average passing rate of GPT-4, GPT-4o, Gemini-Pro, and Claude-3-Sonnet by 8.38%, 6.78%, 29.28%, and 25.28%, respectively.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about how big language models can be better at solving problems that involve graphs, like social networks or brain connections. Right now, these models are really good at understanding text data but struggle with structured data. The researchers created a huge dataset of graph-based problems and tested ten different models on it. They found that some private models are much better than open-source ones, even though the difference is getting smaller. They also looked at how well the models do on different types of graphs and found that some models are really good at solving certain kinds of problems. To make these models even better, they came up with a new way to teach them called Structured Symbolic Decomposition (SSD). This method helps the models solve complex graph-based tasks more effectively.

Keywords

» Artificial intelligence  » Claude  » Gemini  » Gpt