Loading Now

Summary of Multi-logieval: Towards Evaluating Multi-step Logical Reasoning Ability Of Large Language Models, by Nisarg Patel et al.


Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

by Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi Nakamura, Neeraj Varshney, Chitta Baral

First submitted to arxiv on: 24 Jun 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
As Large Language Models (LLMs) demonstrate remarkable performance in natural language understanding tasks, it is essential to measure their ability for human-like multi-step logical reasoning. The existing evaluation benchmarks focus primarily on simplistic single-step or multi-step reasoning with limited inference rules. Additionally, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. The dataset covers three logic types—propositional, first-order, and non-monotonic—consisting of over 30 inference rules and their combinations with varying depths. We conduct evaluations on a range of LLMs, including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~68% at depth-1 to ~43% at depth-5). We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about testing how well Large Language Models (LLMs) can think like humans. Right now, these models are great at understanding language, but they don’t do as well with more complex thinking tasks that involve multiple steps and different types of logic. The researchers created a new dataset called Multi-LogiEval to help test LLMs’ ability for logical reasoning. They tested several different models using this dataset and found that the models get worse at making logical connections as the task gets more complicated. This research can help us improve the way these language models think.

Keywords

» Artificial intelligence  » Gemini  » Gpt  » Inference  » Language understanding  » Zero shot