Summary of A Dataset For Evaluating Llm-based Evaluation Functions For Research Question Extraction Task, by Yuya Fujisaki et al.
A Dataset for Evaluating LLM-based Evaluation Functions for Research Question Extraction Task
by Yuya Fujisaki, Shiro Takagi, Hideki Asoh, Wataru Kumagai
First submitted to arxiv on: 10 Sep 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the task of accurately extracting and summarizing research questions (RQ) from highly specialized documents like research papers. The authors create a new dataset consisting of machine learning papers, RQ extracted using GPT-4, and human evaluations from multiple perspectives. They compare recently proposed LLM-based evaluation functions for summarizations and find that none show sufficiently high correlations with human evaluations. Instead, they propose developing better evaluation functions tailored to the RQ extraction task. The authors contribute to enhancing the performance of this task by making their dataset available. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about helping computers understand research papers. It’s hard for machines to figure out what questions scientists are asking in these papers. The researchers made a special list called a dataset that has lots of machine learning papers, and they used a computer program to find the questions inside those papers. Then, humans looked at these questions and said whether they were correct or not. The paper shows that existing ways of checking how good this is don’t work very well. Instead, it suggests finding new ways to check if computers are doing a good job understanding research papers. |
Keywords
» Artificial intelligence » Gpt » Machine learning