Loading Now

Summary of Is My Meeting Summary Good? Estimating Quality with a Multi-llm Evaluator, by Frederic Kirstein et al.


Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator

by Frederic Kirstein, Terry Ruas, Bela Gipp

First submitted to arxiv on: 27 Nov 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This research paper proposes a novel natural language generation (NLG) system called MESA that utilizes large language models (LLMs) to evaluate the quality of meeting summaries. The study highlights the limitations of existing metrics, such as ROUGE and BERTScore, which have a low correlation with human judgments and fail to capture nuanced errors. MESA employs a three-step assessment of individual error types, multi-agent discussion for decision refinement, and feedback-based self-training to refine error definition understanding and alignment with human judgment. The framework achieves mid to high Point-Biserial correlation with human judgment in error detection and mid Spearman and Kendall correlation in reflecting error impact on summary quality. MESA’s flexibility in adapting to custom error guidelines makes it suitable for various tasks with limited human-labeled data.
Low GrooveSquid.com (original content) Low Difficulty Summary
This research paper is about creating a better way to evaluate how well computers can summarize meetings. Right now, there are some problems with the ways we measure this, like ROUGE and BERTScore, which don’t always agree with what humans think is good or bad. The researchers created a new system called MESA that uses big language models to understand what makes meeting summaries good or bad. MESA has three parts: it looks at individual errors, talks to other agents to decide if the error is important, and trains itself using feedback from humans. This new way of evaluating meeting summaries is better than previous methods because it gets closer to what humans think is good.

Keywords

» Artificial intelligence  » Alignment  » Rouge  » Self training