Summary of What’s Under the Hood: Investigating Automatic Metrics on Meeting Summarization, by Frederic Kirstein et al.
What’s under the hood: Investigating Automatic Metrics on Meeting Summarization
by Frederic Kirstein, Jan Philip Wahle, Terry Ruas, Bela Gipp
First submitted to arxiv on: 17 Apr 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates the evaluation of meeting summarization techniques, which have become increasingly important due to the rise of online interactions. It examines how automatic metrics correlate with human evaluations across a broad error taxonomy and finds that current default-used metrics struggle to capture observable errors, showing weak to mid-correlations. The study uses annotated transcripts and summaries from Transformer-based sequence-to-sequence and autoregressive models from the general summary QMSum dataset and finds that different model architectures respond variably to challenges in meeting transcripts, resulting in different pronounced links between challenges and errors. The results show that only a subset of metrics reacts accurately to specific errors, while most correlations show either unresponsiveness or failure to reflect the error’s impact on summary quality. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Meeting summarization is an important task because it helps people quickly understand what happened during meetings. But right now, there are no good ways to evaluate how well meeting summarizers do their job. This paper tries to fix that by studying how automatic metrics (which are used to test meeting summarizers) compare to human evaluations of summaries. It finds that most of these metrics don’t work very well and can even hide errors in the summaries they’re testing. The study uses real transcripts and summaries from a dataset called QMSum to see which model architectures work best for different types of meetings. |
Keywords
» Artificial intelligence » Autoregressive » Summarization » Transformer