Loading Now

Summary of Are Large Language Models Good Essay Graders?, by Anindita Kundu and Denilson Barbosa


Are Large Language Models Good Essay Graders?

by Anindita Kundu, Denilson Barbosa

First submitted to arxiv on: 19 Sep 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The abstract discusses the limitations of Large Language Models (LLMs) in assessing essay quality. It compares the scores provided by LLMs, specifically ChatGPT and Llama, with human ratings on the Automated Essay Scoring (AES) task using the ASAP dataset. The results show that both models generally assign lower scores than humans, with no strong correlation between their scores and those of humans. Additionally, the study explores various essay features commonly used in AES methods and finds little to no correlation with either human or LLM scores. While LLMs are not yet a replacement for human grading, they may have potential as an aid to assist humans in evaluating written essays.
Low GrooveSquid.com (original content) Low Difficulty Summary
The abstract looks at how well Large Language Models (LLMs) can judge the quality of essays. It compares what these models say with what humans think about the same essays. The results show that both LLMs give lower scores than humans and don’t really match up to human ratings. The study also looks at things like essay length, word connections, and grammar mistakes, but finds no strong connection between those features and either human or LLM scores. Overall, while LLMs aren’t yet good enough to replace human grading, they might be helpful in helping humans evaluate essays.

Keywords

» Artificial intelligence  » Llama