Summary of Are Large Language Models Good Essay Graders?, by Anindita Kundu and Denilson Barbosa

Are Large Language Models Good Essay Graders?

by Anindita Kundu, Denilson Barbosa

First submitted to arxiv on: 19 Sep 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The abstract discusses the limitations of Large Language Models (LLMs) in assessing essay quality. It compares the scores provided by LLMs, specifically ChatGPT and Llama, with human ratings on the Automated Essay Scoring (AES) task using the ASAP dataset. The results show that both models generally assign lower scores than humans, with no strong correlation between their scores and those of humans. Additionally, the study explores various essay features commonly used in AES methods and finds little to no correlation with either human or LLM scores. While LLMs are not yet a replacement for human grading, they may have potential as an aid to assist humans in evaluating written essays.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The abstract looks at how well Large Language Models (LLMs) can judge the quality of essays. It compares what these models say with what humans think about the same essays. The results show that both LLMs give lower scores than humans and don’t really match up to human ratings. The study also looks at things like essay length, word connections, and grammar mistakes, but finds no strong connection between those features and either human or LLM scores. Overall, while LLMs aren’t yet good enough to replace human grading, they might be helpful in helping humans evaluate essays.

Keywords

» Artificial intelligence » Llama

Are Large Language Models Good Essay Graders?

by Anindita Kundu, Denilson Barbosa

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Should Rag Chatbots Forget Unimportant Conversations? Exploring Importance and Forgetting with Psychological Insights, by Ryuichi Sumida et al.

Summary of Gaprotonet: a Multi-head Graph Attention-based Prototypical Network For Interpretable Text Classification, by Ximing Wen et al.

Related Posts