Summary of Are Large Language Models Good Essay Graders?, by Anindita Kundu and Denilson Barbosa
Are Large Language Models Good Essay Graders?
by Anindita Kundu, Denilson Barbosa
First submitted to arxiv on: 19 Sep 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The abstract discusses the limitations of Large Language Models (LLMs) in assessing essay quality. It compares the scores provided by LLMs, specifically ChatGPT and Llama, with human ratings on the Automated Essay Scoring (AES) task using the ASAP dataset. The results show that both models generally assign lower scores than humans, with no strong correlation between their scores and those of humans. Additionally, the study explores various essay features commonly used in AES methods and finds little to no correlation with either human or LLM scores. While LLMs are not yet a replacement for human grading, they may have potential as an aid to assist humans in evaluating written essays. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The abstract looks at how well Large Language Models (LLMs) can judge the quality of essays. It compares what these models say with what humans think about the same essays. The results show that both LLMs give lower scores than humans and don’t really match up to human ratings. The study also looks at things like essay length, word connections, and grammar mistakes, but finds no strong connection between those features and either human or LLM scores. Overall, while LLMs aren’t yet good enough to replace human grading, they might be helpful in helping humans evaluate essays. |
Keywords
» Artificial intelligence » Llama