Loading Now

Summary of Open Source Language Models Can Provide Feedback: Evaluating Llms’ Ability to Help Students Using Gpt-4-as-a-judge, by Charles Koutcheme et al.


Open Source Language Models Can Provide Feedback: Evaluating LLMs’ Ability to Help Students Using GPT-4-As-A-Judge

by Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, Juho Leinonen, Paul Denny

First submitted to arxiv on: 8 May 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large language models (LLMs) have shown potential for generating automatic feedback in various computing contexts. However, concerns about privacy and ethics have sparked interest in open-source LLMs in education, but the quality of their generated feedback remains understudied. This is a concern as flawed or misleading feedback could negatively impact student learning. Inspired by recent work using powerful LLMs to evaluate less powerful models, we conduct an automated analysis of several open-source model feedback on a dataset from an introductory programming course. We investigate GPT-4’s viability as an automated evaluator and find it demonstrates bias toward positively rating feedback while showing moderate agreement with human raters, highlighting its potential as a feedback evaluator. Additionally, we explore the quality of feedback generated by leading open-source LLMs using GPT-4 evaluation, finding some models offer competitive performance with proprietary LLMs like ChatGPT, indicating opportunities for responsible use in educational settings.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models can help computers generate feedback automatically. But people are worried about keeping student work private and ethical issues. This makes open-source models interesting for education, but we don’t know how well they do. If the feedback is wrong, it could hurt students’ learning. We looked at how well GPT-4 evaluates feedback from different models. It’s biased to think good things about feedback, but agrees with people most of the time. Some open-source models work as well as famous proprietary ones like ChatGPT.

Keywords

» Artificial intelligence  » Gpt