Summary of Evaluating the Evaluator: Measuring Llms’ Adherence to Task Evaluation Instructions, by Bhuvanashree Murugadoss et al.
Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions
by Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, Advait Sarkar
First submitted to arxiv on: 16 Aug 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A recent method called LLMs-as-a-judge replaces human evaluations with automatic assessments using large language models (LLMs) like GPT4 and Llama3. This approach is expected to improve alignment with human preferences when prompted for a quality judgement, such as text coherence. However, it’s unclear whether LLMs-as-a-judge evaluations are instruction-based or influenced by their fine-tune data preferences. To investigate this, we analyzed prompts with varying levels of instructions and compared them to a prompt-free method using model perplexity as a quality measure. We aggregated a taxonomy of quality criteria commonly used across state-of-the-art evaluations and provided it as a rigorous benchmark for LLMs-as-judges. Our results show that highly detailed prompts have limited benefits and that perplexity can sometimes align better with human judgements than prompting, especially on textual quality. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary LLMs-as-a-judge is a new way to evaluate tasks by using big language models like GPT4 and Llama3 instead of humans. This seems helpful because these models are good at understanding what people want. But it’s not clear if the evaluations they do are based only on the instructions or if their own preferences come into play. We wanted to find out how much difference prompts with varying levels of instruction make in these evaluations. We also compared them to a method that doesn’t use prompts at all, but instead uses model perplexity as a way to measure quality. Our findings show that giving the models very detailed instructions doesn’t help much and that sometimes the models’ own ideas about what’s good or bad match human judgments better than following instructions. |
Keywords
» Artificial intelligence » Alignment » Perplexity » Prompt » Prompting