Summary of 100 Instances Is All You Need: Predicting the Success Of a New Llm on Unseen Data by Testing on a Few Instances, By Lorenzo Pacchiardi et al.
100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances
by Lorenzo Pacchiardi, Lucy G. Cheke, José Hernández-Orallo
First submitted to arxiv on: 5 Sep 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed method leverages evaluation results from previously tested large language models (LLMs) to reduce the number of evaluations required for predicting their performance on individual task instances. The approach involves testing a new LLM on a small set of reference instances and training a generic assessor that predicts its performance based on features of the instance, using the performance of the LLM on the reference set as input. The authors evaluate this method on HELM-Lite and KindsOfReasoning, a collection of existing reasoning datasets, using instruction-fine-tuned OpenAI models up to GPT4. The results show that predicting performance on instances with similar distribution to those used to train the generic assessor achieves comparable performance to LLM-specific assessors trained on full sets of instances. However, out-of-distribution performances are found to be worse and lack a clear winner. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The researchers developed a way to predict how well large language models (LLMs) will do on specific tasks. They did this by using the results from testing other LLMs in the past and training an “assessor” that can make predictions based on features of each task instance. This approach only requires testing the new LLM on a small set of examples, making it faster and more efficient. The team tested their method on some existing reasoning datasets and found that it worked well when predicting performance on tasks with similar characteristics to those used to train the assessor. However, when trying to predict performance on very different types of tasks, the results were not as good. |