Loading Now

Summary of Few-shot Recalibration Of Language Models, by Xiang Lisa Li and Urvashi Khandelwal and Kelvin Guu


Few-Shot Recalibration of Language Models

by Xiang Lisa Li, Urvashi Khandelwal, Kelvin Guu

First submitted to arxiv on: 27 Mar 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A new framework is proposed for few-shot slice-specific recalibration of language models’ confidence estimates. The approach trains a model that takes a few unlabeled examples from any given slice and predicts a curve remapping confidence scores to be more accurate for that slice. This enables identifying domain-specific confidence thresholds above which predictions can be trusted, and below which the model should abstain. The proposed method consistently outperforms existing calibration methods, improving calibration error by 16% on PaLM2-Large when applied to MMLU.
Low GrooveSquid.com (original content) Low Difficulty Summary
A language model’s confidence score is meant to reflect how likely it is to be correct. However, this can hide significant miscalibration within narrower slices of data. A new way to get well-calibrated confidence estimates for any slice of a distribution has been found. This involves training a special model that takes a few examples from the slice and adjusts the confidence scores so they are more accurate. This new method works without needing labeled data from the slice. It can even work on new slices it hasn’t seen before. The results show that this approach is better than current methods, improving calibration error by 16%.

Keywords

* Artificial intelligence  * Few shot  * Language model