Summary of Few-shot Recalibration Of Language Models, by Xiang Lisa Li and Urvashi Khandelwal and Kelvin Guu
Few-Shot Recalibration of Language Models
by Xiang Lisa Li, Urvashi Khandelwal, Kelvin Guu
First submitted to arxiv on: 27 Mar 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A new framework is proposed for few-shot slice-specific recalibration of language models’ confidence estimates. The approach trains a model that takes a few unlabeled examples from any given slice and predicts a curve remapping confidence scores to be more accurate for that slice. This enables identifying domain-specific confidence thresholds above which predictions can be trusted, and below which the model should abstain. The proposed method consistently outperforms existing calibration methods, improving calibration error by 16% on PaLM2-Large when applied to MMLU. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary A language model’s confidence score is meant to reflect how likely it is to be correct. However, this can hide significant miscalibration within narrower slices of data. A new way to get well-calibrated confidence estimates for any slice of a distribution has been found. This involves training a special model that takes a few examples from the slice and adjusts the confidence scores so they are more accurate. This new method works without needing labeled data from the slice. It can even work on new slices it hasn’t seen before. The results show that this approach is better than current methods, improving calibration error by 16%. |
Keywords
* Artificial intelligence * Few shot * Language model