Summary of How Do Large Language Models Learn In-context? Query and Key Matrices Of In-context Heads Are Two Towers For Metric Learning, by Zeping Yu et al.
How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning
by Zeping Yu, Sophia Ananiadou
First submitted to arxiv on: 5 Feb 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper explores the mechanism of in-context learning (ICL) on sentence classification tasks with semantically-unrelated labels, specifically focusing on the effect of intervening in only 1% heads. The study finds that this intervention significantly impacts ICL accuracy, reducing it from 87.6% to 24.4%. To understand this phenomenon, the paper analyzes value-output vectors in these heads and discovers that they contain substantial information about the corresponding labels. Additionally, the paper observes a prediction shift from “foo” to “bar” due to changes in attention scores at label positions. The authors propose a hypothesis for ICL, suggesting that value-output matrices extract label features while query-key matrices compute similarity between last position’s features and demonstration features at each label position. This hypothesis explains the majority label bias and recency bias in ICL and proposes two methods to reduce these biases by 22% and 17%, respectively. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how we learn from examples that aren’t related to what we’re trying to do (sentence classification). The study finds that if you only look at a tiny part of the data, it makes learning much worse. To understand why this happens, they analyzed special vectors in these tiny parts and found that they hold important information about the labels. They also discovered that when we predict something new, it’s because our model is focusing more on one label than another. The authors came up with a theory for how this works: it seems like our model is extracting features from the examples and comparing them to decide what to do next. This helps explain why we often favor one label over others and proposes ways to reduce this bias by 22% and 17%. |
Keywords
* Artificial intelligence * Attention * Classification