Summary of How Do Large Language Models Learn In-context? Query and Key Matrices Of In-context Heads Are Two Towers For Metric Learning, by Zeping Yu et al.

How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning

by Zeping Yu, Sophia Ananiadou

First submitted to arxiv on: 5 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper explores the mechanism of in-context learning (ICL) on sentence classification tasks with semantically-unrelated labels, specifically focusing on the effect of intervening in only 1% heads. The study finds that this intervention significantly impacts ICL accuracy, reducing it from 87.6% to 24.4%. To understand this phenomenon, the paper analyzes value-output vectors in these heads and discovers that they contain substantial information about the corresponding labels. Additionally, the paper observes a prediction shift from “foo” to “bar” due to changes in attention scores at label positions. The authors propose a hypothesis for ICL, suggesting that value-output matrices extract label features while query-key matrices compute similarity between last position’s features and demonstration features at each label position. This hypothesis explains the majority label bias and recency bias in ICL and proposes two methods to reduce these biases by 22% and 17%, respectively.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper looks at how we learn from examples that aren’t related to what we’re trying to do (sentence classification). The study finds that if you only look at a tiny part of the data, it makes learning much worse. To understand why this happens, they analyzed special vectors in these tiny parts and found that they hold important information about the labels. They also discovered that when we predict something new, it’s because our model is focusing more on one label than another. The authors came up with a theory for how this works: it seems like our model is extracting features from the examples and comparing them to decide what to do next. This helps explain why we often favor one label over others and proposes ways to reduce this bias by 22% and 17%.

Keywords

* Artificial intelligence * Attention * Classification

How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning

by Zeping Yu, Sophia Ananiadou

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Stable and Robust Deep Learning by Hyperbolic Tangent Exponential Linear Unit (telu), By Alfredo Fernandez and Ankur Mali

Summary of Automated Cognate Detection As a Supervised Link Prediction Task with Cognate Transformer, by V.s.d.s.mahesh Akavarapu and Arnab Bhattacharya

Related Posts