Summary of A Theoretical Analysis Of Self-supervised Learning For Vision Transformers, by Yu Huang et al.
A Theoretical Analysis of Self-Supervised Learning for Vision Transformers
by Yu Huang, Zixin Wen, Yuejie Chi, Yingbin Liang
First submitted to arxiv on: 4 Mar 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Optimization and Control (math.OC); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper explores the differences between masked autoencoders (MAE) and contrastive learning (CL) in self-supervised computer vision, specifically focusing on vision transformers (ViTs). While previous studies have empirically demonstrated that MAE captures both global and local features, CL tends to focus on global patterns. The authors provide a theoretical explanation for these differences by modeling the visual data distribution as a combination of dominant global features and minuscule local features. They analyze the training dynamics of one-layer softmax-based ViTs on both MAE and CL objectives using gradient descent, showing that MAE-trained ViTs learn to capture both types of features, while CL-trained ViTs favor global features even under mild imbalance. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about how computer vision works when it’s not taught to recognize specific things. It looks at two ways to do this: one way is called masked autoencoders (MAE) and the other is contrastive learning (CL). Researchers have found that these methods capture different kinds of information. The authors want to understand why this happens. They do this by looking at how a special kind of computer program, called a vision transformer (ViT), learns when it’s given data or tasks related to MAE and CL. They found that the ViTs trained with one method are good at recognizing both big and small details, while the other method is better at recognizing just the big things. |
Keywords
* Artificial intelligence * Gradient descent * Mae * Self supervised * Softmax * Vision transformer * Vit