Summary of Denoising Autoregressive Representation Learning, by Yazhe Li et al.
Denoising Autoregressive Representation Learning
by Yazhe Li, Jorg Bornschein, Ting Chen
First submitted to arxiv on: 8 Mar 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computer Vision and Pattern Recognition (cs.CV)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper introduces DARL, a new generative approach for learning visual representations using a decoder-only Transformer to predict image patches autoregressively. The method is trained with Mean Squared Error (MSE) alone, which leads to strong representations. To improve the image generation ability, the MSE loss is replaced with the diffusion objective by using a denoising patch decoder. The learned representation can be improved by using tailored noise schedules and longer training in larger models. The optimal schedule differs significantly from typical ones used in standard image diffusion models. Despite its simple architecture, DARL achieves performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary DARL is a new way to learn visual representations using a special kind of artificial intelligence called Transformers. It’s like a puzzle where the model predicts what comes next in an image patch by itself, without any help from human labels. The researchers found that training this model with a type of error measurement called Mean Squared Error (MSE) leads to great results. To make it even better, they replaced MSE with another way of learning called diffusion objective. This helps the model generate realistic images. They also experimented with different schedules and larger models to see what works best. Surprisingly, DARL performs almost as well as more complex models that have been fine-tuned for a specific task. |
Keywords
* Artificial intelligence * Decoder * Diffusion * Fine tuning * Image generation * Mse * Transformer