Loading Now

Summary of Denoising Autoregressive Representation Learning, by Yazhe Li et al.


Denoising Autoregressive Representation Learning

by Yazhe Li, Jorg Bornschein, Ting Chen

First submitted to arxiv on: 8 Mar 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computer Vision and Pattern Recognition (cs.CV)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper introduces DARL, a new generative approach for learning visual representations using a decoder-only Transformer to predict image patches autoregressively. The method is trained with Mean Squared Error (MSE) alone, which leads to strong representations. To improve the image generation ability, the MSE loss is replaced with the diffusion objective by using a denoising patch decoder. The learned representation can be improved by using tailored noise schedules and longer training in larger models. The optimal schedule differs significantly from typical ones used in standard image diffusion models. Despite its simple architecture, DARL achieves performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol.
Low GrooveSquid.com (original content) Low Difficulty Summary
DARL is a new way to learn visual representations using a special kind of artificial intelligence called Transformers. It’s like a puzzle where the model predicts what comes next in an image patch by itself, without any help from human labels. The researchers found that training this model with a type of error measurement called Mean Squared Error (MSE) leads to great results. To make it even better, they replaced MSE with another way of learning called diffusion objective. This helps the model generate realistic images. They also experimented with different schedules and larger models to see what works best. Surprisingly, DARL performs almost as well as more complex models that have been fine-tuned for a specific task.

Keywords

* Artificial intelligence  * Decoder  * Diffusion  * Fine tuning  * Image generation  * Mse  * Transformer