Summary of Denoising with a Joint-embedding Predictive Architecture, by Dengsheng Chen et al.
Denoising with a Joint-Embedding Predictive Architecture
by Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu
First submitted to arxiv on: 2 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computer Vision and Pattern Recognition (cs.CV)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), which integrates joint-embedding predictive architectures (JEPAs) within generative modeling. JEPAs have shown promise in self-supervised representation learning, but their application in generative modeling remains underexplored. By recognizing JEPA as a form of masked image modeling and reinterpreting it as a generalized next-token prediction strategy, D-JEPA enables data generation in an auto-regressive manner. The paper also incorporates diffusion loss to model the per-token probability distribution, allowing for continuous space data generation. Additionally, flow matching loss is adapted as an alternative to diffusion loss, enhancing D-JEPA’s flexibility. Experimental results show that D-JEPA consistently achieves lower FID scores with fewer training epochs, indicating its good scalability. The paper also outperforms previous generative models across all scales on ImageNet conditional generation benchmarks. Furthermore, the model is well-suited for other continuous data modeling tasks, such as video and audio. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This research introduces a new way to generate images called D-JEPA. It combines two existing ideas: joint-embedding predictive architectures (JEPAs) and diffusion models. JEPAs are good at learning representations from data, but they haven’t been used for image generation before. Diffusion models are great at generating arbitrary probability distributions. The new method, D-JEPA, uses both ideas to generate images in a continuous space. It also has an alternative way of working called flow matching loss. The results show that D-JEPA is better than previous methods and can be used for other types of data like videos and audio. |
Keywords
» Artificial intelligence » Diffusion » Embedding » Image generation » Probability » Representation learning » Self supervised » Token