Summary of I0t: Embedding Standardization Method Towards Zero Modality Gap, by Na Min An et al.
I0T: Embedding Standardization Method Towards Zero Modality Gap
by Na Min An, Eunki Kim, James Thorne, Hyunjung Shim
First submitted to arxiv on: 18 Dec 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed I0T framework addresses the modality gap issue in contrastive language-image pretraining (CLIP) by introducing two methods to reduce the discrepancy between image and text embeddings. The first method, post-hoc embedding standardization {}, reduces the modality gap approximately to zero, while the second method, trainable {}, alleviates the modality gap problem by adding normalization layers for each encoder. The framework can significantly reduce the modality gap while preserving the original embedding representations of trained models with locked parameters. This is particularly important in applications such as image-text retrieval and classification, where zero-shot inference is crucial. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Contrastive language-image pretraining (CLIP) helps machines understand images and text together without needing any labeled training data. However, some recent improvements to CLIP have a problem called the modality gap. This means that the way images and text are represented in computers gets very different, which makes it harder for machines to understand them correctly. The authors of this paper found out what’s causing the modality gap and propose two ways to fix it. One method adjusts how the image and text representations are standardized, while the other method uses special layers to help the computer better understand both images and text. This is important because we need computers to be able to understand images and text correctly in order to do things like search for specific pictures or identify what’s in a photo. |
Keywords
» Artificial intelligence » Classification » Embedding » Encoder » Inference » Pretraining » Zero shot