Summary of Nearest Neighbor Normalization Improves Multimodal Retrieval, by Neil Chowdhury et al.
Nearest Neighbor Normalization Improves Multimodal Retrieval
by Neil Chowdhury, Franklin Wang, Sumedh Shenoy, Douwe Kiela, Sarah Schwettmann, Tristan Thrush
First submitted to arxiv on: 31 Oct 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary In this paper, multimodal models are explored for their limitations in achieving perfect performance on tasks like image captioning, visual question answering, and cross-modal retrieval. Despite their strengths, these models still produce imperfect results. To address this issue, the authors introduce Nearest Neighbor Normalization (NNN), a simple and efficient method to correct errors in trained contrastive image-text retrieval models without requiring additional training. The proposed technique improves retrieval metrics for various models (CLIP, BLIP, ALBEF, SigLIP, BEiT) on two datasets (MS-COCO and Flickr30k). NNN requires a reference database but does not require training on it; instead, it can even enhance the retrieval accuracy of a model after fine-tuning. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper talks about how machines that understand pictures and words are not perfect. These machines are really good at things like describing what’s in a picture or answering questions based on an image. However, they still make mistakes sometimes. To fix these mistakes, the researchers came up with a new way to improve the machine’s understanding without teaching it more. This method works better for different models and datasets. |
Keywords
» Artificial intelligence » Fine tuning » Image captioning » Nearest neighbor » Question answering