Loading Now

Summary of Nearest Neighbor Normalization Improves Multimodal Retrieval, by Neil Chowdhury et al.


Nearest Neighbor Normalization Improves Multimodal Retrieval

by Neil Chowdhury, Franklin Wang, Sumedh Shenoy, Douwe Kiela, Sarah Schwettmann, Tristan Thrush

First submitted to arxiv on: 31 Oct 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
In this paper, multimodal models are explored for their limitations in achieving perfect performance on tasks like image captioning, visual question answering, and cross-modal retrieval. Despite their strengths, these models still produce imperfect results. To address this issue, the authors introduce Nearest Neighbor Normalization (NNN), a simple and efficient method to correct errors in trained contrastive image-text retrieval models without requiring additional training. The proposed technique improves retrieval metrics for various models (CLIP, BLIP, ALBEF, SigLIP, BEiT) on two datasets (MS-COCO and Flickr30k). NNN requires a reference database but does not require training on it; instead, it can even enhance the retrieval accuracy of a model after fine-tuning.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper talks about how machines that understand pictures and words are not perfect. These machines are really good at things like describing what’s in a picture or answering questions based on an image. However, they still make mistakes sometimes. To fix these mistakes, the researchers came up with a new way to improve the machine’s understanding without teaching it more. This method works better for different models and datasets.

Keywords

» Artificial intelligence  » Fine tuning  » Image captioning  » Nearest neighbor  » Question answering