Loading Now

Summary of Latteclip: Unsupervised Clip Fine-tuning Via Lmm-synthetic Texts, by Anh-quan Cao et al.


LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

by Anh-Quan Cao, Maximilian Jaritz, Matthieu Guillaumin, Raoul de Charette, Loris Bazzani

First submitted to arxiv on: 10 Oct 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large-scale vision-language pre-trained models, such as CLIP, are versatile tools that can be applied to various applications without requiring additional training. However, their performance often falls short when used in specific domains due to domain gaps or the under-representation of these domains in the training data. To address this issue, our proposed method, LatteCLIP, is an unsupervised approach for fine-tuning CLIP models on custom domains using classification with known class names, without relying on human annotations. This method leverages Large Multimodal Models to generate expressive textual descriptions for individual images and groups of images, providing additional contextual information to guide the fine-tuning process. LatteCLIP also introduces a novel strategy to distill only the useful information from noisy generated texts and dual pseudo-labels, allowing it to learn rich per-class prototype representations. Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy and other state-of-the-art unsupervised methods by +3.45 points.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large-scale vision-language pre-trained models can be used for various tasks without requiring additional training. However, their performance often falls short when used in specific domains due to domain gaps or the under-representation of these domains in the training data. To address this issue, a new method has been proposed that allows fine-tuning of these models on custom domains using classification with known class names, without relying on human annotations. This method uses large multimodal models to generate textual descriptions for images and groups of images, providing additional information to guide the fine-tuning process. The results show that this method outperforms other methods by a significant margin.

Keywords

» Artificial intelligence  » Classification  » Fine tuning  » Unsupervised  » Zero shot