Loading Now

Summary of Calibrating Multi-modal Representations: a Pursuit Of Group Robustness Without Annotations, by Chenyu You et al.


Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations

by Chenyu You, Yifei Min, Weicheng Dai, Jasjeet S. Sekhon, Lawrence Staib, James S. Duncan

First submitted to arxiv on: 12 Mar 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper explores fine-tuning pre-trained vision-language models like CLIP to tackle diverse downstream tasks. However, existing methods face challenges: directly tuning entire models is time-consuming and computationally costly, and tuned models become highly specialized, limiting real-world deployment. Recent studies show that pre-trained classifiers overly rely on spurious features, which are patterns correlated with the target in training data but unrelated to the true labeling function. The authors piloting this study focus on mitigating the reliance on spurious features without using group annotation. They systematically study spurious correlation on CLIP and advocate a lightweight representation calibration method for fine-tuning. This approach generates a calibration set using the pretrained model, then calibrates representations through contrastive learning without requiring group labels. Experiments on several benchmarks validate the effectiveness of this proposal, reducing reliance and boosting generalization.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about how to make computers better at understanding images and words together. Right now, there are some problems with the way we do this: it takes a long time and uses a lot of computer power to train these models, and they often become too good at doing one specific task but not very good at others. Some researchers have found that these models can get stuck relying on patterns in the training data that don’t actually help them understand what’s going on. The authors of this paper want to fix this problem without needing extra help from humans. They came up with a new way to adjust the model’s understanding so it doesn’t rely too much on these patterns, and they tested it on several different datasets.

Keywords

* Artificial intelligence  * Boosting  * Fine tuning  * Generalization