Loading Now

Summary of Few-shot Recognition Via Stage-wise Retrieval-augmented Finetuning, by Tian Liu et al.


Few-Shot Recognition via Stage-Wise Retrieval-Augmented Finetuning

by Tian Liu, Huixin Zhang, Shubham Parashar, Shu Kong

First submitted to arxiv on: 17 Jun 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Few-shot recognition (FSR) aims to train classification models with limited labeled examples, addressing costly annotation challenges. We propose methods leveraging a pre-trained Vision-Language Model (VLM) to solve FSR. Our primary focus is retrieval-augmented learning (RAL), which retrieves data from the VLM’s pretraining set for downstream tasks. Although RAL has been extensively studied in zero-shot recognition, its application in FSR presents novel challenges and opportunities. Interestingly, we find that finetuning a VLM on retrieved data underperforms state-of-the-art zero-shot methods due to imbalanced distribution and domain gaps with few-shot examples. In contrast, simply fine-tuning the VLM solely on few-shot data outperforms previous FSR methods. By combining both approaches, we propose Stage-Wise retrieval-Augmented fineTuning (SWAT), which involves end-to-end finetuning on mixed data followed by retraining the classifier on few-shot data. Extensive experiments demonstrate that SWAT significantly outperforms previous methods by >6% accuracy.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about a new way to train computers to recognize things, even when we only have a little bit of information about each thing. The method uses a special kind of computer model called Vision-Language Model (VLM). They tested different ways to use this model and found that some methods worked better than others. One surprising result was that using the VLM with a lot of training data didn’t work as well as just using it with a few examples. This is because the training data wasn’t very similar to the things we wanted the computer to recognize. By combining different approaches, they developed a new method called Stage-Wise retrieval-Augmented fineTuning (SWAT) that worked much better than previous methods.

Keywords

» Artificial intelligence  » Classification  » Few shot  » Fine tuning  » Language model  » Pretraining  » Zero shot