Summary of Evolving Interpretable Visual Classifiers with Large Language Models, by Mia Chiquier et al.
Evolving Interpretable Visual Classifiers with Large Language Models
by Mia Chiquier, Utkarsh Mall, Carl Vondrick
First submitted to arxiv on: 15 Apr 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Multimodal pre-trained models like CLIP excel in zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models’ black-box nature limits interpretability, increases bias risk, and hinders discovery of new visual concepts not written down. In practical settings, unknown class names and attributes prevent these methods from performing well on uncommon images. To address this, we introduce a novel method that discovers interpretable sets of attributes for visual recognition. Our evolutionary search algorithm uses a large language model to iteratively mutate a concept bottleneck of attributes for classification. This approach produces state-of-the-art, interpretable fine-grained classifiers. We outperform baselines by 18.4% on five iNaturalist datasets and 22.2% on two KikiBouba datasets, despite privileged information access. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine a computer system that can understand images without being specifically trained for each type of image. This is what researchers have been trying to achieve with “vision-language models.” However, these systems are not very good at explaining why they make certain decisions or discovering new things. To fix this, scientists have developed a new method that finds important characteristics in images and uses those characteristics to classify the images correctly. This approach works really well and is better than other methods at recognizing specific types of objects. |
Keywords
» Artificial intelligence » Classification » Large language model » Zero shot