Summary of Prompt Decoupling For Text-to-image Person Re-identification, by Weihao Li et al.
Prompt Decoupling for Text-to-Image Person Re-identification
by Weihao Li, Lei Tan, Pingyang Dai, Yan Zhang
First submitted to arxiv on: 4 Jan 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary In this paper, researchers tackle the problem of text-to-image person re-identification (TIReID), where they aim to retrieve a target individual from an image gallery based on a textual description query. To accomplish this task, they employ pre-trained vision-language models like CLIP, which have shown impressive results in semantic concept learning and multi-modal knowledge acquisition. The authors argue that recent methods using CLIP for TIReID rely too heavily on direct fine-tuning of the entire network, requiring both domain adaptation and task adaptation simultaneously. They propose a two-stage approach to decouple these processes, introducing prompt tuning to adapt domains and a two-stage training process to disentangle domain adaptation from task adaptation. The authors test their method on three widely used datasets and report significant improvements over directly fine-tuned approaches. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about finding people in photos based on descriptions. Researchers use special models called CLIP that are good at understanding words and images. They want to make it better, so they try a new way of training the model that separates two important steps: making the model understand different domains (like indoor or outdoor) and making it good at identifying specific people. This helps the model focus on what’s important for person re-identification. The results show that this approach works better than just fine-tuning the entire network. |
Keywords
» Artificial intelligence » Domain adaptation » Fine tuning » Multi modal » Prompt