Loading Now

Summary of Cross-modal Safety Mechanism Transfer in Large Vision-language Models, by Shicheng Xu et al.


Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

by Shicheng Xu, Liang Pang, Yunchang Zhu, Huawei Shen, Xueqi Cheng

First submitted to arxiv on: 16 Oct 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper explores the concept of vision-language alignment in Large Vision-Language Models (LVLMs). The authors find that existing methods fail to transfer the safety mechanism for text in LLMs to vision, leading to vulnerabilities in toxic image detection. They investigate the cause by examining where and how the safety mechanism operates and conducting comparative analysis between text and vision. The study reveals that hidden states at specific transformer layers play a crucial role in activating the safety mechanism. However, current methods’ alignment at the hidden state level is insufficient, resulting in semantic shifts for input images compared to text, which misleads the safety mechanism. To address this, the authors propose Text-Guided vision-language Alignment (TGA) for LVLMs. TGA retrieves related texts and uses them to guide the projection of vision into the hidden states space. The approach successfully transfers the safety mechanism from text to vision without fine-tuning on the visual modality while maintaining general performance on various vision tasks.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how Large Vision-Language Models (LVLMs) understand pictures. Right now, there are problems with this process that make it hard for LVLMs to detect bad images. The authors want to figure out why this is happening and find a way to fix it. They did some experiments and found that the problem is due to how the model processes words and pictures differently. They came up with a new method called Text-Guided vision-language Alignment (TGA) that helps LVLMs understand pictures better by using texts related to those pictures. This method works well and can detect bad images without needing any special training.

Keywords

» Artificial intelligence  » Alignment  » Fine tuning  » Transformer