Summary of Safety Fine-tuning at (almost) No Cost: a Baseline For Vision Large Language Models, by Yongshuo Zong et al.
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
by Yongshuo Zong, Ondrej Bohdal, Tingyang Yu, Yongxin Yang, Timothy Hospedales
First submitted to arxiv on: 3 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel approach to addressing the issue of harmful content generation and vulnerability to attacks in current vision large language models (VLLMs). The authors identify that VLLM fine-tuning can cause forgetting of safety alignment previously learned by the underpinning LLM, which is detrimental to their overall performance. To mitigate this problem, the researchers curate a vision-language safe instruction-following dataset called VLGuard, covering various harmful categories. They then demonstrate that integrating this dataset into standard vision-language fine-tuning or utilizing it for post-hoc fine-tuning effectively safety aligns VLLMs with minimal impact on their helpfulness. The authors also provide empirical results showing that fine-tuned VLLMs reject unsafe instructions and substantially reduce the success rates of several black-box adversarial attacks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps make computer programs called vision large language models (VLLMs) safer. Right now, these models can create bad content and are easily tricked into doing things they shouldn’t do. The problem is that when we teach VLLMs new skills, they forget some important rules that keep them safe. To fix this, the researchers created a special dataset called VLGuard that teaches VLLMs to follow good instructions and ignore bad ones. They show that using this dataset makes VLLMs much safer without making them worse at doing helpful things. |
Keywords
* Artificial intelligence * Alignment * Fine tuning