Summary of The Dark Side Of Trust: Authority Citation-driven Jailbreak Attacks on Large Language Models, by Xikang Yang et al.
The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models
by Xikang Yang, Xuehai Tang, Jizhong Han, Songlin Hu
First submitted to arxiv on: 18 Nov 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the safety vulnerabilities of large language models (LLMs) and proposes a new threat: their inherent bias towards authority. While this bias improves output quality, it increases the risk of producing harmful content by favoring certain authoritative sources over others. To exploit this bias, the authors introduce DarkCite, an adaptive authority citation matcher and generator designed for black-box settings. Experimental results show that DarkCite achieves a higher attack success rate than previous methods. The paper also proposes an authenticity and harm verification defense strategy to counter this risk. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study looks at how large language models can be used to create harmful content. Researchers found that these models are biased towards certain sources of information, like GitHub for malware development. They created a tool called DarkCite that helps attackers use this bias to make their fake information seem more trustworthy. The authors also suggest ways to detect and prevent this kind of attack. |