Summary of The Dark Side Of Trust: Authority Citation-driven Jailbreak Attacks on Large Language Models, by Xikang Yang et al.

The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models

by Xikang Yang, Xuehai Tang, Jizhong Han, Songlin Hu

First submitted to arxiv on: 18 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the safety vulnerabilities of large language models (LLMs) and proposes a new threat: their inherent bias towards authority. While this bias improves output quality, it increases the risk of producing harmful content by favoring certain authoritative sources over others. To exploit this bias, the authors introduce DarkCite, an adaptive authority citation matcher and generator designed for black-box settings. Experimental results show that DarkCite achieves a higher attack success rate than previous methods. The paper also proposes an authenticity and harm verification defense strategy to counter this risk.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This study looks at how large language models can be used to create harmful content. Researchers found that these models are biased towards certain sources of information, like GitHub for malware development. They created a tool called DarkCite that helps attackers use this bias to make their fake information seem more trustworthy. The authors also suggest ways to detect and prevent this kind of attack.

Keywords

* Artificial intelligence

The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models

by Xikang Yang, Xuehai Tang, Jizhong Han, Songlin Hu

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Rethinking Thinking Tokens: Understanding Why They Underperform in Practice, by Sreeram Vennam et al.

Summary of Ikea Manuals at Work: 4d Grounding Of Assembly Instructions on Internet Videos, by Yunong Liu et al.

Related Posts