Loading Now

Summary of The Dark Side Of Trust: Authority Citation-driven Jailbreak Attacks on Large Language Models, by Xikang Yang et al.


The Dark Side of Trust: Authority Citation-Driven Jailbreak Attacks on Large Language Models

by Xikang Yang, Xuehai Tang, Jizhong Han, Songlin Hu

First submitted to arxiv on: 18 Nov 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the safety vulnerabilities of large language models (LLMs) and proposes a new threat: their inherent bias towards authority. While this bias improves output quality, it increases the risk of producing harmful content by favoring certain authoritative sources over others. To exploit this bias, the authors introduce DarkCite, an adaptive authority citation matcher and generator designed for black-box settings. Experimental results show that DarkCite achieves a higher attack success rate than previous methods. The paper also proposes an authenticity and harm verification defense strategy to counter this risk.
Low GrooveSquid.com (original content) Low Difficulty Summary
This study looks at how large language models can be used to create harmful content. Researchers found that these models are biased towards certain sources of information, like GitHub for malware development. They created a tool called DarkCite that helps attackers use this bias to make their fake information seem more trustworthy. The authors also suggest ways to detect and prevent this kind of attack.

Keywords

* Artificial intelligence