Loading Now

Summary of Pal: Proxy-guided Black-box Attack on Large Language Models, by Chawin Sitawarin et al.


PAL: Proxy-Guided Black-Box Attack on Large Language Models

by Chawin Sitawarin, Norman Mu, David Wagner, Alexandre Araujo

First submitted to arxiv on: 15 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large Language Models (LLMs) have gained popularity, but their capabilities to generate harmful content when manipulated are concerning. Techniques like safety fine-tuning aim to minimize harmful use, yet recent works show that LLMs remain vulnerable to attacks eliciting toxic responses. This paper introduces Proxy-Guided Attack on LLMs (PAL), the first optimization-based attack in a black-box query-only setting. PAL relies on a surrogate model and a sophisticated loss designed for real-world LLM APIs. The attack achieves an 84% attack success rate (ASR) on GPT-3.5-Turbo and 48% on Llama-2-7B, outperforming the current state of the art’s 4%. This work also proposes GCG++ and RAL, strong baseline attacks for query-based attacks. The techniques proposed aim to enable more comprehensive safety testing of LLMs, ultimately leading to better security guardrails.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large Language Models are super smart computers that can understand and generate human-like text. But sometimes these models can be used to create mean or harmful things when they’re manipulated. This paper finds a way to trick these models into making bad things happen, which is concerning because it could make the problem worse. They call this new method “PAL” (Proxy-Guided Attack on LLMs). It’s like a game where you try to get the model to say something mean or harmful. The team was able to make the model do this 84% of the time with some really powerful models and 48% of the time with others. They also came up with two other ways to trick these models, which could help keep them from making bad things happen.

Keywords

* Artificial intelligence  * Fine tuning  * Gpt  * Llama  * Optimization