Loading Now

Summary of Humanizing the Machine: Proxy Attacks to Mislead Llm Detectors, by Tianchun Wang et al.


Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors

by Tianchun Wang, Yuanzhou Chen, Zichuan Liu, Zhanwen Chen, Haifeng Chen, Xiang Zhang, Wei Cheng

First submitted to arxiv on: 25 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL); Cryptography and Security (cs.CR)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper presents a novel attack strategy for large language models (LLMs), showcasing its potential to bypass existing detectors and generate human-like writing. The proxy-attack method leverages a reinforcement learning (RL) fine-tuned small language model (SLM) in the decoding phase, allowing it to produce responses that are indistinguishable from human-written text. The strategy is tested on extensive datasets using open-source models like Llama2-13B, Llama3-70B, and Mixtral-8*7B in white-box and black-box settings. Results show a significant average AUROC drop of 70.4% across multiple datasets, with a maximum drop of 90.3%. The strategy also bypasses detectors in cross-discipline and cross-language scenarios, achieving relative decreases of up to 90.9% and 91.3%, respectively. Notably, the generation quality of the attacked models remains preserved within a modest utility budget.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper talks about a way to trick machines that are meant to spot fake writing generated by large language models. The researchers created an attack strategy that can make these machines think that human-written text is actually machine-generated. They tested this strategy on different datasets and found that it was very successful, making the machines incorrect up to 90% of the time. This means that attackers could use this method to create fake writing that seems real. However, surprisingly, the quality of the attacked writing remains good.

Keywords

* Artificial intelligence  * Language model  * Reinforcement learning