Summary of Humanizing the Machine: Proxy Attacks to Mislead Llm Detectors, by Tianchun Wang et al.
Humanizing the Machine: Proxy Attacks to Mislead LLM Detectors
by Tianchun Wang, Yuanzhou Chen, Zichuan Liu, Zhanwen Chen, Haifeng Chen, Xiang Zhang, Wei Cheng
First submitted to arxiv on: 25 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper presents a novel attack strategy for large language models (LLMs), showcasing its potential to bypass existing detectors and generate human-like writing. The proxy-attack method leverages a reinforcement learning (RL) fine-tuned small language model (SLM) in the decoding phase, allowing it to produce responses that are indistinguishable from human-written text. The strategy is tested on extensive datasets using open-source models like Llama2-13B, Llama3-70B, and Mixtral-8*7B in white-box and black-box settings. Results show a significant average AUROC drop of 70.4% across multiple datasets, with a maximum drop of 90.3%. The strategy also bypasses detectors in cross-discipline and cross-language scenarios, achieving relative decreases of up to 90.9% and 91.3%, respectively. Notably, the generation quality of the attacked models remains preserved within a modest utility budget. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper talks about a way to trick machines that are meant to spot fake writing generated by large language models. The researchers created an attack strategy that can make these machines think that human-written text is actually machine-generated. They tested this strategy on different datasets and found that it was very successful, making the machines incorrect up to 90% of the time. This means that attackers could use this method to create fake writing that seems real. However, surprisingly, the quality of the attacked writing remains good. |
Keywords
* Artificial intelligence * Language model * Reinforcement learning