Loading Now

Summary of Focus: Forging Originality Through Contrastive Use in Self-plagiarism For Language Models, by Kaixin Lan et al.


FOCUS: Forging Originality through Contrastive Use in Self-Plagiarism for Language Models

by Kaixin Lan, Tao Fang, Derek F. Wong, Yabo Xu, Lidia S. Chao, Cecilia G. Zhao

First submitted to arxiv on: 2 Jun 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel approach to promoting originality in pre-trained language models (PLMs) is introduced, addressing ethical concerns about verbatim copies of paragraphs from their training data. The strategy, called “self-plagiarism” contrastive decoding, modifies prompts for large language models (LLMs) to develop an amateur and a professional model. The amateur model is encouraged to plagiarize using designed templates, while the professional model maintains its standard status. This approach employs prompts to identify non-original token combinations and impose penalties, ensuring smooth integration with most existing PLMs (T5, GPT, LLaMA). By applying this strategy, a significant decline in non-original sequences of more than three words is observed in academic datasets AASC and ROCStories.
Low GrooveSquid.com (original content) Low Difficulty Summary
Pre-trained language models can generate impressive text, but they might copy paragraphs from their training data. This isn’t good because that data was created by humans. We need to find ways to make these models produce original content. In this study, we developed a new approach called “self-plagiarism” contrastive decoding. It works by creating two models: one that’s encouraged to copy and another that follows the rules. This way, the model learns to identify when its suggestions are not original and gets punished for it. When we tested our method, we saw a big decrease in copied text in academic datasets.

Keywords

» Artificial intelligence  » Gpt  » Llama  » T5  » Token