Summary of Llms Plagiarize: Ensuring Responsible Sourcing Of Large Language Model Training Data Through Knowledge Graph Comparison, by Devam Mondal et al.

LLMs Plagiarize: Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison

by Devam Mondal, Carlo Lipizzi

First submitted to arxiv on: 2 Jul 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This proposed system assesses whether a knowledge source has been used to train or fine-tune large language models (LLMs). Unlike existing methods, it utilizes Resource Description Framework (RDF) triples to create knowledge graphs from both the source document and LLM continuation. These graphs are analyzed for content similarity using cosine similarity and structural similarity using a normalized graph edit distance. This approach enables accurate evaluation of similarity between source documents and LLM continuations by focusing on relationships between ideas and their organization. The system does not require access to LLM metrics like perplexity, making it applicable even in closed LLM “black-box” systems. A prototype is available on GitHub.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper proposes a new way to check if large language models (LLMs) copy from other sources or documents. They use special computer code (RDF triples) to create diagrams of both the original document and the LLM’s continuation. Then, they compare these diagrams to see how similar they are. This helps them figure out if the LLM copied ideas or organization from the original source. This method doesn’t need access to secret LLM information, making it useful even for “black-box” systems.

Keywords

» Artificial intelligence » Cosine similarity » Perplexity

LLMs Plagiarize: Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison

by Devam Mondal, Carlo Lipizzi

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point, by Bokun Wang and Axel Berg and Durmus Alp Emre Acar and Chuteng Zhou

Summary of Splitz: Certifiable Robustness Via Split Lipschitz Randomized Smoothing, by Meiyu Zhong et al.

Related Posts