Summary of Llms Plagiarize: Ensuring Responsible Sourcing Of Large Language Model Training Data Through Knowledge Graph Comparison, by Devam Mondal et al.
LLMs Plagiarize: Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison
by Devam Mondal, Carlo Lipizzi
First submitted to arxiv on: 2 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This proposed system assesses whether a knowledge source has been used to train or fine-tune large language models (LLMs). Unlike existing methods, it utilizes Resource Description Framework (RDF) triples to create knowledge graphs from both the source document and LLM continuation. These graphs are analyzed for content similarity using cosine similarity and structural similarity using a normalized graph edit distance. This approach enables accurate evaluation of similarity between source documents and LLM continuations by focusing on relationships between ideas and their organization. The system does not require access to LLM metrics like perplexity, making it applicable even in closed LLM “black-box” systems. A prototype is available on GitHub. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper proposes a new way to check if large language models (LLMs) copy from other sources or documents. They use special computer code (RDF triples) to create diagrams of both the original document and the LLM’s continuation. Then, they compare these diagrams to see how similar they are. This helps them figure out if the LLM copied ideas or organization from the original source. This method doesn’t need access to secret LLM information, making it useful even for “black-box” systems. |
Keywords
» Artificial intelligence » Cosine similarity » Perplexity