Loading Now

Summary of Elephants Never Forget: Memorization and Learning Of Tabular Data in Large Language Models, by Sebastian Bordt et al.


Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

by Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, Rich Caruana

First submitted to arxiv on: 9 Apr 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This research paper investigates the application of Large Language Models (LLMs) on tabular data, focusing on the issues of data contamination and memorization. The authors introduce various techniques to assess whether an LLM has seen a tabular dataset during training, revealing that many popular datasets are memorized verbatim. The study compares few-shot learning performance on seen and unseen datasets, showing that memorization leads to overfitting but also highlights the robustness of LLMs to data transformations. Additionally, the authors examine in-context statistical learning abilities, finding that while LLMs perform better than random, their sample efficiency lags behind traditional algorithms as problem dimension increases. The results emphasize the importance of testing whether an LLM has seen an evaluation dataset during pre-training.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how Large Language Models can be used to work with tables of data. However, researchers have found that these models often memorize entire datasets instead of just learning from them. The authors try to figure out why this is happening and if it’s affecting the performance of these models. They test several methods to see if a model has seen certain datasets before and find that most popular datasets are being memorized word-for-word. The study also looks at how well these models do when given a small amount of new data to learn from. It turns out that while they’re good at learning some things, they can be overfitting (which means they get too good at the specific task and don’t generalize well). Overall, this paper highlights the importance of checking whether an LLM has seen certain data before.

Keywords

* Artificial intelligence  * Few shot  * Overfitting