Summary of Language Model-driven Data Pruning Enables Efficient Active Learning, by Abdul Hameed Azeemi et al.
Language Model-Driven Data Pruning Enables Efficient Active Learning
by Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza
First submitted to arxiv on: 5 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel plug-and-play unlabeled data pruning strategy, called ActivePrune, is introduced to optimize data labeling efficiency in active learning (AL) tasks. ActivePrune leverages language models to prune large unlabeled datasets, addressing high computational costs and enabling applicability on larger datasets. The method implements a two-stage pruning process using perplexity scores from an n-gram language model followed by metrics for data quality computed through a quantized LLM. Additionally, a perplexity reweighting method is proposed to enhance diversity in the unlabeled pool. Experimental results on four diverse datasets and active learning strategies demonstrate that ActivePrune outperforms existing data pruning methods, providing up to 74% reduction in end-to-end time required for AL. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary ActivePrune is a new way to help computers learn from labeled data more efficiently. It uses language models to pick the most important pieces of information from large datasets and then labels them. This makes it faster and more efficient than other methods, which can take a long time on big datasets. The method works by first quickly checking if an instance is likely to be useful using an n-gram language model, and then doing a more detailed check using metrics for data quality computed through a quantized LLM. This helps ensure that the instances chosen are diverse and relevant. By reducing the amount of time it takes to label data, ActivePrune can help computers learn faster and make better decisions. |
Keywords
» Artificial intelligence » Active learning » Data labeling » Language model » N gram » Perplexity » Pruning