Loading Now

Summary of Challenges Learning From Imbalanced Data Using Tree-based Models: Prevalence Estimates Systematically Depend on Hyperparameters and Can Be Upwardly Biased, by Nathan Phelps et al.


Challenges learning from imbalanced data using tree-based models: Prevalence estimates systematically depend on hyperparameters and can be upwardly biased

by Nathan Phelps, Daniel J. Lizotte, Douglas G. Woolford

First submitted to arxiv on: 17 Dec 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Machine Learning (stat.ML)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed approach aims to address imbalanced binary classification problems by accounting for biases introduced through undersampling. A common practice in machine learning is to subsample the majority class to create a more balanced dataset for model training, which leads to biased predictions due to differences between the training and new data generating processes. The study investigates analytical mapping of predictions to new values based on the sampling rate used for the majority class, which can be effective for some models but has unintended consequences when applied to random forests. This approach is found to have upwardly biased prevalence estimates that depend on the number of predictors considered at each split and the sampling rate used.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper discusses a common problem in machine learning where imbalanced data affects model performance. To fix this, people often undersample the majority class, but this can lead to biased predictions. The study looks into a way to correct for this bias by mapping predictions back to the original scale based on the sampling rate used. They found that this works well for some models, but not for random forests. Surprisingly, decision trees are actually biased towards the minority class, not just the majority class.

Keywords

» Artificial intelligence  » Classification  » Machine learning