Summary of Cluster Metric Sensitivity to Irrelevant Features, by Miles Mccrory and Spencer A. Thomas
Cluster Metric Sensitivity to Irrelevant Features
by Miles McCrory, Spencer A. Thomas
First submitted to arxiv on: 19 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Medium Difficulty Summary: This paper investigates the impact of noisy, uncorrelated variables on clustering performance using k-means algorithm. The authors demonstrate that different types of irrelevant features can affect the outcome of a clustering result in distinct ways. They find that when irrelevant features are Gaussian-distributed, the adjusted rand index (ARI) and normalised mutual information (NMI) remain resilient to high proportions of noise. However, for uniformly distributed irrelevant features, the resilience depends on data dimensionality, with tipping points between high scores and near zero. The Silhouette Coefficient and Davies-Bouldin score are found to be particularly sensitive to irrelevant added features, making them suitable candidates for optimising feature selection in unsupervised clustering tasks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Low Difficulty Summary: This research looks at how adding extra “noise” variables to a dataset affects the way k-means clustering works. The scientists discovered that different types of noise can have different effects on the results. They found that when the noise is similar to the real data, the clustering method stays effective even with lots of noise. But if the noise is very different from the real data, the method becomes less accurate and more sensitive to changes. This matters because it means we need better ways to choose which features are important in unsupervised clustering tasks. |
Keywords
* Artificial intelligence * Clustering * Feature selection * K means * Unsupervised