Loading Now

Summary of Risk and Cross Validation in Ridge Regression with Correlated Samples, by Alexander Atanasov and Jacob A. Zavatone-veth and Cengiz Pehlevan


Risk and cross validation in ridge regression with correlated samples

by Alexander Atanasov, Jacob A. Zavatone-Veth, Cengiz Pehlevan

First submitted to arxiv on: 8 Aug 2024

Categories

  • Main: Machine Learning (stat.ML)
  • Secondary: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Ridge regression has made significant strides in recent years, but existing theories assume independence between training examples. This paper leverages techniques from random matrix theory and free probability to provide sharp asymptotics for the in- and out-of-sample risks when data points have arbitrary correlations. The generalized cross validation estimator (GCV) is shown to fail correctly predicting the out-of-sample risk, but a modified GCV, dubbed CorrGCV, yields an efficiently-computable unbiased estimator that concentrates in the high-dimensional limit. This modification can be extended to test points with nontrivial correlations with the training set, often encountered in time series forecasting. Assuming knowledge of the correlation structure, this extends the GCV estimator and sharply characterizes the degree to which such test points yield an overly optimistic prediction of long-time risk. The theory is validated across various high-dimensional data.
Low GrooveSquid.com (original content) Low Difficulty Summary
Ridge regression has made big progress lately, but current theories assume that training examples are independent. This paper uses new math techniques to better understand what happens when data points have relationships with each other. They show that a popular way to choose the best model (GCV) often doesn’t work well in this situation. However, they develop a new method called CorrGCV that can accurately predict how well a model will do on unseen data. This new method is especially useful for forecasting time series data, which often involves related data points.

Keywords

» Artificial intelligence  » Probability  » Regression  » Time series