Why do random forests work? They are self-regularizing adaptive smoothers

Here's some context and a partial summary (youoy also has a nice summary) --

Context:

A random forest is an ML model that can be trained to predict an output value based on a list of input features: eg, predicting a house's value based on square footage, location, etc. This paper focuses on regression models, meaning the output value is a real number (or a vector thereof). Classical ML theory suggests that models with many learned parameters are more likely to overfit the training data, meaning that when you predict an output for a test (non-training) input, the predicted value is less likely to be correct because the model is not generalizing well (it does well on training data, but not on test data - aka, it has memorized, but not understood).

Historically, a surprise is that random forests can have many parameters yet don't overfit. This paper explores the surprise.

What the paper says:

The perspective of the paper is to see random forests (and related models) as _smoothers_, which is a kind of model that essentially memorizes the training data and then makes predictions by combining training output values that are relevant to the prediction-time (new) input values. For example, k-nearest neighbors is a simple kind of smoother. A single decision tree counts as a smoother because each final/leaf node in the tree predicts a value based on combining training outputs that could possibly reach that node. The same can be said for forests.

So the authors see a random forest as a way to use a subset of training data and a subset of (or set of weights on) training features, to provide an averaged output. While a single decision tree can overfit (become "spikey") because some leaf nodes can be based on single training examples, a forest gives a smoother prediction function since it is averaging across many trees, and often other trees won't be spikey for the same input (their leaf node may be based on many training points, not a single one).

Finally, the authors refer to random forests as _adaptive smoothers_ to point out that random forests become even better at smoothing in locations in the input space that either have high variation (intuitively, that have a higher slope), or that are far from the training data. The word "adaptive" indicates that the predicted function changes behavior based on the nature of the data — eg, with k-NN, an adaptive version might increase the value of k at some places in the input space.

The way random forests act adaptively is that (a) the prediction function is naturally more dense (can change value more quickly) in areas of high variability because those locations will have more leaf nodes, and (b) the prediction function is typically a combination of a wider variety of possible values when the input is far from the training data because in that case the trees are likely to provide a variety of output values. These are both ways to avoid overfitting to training data and to generalize better to new inputs.

Disclaimer: I did not carefully read the paper; this is my quick understanding.

I think this is specifically coming to terms with an insight that's taught to statisticians about a bias-variance tradeoff.

From my understanding, in a statistical setting, low variability in bias leads to high variability in variance whereas low variability in variance leads to high variability in bias. The example I saw was with K-means, where K = N leads to high variance (the predicted cluster is highly variable) but low bias (take an input point, you get that exact input point back), vs. K=1 low variance (there's only one cluster) but bad bias (input point is far away from the cluster center/representative point).

I'm not sure I've characterized it well but there's a Twitter post from Alicia Curth that explains it [0] as well as a paper that goes into it [1].

[0] https://x.com/AliciaCurth/status/1841817856142348529

[1] https://arxiv.org/abs/2409.18842