←back to thread

290 points sebg | 1 comments | | HN request time: 0.207s | source
Show context
youoy ◴[] No.41890415[source]
Nice article, although I find it's a bit overly complex if your are not familiar with ML and mathematics at the same time.

I will leave here a geometrical intuition of why they work in case it can help someone:

To simplify things, I will talk about regression, and say we want to predict some value y, that we know depends on x, and we have some noisy measurements of y.

y = f(x) y_observed = f(x) + noise

If you want some mental image, think about f(x)=sin(x).

Now, a (over fitted) regression tree in this case is just a step function where the value at x is y_observed. If there is no noise, we now that by doing more measurements, we can approximate y with as much precision as we want. But if there is noise, the regression tree will over fit the noisy values, creating some artificial spikes.

If you want to avoid this over fitting, you sample a lot of times the values of X, and for each sample you build a regression tree, and then average them. When you average them, every tree will contain its own particular artificial spikes, and if they are noise, they won't appear in the majority of the other trees. So when you average them, the spikes will attenuate, creating the smoother behaviour that the article talks about.

I hope it helps!

replies(5): >>41890442 #>>41892542 #>>41892756 #>>41893558 #>>41893790 #
math_dandy ◴[] No.41892756[source]
This is good intuition for why ensembling overparametrized is a good idea. Doesn’t speak to why ensembles of tree-structured estimators in particular perform so well compared to ensembles of other nonparametric estimators.
replies(1): >>41893498 #
1. youoy ◴[] No.41893498[source]
If you look at what makes it work well in the example, I would say it is being able to easily approximate a function with whatever degree of precision that you want, which translates to being able to isolate spikes in the approximation.

For example, one could ask, what if instead of an approximation by step functions, we use a piecewise linear approximation (which is as good)? You can do that with a fully connected artificial neural network with ReLU nonlinearity, and if you check it experimentally, you will see that the results are equivalent.

Why do people often use ensembles of tree structures? The ensembling part is included in the programming packages and that is not the case for ANN, so it is quicker to experiment with. Appart from that, if you have some features that behave like categorical variables, trees also behave better in training.