Backpropagation is a leaky abstraction (2016)

1. nirinor ◴[02 Nov 25 14:40 UTC] No.45790653[source]▶

Its a nit pick, but backpropagation is getting a bad rep here. These examples are about gradients+gradient descent variants being a leaky abstraction for optimization [1].

Backpropagation is a specific algorithm for computing gradients of composite functions, but even the failures that do come from composition (multiple sequential sigmoids cause exponential gradient decay) are not backpropagation specific: that's just how the gradients behave for that function, whatever algorithm you use. The remedy, of having people calculate their own backwards pass, is useful because people are _calculating their own derivatives_ for the functions, and get a chance to notice the exponents creeping in. Ask me how I know ;)

[1] Gradients being zero would not be a problem with a global optimization algorithm (which we don't use because they are impractical in high dimensions). Gradients getting very small might be dealt with by with tools like line search (if they are small in all directions) or approximate newton methods (if small in some directions but not others). Not saying those are better solutions in this context, just that optimization(+modeling) are the actually hard parts, not the way gradients are calculated.

replies(2): >>45791131 #>>45792438 #

2. xpe ◴[02 Nov 25 15:44 UTC] No.45791131[source]▶

>>45790653 (TP) #

Yes. No need to be apologetic or timid about it — it’s not a nit to push back against a flawed conceptual framing.

I respect Karpathy’s contributions to the field, but often I find his writing and speaking to be more than imprecise — it is sloppy in the sense that it overreaches and butchers key distinctions. This may sound harsh, but at his level, one is held to a higher standard.

replies(3): >>45791270 #>>45791604 #>>45791798 #

3. embedding-shape ◴[02 Nov 25 16:03 UTC] No.45791270[source]▶

>>45791131 #

> often I find his writing and speaking to be more than imprecise

I think that's more because he's trying to write to an audience who isn't hardcore deep into ML already, so he simplifies a lot, sometimes to the detriment of accuracy.

At this point I see him more as a "ML educator" than "ML practitioner" or "ML researcher", and as far as I know, he's moving in that direction on purpose, and I have no qualms with it overall, he seems good at educating.

But I think shifting the mindset of what the purpose of his writings are maybe help understand why sometimes it feels imprecise.

4. HarHarVeryFunny ◴[02 Nov 25 16:48 UTC] No.45791604[source]▶

>>45791131 #

Whoever chose this topic title perhaps did him a disservice in suggesting he said the problem was backprop itself, since in his blog post he immediately clarifies what he meant by it. It's a nice pithy way of stating the issue though.

replies(1): >>45791895 #

5. mitthrowaway2 ◴[02 Nov 25 17:12 UTC] No.45791798[source]▶

>>45791131 #

But Karpathy is completely right; students who understand and internalize how backprop works, having implemented it rather than treating it as a magic spell cast by TF/PyTorch, will also be able to intuitively understand these problems of vanishing gradients and so on.

Sure, instead of "the problem with backpropagation is that it's a leaky abstraction" he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading for an introductory-level article for an undergraduate audience, and also unnecessary because he already said that in the introduction.

replies(1): >>45800281 #

6. nirinor ◴[02 Nov 25 17:25 UTC] No.45791895{3}[source]▶

>>45791604 #

Nah, Karpathy's title is "Yes you should understand backprop", and his first highlight is "The problem with Backpropagation is that it is a leaky abstraction." This is his choice as a communicator, not the poster to HN.

And his _examples_ are about gradients, but nowhere does he distinguish between backpropagation, a (part of) an algorithm for automatic differentiation and the gradients themselves. None of the issues are due to BP returning incorrect gradients (it totally could, for example, lose too much precision, but it doesn't).

replies(1): >>45791984 #

7. HarHarVeryFunny ◴[02 Nov 25 17:38 UTC] No.45791984{4}[source]▶

>>45791895 #

Yeah - he chose it as a pithy/catchy description of the issue, then immediately clarified what he meant by it.

> In other words, it is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data.

Then follows this with multiple clear examples of exactly what he is talking about.

The target audience was people building and training neural networks (such as his CS231n students), so I think it's safe to assume they knew what backprop and gradients are, especially since he made them code gradients by hand, which is what they were complaining about!

8. fjdjshsh ◴[02 Nov 25 18:46 UTC] No.45792438[source]▶

>>45790653 (TP) #

I get your point, but I don't think your nit-pick is useful in this case.

The point is that you can't abstract away the details of back propagation (which involve computing gradients) under some circumstances. For example, when we are using gradient descend. Maybe in other circumstances (global optimization algorithm) it wouldn't be an issue, but the leaky abstraction idea isn't that the abstraction is always an issue.

(Right now, back propagation is virtually the only way to calculate gradients in deep learning)

replies(1): >>45795277 #

9. nirinor ◴[03 Nov 25 02:19 UTC] No.45795277[source]▶

>>45792438 #

So, are computing gradients details of backpropagation that it is failing to abstract over, or are gradients the goal that backpropagation achieves? It isn't both, its just the latter.

This is like complaining about long division not behaving nicely when dividing by 0. The algorithm isn't the problem, and blaming the wrong part does not help understanding.

It distracts from what is actually helping which is using different functions with nicer behaving gradients, e.g., the Huber loss instead of quadratic.

replies(2): >>45796606 #>>45796663 #

10. DSingularity ◴[03 Nov 25 07:18 UTC] No.45796606{3}[source]▶

>>45795277 #

It’s just an observation. It’s an abstraction in the classical computer science sense in that you stack some modules and the backprop is generated. It’s leaky in the sense that you cant fully abstract away the details because of the vanishing/exploding gradient issues you must be mindful of.

It is definitely a useful thing for people who are learning this topic to understand from day 1.

11. grumbelbart2 ◴[03 Nov 25 07:28 UTC] No.45796663{3}[source]▶

>>45795277 #

> It distracts from what is actually helping which is using different functions with nicer behaving gradients, e.g., the Huber loss instead of quadratic.

Fully agree. It's not the "fault" of Backprop. It does what you tell it to do, find the direction in which your loss is reduced the most. If the first layers get no signal because the gradient vanishes, then the reason is your network layout: Very small modifications in the initial layers would lead to very large modifications in the final layers (essentially an unstable computation), so gradient descend simply cannot move that fast.

Instead, it's a vital signal for debugging your network. Inspecting things like gradient magnitudes per layer shows you might have vanishing or exploding gradients. And that has lead to great inventions how to deal with that, such as residual networks and a whole class of normalization methods (such as batch normalization).

12. xpe ◴[03 Nov 25 15:43 UTC] No.45800281{3}[source]▶

>>45791798 #

I never disagreed with the utility and importance of understanding backprop. I'm glad the article exists. And it could be easily improved -- and all of us can gain [1] by acknowledging this rather than circling the wagons [2], so to speak, or excusing unforced errors.

> ... he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading ...

My concern isn't about the heading he chooses. My concern is deeper; he commits a category error [3]. These following things are true, but Karpathy's article gets them wrong: (1) Leaky abstractions only occur with interfaces; (2) Backpropagation is algorithm; (3) Algorithms can never be leaky abstractions.

Karpathy could have communicated his point clearly and correctly by saying e.g.: "treating backprop learning as a magical optimization oracle is risky". There is zero need for introducing the concept of leaky abstractions at all.

---

Ok, with the above out of the way, we can get to some interesting technical questions that are indeed about leaky abstractions which can inform the community about pros/cons of the design space: To what degree is the interface provided by [Library] a leaky abstraction? (where [Library] might be PyTorch or TensorFlow) Getting into these details is interesting. (See [4] for example.) There is room for more writing on this.

[1]: We can all gain because accepting criticism is hard. Once we see that even Karpathy messes up, we probably shouldn't be defensive when we mess up.

[2]: No one is being robbed here. Criticism is a gift; offering constructive criticism is a sign of respect. It also respects the community by saying i.e. "I want to make it easier for people to get the useful, clear ideas into their heads rather than muddled ones."

[3]: https://en.wikipedia.org/wiki/Category_mistake

[4]: https://elanapearl.github.io/blog/2025/the-bug-that-taught-m...