Backpropagation is a leaky abstraction (2016)

1. WithinReason ◴[02 Nov 25 10:20 UTC] No.45789232[source]▶

Karpathy suggests the following error:

  def clipped_error(x): 
    return tf.select(tf.abs(x) < 1.0, 
                   0.5 * tf.square(x), 
                   tf.abs(x) - 0.5) # condition, true, false

Following the same principles that he outlines in this post, the "- 0.5" part is unnecessary since the gradient of 0.5 is 0, therefore -0.5 doesn't change the backpropagated gradient. In addition, a nicer formula that achieves the same goal as the above is √(x²+1)

replies(3): >>45789324 #>>45790005 #>>45791588 #

2. macleginn ◴[02 Nov 25 10:40 UTC] No.45789324[source]▶

>>45789232 (TP) #

If we don't subtract from the second branch, there will be a discontinuity around x = 1, so the derivative will not be well-defined. Also the value of the loss will jump at this value, which will make it hard to inspect the errors, for one thing.

replies(1): >>45789361 #

3. WithinReason ◴[02 Nov 25 10:49 UTC] No.45789361[source]▶

>>45789324 #

No, that's not how backprop works. There will be no discontinuity in a backpropagated gradient.

replies(1): >>45789667 #

4. macleginn ◴[02 Nov 25 11:57 UTC] No.45789667{3}[source]▶

>>45789361 #

I did not say there will be a discontinuity in the gradient; I said that the modified loss function will not have a mathematically well-defined derivative because of the discontinuity in the function.

5. kingstnap ◴[02 Nov 25 13:05 UTC] No.45790005[source]▶

>>45789232 (TP) #

You do that to make things smoother when plotted. You could in theory add some crazy stairstep that adds a hundred to the middle part. It would make your loss curves spike and increase towards convergence but then those spikes are just visual artifacts from doing weird discontinuous nonsense with yoru loss.

6. slashdave ◴[02 Nov 25 16:46 UTC] No.45791588[source]▶

>>45789232 (TP) #

square roots are expensive

replies(1): >>45791737 #

7. WithinReason ◴[02 Nov 25 17:06 UTC] No.45791737[source]▶

>>45791588 #

they are negligible, especially when the post was written when ops were not fused. The extra memory you need to store the extra tensors when you use the original version is more expensive