Backpropagation is a leaky abstraction (2016)

The problem isn't with backprop itself or the optimizer - it's potentially in (the dervatives of) the functions you are building the neural net out of, such as the Sigmoid and ReLU examples that Karpathy gave.

Just because the framework you are using provides things like ReLU doesn't mean you can assume someone else has done all the work and you can just use these and expect them to work all the time. When things go wrong training a neural net you need to know where to look, and what to look for - things like exploding and vanishing gradients.