←back to thread

161 points belleville | 1 comments | | HN request time: 0.203s | source
Show context
gwern ◴[] No.43677261[source]
https://www.reddit.com/r/MachineLearning/comments/1jsft3c/r_...

I'm still not quite sure how to think of this. Maybe as being like unrolling a diffusion model, the equivalent of BPTT for RNNs?

replies(2): >>43677696 #>>43684636 #
cttet ◴[] No.43677696[source]
In all their experiments, backprop is used for most of their parameter though...
replies(1): >>43678281 #
1. hansvm ◴[] No.43678281[source]
There is a meaningful distinction. They only use backprop one layer at a time, requiring additional space proportional to that layer. Full backprop requires additional space proportional to the whole network.

It's also a bit interesting as an experimental result, since the core idea didn't require backprop. Being an implementation detail, you could theoretically swap in other layer types or solvers.