Most active commenters

erikerikson(4)

Popular/hot comments

>>43678034 #

←back to thread

NoProp: Training neural networks without back-propagation or forward-propagation

(arxiv.org)

Show context

itsthecourier ◴[14 Apr 25 03:01 UTC] No.43677688[source]▶

>>43676837 (OP) #

"Whenever these kind of papers come out I skim it looking for where they actually do backprop.

Check the pseudo code of their algorithms.

"Update using gradient based optimizations""

replies(4): >>43677717 #>>43677878 #>>43684074 #>>43725019 #

1. f_devd ◴[14 Apr 25 03:47 UTC] No.43677878[source]▶

>>43677688 #

I mean the only claim is no propagation, you always need a gradient of sorts to update parameters. Unless you just stumble upon the desired parameters. Even genetic algorithms effectively has gradients which are obfuscated through random projections.

replies(3): >>43678034 #>>43679597 #>>43679675 #

2. erikerikson ◴[14 Apr 25 04:24 UTC] No.43678034[source]▶

>>43677878 (TP) #

No you don't. See Hebbian learning (neurons that fire together wire together). Bonus: it is one of the biologically plausible options.

Maybe you have a way of seeing it differently so that this looks like a gradient? Gradient keys my brain into a desired outcome expressed as an expectation function.

replies(4): >>43678091 #>>43679021 #>>43680033 #>>43683591 #

3. red75prime ◴[14 Apr 25 04:36 UTC] No.43678091[source]▶

>>43678034 #

> See Hebbian learning

The one that is not used, because it's inherently unstable?

Learning using locally accessible information is an interesting approach, but it needs to be more complex than "fire together, wire together". And then you might have propagation of information that allows to approximate gradients locally.

replies(1): >>43678117 #

4. erikerikson ◴[14 Apr 25 04:43 UTC] No.43678117{3}[source]▶

>>43678091 #

Is that what they're teaching now? Originally it was not used because it was believed it couldn't learn XOR (it can [just not as perceptrons were defined]).

Is there anyone in particular whose work focuses on this that you know of?

replies(1): >>43679247 #

5. yobbo ◴[14 Apr 25 07:50 UTC] No.43679021[source]▶

>>43678034 #

If there is a weight update, there is a gradient, and a loss objective. You might not write them down explicitly.

I can't recall exactly what the Hebbian update is, but something tells me it minimises the "reconstruction loss", and effectively learns the PCA matrix.

replies(2): >>43680272 #>>43682329 #

6. ckcheng ◴[14 Apr 25 08:31 UTC] No.43679247{4}[source]▶

>>43678117 #

Oja's rule dates back to 1982?

It’s Hebbian and solves all stability problems.

https://en.wikipedia.org/wiki/Oja's_rule

7. bob1029 ◴[14 Apr 25 09:39 UTC] No.43679597[source]▶

>>43677878 (TP) #

In genetic algorithms, any gradient found would be implied by way of the fitness function and would not be something to inherently pursue. There are no free lunches like with chain rule of calculus.

GP is essentially isomorphic with beam search where the population is the beam. It is a fancy search algorithm. It is not "training" anything.

replies(1): >>43679880 #

8. gsf_emergency_2 ◴[14 Apr 25 09:59 UTC] No.43679675[source]▶

>>43677878 (TP) #

GP glancing at the pseudo-code is certainly an efficient way to dismiss an article, but something tells me he missed the crucial sentence in the abstract:

>"We believe this work takes a first step TOWARDS introducing a new family of GRADIENT-FREE learning methods"

I.e. for the time being, authors can't convince themselves not to take advantage of efficient hw for taking gradients

(*Checks that Oxford University is not under sanctions*)

9. f_devd ◴[14 Apr 25 10:41 UTC] No.43679880[source]▶

>>43679597 #

True, genetic algorithms are only implied, but those implied gradients are used in the more successful evolutionary strategies. So while they might not look like it (because it's not used in a continuous descent) they still very much work like (although they represent a smoother function than) regular back-prop gradients when aggregated.

10. HarHarVeryFunny ◴[14 Apr 25 11:10 UTC] No.43680033[source]▶

>>43678034 #

Even with Hebbian learning, isn't there a synapse strength? If so, then you at least need a direction (+/-) if not a specific gradient value.

replies(1): >>43682035 #

11. orbifold ◴[14 Apr 25 11:48 UTC] No.43680272{3}[source]▶

>>43679021 #

Not every vector field has a potential. So not every weight update can be written as a gradient.

replies(1): >>43682930 #

12. erikerikson ◴[14 Apr 25 14:57 UTC] No.43682035{3}[source]▶

>>43680033 #

Yes there is a weight on every connection. At least when I was at it gradients were talked about in reference to the solution space (e.g. gradient descent). The implication is that there is some notion of what is "correct"for some neutron to have output and then we bend it to our will by updating the weight. In Hebbian learning there isn't a notion of correct activation, just a calculation over the local environment.

13. erikerikson ◴[14 Apr 25 15:24 UTC] No.43682329{3}[source]▶

>>43679021 #

> loss objective

There is no prediction or desired output, certainly explicit. I was playing with those things in my work to try and understand how our brains cause the emergence of intelligence rather than solve some classification or related problem. What I managed to replicate was the learning of XOR by some nodes and further that multidimensional XORs up to the number of inputs could be learned.

Perhaps you can say that PCAish is the implicit objective/result but I still reject that there is any conceptual notion of what a node "should" output even if iteratively applying the learning rule leads us there.

14. yobbo ◴[14 Apr 25 16:17 UTC] No.43682930{4}[source]▶

>>43680272 #

True.

15. srean ◴[14 Apr 25 17:14 UTC] No.43683591[source]▶

>>43678034 #

Nope that update with the rank one update is exactly the projected gradient of the reconstruction loss. That's not the way it is usually taught. So Hebbian learning was an unfortunate example.

Gradient descent is only one way of searching for a minima, so in that sense it is not necessary, for example, when one can analytically solve for the extrema of the loss. As an alternative one could do Monte Carlo search instead of gradient descent. For a convex loss that would be less efficient of course.

↑