Nested Learning: A new ML paradigm for continual learning

(research.google)

1. abracos ◴[07 Dec 25 20:58 UTC] No.46185107[source]▶

>>46182031 (OP) #

Someone's trying to reproduce it in open https://github.com/kmccleary3301/nested_learning

replies(1): >>46190540 #

2. panarchy ◴[07 Dec 25 23:07 UTC] No.46186306[source]▶

>>46182031 (OP) #

I've been waiting for someone to make this since about 2019 it seemed pretty self-evident. It will be interesting when they get to mixed heterogeneous architecture networks with a meta network that optimizes for specific tasks.

3. aktuel ◴[08 Dec 25 09:48 UTC] No.46190370[source]▶

>>46182031 (OP) #

There is also a related youtube video online: Ali Behrouz of Google Research explaining his poster paper entitled "Nested Learning: The Illusion of Deep Learning Architecture" at NeurIPS 2025. https://www.youtube.com/watch?v=uX12aCdni9Q

replies(1): >>46191332 #

4. NitpickLawyer ◴[08 Dec 25 10:08 UTC] No.46190540[source]▶

>>46185107 #

Surprised this isn't by lucidrains, they usually have the first repro attempts.

This tidbit from a discussion on that repo sounds really interesting:

> You can load a pretrained transformer backbone, freeze it, and train only the HOPE/TITAN/CMS memory pathways.

In principle, you would:

- Freeze the shared transformer spine (embeddings, attention/MLP blocks, layer norms, lm_head) and keep lm_head.weight tied to embed.weight.

- Train only the HOPE/TITAN memory modules (TITAN level, CMS levels, self-modifier projections, inner-optimizer state).

- Treat this like an adapter-style continual-learning finetune: base model provides stable representations; HOPE/CMS learn to adapt/test-time-learn on top.

----

Pretty cool if this works. I'm hopeful more research will go into reusing already trained models (other than freeze existing parts, train the rest) so all that training effort doesn't get lost. Something that can re-use that w/ architecture enhancements will be truly revolutionary.

5. Bombthecat ◴[08 Dec 25 11:47 UTC] No.46191193[source]▶

>>46182031 (OP) #

Damn, and before that, Titan from Google: https://research.google/blog/titans-miras-helping-ai-have-lo...

We are not at the end of AI :)

Also, someone claimed that NVIDA combined diffusion and autoregression, making it 6 times faster, but couldn't find a source. Big if true!

replies(1): >>46191256 #

6. heavymemory ◴[08 Dec 25 11:59 UTC] No.46191256[source]▶

>>46191193 #

Do you have a source for the NVIDIA “diffusion plus autoregression 6x faster” claim? I can’t find anything credible on that.

replies(2): >>46191331 #>>46191397 #

7. heavymemory ◴[08 Dec 25 12:03 UTC] No.46191281[source]▶

>>46182031 (OP) #

The idea is interesting, but I still don’t understand how this is supposed to solve continual learning in practice.

You’ve got a frozen transformer and a second module still trained with SGD, so how exactly does that solve forgetting instead of just relocating it?

8. Bombthecat ◴[08 Dec 25 12:10 UTC] No.46191331{3}[source]▶

>>46191256 #

Me neither, that's why I wrote that someone claimed that they did.

The idea is simple, in a way, with diffusion several sentences / words get predicted, but they usually are not of great quality. With auto regression they select the correct words.

Increasing quality and speed. Sounds a bit like conscious and sub-conscious to me.

9. heavymemory ◴[08 Dec 25 12:10 UTC] No.46191332[source]▶

>>46190370 #

This still seems like gradient descent wrapped in new terminology. If all learning happens through weight updates, its just rearranging where the forgetting happens

10. Bombthecat ◴[08 Dec 25 12:18 UTC] No.46191397{3}[source]▶

>>46191256 #

Ha! Found it: https://arxiv.org/abs/2511.08923

Thanks to AI search :)