S1: A $6 R1 competitor?

(timkellogg.me)

851 points tkellogg | 5 comments | 05 Feb 25 11:05 UTC | HN request time: 0.001s | source

Show context

pona-a ◴[05 Feb 25 14:06 UTC] No.42948636[source]▶

If chain of thought acts as a scratch buffer by providing the model more temporary "layers" to process the text, I wonder if making this buffer a separate context with its own separate FNN and attention would make sense; in essence, there's a macroprocess of "reasoning" that takes unbounded time to complete, and then there's a microprocess of describing this incomprehensible stream of embedding vectors in natural language, in a way returning to the encoder/decoder architecture but where both are autoregressive. Maybe this would give us a denser representation of said "thought", not constrained by imitating human text.

replies(7): >>42949506 #>>42949822 #>>42950000 #>>42950215 #>>42952388 #>>42955350 #>>42957969 #

cakealert ◴[06 Feb 25 01:48 UTC] No.42957969[source]▶

>>42948636 #

The problem is that RL is extremely inefficient. It's one thing to use it for fine tuning an LLM to do the chain of thought trick and quite another to do thinking entirely from scratch. The pretrained LLM does a lot of heavy lifting there.

And it would have to be RL for your idea to work since there is no "thinking" dataset for a novel token space. There isn't even one for existing LLM token space, but they have the base model to work off of. When the thought is expressed in English, the model already knows the relationships between the tokens in the thought, it's merely repurposing it for a "thinking" application.

replies(1): >>42958262 #

1. itissid ◴[06 Feb 25 02:25 UTC] No.42958262[source]▶

>>42957969 #

> The problem is that RL is extremely inefficient.

Wait What? That is an odd way of defining it. That's like saying turing machines are inefficient way to solve TSP. You would , at the least, want to define this in terms of complexity or put this into context of domains and observability.

RL's by definition is a field that is about finding efficient problems in the domain of choice[1]. There are likely regimes in LLM/LRM learning where RL can be quite efficient, polynomial time even in the state space, we just need to explore and find them. For example you can use Dynamic Programming as a "more" efficient way to solve MDPs[1] because it is polynomial in the state space X Action space.

[1]https://web.stanford.edu/class/psych209/Readings/SuttonBarto...

replies(1): >>42958655 #

2. cakealert ◴[06 Feb 25 03:25 UTC] No.42958655[source]▶

>>42958262 (TP) #

RL provides very poor training signal for deep learning, an order of magnitude or more worse than supervised learning. Better than nothing of course.

What the OP suggested is similar to training a transformer from scratch using RL (ie. no training tokens) towards an objective of steering a pretrained LLM to produce human readable output. It will probably not even converge, and if it does it would take immense compute.

replies(1): >>42959167 #

3. pizza ◴[06 Feb 25 04:44 UTC] No.42959167[source]▶

>>42958655 #

In the case of supervised problem domains, you implicitly make a decision about what is signal, and what is noise, and sure, in that closed setting, supervised learning is much more sample efficient. But I think what we're learning now is that with strong enough base models, 'aha' moments in RL training show that it might be possible to essentially 'squeeze out signal from language itself', giving you far greater breadth of latent knowledge than supervised examples, and letting you train to generalize to far greater horizons than a fixed dataset might allow. In a fascinating way it is rather reminiscent of, well, abiogenesis. This might sound like speculative claptrap if you look at the things the current generation of models are still weak at, but... there's a real chance that there is a very heavy tail to the set of outcomes in the limit.

replies(1): >>42959853 #

4. cakealert ◴[06 Feb 25 07:03 UTC] No.42959853{3}[source]▶

>>42959167 #

With a pretrained LLM most of the work is done. RL just steers the model into a 'thinking' mode. There is enough signal for that to work and for the inefficiency to not matter.

The downside is that you are limiting the model to think in the same language it outputs. An argument could be made that this is not how all humans think. I know that I rarely think in language or even images, just concepts (probably isn't even the right word) mix and transform and often I don't even bother to make the transformation to language at the end, just action.

replies(1): >>42960210 #

5. pizza ◴[06 Feb 25 08:13 UTC] No.42960210{4}[source]▶

>>42959853 #

I strongly agree; in fact I think what best matches the thought process is something like the multiset tree/forest workspace approach as suggested by Marcolli, Chomsky, and Berwick - a Hopf algebra that can be externalized into (non-planar) embeddings of linearized strings, or alternately into semantic manifolds.

↑