S1: A $6 R1 competitor?

1. pona-a ◴[05 Feb 25 14:06 UTC] No.42948636[source]▶

If chain of thought acts as a scratch buffer by providing the model more temporary "layers" to process the text, I wonder if making this buffer a separate context with its own separate FNN and attention would make sense; in essence, there's a macroprocess of "reasoning" that takes unbounded time to complete, and then there's a microprocess of describing this incomprehensible stream of embedding vectors in natural language, in a way returning to the encoder/decoder architecture but where both are autoregressive. Maybe this would give us a denser representation of said "thought", not constrained by imitating human text.

replies(7): >>42949506 #>>42949822 #>>42950000 #>>42950215 #>>42952388 #>>42955350 #>>42957969 #

2. bluechair ◴[05 Feb 25 15:09 UTC] No.42949506[source]▶

>>42948636 (TP) #

I had this exact same thought yesterday.

I’d go so far as to add one more layer to monitor this one and stop adding layers. My thinking is that this meta awareness is all you need.

No data to back my hypothesis up. So take it for what it’s worth.

replies(2): >>42949744 #>>42957660 #

3. larodi ◴[05 Feb 25 15:23 UTC] No.42949744[source]▶

>>42949506 #

My thought on the same guess being - all tokens live in same latent space or in many spaces and each logical units train separate of each other…?

4. whimsicalism ◴[05 Feb 25 15:27 UTC] No.42949822[source]▶

>>42948636 (TP) #

> this incomprehensible stream of embedding vectors as natural language explanation, in a way returning to encoder/decoder architecture

this is just standard decoding, the stream of vectors is called the k/v cache

5. easeout ◴[05 Feb 25 15:37 UTC] No.42950000[source]▶

>>42948636 (TP) #

Here's a paper your idea reminds me of. https://arxiv.org/abs/2501.19201

It's also so not far from Meta's large concept model idea.

replies(1): >>42950129 #

6. pona-a ◴[05 Feb 25 15:43 UTC] No.42950129[source]▶

>>42950000 #

Previous discussion:

[41 comments, 166 points] https://news.ycombinator.com/item?id=42919597

7. bloomingkales ◴[05 Feb 25 15:49 UTC] No.42950215[source]▶

>>42948636 (TP) #

Once we train models on the chain of thought outputs, next token prediction can solve the halting problem for us (eg, this chain of thinking matches this other chain of thinking).

replies(1): >>42951030 #

8. psadri ◴[05 Feb 25 16:36 UTC] No.42951030[source]▶

>>42950215 #

I think that is how human brains work. When we practice, at first we have to be deliberate (thinking slow). Then we “learn” from our own experience and it becomes muscle memory (thinking fast). Of course, it increases the odds we are wrong.

replies(1): >>42951204 #

9. bloomingkales ◴[05 Feb 25 16:45 UTC] No.42951204{3}[source]▶

>>42951030 #

Or worse, we incorrectly overweight the wrong chain of thinking to an irrelevant output (but pragmatically useful output), at scale.

For example, xenophobia as a response to economic hardship is the wrong chain of thinking embedded in the larger zeitgeist.

10. jjk7 ◴[05 Feb 25 17:53 UTC] No.42952388[source]▶

>>42948636 (TP) #

Comments on a google doc? Nesting in social media comments?

Seems like similar concepts. I think there is some potential to improving how LLMs improve and further their own reasoning lines, but I'm no AI mage.

11. zoogeny ◴[05 Feb 25 21:22 UTC] No.42955350[source]▶

>>42948636 (TP) #

I've had an idea since I was a kid which I can share. I was contemplating AI and consciousness generally, probably around the time I read "The Minds I".

I reflected on the pop-psychology idea of consciousness and subconsciousness. I thought of each as an independent stream of tokens, like stream of consciousness poetry. But along the stream there were joining points between these two streams, points where the conscious stream was edited by the subconscious stream. You could think of the subconscious stream as performing CRUD like operations on the conscious stream. The conscious stream would act like a buffer of short-term memory while the subconscious stream would act like a buffer of long-term memory. Like, the subconscious has instructions related to long-term goals and the conscious stream has instructions related to short-term goals.

You can imagine perception as input being fed into the conscious stream and then edited by the subconscious stream before execution.

It seems entirely possible to actually implement this idea in this current day and age. I mean, it was a fever dream as a kid, but now it could be an experiment!

replies(2): >>42960298 #>>42962057 #

12. hadlock ◴[06 Feb 25 01:05 UTC] No.42957660[source]▶

>>42949506 #

This is where I was headed but I think you said it better. Some kind of executive process monitoring the situation, the random stream of consciousness and the actual output. Looping back around to outdated psychology you have the ego which is the output (speech), the super ego is the executive process and the id is the <think>internal monologue</think>. This isn't the standard definition of those three but close enough.

13. cakealert ◴[06 Feb 25 01:48 UTC] No.42957969[source]▶

>>42948636 (TP) #

The problem is that RL is extremely inefficient. It's one thing to use it for fine tuning an LLM to do the chain of thought trick and quite another to do thinking entirely from scratch. The pretrained LLM does a lot of heavy lifting there.

And it would have to be RL for your idea to work since there is no "thinking" dataset for a novel token space. There isn't even one for existing LLM token space, but they have the base model to work off of. When the thought is expressed in English, the model already knows the relationships between the tokens in the thought, it's merely repurposing it for a "thinking" application.

replies(1): >>42958262 #

14. itissid ◴[06 Feb 25 02:25 UTC] No.42958262[source]▶

>>42957969 #

> The problem is that RL is extremely inefficient.

Wait What? That is an odd way of defining it. That's like saying turing machines are inefficient way to solve TSP. You would , at the least, want to define this in terms of complexity or put this into context of domains and observability.

RL's by definition is a field that is about finding efficient problems in the domain of choice[1]. There are likely regimes in LLM/LRM learning where RL can be quite efficient, polynomial time even in the state space, we just need to explore and find them. For example you can use Dynamic Programming as a "more" efficient way to solve MDPs[1] because it is polynomial in the state space X Action space.

[1]https://web.stanford.edu/class/psych209/Readings/SuttonBarto...

replies(1): >>42958655 #

15. cakealert ◴[06 Feb 25 03:25 UTC] No.42958655{3}[source]▶

>>42958262 #

RL provides very poor training signal for deep learning, an order of magnitude or more worse than supervised learning. Better than nothing of course.

What the OP suggested is similar to training a transformer from scratch using RL (ie. no training tokens) towards an objective of steering a pretrained LLM to produce human readable output. It will probably not even converge, and if it does it would take immense compute.

replies(1): >>42959167 #

16. pizza ◴[06 Feb 25 04:44 UTC] No.42959167{4}[source]▶

>>42958655 #

In the case of supervised problem domains, you implicitly make a decision about what is signal, and what is noise, and sure, in that closed setting, supervised learning is much more sample efficient. But I think what we're learning now is that with strong enough base models, 'aha' moments in RL training show that it might be possible to essentially 'squeeze out signal from language itself', giving you far greater breadth of latent knowledge than supervised examples, and letting you train to generalize to far greater horizons than a fixed dataset might allow. In a fascinating way it is rather reminiscent of, well, abiogenesis. This might sound like speculative claptrap if you look at the things the current generation of models are still weak at, but... there's a real chance that there is a very heavy tail to the set of outcomes in the limit.

replies(1): >>42959853 #

17. cakealert ◴[06 Feb 25 07:03 UTC] No.42959853{5}[source]▶

>>42959167 #

With a pretrained LLM most of the work is done. RL just steers the model into a 'thinking' mode. There is enough signal for that to work and for the inefficiency to not matter.

The downside is that you are limiting the model to think in the same language it outputs. An argument could be made that this is not how all humans think. I know that I rarely think in language or even images, just concepts (probably isn't even the right word) mix and transform and often I don't even bother to make the transformation to language at the end, just action.

replies(1): >>42960210 #

18. pizza ◴[06 Feb 25 08:13 UTC] No.42960210{6}[source]▶

>>42959853 #

I strongly agree; in fact I think what best matches the thought process is something like the multiset tree/forest workspace approach as suggested by Marcolli, Chomsky, and Berwick - a Hopf algebra that can be externalized into (non-planar) embeddings of linearized strings, or alternately into semantic manifolds.

19. barrenko ◴[06 Feb 25 08:29 UTC] No.42960298[source]▶

>>42955350 #

Conscious as subconscious pretending not to be sunconscious, something like that, a thin wrapper. Crud makes sense.

Gels closely to buddhism, hell, all religions.

20. ForHackernews ◴[06 Feb 25 13:22 UTC] No.42962057[source]▶

>>42955350 #

Have you read Jaynes' "The Origin of Consciousness in the Breakdown of the Bicameral Mind"?

replies(1): >>42965390 #

21. zoogeny ◴[06 Feb 25 19:01 UTC] No.42965390{3}[source]▶

>>42962057 #

I haven't read the original but I am familiar with the broad stroke view. There are similarities (perhaps vague) in the more recent work of someone like McGilchrist and his The Master and His Emissary (another book which I only have a broad stroke view of).

At the time I had this idea I did not know of either of these. I think I was drawing explicitly on the conscious / subconscious vocabulary.