←back to thread

S1: A $6 R1 competitor?

(timkellogg.me)
851 points tkellogg | 2 comments | | HN request time: 0.412s | source
Show context
pona-a ◴[] No.42948636[source]
If chain of thought acts as a scratch buffer by providing the model more temporary "layers" to process the text, I wonder if making this buffer a separate context with its own separate FNN and attention would make sense; in essence, there's a macroprocess of "reasoning" that takes unbounded time to complete, and then there's a microprocess of describing this incomprehensible stream of embedding vectors in natural language, in a way returning to the encoder/decoder architecture but where both are autoregressive. Maybe this would give us a denser representation of said "thought", not constrained by imitating human text.
replies(7): >>42949506 #>>42949822 #>>42950000 #>>42950215 #>>42952388 #>>42955350 #>>42957969 #
bloomingkales ◴[] No.42950215[source]
Once we train models on the chain of thought outputs, next token prediction can solve the halting problem for us (eg, this chain of thinking matches this other chain of thinking).
replies(1): >>42951030 #
1. psadri ◴[] No.42951030[source]
I think that is how human brains work. When we practice, at first we have to be deliberate (thinking slow). Then we “learn” from our own experience and it becomes muscle memory (thinking fast). Of course, it increases the odds we are wrong.
replies(1): >>42951204 #
2. bloomingkales ◴[] No.42951204[source]
Or worse, we incorrectly overweight the wrong chain of thinking to an irrelevant output (but pragmatically useful output), at scale.

For example, xenophobia as a response to economic hardship is the wrong chain of thinking embedded in the larger zeitgeist.