←back to thread

S1: A $6 R1 competitor?

(timkellogg.me)
851 points tkellogg | 1 comments | | HN request time: 0.207s | source
Show context
pona-a ◴[] No.42948636[source]
If chain of thought acts as a scratch buffer by providing the model more temporary "layers" to process the text, I wonder if making this buffer a separate context with its own separate FNN and attention would make sense; in essence, there's a macroprocess of "reasoning" that takes unbounded time to complete, and then there's a microprocess of describing this incomprehensible stream of embedding vectors in natural language, in a way returning to the encoder/decoder architecture but where both are autoregressive. Maybe this would give us a denser representation of said "thought", not constrained by imitating human text.
replies(7): >>42949506 #>>42949822 #>>42950000 #>>42950215 #>>42952388 #>>42955350 #>>42957969 #
bluechair ◴[] No.42949506[source]
I had this exact same thought yesterday.

I’d go so far as to add one more layer to monitor this one and stop adding layers. My thinking is that this meta awareness is all you need.

No data to back my hypothesis up. So take it for what it’s worth.

replies(2): >>42949744 #>>42957660 #
1. larodi ◴[] No.42949744[source]
My thought on the same guess being - all tokens live in same latent space or in many spaces and each logical units train separate of each other…?