S1: A $6 R1 competitor?

(timkellogg.me)

851 points tkellogg | 1 comments | 05 Feb 25 11:05 UTC | HN request time: 0.234s | source

Show context

pona-a ◴[05 Feb 25 14:06 UTC] No.42948636[source]▶

If chain of thought acts as a scratch buffer by providing the model more temporary "layers" to process the text, I wonder if making this buffer a separate context with its own separate FNN and attention would make sense; in essence, there's a macroprocess of "reasoning" that takes unbounded time to complete, and then there's a microprocess of describing this incomprehensible stream of embedding vectors in natural language, in a way returning to the encoder/decoder architecture but where both are autoregressive. Maybe this would give us a denser representation of said "thought", not constrained by imitating human text.

replies(7): >>42949506 #>>42949822 #>>42950000 #>>42950215 #>>42952388 #>>42955350 #>>42957969 #

bluechair ◴[05 Feb 25 15:09 UTC] No.42949506[source]▶

>>42948636 #

I had this exact same thought yesterday.

I’d go so far as to add one more layer to monitor this one and stop adding layers. My thinking is that this meta awareness is all you need.

No data to back my hypothesis up. So take it for what it’s worth.

replies(2): >>42949744 #>>42957660 #

1. larodi ◴[05 Feb 25 15:23 UTC] No.42949744[source]▶

>>42949506 #

My thought on the same guess being - all tokens live in same latent space or in many spaces and each logical units train separate of each other…?

↑