←back to thread

265 points ctoth | 9 comments | | HN request time: 0s | source | bottom
Show context
sejje ◴[] No.43744995[source]
In the last example (the riddle)--I generally assume the AI isn't misreading, rather that it assumes you couldn't give it the riddle correctly, but it has seen it already.

I would do the same thing, I think. It's too well-known.

The variation doesn't read like a riddle at all, so it's confusing even to me as a human. I can't find the riddle part. Maybe the AI is confused, too. I think it makes an okay assumption.

I guess it would be nice if the AI asked a follow up question like "are you sure you wrote down the riddle correctly?", and I think it could if instructed to, but right now they don't generally do that on their own.

replies(5): >>43745113 #>>43746264 #>>43747336 #>>43747621 #>>43751793 #
Jensson ◴[] No.43745113[source]
> generally assume the AI isn't misreading, rather that it assumes you couldn't give it the riddle correctly, but it has seen it already.

LLMs doesn't assume, its a text completer. It sees something that looks almost like a well known problem and it will complete with that well known problem, its a problem specific to being a text completer that is hard to get around.

replies(6): >>43745166 #>>43745289 #>>43745300 #>>43745301 #>>43745340 #>>43754148 #
og_kalu ◴[] No.43745340[source]
Text Completion is just the objective function. It's not descriptive and says nothing about how the models complete text. Why people hang on this word, I'll never understand. When you wrote your comment, you were completing text.

The problem you've just described is a problem with humans as well. LLMs are assuming all the time. Maybe you would like to call it another word, but it is happening.

replies(2): >>43745745 #>>43746034 #
1. codr7 ◴[] No.43745745{3}[source]
With a plan, aiming for something, that's the difference.
replies(2): >>43745781 #>>43746301 #
2. og_kalu ◴[] No.43745781[source]
Again, you are only describing the how here, not the what (text completion).

Also, LLMs absolutely 'plan' and 'aim for something' in the process of completing text.

https://www.anthropic.com/research/tracing-thoughts-language...

replies(1): >>43746009 #
3. namaria ◴[] No.43746009[source]
Yeah this paper is great fodder for the LLM pixel dust argument.

They use a replacement model. It isn't even observing the LLM itself but a different architecture model. And it is very liberal with interpreting the patterns of activations seen in the replacement model with flowery language. It also include some very relevant caveats, such as:

"Our cross-layer transcoder is trained to mimic the activations of the underlying model at each layer. However, even when it accurately reconstructs the model’s activations, there is no guarantee that it does so via the same mechanisms."

https://transformer-circuits.pub/2025/attribution-graphs/met...

So basically the whole exercise might or might not be valid. But it generates some pretty interactive graphics and a nice blog post to reinforce the anthropomorphization discourse

replies(1): >>43746344 #
4. losvedir ◴[] No.43746301[source]
So do LLMs. "In the United States, someone whose job is to go to space is called ____" it will say "an" not because that's the most likely next word, but because it's "aiming" (to use your terminology) for "astronaut" in the future.
replies(2): >>43746549 #>>43756579 #
5. og_kalu ◴[] No.43746344{3}[source]
'So basically the whole exercise might or might not be valid.'

Nonsense. Mechanistic faithfulness probes whether the replacement model (“cross‑layer transcoder”) truly uses the same internal functions as the original LLM. If it doesn’t, the attribution graphs it suggests might mis‐lead at a fine‐grained level but because every hypothesis generated by those graphs is tested via direct interventions on the real model, high‑level causal discoveries (e.g. that Claude plans its rhymes ahead of time) remain valid.

replies(1): >>43750275 #
6. codr7 ◴[] No.43746549[source]
I don't know about you, but I tend to make more elaborate plans than the next word. I have a purpose, an idea I'm trying to communicate. These things don't have ideas, they're not creative.
7. namaria ◴[] No.43750275{4}[source]
> the attribution graphs it suggests might mis‐lead at a fine‐grained level

"In principle, our attribution graphs make predictions that are much more fine-grained than these kinds of interventions can test."

> high‑level causal discoveries (e.g. that Claude plans its rhymes ahead of time) remain valid.

"We found planned word features in about half of the poems we investigated, which may be due to our CLT not capturing features for the planned words, or it may be the case that the model does not always engage in planning"

"Our results are only claims about specific examples. We don't make claims about mechanisms more broadly. For example, when we discuss planning in poems, we show a few specific examples in which planning appears to occur. It seems likely that the phenomenon is more widespread, but it's not our intent to make that claim."

And quite significantly:

"We only explain a fraction of the model's computation. The remaining “dark matter” manifests as error nodes in our attribution graphs, which (unlike features) have no interpretable function, and whose inputs we cannot easily trace. (...) Error nodes are especially a problem for complicated prompts (...) This paper has focused on prompts that are simple enough to avoid these issues. However, even the graphs we have highlighted contain significant contributions from error nodes."

Maybe read the paper before making claims about its contents.

replies(1): >>43753589 #
8. og_kalu ◴[] No.43753589{5}[source]
Maybe understand the paper before making claims about its contents.

>"In principle, our attribution graphs make predictions that are much more fine-grained than these kinds of interventions can test."

Literally what I said. If the replacement model isn't faithful then you can't trust the details of the graphs. Basically stuff like “increasing feature f at layer 7 by Δ will raise feature g at layer 9 by exactly 0.12 in activation”

>"We found planned word features in about half of the poems we investigated, which may be due to our CLT not capturing features for the planned words, or it may be the case that the model does not always engage in planning"

>"Our results are only claims about specific examples. We don't make claims about mechanisms more broadly. For example, when we discuss planning in poems, we show a few specific examples in which planning appears to occur. It seems likely that the phenomenon is more widespread, but it's not our intent to make that claim."

The moment there were examples of the phenomena through interventions was the moment they remained valid regardless of how faithful the replacement model was.

The worst case scenario here (and it's ironic here because this scenario would mean the model is faithful) is that Claude does not always plan its rhymes, not that it never plans them. The model not being faithful actually means the replacement was simply not robust enough to capture all the ways Claude plans rhymes. Guess what? Neither option invalidates the examples.

Regards of how faithful the replacement model is, Anthropic have demonstrated Claude has the ability to plan its rhymes ahead of time and engages in this planning at least sometimes. This is started quite plainly too. What's so hard to understand ?

>"We only explain a fraction of the model's computation. The remaining “dark matter” manifests as error nodes in our attribution graphs, which (unlike features) have no interpretable function, and whose inputs we cannot easily trace. (...) Error nodes are especially a problem for complicated prompts (...) This paper has focused on prompts that are simple enough to avoid these issues. However, even the graphs we have highlighted contain significant contributions from error nodes."

Ok and ? Model computations are extremely complex, who knew ? This does not invalidate what they do manage to show.

9. yahoozoo ◴[] No.43756579[source]
Are we sure “an astronaut” is not the token?