Maybe understand the paper before making claims about its contents.
>"In principle, our attribution graphs make predictions that are much more fine-grained than these kinds of interventions can test."
Literally what I said. If the replacement model isn't faithful then you can't trust the details of the graphs. Basically stuff like “increasing feature f at layer 7 by Δ will raise feature g at layer 9 by exactly 0.12 in activation”
>"We found planned word features in about half of the poems we investigated, which may be due to our CLT not capturing features for the planned words, or it may be the case that the model does not always engage in planning"
>"Our results are only claims about specific examples. We don't make claims about mechanisms more broadly. For example, when we discuss planning in poems, we show a few specific examples in which planning appears to occur. It seems likely that the phenomenon is more widespread, but it's not our intent to make that claim."
The moment there were examples of the phenomena through interventions was the moment they remained valid regardless of how faithful the replacement model was.
The worst case scenario here (and it's ironic here because this scenario would mean the model is faithful) is that Claude does not always plan its rhymes, not that it never plans them. The model not being faithful actually means the replacement was simply not robust enough to capture all the ways Claude plans rhymes. Guess what? Neither option invalidates the examples.
Regards of how faithful the replacement model is, Anthropic have demonstrated Claude has the ability to plan its rhymes ahead of time and engages in this planning at least sometimes. This is started quite plainly too. What's so hard to understand ?
>"We only explain a fraction of the model's computation. The remaining “dark matter” manifests as error nodes in our attribution graphs, which (unlike features) have no interpretable function, and whose inputs we cannot easily trace. (...) Error nodes are especially a problem for complicated prompts (...) This paper has focused on prompts that are simple enough to avoid these issues. However, even the graphs we have highlighted contain significant contributions from error nodes."
Ok and ? Model computations are extremely complex, who knew ? This does not invalidate what they do manage to show.