I don't know what those "needle in haystack" benchmarks are testing for because in my experience dumping a big amount of code in the context is not working as you'd expect. It works better if you keep the context small
Claude works well for me loading code up to around 80% of its 200K context and then asking for changes. If the whole project can't fit I try to at least get in headers and then the most relevant files. It doesn't seem to degrade. If you are using something like an AI IDE a lot of times they don't really get the 200K context.