I don't know what those "needle in haystack" benchmarks are testing for because in my experience dumping a big amount of code in the context is not working as you'd expect. It works better if you keep the context small
I think the sweet spot is to include some context that is limited to the scope of the problem and benefit from the longer context window to keep longer conversations going. I often go back to an earlier message on that thread and rewrite with understanding from that longer conversation so that I can continue to manage the context window