Has it been shown for o1 conclusively? I'd love to read the paper.
I recall that Apple paper about non reasoning due to fuzzed question data causing performance degradation that caught a lot of traction but IIRC o1 had pretty resilient performance compared to previous models. which to be clear I agree with your sentiment towards. I just have yet to see definitive data that shows o1 is not fundamentally more resilient to the types of test we use to discern "reasoning" from "pattern matching".
I watched a professor lecture on the likely candidates for what the open source llm community think is going on in o1[0] and I'm not convinced it's still simple pattern matching.
[0] https://youtu.be/6PEJ96k1kiw