Reasoning models reason well, until they don't

This is not the only paper that scales reasoning complexity / difficulty.

The CogniLoad benchmark does this as well (in addition to scaling reasoning length and distractor ratio). Requiring the LLM to purely reason based on what is in the context (i.e. not based on the information its pretrained on), it finds that reasoning performance decreases significantly as problems get harder (i.e. require the LLM to hold more information in its hidden state simultaneously), but the bigger challenge for them is length.

https://arxiv.org/abs/2509.18458

Disclaimer: I'm the primary author of CogniLoad so feel free to ask me any questions.