Reasoning models reason well, until they don't

From the abstract:

> some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity

Can someone ELI5 what the definitions of reasoning and complexity are here?

I see they seem to focus on graph problems and representing problems as graph problems. But I didn't completely read the paper or understand it in depth. I skimmed some parts that seem to address this question (e.g. section 5 and the Introduction), but maybe there are simpler definitions that elude me.

Surely they don't mean "computational complexity"?

And what exactly is "reasoning"?

I'm aware of philosophical logic and strict logic that can be applied to natural language arguments.

But have we already agreed on a universal scale that grades answers to questions about the physical world? Or is this about mathematical reasoning?

Mixing all of this together always irks me when it comes to these AI "benchmarks". But apparently people see value in these?

I know my question isn't new.

To me it seems, that when we leave the mathematical realms, it quickly becomes fuzzy what correct "reasoning" should be.

People can be convincing and avoid obious logical fallacies, and still make wrong conclusions... or conclusions that run counter to assumed goals.