Is there a reason human preference data is even needed? Don't LLMs already have a strong enough notion of question complexity to build a dataset for routing?
"Do you think you need to do high/medium/low amount of thinking to answer X?" seems well within an LLMs wheelhouse if the goal is to build an optimized routing engine.
How do you think that an LLM could come by that information? Do you think that LLM vendors are logging performance and feeding that back into the model or some other mechanism?