If the job is cutting trees, you can't measure them by how long they take to cut a tree, but whether they have stamina to cut through multiple trees.
Take home assignments work, and the good news is they can be shorter now. 1 day or 4 hours of work is enough of a benchmark. Something like a Wordle clone is about the right level of complexity.
Things we look for:
1. Do they use a library? Some people are a bit egoistic about doing things the easy way. GenAI will make a list of words, which is both wasteful and incomplete when they can find a dictionary of words. Do they cut down the dictionary to the right size? It should only be the words not definitions.
2. Architecture? What do they normally use? How do the parts link to one another? How do they handle errors?
3. Do they bring in something new? AI will usually use a 5 year old tech stack unless you give it a specific one, because that's around the average of code it's trained on. If they're experienced enough to tell AI to use the new tech, they're probably experienced enough.
4. Require a starting commit (probably gitignore) and ask them to add reasonable sized commits. Classic coding should look a bit like painting. Vibe coding looks like sculpting, where you chip bits off. This will also catch more critical cheating, like someone else doing the work on their behalf - the commits may be from the wrong emails or you'll see massive commits where nothing gets chipped off.
5. There are going to be people who think AI is a nuisance. Tests like this will help you benchmark the different factions. But don't give them so much toil that it puts the AI users at a large advantage and don't give overly complex "solved" questions that the AI can just pull out from training.