I've tried to use gemma3:4b which comes up better in that benchmark and found it to be quite disappointing. It breaks a lot, sucks even worse than qwen2.5-coder:7b and incept5/llama3.1-claude:7b at code, needs to be tricked or threatened into saying stuff about many everyday topics. It also commonly chugs away for minutes exercising the GPU fans before responding, at which point I'm already ahead because I figured out another way to solve my problem or get at some information.
My experience with phi4-mini and granite3.3 was about the same, and they annoy me even more when I hook them into code editors and try to get them to contribute to my work. For one because they're slow, and at best they suggest adding unnecessary error handling in the style of null checks everywhere, at worst they just start mixing or hallucinating programming languages. Where they would be useful as leverage if they worked, i.e. close to the edge of where I can debug and refactor without getting stuck, they just go into straight nonsense mode, especially on terse first-pass code.
Sometimes I've tried to query these things for descriptions of recent history in foreign countries, Wikipedia trivia basically, and they're very often wrong in subtle ways. For example, a politician might have been at it for half a century or so in a troubled country and because they've been ousted in a coup once in the eighties the model is absolutely sure they can't have been in office since.
If a person acted like these things do I'd wish for them to get immediate institutional care. Maybe the problem is somehow with me, but I have a deep suspicion it's not.