←back to thread

423 points sohkamyung | 1 comments | | HN request time: 0s | source
Show context
Narciss ◴[] No.45670278[source]
> All participating organizations then generated responses to each question from each of the four AI assistants. This time, we used the free/consumer versions of ChatGPT, Copilot, Perplexity and Gemini. Free versions were chosen to replicate the default (and likely most common) experience for users. Responses were generated in late May and early June 2025.

First of all, none of the SOTA models we're currently using were released in May and early June. Gemini 2.5 came out in June 17, GPT 5 & Claude Opus 4.1 at the beginning of August.

On top of that, to use free models for anything like this is absolutely wild. I use the absolute best models, and the research versions of this whenever I do research. Anything less is inviting disaster.

You have to use the right tools for the right job, and any report that is more than a month old is useless in the AI world at this point in time, beyond a snapshot of how things 'used to be'.

replies(5): >>45670334 #>>45670358 #>>45670859 #>>45670920 #>>45672440 #
dns_snek ◴[] No.45672440[source]
Ah, the "you're using the wrong model" fallacy (is there a name for this?)

In the eyes of the evangelists, every major model seems to go from "This model is close to flawless at this task, you MUST try this TODAY" to "It's absolutely wild that anyone would ever consider using such a no-good, worthless model for this task" over the course of a year or so. The old model has to be re-framed for the new model to look more impressive.

When GPT-4 was released I was told it was basically a senior-level developer, now it's an obviously worthless model that you'd be a fool to use to write so much as a throwaway script.

replies(1): >>45672801 #
1. Narciss ◴[] No.45672801[source]
Not an evangelist for AI at all, I just love it as a tool for my creativity, research and coding.

What I’m saying is that there should be a disclaimer: hey, we’re testing these models for the average person, that have no idea about AI. People who actually know AI would never use them in this way.

A better idea: educate people. Add “Here’s the best way to use them btw…” to the report.

All I’m saying is, it’s a tool, and yes you can use it wrong. That’s not a crazy realization. It applies to every other tool.

We knew that the hallucation rate for gpt 4o was nuts. From the start. We also know that gpt-5 has a much lower hallucination rate. So there are no surprises here, I’m not saying anything groundbreaking, and neither are they.