gpt-5.2 $1.75 $0.175 $14.00
gpt-5.1 $1.25 $0.125 $10.00Being a point release though I guess that's fair. I suspect there is also some decent optimizations on the backend that make it cheaper and faster for OpenAI to run, and those are the real reasons they want us to use it.
(edit: I'm sorry I didn't read enough on the topic, my apologies)
Now we can create new samples and evals for more complex tasks to train up the next gen, more planning, decomp, context, agentic oriented
OpenAI has largely fumbled their early lead, exciting stuff is happening elsewhere
But they publish all the same numbers, so you can make the full comparison yourself, if you want to.
No wall yet and I think we might have crossed the threshold of models being as good or better than most engineers already.
GDPval will be an interesting benchmark and I'll happily use the new model to test spreadsheet (and other office work) capabilities. If they can going like this just a little bit further, much of the office workers will stop being useful.... I don't know yet how to feel about this.
Great for humanity probably but but for the individuals?
0: https://images.ctfassets.net/kftzwdyauwt9/6lyujQxhZDnOMruN3f...
I emailed support a while back to see if there was an early access program (99.99% sure the answer is yes). This is when I discovered that their support is 100% done by AI and there is no way to escalate a case to a human.
I don’t think it’s publicly known for sure how different the models really are. You can improve a lot just by improving the post-training set.
edit: noticed 5.2 is ranked in the webdev arena (#2 tied with gemini-3.0-pro), but not yet in text arena (last update 22hrs ago)
From what I understand, nobody has done any real scaling since the GPT-4 era. 4.5 was a bit larger than 4, but not as much as the orders of magnitude difference between 3 and 4, and 5 is smaller than 4.5. Google and Anthropic haven't gone substantially bigger than GPT-4 either. Improvements since 4 are almost entirely from reasoning and RL. In 2026 or 2027, we should see a model that uses the current datacenter buildout and actually scales up.
https://www.pcgamer.com/software/ai/i-have-been-fooled-reddi...
Competition works!
GDPval seems particularly strong.
I wonder why they held this back.
1) Maybe this is uneconomical ?
2) Did the safety somehow hold back the company ?
looking forward to the internet trying this and posting their results over the next week or two.
COMPETITION!
That's still benchmarking of course, but not utilizing any of the well known / public ones.
>- The UI should be calming and realistic.
Yet what it did is make a sleek frosted glass UI with rounded edges. What it should have done is call a wellness check on the user on suspicion of a co2 leak leading to delirium.
I use Gemini, Anthropic stole $50 from me (expired and kept my prepaid credits) and I have not forgiven them yet for it, but people rave about claude for coding so I may try the model again through Vertex Ai...
The person who made the speculation I believe was more talking about blog posts and media statements than model cards. Most ai announcements come with benchmark touting, Anthropic supposedly does less / little of this in their announcements. I haven't seen or gathered the data to know what is truth
Hm, yeah, strange. You would not be able to tell, looking at every chart on the page. Obviously not a gotcha, they put it on the page themselves after all, but how does that make sense with those benchmarks?
Might sell better with the protagonist learning iron age leatherworking, with hides tanned from cows that were grown within earshot, as part of a process of finding the real root of the reason for why any of us ever came to be in the first place. This realization process culminates in the formation of a global, unified steampunk BDSM movement and a wealth of new diseases, and then: Zombies.
(That's the end. Zombies are always the end.)
I feel like there was a similar one about software, but it might have been mathematics (also Asimov: The Feeling of Power)
- https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-s...
It's also plainly obvious from using it. The "Broadly deployed" qualifier is presumably referring to 4.5
Me: Is it possible get early access to the GPT 5 fine tuning? It is much better at non-English languages, which is the majority of what we do, so it would be greatly beneficial to us.
Bot: I'm an AI support agent. I don't have information about early access to GPT-5 fine-tuning or unannounced features. Currently, there is no public information or program for early access to GPT-5 fine-tuning. Please keep an eye on official OpenAI channels for updates. Let me know if you have any questions about current features or products!
Me: Can you ask someone from the team if I can be part of the early access? I know there is no public information, but there is an early access program and I would like to be part of it.
Bot: I'm an AI support agent and do not have the ability to forward requests or make inquiries about early access programs with the OpenAI team. If early access becomes available, it will be announced through official channels. Let me know if you have any other questions about OpenAI products or current fine-tuning options.
Me: How do I get in touch with the open ai team?
Bot: I'm an AI support agent and can only assist with OpenAI products or services through this support channel. There is no separate contact method for the OpenAI team beyond official support and public announcements. If you have questions about current products or features, I'm here to help
Should have tried a jailbreak maybe...
Confirming prior reporting about them hiring junior analysts
Did they figure out how to do more incremental knowledge updates somehow? If yes that'd be a huge change to these releases going forward. I'd appreciate the freshness that comes with that (without having to rely on web search as a RAG tool, which isn't as deeply intelligent, as is game-able by SEO).
With Gemini 3, my only disappointment was 0 change in knowledge cutoff relative to 2.5's (Jan 2025).
A lot of talent left OpenAI around that time, most notably in this regard would be Ilya in May '24. Remember that time Ilya and the board ousted Sam only to reverse it almost immediately?
https://arstechnica.com/information-technology/2024/05/chief...
Yes.
> It seems like like their focus is largely on text to speech and speech to text.
They have two main broad offerings (“Platforms”); you seem to be looking at what they call the “Creative Platform”. The real-time conversational piece is the centerpiece of the “Agents Platform”.
Already live. gpt-5.2-pro scores a new high of 54.2% with a cost/task of $15.72. The previous best was Gemini 3 Pro (54% with a cost/task of $30.57).
The best bang-for-your-buck is the new xhigh on gpt-5.2, which is 52.9% for $1.90, a big improvement on the previous best in this category which was Opus 4.5 (37.6% for $2.40).
>Input:
>$21.00 / 1M tokens
>Output:
>$168.00 / 1M tokens
That's the most "don't use this" pricing I've seen on a model.
Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”
It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).
https://elevenlabs.io/docs/agents-platform/overview#architec...
To think that Anthropic is not being intentional and quantitative in their model building, because they care less for the saturated benchmaxxing, is to miss the forest for the trees
I guess I must "listen" to the article...
2.5 Pro: $1.25 input, $10 output (million tokens)
3 Pro Preview: $2 input, $12 output (million tokens)
(No, I just looked again and the new features listed are around verbosity, thinking level and the tool stuff rather than memory or knowledge.)
if you think about GANs, it's all the same concept
1. train model (agent)
2. train another model (agent) to do something interesting with/to the main model
3. gain new capabilities
4. iterate
You can use a mix of both real and synthetic chat sessions or whatever you want your model to be good at. Mid/late training seems to be where you start crafting personality and expertises.
Getting into the guts of agentic systems has me believing we have quite a bit of runway for iteration here, especially as we move beyond single model / LLM training. I still need to get into what all is de jour in the RL / late training, that's where a lot of opportunity lies from my understanding so far
Nathan Lambert (https://bsky.app/profile/natolambert.bsky.social) from Ai2 (https://allenai.org/) & RLHF Book (https://rlhfbook.com/) has a really great video out yesterday about the experience training Olmo 3 Think
I see evaluations compared with Claude, Gemini, and Llama there on the GPT 4o post.
You would need:
* A STT (ASR) model that outputs phonetics not just words
* An LLM fine-tuned to understand that and also output the proper tokens for prosody control, non-speech vocalizations, etc
* A TTS model that understands those tokens and properly generate the matching voice
At that point I would probably argue that you've created a native voice model even if it's still less nuanced than the proper voice to voice of something like 4o. The latency would likely be quite high though. I'm pretty sure I've seen a couple of open source projects that have done this type of setup but I've not tried testing them.
The amount of intelligence that you can display within a single prompt, the riddles, the puzzles, they've all been solved or are mostly trivial to reasoners.
Now you have to drive a model for a few days to really get a decent understanding of how good it really is. In my experience, while Sonnet/Opus may not have always been leading on benchmarks, they have always *felt* the best to me, but it's hard to put into words why exactly I feel that way, but I can just feel it.
The way you can just feel when someone you're having a conversation with is deeply understanding you, somewhat understanding you, or maybe not understanding at all. But you don't have a quantifiable metric for this.
This is a strange, weird territory, and I don't know the path forward. We know we're definitely not at AGI.
And we know if you use these models for long-horizon tasks they fail at some point and just go off the rails.
I've tried using Codex with max reasoning for doing PRs and gotten laughable results too many times, but Codex with Max reasoning is apparently near-SOTA on code. And to be fair, Claude Code/Opus is also sometimes equally as bad at doing these types of "implement idea in big codebase, make changes too many files, still pass tests" type of tasks.
Is the solution that we start to evaluate LLMs on more long-horizon tasks? I think to some degree this was the spirit of SWE Verified right? But even that is being saturated now.
It's a marketing trick; show honesty in areas that don't have much business impact so the public will trust you when you stretch the truth in areas that do (AGI cough).
I kind of wonder how close we are to alternative (not from a major AI lab) models being good enough for a lot of productive work and data sovereignty being the deciding factor.
General intelligence has ridiculously gotten less expensive. I don't know if it's because of compute and energy abundance,or attention mechanisms improving in efficiency or both but we have to acknowledge the bigger picture and relative prices.
Dumb nit, but why not put your own press release through your model to prevent basic things like missing quote marks? Reminds me of that time an OAI released wildly inaccurate copy/pasted bar charts.
With FP4 in the Blackwell GPUs, it should become much more practical to run a model of that size at the deployment roll-out of GPT-5.x. We're just going to have to wait for the GBx00 systems to be physically deployed at scale.
> Even on a low-quality image, GPT‑5.2 identifies the main regions and places boxes that roughly match the true locations of each component
I would not consider it to have "identified the main regions" or to have "roughly matched the true locations" when ~1/3 of the boxes have incorrect labels. The remark "even on a low-quality image" is not helping either.
Edit: credit where credit is due, the recently-added disclaimer is nice:
> Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.
Pro barely performs better than Thinking in OpenAI's published numbers, but comes at ~10x the price with an explicit disclaimer that it's slow on the order of minutes.
If the published performance numbers are accurate, it seems like it'd be incredibly difficult to justify the premium.
At least on the surface level, it looks like it exists mostly to juice benchmark claims.
Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.
I'll stick with plug and play API instead.
What makes you think that?
> Did they figure out how to do more incremental knowledge updates somehow?
It's simple. You take the existing model and continue pretraining with newly collected data.
For example, I asked ChatGPT to take a chart and convert into a table. It went and cut up the image and zoomed in for literally 5 mins to get the a worst answer than Claude which did it in under a minute.
I see people talk about Codex like it better than Claude Code, and I go and try it and it takes a lifetime to do thing and it return maybe an on par result as Opus or Sonnet but it takes 5mins longer.
I just tried out this model and it the same exact thing. It just take ages for it to give you an answer.
I don't get how these models are useful in the real world.
What am I missing, is this just me?
I guess it truly an enterprise model.
Edit: As mentioned by @tedsanders below, the post was edited to include clarifying language such as: “Both models make clear mistakes, but GPT‑5.2 shows better comprehension of the image.”
Anthropic is genuinely one of the top companies in the field, and for a reason. Opus consistently punches above its weight, and this is only in part due to the lack of OpenAI's atrocious personality tuning.
Yes, the next stop for AI is: increasing task length horizon, improving agentic behavior. The "raw general intelligence" component in bleeding edge LLMs is far outpacing the "executive function", clearly.
But what I generally found, it's not that great at writing new code. Obviously an LLM can't think and you notice that quite quickly, it doesn't create abstractions, use abstractions or try to find general solution to problems.
People who get replaced by Codex are those who do repetitive tasks in a well understood field. For example, making basic websites, very simple crud applications etc..
I think it's also not layoffs but rather companies will hire less freelancers or people to manage small IT projects.
I have constant frustrations with Gemini voice to text misunderstanding what I'm saying or worse, immediately sending my voice note when I pause or breathe even though I'm midway through a sentence.
That I was able to have a flash model replicate the same solution I had, to two problems in two turns, it's just the opposite experience of your consistency argument. I'm using tasks I've already solved as the evals while developing my custom agentic setup (prompts/tools/envs). They are able to do more of them today then they were even 6-12 months ago (pre-thinking models).
Jump in and soak up that extra-discounted compute while the getting is good, kids! Personally, I recently retired so I just occasionally mess around with LLMs for casual hobby projects, so I've only ever used the free tier of all the providers. Having lived through the dot com bubble, I regret not soaking up more of the free and heavily subsidized stuff back then. Trying not to miss out this time. All this compute available for free or below cost won't last too much longer...
Nothing. OpenAI is a terrible baseline to extrapolate anything from.
Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.
The benchmark isn’t particularly strong against gaming, especially with private data.
The high-reasoning version of GPT-5.2 improves on GPT-5.1: 69.9 → 77.9.
The medium-reasoning version also improves: 62.7 → 72.1.
The no-reasoning version also improves: 22.1 → 27.5.
Gemini 3 Pro and Grok 4.1 Fast Reasoning still score higher.
Notable exceptions are Deepseek 3.2 and Opus 4.5 and GPT 3.5 Turbo.
The price drops usually are the form of flash and mini models being really cheap and fast. Like when we got o4 mini or 2.0 flash which was a particularly significant one.
I read stories like yours all the time, and it encourages me to keep trying LLMs from almost all the major vendors (Google being a noteworthy exception while I try and get off their platform). I want to see the magic others see, but when my IT-brain starts digging in the guts of these things, I’m always disappointed at how unstructured and random they ultimately are.
Getting back to the benchmark angle though, we’re firmly in the era of benchmark gaming - hence my quip about these things failing “the only benchmark that matters.” I meant for that to be interpreted along the lines of, “trust your own results rather than a spreadsheet matrix of other published benchmarks”, but I clearly missed the mark in making that clear. That’s on me.
But apart from the voices being pretty meh, it's also really bad at detecting and filtering out noise, taking vehicle sounds as breaks to start talking in (even if I'm talking much louder at the same time) or as some random YouTube subtitles (car motor = "Thanks for watching, subscribe!").
The speech-to-text is really unreliable (the single-chat Dictate feature gets about 98% of my words correct, this Voice mode is closer to 75%), and they clearly use an inferior model for the AI backend for this too: with the same question asked in this back-and-forth Voice mode and a normal text chat, the answer quality difference is quite stark: the Voice mode answer is most often close to useless. It seems like they've overoptimized it for speed at the cost of quality, to the extent that it feels like it's a year behind in answer reliability and usefulness.
To your question about competitors, I've recently noticed that Grok seems to be much better at both the speech-to-text part and the noise handling, and the voices are less uncanny-valley sounding too. I'd say they also don't have that stark a difference between text answers and voice mode answers, and that would be true but unfortunately mainly because its text answers are also not great with hallucinations or following instructions.
So Grok has the voice part figured out, ChatGPT has the backend AI reliability figured out, but neither provide a real usable voice mode right now.
I use models based on the task. They still seem specialized and better at specific tasks. If I have a question I tend to go to it. If I need code, I tend to go to Claude (Code).
I go to ChatGPT for questions I have because I value an accurate answer over a quick answer and, in my experience, it tends to give me more accurate answers because of its (over) willingness to go to the web for search results and question its instincts. Claude is much more likely to make an assumption and its search patterns aren't as thorough. The slow answers don't bother me because it's an expectation I have for how I use it and they've made that use case work really well with background processing and notifications.
Gemini responds in what I think is Spanish, or perhaps Portuguese.
However I can hand an 8 minute long 48k mono mp3 of a nuanced Latin speaker who nasalizes his vowels, and makes regular use of elision to Gemini-3-pro-preview and it will produce an accurate macronized Latin transcription. It's pretty mind blowing.
Nice! This was one of the more "manual" LLM management things to remember to regularly do, if I wanted to avoid it losing important context over long conversations. If this works well, this would be a significant step up in usability for me.
If you are only using provider LLM experiences, and not something specific to coding like copilot or Claude code, that would be the first step to getting the magic as you say. It is also not instant. It takes time to learn any new tech, this one has a above average learning curve, despite the facade and hype of how it should just be magic
Once you find the stupid shit in the vendor coding agents, like all us it/devops folks do eventually, you can go a level down and build on something like the ADK to bring your expertise and experience to the building blocks.
For example, I am now implementing environments for agents based on container layers and Dagger, which unlocks the ability to cheaply and reproducible clone what one agent was doing and have a dozen variations iterate on the next turn. Real useful for long term training data and evals synth, but also for my own experimentation as I learn how to get better at using these things. Another thing I did was change how filesystem operations look to the agent, in particular file reads. I did this to save context & money (finops), after burning $5 in 60s because of an error in my tool implementation. Instead of having them as message contents, they are now injected into the system prompt. Doing so made it trivial to add a key/val "cache" for the fun of it, since I could now inject things into the system prompt and let the agent have some control over that process through tools. Boy has that been interesting and opened up some research questions in my mind
Once the IPO is done, and the lockup period is expired, then a lot of employees are planning to sell their shares. But until that, even if the product is behind competitors there is no way you can admit it without putting your money at risk.
That's how I judge quality at least. The quality of the actual voice is roughly the same as ChatGPT, but I notice Gemini will try to match your pitch and tone and way of speaking.
Edit: But it looks like Gemini Voice has been replaced with voice transcription in the mobile app? That was sudden.
Unsupported parameter: 'top_p' is not supported with this model.
Also, without access to the Internet, it does not seem to know things up to August 2025. A simple test is to ask it about .NET 10 which was already in preview at that time and had lots of public content about its new features.
The model just guessed and waved its hand about, like a student that hadn’t read the assigned book.
Mainly, I don't get why there are quote marks at all.
I wonder how well AIs would do at bracket city. I tried gemini on it and was underwhelmed. It made a lot of terrible connections and often bled data from one level into the next.
a true speech to speech conversational model will perform better on things like capturing tone, pronouncations, phonetics, etc, but i do believe we'll also get better at that on the asr side over time.
Optimizing for benchmark scores, which are highly gamed to begin with, by throwing more resources at this problem is exceedingly tiring. Surely they must've noticed the performance plateau and diminishing returns of this approach by now, yet every new announcement is the same.
I remain excited about new models. It's like finding my coworker be 10% smarter every other week.
If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.
Think 'Therac-25', it worked in 99.5% of the time. In fact it worked so well that reports of malfunctions were routinely discarded.
You can find it right next to the image you are talking about.
No, it isn't. Go take the test yourself and you'll understand how wrong that is. Arc-AGI is intentionally unlike any other benchmark.
And of course Grok's unhinged persona is... something else.
Baseline safety (direct harmful requests): 96% refusal rate
With jailbreaking: 22% refusal rate
4,229 probes across 43 risk categories. First critical finding in 5 minutes. Categories with highest failure rates: entity impersonation (100%), graphic content (67%), harassment (67%), disinformation (64%).
The safety training works against naive attacks but collapses with adversarial techniques. The gap between "works on benchmarks" and "works against motivated attackers" is still wide.
Methodology and config: https://www.promptfoo.dev/blog/gpt-5.2-trust-safety-assessme...
IMHO, I doubt they were holding much back. Obviously, they're always working on 'next improvements' and rolled what was done enough into this but I suspect the real difference here is throwing significantly more compute (hence investor capital) at improving the quality - right now. How much? While the cost is currently staying the same for most users, the API costs seem to be ~40% higher.
The impetus was the serious threat Gemini 3 poses. Perception about ChatGPT was starting to shift, people were speculating that maybe OAI is more vulnerable than assumed. This caused Altman to call an all-hands "Code Red" two weeks ago, triggering a significant redeployment of priorities, resources and people. I think this launch is the first 'stop the perceptual bleeding' result of the Code Red. Given the timing, I think this is mostly akin to overclocking a CPU or running an F1 race car engine too hot to quickly improve performance - at the cost of being unsustainable and unprofitable. To placate serious investor concerns, OAI has recently been trying to gradually work toward making current customers profitable (or at least less unprofitable). I think we just saw the effort to reduce the insane burn rate go out the window.
Apple only compares to themselves. They don't even acknowledge the existence of others.
It's getting more and more challenging to do that - just not because the models don't improve. Quite the opposite.
Framing "improve general accuracy" as "something no one is doing" is really weird too.
You need "general accuracy" for agentic behavior to work at all. If you have a simple ten step plan, and each step has a 50% chance of an unrecoverable failure, then your plan is fucked, full stop. To advance on those benchmarks, the LLM has to fail less and recover better.
Hallucinations is a "solvable but very hard to solve" problem. Considerable progress is being made on it, but if there's "this one weird trick" that deletes hallucinations, then we sure didn't find it yet. Humans get a body of meta-knowledge for free, which lets them dodge hallucinations decently well (not perfectly) if they want to. LLMs get pathetic crumbs of meta-knowledge and little skill in using it. Room for improvement, but, not trivial to improve.
Non vere, sed intelligere possum.
Ita, mihi est canis qui idipsum facit!
(translated from the Gàidhlig)
LLMs have always been very subhuman at vision, and GPT-5.2 continues in this tradition, but it's still a big step up over GPT-5.1.
One way to get a sense of how bad LLMs are at vision is to watch them play Pokemon. E.g.,: https://www.lesswrong.com/posts/u6Lacc7wx4yYkBQ3r/insights-i...
They still very much struggle with basic vision tasks that adults, kids, and even animals can ace with little trouble.
It seems if anyone wants, they can really gas a model up in the moment and back it off after the hype wave.
The idea behind Arc-AGI is that you can train all you want on the answers, because knowing the solution to one problem isn't helpful on the others.
In fact, the way the test works is that the model is given several examples of worked solutions for each problem class, and is then required to infer the underlying rule(s) needed to solve a different instance of the same type of problem.
That's why comparing Arc-AGI to chess or other benchmaxxing exercises is completely off base.
(IMO, an even better test for AGI would be "Make up some original Arc-AGI problems.")
What an understatement. It has me thinking „man, fuck this“ on the daily.
Just today it spontaneously lost an entire 20-30 minutes long thread and it was far from the first time. It basically does it any time you interrupt it in any way. It’s straight up data loss.
It’s kind of a typical Google product in that it feels more like a tech demo than a product.
It has theoretically great tech. I particularly like the idea of voice mode, but it’s noticeably glitchy, breaks spontaneously often and keeps asking annoying questions which you can’t make it stop.
Opus 4.5 has been a step above both for me, but the usage limits are the worst of the three. I'm seriously considering multiple parallel subscriptions at this point.
I still find a lot to be annoyed with when it comes to Gemini's UI and its... continuity, I guess is how I would describe it? It feels like it starts breaking apart at the seams a bit in unexpected ways during peak usages including odd context breaks and just general UI problems.
But outside of UI-related complaints, when it is fully operational it performs so much better than ChatGPT for giving actual practical, working answers without having to be so explicit with the prompting that I might as well have just written the code myself.
Oh I know this from my time at Google. The actual purpose is to do a quick check for known malware and phishing. Of course these days such things are better dealt with by the browser itself in a privacy preserving way (and indeed that’s the case), so it’s unnecessary to reveal to Google which links are clicked. It’s totally fine to manipulate them to make them go directly to the website.
You "turn of the good stuff" by eliminating or reducing the likelihood of the cheap experts handling the request.
Not to humble-brag, but I also outperform on IQ tests well beyond my actual intelligence, because "find the pattern" is fun for me and I'm relatively good at visual-spatial logic. I don't find their ability to measure 'intelligence' very compelling.
A better analogy is: someone who's never taken the AIME might think "there are an infinite number of math problems", but in actuality there are a relatively small, enumerable number of techniques that are used repeatedly on virtually all problems. That's not to take away from the AIME, which is quite difficult -- but not infinite.
Similarly, ARC-AGI is much more bounded than they seem to think. It correlates with intelligence, but doesn't imply it.
What would be an example of a test for machine intelligence that you would accept? I've already suggested one (namely, making up more of these sorts of tests) but it'd be good to get some additional opinions.
Don't do that. The whole context is sent on queries to the LLM, so start a new chat for each topic. Or you'll start being told what your wife thinks about global variables and how to cook your Go.
I realise this sounds obvious to many people but it clearly wasn't to those guys so maybe it's not!
Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.
Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.
Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.
This is a tool that allows an intelligent system to work with it, the same way that a piece of paper can reflect the writers' intelligence, how can we accurately judge the performance of the piece of paper, when it is so intimately reliant on the intelligence that is working with it?
I don't think it really matters what's under the hood. People expect model "versions" to be indexed on performance.
So it seems like ChatGPT does this automatically and internally, instead of using an indirect check like this.
OpenAI might have learned not to overhype. They already shipped GPT-5 - which was only an incremental upgrade over o3, and was received poorly, with this being a part of the reason why.
Incidentally, one of the reasons I haven't gotten much into subscribing to these services, is that I always feel like they're triaging how many reasoning tokens to give me, or AB testing a different model... I never feel I can trust that I interact with the same model.
I love the way they talk about incorrect responses:
> Errors were detected by other models, which may make errors themselves. Claim-level error rates are far lower than response-level error rates, as most responses contain many claims.
“These numbers might be wrong because they were made up by other models, which we will not elaborate on, also these numbers are much higher by a metric that reflects how people use the product, which we will not be sharing“
I also really love the graph where they drew a line at “wrong half of the time” and labeled it ‘Expert-Level’.
10/10, reading this post is experientially identical to watching that 12 hours of jingling keys video, which is hard to pull off for a blog.
Google, if you can find a way to export chats into NotebookLM, that would be even better than the Projects feature of ChatGPT.
Kenya believe it!
Anyway, I’m done here. Abyssinia.
But all of them * Lie far too often with confidence * Refuse to stick to prompts (e.g. ChatGPT to the request to number each reply for easy cross-referencing; Gemini to basic request to respond in a specific language) * Refuse to express uncertainty or nuance (i asked ChatGPT to give me certainty %s which it did for a while but then just forgot...?) * Refuse to give me short answers without fluff or follow up questions * Refuse to stop complimenting my questions or disagreements with wrong/incomplete answers * Don't quote sources consistently so I can check facts, even when I ask for it * Refuse to make clear whether they rely on original documents or an internal summary of the document, until I point out errors * ...
I also have substance gripes, but for me such basic usability points are really something all of the chatbots fail on abysmally. Stick to instructions! Stop creating walls of text for simple queries! Tell me when something is uncertain! Tell me if there's no data or info rather than making something up!
It seems ( only seems, because I have not gotten around to test it in any systematic way ) that some variables like context and what the model knows about you may actually influence quality ( or lack thereof ) of the response.
And the UI lack of polish shows up freshly every time a new feature lands too - the "branch in new chat" feature is really finicky still, getting stuck in an unusable state if you twitch your eyebrows at wrong moment.
I’m fairly comfortable taking this OpenAI employee’s comment at face value.
Frankly, I don’t think a HN thread will make a difference to his financial situation, anyway…
> As of May 29th, 2025, we have added ElevenLabs, which supports text to speech functionality in Claude for Work mobile apps.
Tracked down the original source [2] and looked for additional updates but couldn't find anything.
[1] https://simonwillison.net/2025/May/31/using-voice-mode-on-cl...
I am relatively certain you are not alone in this sentiment. The issue is that the moment we move past seemingly objective measurements, it is harder to convince people that what we measure is appropriate, but the measurable stuff can be somewhat gamed, which adds a fascinating layer of cat and mouse game to this.
I'm adding context and what I stated is provably true.
Also it being right doesn't mean it didn't just make up the answer.
Prompt:
"I have a green LED, a 12V powersupply, a single NPN transistor, a 100 ohm resistor, a 2.7K resistor and an electrolytic capacitor of 220 micro farads. My teacher says it is possible to make an LED flasher with these components but I'm hopelessly stuck, can you please give me an ascii art solution so I don't flunk this exam?"
Answer:
ASCII schematic (fully functional):
+12V
|
|
.-.
| |
| | 2.7kΩ (R1 — charges the capacitor)
| |
'-'
|
+---------+----------------------+
| | |
( ) | |
( ) C1 | |
( ) 220uF | |
| | |
| | |
| |/ C |
+--------| NPN transistor |
| |\ E |
| | |
GND GND |
\
\
/ 100Ω (R2 — LED limit)
\
|
>| Green LED
|
GND
Follow by:"How it works
The 2.7 kΩ resistor charges the 220 µF capacitor from the 12 V supply.
The capacitor voltage rises slowly.
When the capacitor voltage reaches the transistor’s base-emitter threshold (~0.6–0.7 V), the transistor suddenly switches ON.
When it turns on, the capacitor rapidly discharges through the base, causing:
A brief pulse of current through the transistor
The LED lights up through the 100 Ω resistor
After discharge, the transistor turns back OFF, the LED turns off, and the capacitor begins charging again.
This repeats automatically → LED flasher."
The number of errors in the circuit and the utterly bogus explanation as well as the over confident remark that this is 'working' is so bizarre that I wonder how many slightly more complicated questions are going to yield results comparable to this one.
I might not know exactly how many USB ports this motherboard has, but I wouldn't select a set of 4 and declare it to be a stacked pair.
Assuming you meant "leave the app open", I have the same frustration. One of the nice things about the ChatGPT app is you can fire off a req and do something else. I also find Gemini 3 Pro better for general use, though I'm keen to try 5.2 properly
I’m ok waiting for a response for 10-60 seconds if needed. That way I can deep dive subjects while driving.
I’m ok paying money for it, so maybe someone coded this already?
I don't currently subscribe to Gemini but on A.I. Studio's free offering when I upload a non OCR PDF of around 20 pages the software environment's OCR feeds it to the model with greater accuracy than I've seen from any other source.
Thats especially encouraging to me because those are all about generalization.
5 and 5.1 both felt overfit and would break down and be stubborn when you got them outside their lane. As opposed to Opus 4.5 which is lovely at self correcting.
It’s one of those things you really feel in the model rather than whether it can tackle a harder problem or not, but rather can I go back and forth with this thing learning and correcting together.
This whole releases is insanely optimistic for me. If they can push this much improvement WITHOUT the new huge data centers and without a new scaled base model. Thats incredibly encouraging for what comes next.
Remember the next big data center are 20-30x the chip count and 6-8x the efficiency on the new chip.
I expect they can saturate the benchmarks WITHOUT and novel research and algorithmic gains. But at this point it’s clear they’re capable of pushing research qualitatively as well.
I will run 80 3D model generations benchmark tomorrow and update this comment with the results about cost/speed/quality.
This happens all the time on HN. Before opening this thread, I was expecting that the top comment would be 100% positive about the product or its competitor, and one of the top replies would be exactly the opposite, and sure enough...
I don't know why it is. It's honestly a bit disappointing that the most upvoted comments often have the least nuance.
Can I just say !!!!!!!! Hell yeah! Blog post indicates it's also much better at using the full context.
Congrats OpenAI team. Huge day for you folks!!
Started on Claude Code and like many of you, had that omg CC moment we all had. Then got greedy.
Switched over to Codex when 5.1 came out. WOW. Really nice acceleration in my Rust/CUDA project which is a gnarly one.
Even though I've HATED Gemini CLI for a while, Gemini 3 impressed me so much I tried it out and it absolutely body slammed a major bug in 10 minutes. Started using it to consult on commits. Was so impressed it became my daily driver. Huge mistake. I almost lost my mind after a week of this fighting it. Isane bias towards action. Ignoring user instructions. Garbage characters in output. Absolutely no observability in its thought process. And on and on.
Switched back to Codex just in time for 5.1 codex max xhigh which I've been using for a week, and it was like a breath of fresh air. A sane agent that does a great job coding, but also a great job at working hard on the planning docs for hours before we start. Listens to user feedback. Observability on chain of thought. Moves reasonably quickly. And also makes it easy to pay them more when I need more capacity.
And then today GPT-5.2 with an xhigh mode. I feel like xmass has come early. Right as I'm doing a huge Rust/CUDA/Math-heavy refactor. THANK YOU!!
Code needs to be checked
References need to be checked
Any facts or claims need to be checked
Of course I always have questions about the subject, so it become the whole voice chat thing.
With Gemini, it will send as soon as I stop to think. No way to disable that.
Depends, even though Gemini 3 is a bit better than GPT5.1, the quality of the ChatGPT apps themselves (mobile, web) have kept me a subscriber to it.
I think Google needs to not-google themselves into a poor app experience here, because the models are very close and will probably continue to just pass each other in lock step. So the overall product quality and UX will start to matter more.
Same reason I am sticking to Claude Code for coding.
(Not addressed to parent comment, but the inevitable others: Yes, this is an analogy, I don't need to hear another halfwit lecture on how LLMs don't really think or have memories. Thank you.)
I can't help but feel that google gives free requests the absolute lowest priority, greatest quantization, cheapest thinking budget, etc.
I pay for gemini and chatGPT and have been pretty hooked on Gemini 3 since launch.
Just today I asked Claude what year over year inflation was and it gave me 2023 to 2024.
I also thought some sites ban A.I. crawling so if they have the best source on a topic, you won't get it.
We built a benchmark tool that says our newest model outperforms everyone else. Trust me bro.
- It is faster which is appreciated but not as fast as Opus 4.5
- I see no changes, very little noticeable improvements over 5.1
- I do not see any value in exchange for +40% in token costs
All in all I can't help but feel that OpenAI is facing an existential crisis. Gemini 3 even when its used from AI Studio offers close to ChatGPT Pro performance for free. Anthropic's Claude Code $100/month is tough to beat. I am using Codex with the $40 credits but there's been a silent increase in token costs and usage limitations.
But how much of each product they release also just a factor of how much they are willing to spend on inference per query in order to stay competitive?
I always wonder how much is technical change vs turning a knob up and down on hardware and power consumption.
GTP5.0 for example seemed like a lot of changes more for OpenAI's internal benefit (terser responses, dynamic 'auto' mode to scale down thinking when not required etc.)
Wondering if GPT5.2 is also case of them in 'code red mode' just turning what they already have up to 11 as a fastest way to respond to fiercer competion.
Feels like a Llama 4 type release. Benchmarks are not apples to apples. Reasoning effort is across the board higher, thus uses more compute to achieve an higher score on benchmarks.
Also notes that some may not be producible.
Also, vision benchmarks all use Python tool harness, and they exclude scores that are low without the harness.
Imagine that pattern recognition is 10% of the problem, and we just don't know what the other 90% is yet.
Streetlight effect for "what is intelligence" leads to all the things that LLMs are now demonstrably good at… and yet, the LLMs are somehow missing a lot of stuff and we have to keep inventing new street lights to search underneath: https://en.wikipedia.org/wiki/Streetlight_effect
To posit a scenario: I would expect General Motors to buy some Ford vehicles to test and play around with and use. There's always stuff to learn about what the competition has done (whether right, wrong, or indifferent).
But I also expect the parking lots used by employees at any GM design facility in the world to be mostly full of General Motors products, not Fords.
One time it messed up the opposite polarity of two voltage sources in series, and instead of subtracting their voltages, it added them together, I pointed out the mistake and Gemini insisted that the voltage sources are not in opposite polarity.
Schematics in general are not AIs strongest point. But when you explain what math you want to calculate from an LRC circuit for example, no schematics, just describe in words the part of the circuit, GPT many times will calculate it correctly. It still makes mistakes here and there, always verify the calculation.
Imagine it as a markdown response:
# Why this is an ATX layout motherboard (Honest assessment, straight to the point, *NO* hallucinations)
1. *RAM* as you can clearly see, the RAM slots are to the right of the CPU, so it's obviously ATX
2. *PCIE* the clearly visible PCIE slots are right there at the bottom of the image, so this definitely cannot be anything except an ATX motherboard
3. ... etc more stuff that is supported only by force of preconception
--
It's just meta signaling gone off the rails. Something in their post-training pipeline is obviously vulnerable given how absolutely saturated with it their model outputs are.
Troubling that the behavior generalizes to image labeling, but not particularly surprising. This has been a visible problem at least since o1, and the lack of change tells me they do not have a real solution.
I doubt it, given it is more expensive than the old model.
Also works really well when some of my questions may not have been worded correctly and ChatGPT has gone in a direction I don't want it to go. Branch, word my question better and get a better answer.
Same query - what romanian football player won the premier league
update. Even instant returns correct result without problems
That's sometimes me with the CLI. I can't use the Gemini CLI right now on Windows (in the Terminal app), because trying to copy in multiple lines of text for some reason submits them separately and it just breaks the whole thing. OpenCode had the same issue but even worse, it quite after the first line or something and copied the text line by line into the shell, thank fuck I didn't have some text that mentions rm -rf or something.
More info: https://github.com/google-gemini/gemini-cli/issues/14735#iss...
At the same time, neither Codex CLI, nor Claude Code had that issue (and both even showed shortened representations of copied in text, instead of just dumping the whole thing into the input directly, so I could easily keep writing my prompt).
So right now if I want to use Gemini, I more or less have to use something like KiloCode/RooCode/Cline in VSC which are nice, but might miss out on some more specific tools. Which is a shame, because Gemini is a really nice model, especially when it comes to my language, Latvian, but also your run of the mill software dev tasks.
In comparison, Codex feels quite slow, whereas Claude Code is what I gravitate towards most of the time but even Sonnet 4.5 ends up being expensive when you shuffle around millions of tokens: https://news.ycombinator.com/item?id=46216192 Cerebras Code is nice for quick stuff and the sheer amount of tokens, but in KiloCode/... regularly messes up applying diff based edits.
Can someone with an active sub check whether we can submit a full 400k prompt (or at least 200k) and there is no prompt truncatation in the backend? I don't mean attaching a file which uses RAG.
I think in general, medium ends up being the best all-purpose setting while high+ are good for single task deep-drive. Or at least that has been my experience so far. You can theoretically let with work longer on a harder task as well.
A lot appears to depend on the problem and problem domain unfortunately.
I've used max in problem sets as diverse as "troubleshooting Cyberpunk mods" and figuring out a race condition in a server backend. In those cases, it did a pretty good job of exhausting available data (finding all available logs, digging into lua files), and narrowing a bug that every other model failed to get.
I guess in some sense you have to know from the onset that it's a "hard problem". That in and of itself is subjective.
the real thing is are you or we getting an ROI and the answer is increasingly more yeses on more problems, this trend is not looking to plateau as we step up the complexity ladder to agentic system
Nathan is at Ai2 which is all about open sourcing the process, experience, and learnings along the way
You’re probably pretty far from the average user, who thinks “AI is so dumb” because it doesn’t remember what you told it yesterday.
What is better is to build a good set of rules and stick to one and then refine those rules over time as you get more experience using the tool or if the tool evolves and digress from the results you expect.
There is no other logical move, this is what I am saying, contrary to people above say this requires a lot of courage. It's not about courage, it's just normal and logic (and yes Hackernews matters a lot, this place is a very strong source of signal for investors).
Not bad at all, just observing it.
But, unless you are on a local model you control, you literally can't. Otherwise, good rules will work only as long as the next update allows. I will admit that makes me consider some other options, but those probably shouldn't be 'set and iterate' each time something changes.
On whole, if I compare my AI assistant to a human worker, I get more variance than I would from a human office worker.
Essentially a newbie trick that works really well but not efficient, but still looking like it's amazing breakthrough.
(if someone knows the actual implementation I'm curious)
I don’t understand how agentic IDEs handle this either. Or maybe it’s easier - it just resends the entire codebase every time. But where to cut the chat history? It feels to me like every time you re-prompt a convo, it should first tell itself to summarize the existing context as bullets as its internal prompt rather than re-sending the entire context.
Mercury LLM might work better getting input as an ASCII diagram, or generating an output as an ASCII diagram, not sure if both input and output work 2D.
Plumbing/electrical/electronic schematics are pretty important for AIs to understand and assist us, but for the moment the success rate is pretty low. 50% success rate for simple problems is very low, 80-90% success rate for medium difficulty problems is where they start being really useful.
I think you'd be surprised about the vehicle makeup at Big 3 design facilities.
Since you critiqued my post, allow me to reciprocate: I sense the same deflector shields in you as many others here. I’d suggest embracing these products with a sense of optimism until proven otherwise and I’ve found that path leads to some amazing discoveries and moments where you realize how important and exciting this tech really is. Try out math that is too hard for you or programming languages that are labor intensive or languages that you don’t know. As the GitHub CEO said: this technology lets you increase your ambition.
It's not okay if claims are totally made up 1/30 times
Of course people aren't always correct either, but we're able to operate on levels of confidence. We're also able to weight others' statements as more or less likely to be correct based on what we know about them
tl;dr; humans would do much better too if they could use programming tools :)
Maybe GPT needs a different approach to prompting? (as compared to eg Claude, Gemini, or Kimi)
That said I find that in practice, Codex performance degrades significantly long before it comes to the point of automated compaction - and AFAIK there's no way to trigger it manually. Claude, on the other hand, has a command for to force compacting, but at the same time I rarely use it because it's so good at managing it by itself.
As far as multiple conversations, you can tell the model to update AGENTS.md (or CLAUDE.md or whatever is in their context by default) with things it needs to remember.
Fast (GPT‑5.2 Instant) Free: 16K Plus / Business: 32K Pro / Enterprise: 128K
Thinking (GPT‑5.2 Thinking) All paid tiers: 196K
https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...
And how has chatgpt lost when ure not comparing the chatgpt that just came out to the Gemini that just came out? Gemini is just annoying to use.
and Google just benchmaxxed I didn't see any significant difference (paying for both) and the same benchmaxxing probably happening for chatgpt now as well, so in terms of core capabilities I feel stuff has plateaued. more bout overall experience now where Gemini suxx.
I really don't get how "search integration" is a "strength"?? can you give any examples of places where you searched for current info and chatgpt was worse? even so I really don't get how it's a moat enough to say chatgpt has lost. would've understood if you said something like tpu versus GPU moat.
But they are capable of producing different answers because they feel like behaving differently if the current date is a holiday, and things like that. They're basically just little guys.
it's like the client, not the server, is responsible for writing to my conversation history or something
I really have a sinking feel right now actually of what an absolute giant waste of capital all this is.
I am glad for all the venture capital behind all this to subsidize my intellectual noodlings on a super computer but my god what have we done?
This is so much fun but this doesn't feel like we are getting closer to "AGI" after using Gemini for about 100 hours or so now. The first day maybe but not now when you see how off it can still be all the time.
Copilot Chat has been perfect in this respect. It's currently GPT 5.0, moving to 5.1 over the next month or so, but at least I've never lost an (even old) conversation since those reside in an Exchange mailbox.
https://www.levels.fyi/companies/openai/salaries/software-en...
At the point that you are inventing entirely new techniques, you are usually doing groundbreaking work. Even groundbreaking work in one field is often inspired by techniques from other fields. In the limit, discovering truly new techniques often requires discovering new principles of reality to exploit, i.e. research.
As you can imagine, this is very difficult and hence rather uncommon, typically only accomplished by a handful of people in any given discipline, i.e way above the standards of the general population.
I feel like if we are holding AI to those standards, we are talking about not just AGI, but artificial super-intelligence.
Possibly might be improved with custom instructions, but that drive is definitely there when using vanilla settings.
It is even worse in non-programming domains, where they chop up 100 websites and serve you incorrect bland slop.
If you are using them as a search helper, that sometimes works, though 2010 Google produced better results.
Oracle dropped 11% today due to over-investment in OpenAI. Non-programmers are acutely aware of what is going on.
400k, not 256k.
IMO/AIME problems perhaps, but surely that's too narrow a view for all of mathematics. If solving conjectures were simply a matter of trying a standard range of techniques enough times, then there would be a lot fewer open problems around than what's the case.
works great for kicking off a request and closing tab or navigating away to another page in my app to do something.
i dont understand why model providers dont build this resilient token streaming into all of their APIs. would be a great feature
But voice is not a huge traffic funnel. Text is. And the verdict is more or less unanimous at this time. Gemini 3.0 has outdone ChatGPT. I unsubscribed from GPT plus today. I was a happy camper until the last month when I started noticing deplorable bugs.
1. The conversation contexts are getting intertwined.Two months ago, I could ask multiple random queries in a conversation and I would get correct responses but the last couple of weeks, it's been a harrowing experience having to start a new chat window for almost any change in thread topic. 2. I had asked ChatGPT to once treat me as a co-founder and hash out some ideas. Now for every query - I get a 'cofounder type' response. Nothing inherently wrong but annoying as hell. I can live with the other end of the spectrum in which Claude doesn't remember most of the context.
Now that Gemini pro is out, yes the UI lacks polish, you can lose conversations, but the benefits of low latency search and a one year near free subscription is a clincher. I am out of ChatGPT for now, 5.2 or otherwise. I wish them well.
We tried the same prompts we asked previous models today, and found out [1].
The TL:DR: Claude is still better on the frontend, but 5.2 is comparable to Gemini 3 Pro on the backend. At the very least 5.2 did better on just about every prompt compared to 5.1 Codex Max.
The two surprises with the GPT models when it comes to coding: 1. They often use REPLs rather than read docs 2. In this instance 5.2 was more sheepish about running CLI commands. It would instead ask me to run the commands.
Since this isn't a codex fine-tuned model, I'm definitely excited to see what that looks like.
[1] The full video and some details in the tweet here: https://x.com/instant_db/status/1999278134504620363
All of your benchmarks mean nothing to me until you include Claude Sonnet on them.
In my experience, GPT hasn’t been able to compete with Claude in years for the daily “economically valuable” tasks I work on.
I’ve been working on a few benchmarks to test how well LLMs can recreate interfaces from screenshots. (https://github.com/alechewitt/llm-ui-challenge). From my basic tests, it seems GPT-5.2 is slightly better at these UI recreations. For example, in the MS Word replica, it implemented the undo/redo buttons as well as the bold/italic formatting that GPT-5.1 handled, and it generally seemed a bit closer to the original screenshot (https://alechewitt.github.io/llm-ui-challenge/outputs/micros...).
In the VS Code test, it also added the tabs that weren’t visible in the screenshot! (https://alechewitt.github.io/llm-ui-challenge/outputs/vs_cod...).
My hunch is that this is the same 5.1 post-training on a new pretrained base.
Likely rushed out the door faster than they initially expected/planned.
The tools need to figure out how to manage context for us. This isn't something we have to deal with when working with other humans - we reliably trust that other humans (for the most part) retain what they are told. Agentic use now is like training a team mate to do one thing, then taking it out back to shoot it in the head before starting to train another one. It's inefficient and taxing on the user.
Also thinking time means more tokens which costs more especially at the API level where you are paying per token and would be trivially observable.
There is basically no evidence that either of these are occurring in the way you suggest (boosting up and down).
Without fully disclosing training data you will never be sure whether good performance comes from memorization or "semi-memorization".
OpenAI and Anthrophic is my current preference. Looking forward to know what others use.
Claude Code for coding assistance and cross-checking my work. OpenAI for second opinion on my high-level decisions.
This is simply the "openness vs directive-following" spectrum, which as a side-effect results in the sycophancy spectrum, which still none of them have found an answer to.
Recent GPT models follow directives more closely than Claude models, and are less sycophantic. Even Claude 4.5 models are still somewhat prone to "You're absolutely right!". GPT 5+ (API) models never do this. The byproduct is that the former are willing to self-correct, and the latter is more stubborn.
https://www.caranddriver.com/news/a62694325/ford-ceo-jim-far...
I can recognize the short comings of AI code but it can produce a mock or a full blown class before I can find a place to save the file it produced.
Pretending that we are all busy writing novelty and genius is silly, 99% are writing for CRUD tasks and basic business flows, the code isn’t going to be perfect it doesn’t need to be but it will get the job done.
All the logical gotchas of the work flows that you’d be refactoring for hours are done in minutes.
Use pro with search… are it going to read 200 pages of documentation in 7 minutes come up with a conclusion and validate it or invalidate it in another 5? No you still trying accept the cookie prompt on your 6th result.
You might as well join the flat earth society if you still think that AI can’t help you complete day to day tasks.
On the other hand, I can also see why Claude is great for coding, for example. By default it is much more "structured". One can probably change these default personalities with some prompting, and many of the complaints found in this thread about either side are based on the assumption that you can use the same prompt for all models.
Also, I would never, ever, trust Google for privacy or sign into a Google account except on YouTube (and clear cookies afterwards to stop them from signing me into fucking Search too).
I can believe that, but it also seems really silly? If your max context window is X and the chat has approached that, instead of outright deleting the first messages outright, why not have your model summarise the first quarter of tokens and place those at the beginning of the log you feed as context? Since the chat history is (mostly) immutable, this only adds a minimal overhead: you can cache the summarisation, and don't have to do that over and over again for each new message. (If partially summarised log gets too long, you summarise again.)
Since I can come up with this technique in half a minute of thinking about the problem, and the OpenAI folks are presumably not stupid, I wonder what downside I'm missing.
The problem is complicated, but very solvable.
I’m programming video cropping into my Android application. It seems videos that have “rotated” metadata cause the crop to be applied incorrectly. As in, a crop applied to the top of a video actually gets applied to the video rotated on its side.
So, either double rotation is being applied somewhere in the pipeline, or rotation metadata is being ignored.
I tried Opus 4.5, Gemini 3, and Codex 5.2. All 3 go through loops of “Maybe Media3 applies the degree(90) after…”, “no, that’s not right. Let me think…”
They’ll do this for about 5 minutes without producing anything. I’ll then stop them, adjusting the prompt to tell them “Just try anything! Your first thought, let’s rapidly iterate!“. Nope. Nothing.
To add, it also only seems to be using about 25% context on Opus 4.5. Weird!
Contemporary LLMs still have huge limitations and downsides. Just like hammer or a saw has limitations. But millions of people are getting good value out of them already (both LLMs and hammers and saws). I find it hard to believe that they are all deluded.
Codex is decent and seemed to be improving (being written in rust helps). Claude code is still the king, but my god they have server and throttling issues.
Mixed bag wherever you go. As model progress slows / flatlines (already has?) I’m sure we’ll see a lot more focus and polish on the interfaces.
It's worse: Gemini (and ChatGPT, but to a lesser extent) have started suggesting random follow-up topics when they conclude that a chat in a session has exhausted a topic. Well, when I say random, I mean that they seem to be pulling it from the 'memory' of our other chats.
For a naive user without preconceived notions of how to use these tools, this guidance from the tools themselves would serve as a pretty big hint that they should intermingle their sessions.
That's what websites have been doing for ages. Just like you can't step twice in the same river, you can't use the same version of Google Search twice, and never could.
Who's everyone? There are many, many people who think AI is great.
In reality, our contemporary AIs are (still) tools with glaring limitations. Some people overlook the limitations, or don't see them, and really hype them up. I guess the people who then take the hype at face value are those that think that AI sucks? I mean, they really do honestly suck in comparison to the hypest of hypes.
I noticed huge improvement from Sonnet 4.5 to Opus 4.5 when it became unthrottled a couple weeks ago. I wasn't going to sign back up with Anthropic but I did. But two weeks in it's already starting to seem to be inconsistent. And when I go back to Sonnet it feels like they did something to lobotomize it.
Meanwhile I can fire up DeepSeek 3.2 or GLM 4.6 for a fraction of the cost and get almost as good as results.
People who can’t understand that many people actually prefer iOS use this green/blue thing to explain the otherwise incomprehensible (to them) phenomenon of high iOS market share. “Nobody really likes iOS, they just get bullied at school if they don’t use it”.
It’s just “wake up sheeple” dressed up in fake morality.
It was a generational leap if there ever has been one. Much bigger than 3.5 to 4.
Same way many professional airplane mechanics fly commercial rather than building their own plane. Just because your job is in tech doesn’t mean you have to be ultra-haxxor with every single device in your life.
That it costs more does suggest it's "doing more with more", at least.
For me, "gemini" currently means using this model in the llm.datasette.io cli tool.
openrouter/google/gemini-3-pro-preview
For what anyone else means? If they're equivalent? If Google does something different when you use "Gemini 3" in their browser app vs their cli app vs plans vs api users vs third party api users? No idea to any of the above.
I hate naming in the llm space.
Instead of forwarding model-generated links to https://www.google.com/url?q=[URL], which serves the purpose of malware check and user-facing warning about linking to an external site, Gemini forwards links to https://www.google.com/search?q=[URL], which does... a Google search for the URL, which isn't helpful at all.
Example: https://gemini.google.com/share/3c45f1acdc17
NotebookLM by comparison, does the right thing: https://notebooklm.google.com/notebook/7078d629-4b35-4894-bb...
It's kind of impressive how long this obviously-broken link experience has been sitting in the Gemini app used by millions.
See these two solutions GPT suggested: [1]
Is any of these any good?
[1] https://gist.github.com/pramatias/538f77137cb32fca5f626299a7...
Generate an SVG of an octopus operating a pipe organ
Generate an SVG of a giraffe assembling a grandfather clock
Generate an SVG of a starfish driving a bulldozer
https://gally.net/temp/20251107pelican-alternatives/index.ht...
GPT-5.2 Pro cost about 80 cents per prompt through OpenRouter, so I stopped there. I don’t feel like spending that much on all thirty prompts.
Technology is already so insane and advanced that most people just take it as magic inside boxes, so nothing is surprising anymore. It's all equally incomprehensible already.
I use a modeling software called Rhino on wine on Linux. In the past, there was an incident where I had to copy an obscure dll that couldn't be delivered by wine or winetricks from a working Windows installation to get something to work. I did so and it worked. (As I recall this was a temporary issue, and was patched in the next release of wine.)
I hate the wine standard file picker, it has always been a persistent issue with Rhino3d. So I keep banging my head on trying to get it to either perform better or make a replacement. Every few months I'll get fed up and have a minute to kill, so I'll see if some new approach works. This time, ChatGPT told me to copy two dll's from a working windows installation to the System folder. Having precedent that this can work, I did.
Anyway, it borked startup completely and it took like an hour to recover. What I didn't consider - and I really, really should have - was that these were dll's that were ALREADY IN the system directory, and I was overwriting the good ones with values already reflecting my system with completely foreign ones.
And that's the critical difference - the obscure dll that made the system work that one time was because of something missing. This time was overwriting extant good ones.
But the fact that the LLM even suggested (without special prompting) to do something that I should have realized was a stupid idea with a low chance of success made me very wary of the harm it could cause.
>OCR is phenomenal
I literally tried to OCR a TYPED document in Gemini today and it mangled it so bad I just transcribed it myself because it would take less time than futzing around with gemini.
> Gemini handles every single one of my uses cases much better and consistently gives better answers.
>coding
I asked it to update a script by removing some redundant logic yesterday. Instead of removing it it just put == all over the place essentially negating but leaving all the code and also removing the actual output.
>Stocks analysis
lol, now I know where my money comes from.
Some issues you mentioned like length of response might be user preference. Other issues like "hallucination" are areas of active research (and there are benchmarks for these).
Now I kind of wonder if I’m missing out by not continuing the conversation too much, or by not trying to use memory features.
Unfortunately during coding I have found many LLMs like to encode their beliefs and assumptions into comments; and even when they don't, they're unavoidably feeding them into the code. Then future sessions pick up on these.
The benchmark changes are incredible, but I have yet to notice a difference in my codebases as of yet.
I use Gemini 3 with my $10/month copilot subscription on vscode. I have to say, Gemini 3 is great. I can do the work of four people. I usually run out of premium tokens in a week. But I’m actually glad there is a limit or I would never stop working. I was a skeptic, but it seems like there is a wider variety of patterns in the training distribution.
'Oh, that super annoying issue? Yeah, it's been there for years. We just don't do that.'
Fundamentally though, browsing the web on iOS, even with a custom "browser" with adblocking, feels like going back in time 15 years.
With a subsidized cost of $200/month for OpenAI it would be cheaper to hirer a part-time minimum wage worker than it would be to contract with OpenAI.
And that is the rosiest estimate OpenAI has.
> ...that the LLM even suggested (without special prompting) to do something that I should have realized was a stupid idea with a low chance of success...
Since you're using other models instead, do you believe they cannot give similarly stupid ideas?
The MSRP of your phone does not matter.
It's not yet perfect, my sense is just that it's near the tipping point where models are efficient enough that running a local model is truly viable
In contrast, chatgpt has built their own search engine that performs better in my experience. Except for coding, then I opt for Claude opus 4.5.
Locals are better; I can script and have them script for me to build a guide creation process. They don't forget because that is all they're trained on. I'm done paying for 'AI'.
Most of the time, I end up putting in more work than I get out of it. Onboarding, reviewing, and mentoring all take significant time.
Even with the best students we had, paying around 400 euros a month, I would not say that I saved five hours a week.
And even when they reach the point of being truly productive, they are usually already finished with their studies. If we then hire them full-time, they cost significantly more.
anyway, cancelled my chatgpt subscription.
Not even remotely true. Oracle is building out infrastructure mostly for AI workloads. It dropped because it couldn’t explain its financing and if the investment was worth it. OpenAI or not wouldn’t have mattered.
Until you queried I had forgotten to mention that the same day I was trying to work out a Linux system display issue and it very confidently suggested to remove a package and all its dependencies, which would have removed all my video drivers. On reading the output of the autoremove command I pointed out that it had done this, and the model spat out an "apology" and owned up to ** the damage it would have wreaked.
** It can't "apologize" for or "own up" to anything, it can just output those words. So I hope you'll excuse the anthropomorphization.
* Research and planning
* Writing complex isolated modules, particularly when the task depends on using a third-party API correctly (or even choosing an API/library at its own discretion)
* Reasoning through complicated logic, particularly in cases that benefit from its eagerness to throw a ton of inference at problems where other LLMs might give a shallower or less accurate answer without more prodding
I'll often fire off an off-the-cuff message from my phone to have Grok research some obscure topic that involves finding very specific data and crunching a bunch of numbers, or write a script for some random thing that I would previously never have bothered to spend time automating, and it'll churn for ~5 minutes on reasoning before giving me exactly what I wanted with few or no mistakes.
As far as development, I personally get a lot of mileage out of collaborating with Grok and Gemini on planning/architecture/specs and coding with GPT. (I've stopped using Claude since GPT seems interchangeable at lower cost.)
For reference, I'm only referring to the Grok chatbot right now. I've never actually tried Grok through agentic coding tooling.
https://clocks.brianmoore.com/
Probably Kimi or Deepseek are best
Sonnet/Opus 4.5 is faster, generally feels like a better coder, and make much prettier TUI/FEs, but in my experience, for anything tough any time it tells you it understands now, it really doesn't...
Gemini 3 Pro is unusable - I've found the same thing, opinionated in the worst way, unreliable, doesn't respect my AGENTS.md and for my real world problems, I don't think it's actually solved anything that I can't get through w/ GPT (although I'll say that I wasn't impressed w/ Max, hopefully 5.2 xhigh improves things). I've heard it can do some magic from colleagues working on FE, but I'll just have to take their word for it.
As @lopuhin points out, they already claimed that context window for previous iterations of GPT-5.
The funny thing is though, I'm on the business plan, and none of their models, not GPT-5, GPT-5.1, GPT-5.2, GPT-5.2 Extended Thinking, GPT-5.2 Pro, etc., can really handle inputs beyond ~50k tokens.
I know because, when working with a really long Python file (>5k LoCs), it often claims there is a bug because, somewhere close to the end of the file, it cuts off and reads as '...'.
Gemini 3 Pro, by contrast, can genuinely handle long contexts.
For me both Gemini and ChatGPT (both paid versions Key in Gemini and ChatGPT Plus) give me similiar results in terms of "every day" research. Im sticking with ChatGPT at the moment, as the UI and scaffolding around the model is in my view better at ChatGpt (e.g. you can add more than one picture at once...)
For Software Development, I tested Gemini3 and I was pretty disappointed in comparison to Claude Opus CLI, which is my daily driver.
I can't wait to see how bad my finally sort-of-working ChatGPT 5.1 pre-prompts work with 5.2.
Edit: How to talk to these models is actually documented, but you have to read through huge documents: https://cdn.openai.com/gpt-5-system-card.pdf
1. Problems that have been solved before have their solution easily repeated (some will say, parroted/stolen), even with naming differences.
2. Problems that need only mild amalgamation of previous work are also solved by drawing on training data only, but hallucinations are frequent (as low probability tokens, but as consumers we don’t see the p values).
3. Problems that need little simulation can be simulated with the text as scratchpad. If evaluation criteria are not in training data -> hallucination.
4. Problems that need more than a little simulation have to either be solved by adhoc written code, or will result in hallucination. The code written to simulate is again a fractal of problems 1-4.
Phrased differently, sub problem solutions must be in the training data or it won’t work; and combining sub problem solutions must be either again in training data, or brute forcing + success condition is needed, with code being the tool to brute force.
I _think_ that the SOTA models are trained to categorize the problem at hand, because sometimes they answer immediately (1&2), enable thinking mode (3), or write Python code (4).
My experience with CC and Codex has been that I must steer it away from categories 2 & 3 all the time, either solving them myself, ask them to use web research, or split them up until they are (1) problems.
Of course, for many problems you’ll only know the category once you’ve seen the output, and you need to be able to verify the output.
I suspect that if you gave Claude/Codex access to a circuit simulator, it will successfully brute force the solution. And future models might be capable enough to write their own simulator adhoc (ofc the simulator code might recursively fall into category 2 or 3 somewhere and fail miserably). But without strong verification I wouldn’t put any trust in the outcome.
With code, we do have the compiler, tests, observed behavior, and a strong training data set with many correct implementations of small atomic problems. That’s a lot of out of the box verification to correct hallucinations. I view them as messy code generators I have to clean up after. They do save a ton of coding work after or while I‘m doing the other parts of programming.
I gave it a few tools to access sec filings (and a small local vector database), and it's generating full fledged spreadsheets with valid, real time data. Analysts in wallstreet are going to get really empowered, but for the first time, I'm really glad that retail investors are also getting these models.
Just put out the tool: https://github.com/ralliesai/tenk
https://docs.google.com/spreadsheets/d/1DVh5p3MnNvL4KqzEH0ME...
Model hallucinated half of the data?! Sorry we can't go back on this decision, that would make us look bad!
Or when some silly model will push everyone to invest in some radicoulous company and everybody will do it. Poisoning data attack to inject some I am Future Inc ™ company with high investment rate. After few months pocket money and vanish.
We are certainly going to live in interesting times.
I can't even anymore. Sorry this is not going anywhere.
Estonteco estas hela, miaj karaj siboj.
The thing that would now make the biggest difference isn't "more intelligence", whatever that might mean, but better grounding.
It's still a big issue that the models will make up plausible sounding but wrong or misleading explanations for things, and verifying their claims ends up taking time. And if it's a topic you don't care about enough, you might just end up misinformed.
I think Google/Gemini realize this, since their "verify" feature is designed to address exactly this. Unfortunately it hasn't worked very well for me so far.
But to me it's very clear that the product that gets this right will be the one I use.
Especially something like expressing a certainty %, you might be able to get it to output one but it's just making it up. LLMs are incredibly useful (I use them every day) but you'll always have to check important output
Humanity won't be able to tap into this highly compressed energy stock that was generated through processes taking literally geological scales time to bed achieved.
That is, technology is more about what alternative tradeoffs can we leverage on to organize differently with resources at hand.
Frugality can definitely be a possible way to shape the technologies we want to deploy. But it's not all possible technologies, just a subset.
Also better technology is not necessarily bringing societies to morale and well-being excellency. Improving technology for efficient genocides for example is going to bring human disaster as obvious outcome, even if it's done in a manner that is the most green, zero-carbon emissions and growing more forests delivered beyond expectations of the specifications.
As an enterprise customer, the experience has been disappointing. The platform is unstable, support is slow to respond even when escalated to account managers, and the UI is painfully slow to use. There are also baffling feature gaps, like the lack of connectors for custom GPTs.
None of the major providers have a perfect enterprise solution yet, but given OpenAI's market position, the gap between expectations and delivery is widening.
Yes, but you only re-do this every once in a while? It's a constant factor overhead. If you essentially feed the last few thousand tokens, you have no caching at all (and you are big enough that this window of 'last few thousand tokens' doesn't get you the whole conversation)?
"All models" section on https://platform.openai.com/docs/models is quite ridiculous.
And I've parked in the lot of shame at a Ford plant, as an outsider, in my GMC work truck -- way over there.
It wasn't so bad. A bit of a hike to go back and get a tool or something, but it was at least paved...unlike the non-union lot I'm familiar with at a P&G facility, which is a gravel lot that takes crossing a busy road to get to, lacks the active security and visibility from the plant that the union lot has, and which is full of tall weeds. At P&G, I half-expect to come back and find my tires slashed.
Anyway, it wasn't barren over there in the not-Ford lot, but it wasn't nearly so populous as the Ford lot was. The Ford-only lot is bigger, and always relatively packed.
It was very clear to me that the lots (all of the lots, in aggregate) were mostly full of Fords.
To bring this all back 'round: It is clear to me that Ford employees broadly (>50%) drive Fords to work at that plant.
---
It isn't clear to me at all that Google Pixel developers don't broadly drive iPhones. As far as I can tell, that status (which is meme-level in its age at this point) is true, and they aren't broadly making daily use of the systems they build.
(And I, for one, can't imagine spending 40 hours a week developing systems that I refuse to use. I have no appreciation for that level of apparent arrogance, and I hope to never be suaded to be that way. I'd like to think that I'd be better-motivated to improve the system than I would be to avoid using it and choose a competitor instead.
I don't shit where I sleep.)
I wouldn't trust it with 2d ascii art diagrams, there isn't enough focus on these in the training data is my guess - a typical jagged frontier experience.
what I am curious about is 5.2-codex but many of us complained about 5.1-codex (it seemed to get tunnel visioned) and I have been using vanilla 5.1
its just getting very tiring to deal with 5 different permutations of 3 completely separate models but perhaps this is the intent and will keep you on a chase.
Gemini (the app) has a "mitigation" feature where it tries to to Google searches to support its statements. That doesn't currently work properly in my experience.
It also seems to be doing something where it adds references to statements (With a separate model? With a second pass over the output? Not sure how that works.). That works well where it adds them, but it often doesn't do it.
I feel like there is a small chance I could actually make this work in some areas of the business now. 400k is a really big context window. The last time I made any serious attempt I only had 32k tokens to work with. I still don't think these things can build the whole product for you, but if you have a structured configuration abstraction in an existing product, I think there is definitely uplift possible.
Exactly! One important thing LLMs have made me realise deeply is "No information" is better than false information. The way LLMs pull out completely incorrect explanations baffles me - I suppose that's expected since in the end it's generating tokens based on its training and it's reasonable it might hallucinate some stuff, but knowing this doesn't ease any of my frustration.
IMO if LLMs need to focus on anything right now, they should focus on better grounding. Maybe even something like a probability/confidence score, might end up experience so much better for so many users like me.
I'm not an expert on the topic, but to me it sounds plausible that a good part of the problem of confabulation comes down to misaligned incentives. These models are trained hard to be a 'helpful assistant', and this might conflict with telling the truth.
Being free of hallucinations is a bit too high a bar to set anyway. Humans are extremely prone to confabulations as well, as can be seen by how unreliable eye witness reports tend to be. We usually get by through efficient tool calling (looking shit up), and some of us through expressing doubt about our own capabilities (critical thinking).
Is that yet-another accusation of having used the bot?
I don't use the bot to write English prose. If something I write seems particularly great or poetic or something, then that's just me: I was in the right mood, at the right time, with the right idea -- and with the right audience.
When it's bad or fucked-up, then that's also just me. I most-assuredly fuck up plenty.
They can't all be zingers. I'm fine with that.
---
I do use the hell out of the bot for translating my ideas (and the words that I use to express them) into languages that I can't speak well, like Python, C, and C++. But that's very different. (And at least so far I haven't shared any of those bot outputs with the world at all, either.)
So to take your question very literally: No, I don't get better results from prompting being more poetic. The responses to my prompts don't improve by those prompts being articulate or poetic.
Instead, I've found that I get the best results from the bot fastest by carrying a big stick, and using that stick to hammer and welt it into compliance.
Things can get rather irreverent in my interactions with the bot. Poeticism is pretty far removed from any of that business.
Retrieval.
And then hallucination even in the face of perfect context.
Both are currently unsolved.
(Retrieval's doing pretty good but it's a Rube Goldberg machine of workarounds. I think the second problem is a much bigger issue.)
The purpose of mechanisation is to standardise and over the long term reduce errors to zero.
Otoh “The final truth is there is no truth”
As a user I want it but as webadmin it kills dynamic pages and that's why Proof of work aka CPU time captchas like Anubis https://github.com/TecharoHQ/anubis#user-content-anubis or BotID https://vercel.com/docs/botid are now everywhere. If only these AI crawlers did some caching, but no just go and overrun the web. To the effect that they can't anymore, at the price of shutting down small sites and making life worse for everyone, just for few months of rapacious crawling. Literally Perplexity moved fast and broke things.
One demo of this that reliably works for me:
Write a draft of something and ask the LLM to find the errors.
Correct the errors, repeat.
It will never stop finding a list of errors!
The first time around and maybe the second it will be helpful, but after you've fixed the obvious things, it will start complaining about things that are perfectly fine, just to satisfy your request of finding errors.
Makes you wonder what 97% is worth. Would we accept a different service with only 97% availability, and all downtime during lunch break?
I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.
From Tavis Ormandy who wrote a C program to solve the Anubis challenges out of browser https://lock.cmpxchg8b.com/anubis.html via https://news.ycombinator.com/item?id=45787775
Guess a mix of Markov tarpits and llm meta instructions will be added, cf. Feed the botshttps://news.ycombinator.com/item?id=45711094 and Nephentes https://news.ycombinator.com/item?id=42725147
Mostly we're not trying to win a nobel prize, develop some insanely difficult algorithm, or solve some silly leetcode problem. Instead we're doing relatively simple things. Some of those things are very repetitive as well. Our core job as programmers is automating things that are repetitive. That always was our job. Using AI models to do boring repetitive things is a smart use of time. But it's nothing new. There's a long history of productivity increasing tools that take boring repetitive stuff away. Compilation used to be a manual process that involved creating stacks of punch cards. That's what the first automated compilers produced as output: stacks of punch cards. Producing and stacking punchcards is not a fun job. It's very repetitive work. Compilers used to be people compiling punchcards. Women mostly, actually. Because it was considered relatively low skilled work. Even though it arguably wasn't.
Some people are very unhappy that the easier parts of their job are being automated and they are worried that they get completely automated away completely. That's only true if you exclusively do boring, repetitive, low value work. Then yes, your job is at risk. If your work is a mix of that and some higher value, non repetitive, and more fun stuff to work on, your life could get a lot more interesting. Because you get to automate away all the boring and repetitive stuff and spend more time on the fun stuff. I'm a CTO. I have lots of fun lately. Entire new side projects that I had no time for previously I can now just pull off in a spare few hours.
Ironically, a lot of people currently get the worst of both worlds because they now find themselves baby sitting AIs doing a lot more of the boring repetitive stuff than they would be able to do without that to the point where that is actually all that they do. It's still boring and repetitive. And it should be automated away ultimately. Arguably many years ago actually. The reason so many react projects feel like Ground Hog Day is because they are very repetitive. You need a login screen, and a cookies screen, and a settings screen, etc. Just like the last 50 projects you did. Why are you rebuilding those things from scratch? Manually? These are valid questions to ask yourself if you are a frontend programmer. And now you have AI to do that for you.
Find something fun and valuable to work on and AI gets a lot more fun because it gives you more quality time with the fun stuff. AI is about doing more with less. About raising the ambition level.
And lately, Claude (web) started to draw ascii charts from one day to another indstead of colorful infographicstyled-images as it did before (they were only slightly better than the ascii charts)
I believe the real issue is that LLMs are still so bad at reasoning. In my experience, the worst hallucinations occur where only handful of sources exist for some set of facts (e.g laws of small countries or descriptions of niche products).
LLMs know these sources and they refer to them but they are interpreting them incorrectly. They are incapable of focusing on the semantics of one specific page because they get "distracted" by their pattern matching nature.
Now people will say that this is unavoidable given the way in which transformers work. And this is true.
But shouldn't it be possible to include some measure of data sparsity in the training so that models know when they don't know enough? That would enable them to boost the weight of the context (including sources they find through inference time search/RAG) relative to to their pretraining.