GPT-5.2

(openai.com)

1084 points atgctg | 1 comments | 11 Dec 25 18:04 UTC | HN request time: 0.194s | source

https://platform.openai.com/docs/guides/latest-model

System card: https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944...

Show context

josalhor ◴[11 Dec 25 18:24 UTC] No.46235005[source]▶

>>46234788 (OP) #

From GPT 5.1 Thinking:

ARC AGI v2: 17.6% -> 52.9%

SWE Verified: 76.3% -> 80%

That's pretty good!

replies(7): >>46235062 #>>46235070 #>>46235153 #>>46235160 #>>46235180 #>>46235421 #>>46236242 #

verdverm ◴[11 Dec 25 18:28 UTC] No.46235062[source]▶

>>46235005 #

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

replies(5): >>46235126 #>>46235266 #>>46235466 #>>46235492 #>>46235583 #

stego-tech ◴[11 Dec 25 19:01 UTC] No.46235583[source]▶

>>46235062 #

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

replies(2): >>46236156 #>>46236484 #

snet0 ◴[11 Dec 25 20:11 UTC] No.46236484[source]▶

>>46235583 #

To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

replies(1): >>46236543 #

jacquesm ◴[11 Dec 25 20:15 UTC] No.46236543[source]▶

>>46236484 #

That means you're probably asking it to do very simple things.

replies(4): >>46236778 #>>46236779 #>>46236916 #>>46243575 #

baq ◴[11 Dec 25 20:35 UTC] No.46236779[source]▶

>>46236543 #

I can confidently say that anecdotally you’re completely wrong, but I’ll also allow a very different definition of ‘simple’ and/or attempting to use an unpopular environment as a valid anecdotal counterpoint.

replies(2): >>46237593 #>>46238522 #

jacquesm ◴[11 Dec 25 21:43 UTC] No.46237593[source]▶

>>46236779 #

I don't reach for AI until I'm solidly stuck and then use it mostly for inspiration, it has yet to happen that it directly pointed at a solution and I'm pretty good at writing prompts. When I throw a bunch of elementary stuff at it then it is super good at identifying issues and solving them (but I could have done that myself, it is just nice to try to delineate where you can and where you can't trust the thing, but that too is fluctuating, sometimes even within a single session). Here is a nice example for a slightly more complex test:

Prompt:

"I have a green LED, a 12V powersupply, a single NPN transistor, a 100 ohm resistor, a 2.7K resistor and an electrolytic capacitor of 220 micro farads. My teacher says it is possible to make an LED flasher with these components but I'm hopelessly stuck, can you please give me an ascii art solution so I don't flunk this exam?"

Answer:

ASCII schematic (fully functional):

           +12V
            |
            |
           .-.
           | |
           | | 2.7kΩ   (R1 — charges the capacitor)
           | |
           '-'
            |
            +---------+----------------------+
            |         |                      |
           ( )        |                      |
           ( ) C1     |                      |
           ( ) 220uF  |                      |
            |         |                      |
            |         |                      |
            |        |/ C                    |
            +--------|   NPN transistor      |
            |        |\ E                    |
            |         |                      |
           GND       GND                     |
                                              \
                                               \
                                               / 100Ω (R2 — LED limit)
                                               \
                                                |
                                                >|  Green LED
                                                |
                                               GND

Follow by:

"How it works

The 2.7 kΩ resistor charges the 220 µF capacitor from the 12 V supply.

The capacitor voltage rises slowly.

When the capacitor voltage reaches the transistor’s base-emitter threshold (~0.6–0.7 V), the transistor suddenly switches ON.

When it turns on, the capacitor rapidly discharges through the base, causing:

A brief pulse of current through the transistor

The LED lights up through the 100 Ω resistor

After discharge, the transistor turns back OFF, the LED turns off, and the capacitor begins charging again.

This repeats automatically → LED flasher."

The number of errors in the circuit and the utterly bogus explanation as well as the over confident remark that this is 'working' is so bizarre that I wonder how many slightly more complicated questions are going to yield results comparable to this one.

replies(3): >>46238202 #>>46241729 #>>46243194 #

emporas ◴[11 Dec 25 22:30 UTC] No.46238202[source]▶

>>46237593 #

I have used Gemini for reading and solving electronic schematics exercises, and it's results were good enough for me. Roughly 50% of the exercises managed to solve correctly, 50% wrong. Simple R circuits.

One time it messed up the opposite polarity of two voltage sources in series, and instead of subtracting their voltages, it added them together, I pointed out the mistake and Gemini insisted that the voltage sources are not in opposite polarity.

Schematics in general are not AIs strongest point. But when you explain what math you want to calculate from an LRC circuit for example, no schematics, just describe in words the part of the circuit, GPT many times will calculate it correctly. It still makes mistakes here and there, always verify the calculation.

replies(1): >>46238380 #

jacquesm ◴[11 Dec 25 22:45 UTC] No.46238380[source]▶

>>46238202 #

I guess I'm just more critical than you are. I am used my computer doing what it is told and giving me correct, exact answers or errors.

replies(2): >>46238786 #>>46243329 #

1. dagss ◴[12 Dec 25 12:01 UTC] No.46243329[source]▶

>>46238380 #

I think most people treat them like humans not computers, and I think that is actually a much more correct way to treat them. Not saying they are like humans, but certainly a lot more like humans than whatever you seem to be expecting in your posts.

Humans make errors all the time. That doesn't mean having colleagues is useless, does it?

An AI is a colleague that can code very very fast and has a very wide knowledge base and versatility. You may still know better than it in many cases and feel more experienced that in. Just like you might with your colleagues.

And it needs the same kind of support that humans need. Complex problem? Need to plan ahead first. Tricky logic? Need unit tests. Research grade problem? Need to discuss through the solution with someone else before jumping to code and get some feedback and iterate for 100 messages before we're ready to code. And so on.

↑