DeepSeek-v3.1

(api-docs.deepseek.com)

Show context

hodgehog11 ◴[21 Aug 25 20:01 UTC] No.44977357[source]▶

For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

replies(6): >>44977423 #>>44977655 #>>44977754 #>>44977946 #>>44978395 #>>44978560 #

guluarte ◴[21 Aug 25 20:53 UTC] No.44977946[source]▶

>>44977357 #

tbh companies like anthopic, openai, create custom agents for specific benchmarks

replies(2): >>44978101 #>>44979380 #

amelius ◴[21 Aug 25 23:24 UTC] No.44979380[source]▶

>>44977946 #

Aren't good benchmarks supposed to be secret?

replies(3): >>44979634 #>>44982470 #>>45056160 #

noodletheworld ◴[22 Aug 25 09:20 UTC] No.44982470[source]▶

>>44979380 #

How can a benchmark be secret if you post it to an API to test a model on it?

"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"

If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.

So no.

They're not secret.

replies(1): >>44982928 #

dmos62 ◴[22 Aug 25 10:47 UTC] No.44982928[source]▶

>>44982470 #

How do you propose that would work? A pipeline that goes through query-response pairs to deduce response quality and then uses the low-quality responses for further training? Wouldn't you need a model that's already smart enough to tell that previous model's responses weren't smart enough? Sounds like a chicken and egg problem.

replies(1): >>44983190 #

1. irthomasthomas ◴[22 Aug 25 11:28 UTC] No.44983190[source]▶

>>44982928 #

It just means that once you send your test questions to a model API, that company now has your test. So 'private' benchmarks take it on faith that the companies won't look at those requests and tune their models or prompts to beat them.

replies(2): >>44983732 #>>44985081 #

2. dmos62 ◴[22 Aug 25 12:30 UTC] No.44983732[source]▶

>>44983190 (TP) #

Sounds a bit presumptious to me. Sure, they have your needle, but they also need a cost-efficient way to find it in their hay stack.

replies(2): >>44984722 #>>44986900 #

3. noodletheworld ◴[22 Aug 25 13:53 UTC] No.44984722[source]▶

>>44983732 #

Security through obscurity is not security.

Your api key is linked to your credit card, which is linked to your identity.

…but hey, youre right.

Lets just trust them not to be cheating. Cool.

4. merelysounds ◴[22 Aug 25 14:24 UTC] No.44985081[source]▶

>>44983190 (TP) #

Would the model owners be able to identify the benchmarking session among many other similar requests?

replies(1): >>44985331 #

5. irthomasthomas ◴[22 Aug 25 14:47 UTC] No.44985331[source]▶

>>44985081 #

Depends. Something like arc-agi might be easy as it follows a defined format. I would also guess that the usage pattern for someone running a benchmark will be quite distinct from that of a normal user, unless they take specific measures to try to blend in.

6. lucianbr ◴[22 Aug 25 16:59 UTC] No.44986900[source]▶

>>44983732 #

They have quite large amounts of money. I don't think they need to be very cost-efficient. And they also have very smart people, so likely they can figure out a somewhat cost-efficient way. The stakes are high, for them.

↑