o1 > GPT-4o > o3-mini > o1-mini > GPT-4o-mini
o3 mini system card: https://cdn.openai.com/o3-mini-system-card.pdf
And groundbreaking.
Your summary attempt e.g. also misses o3-mini vs o3-mini-high. Lots of trade-ofs.
looks like finally their threat model has been updated to take into account that the user might be too "unaligned" to be trusted with the ability to provide a system message of their own
Or can it not compare? I don't know much about this stuff, but I've heard recently many people talk about DeepSeek and how unexpected it was.
Imagine if Google, during it's accent, had a huge array of search engines with code names and notes about what it's doing behind the scenes. No, you open the page and type in box. If they can make it work better next month, great.
(I understand this could not apply to developers or enterprise-type API usage).
I’ve been incredibly pleased with DeepSeek this past week. Wonderful product, I love seeing its brain when it’s thinking.
Pricing for o3-mini [1] is $1.10 / $4.40 per 1M tokens.
[1]: https://platform.openai.com/docs/pricing#:~:text=o3%2Dmini
But DeepSeek clearly states in their terms of service that they can train on your API data or use it for other purposes. Which one might assume their government can access as well.
We need direct eval comparisons between o3-mini and DeepSeek.. Or, well they are numbers so we can look them up on leaderboards.
It uses GPT-4o mini to extract updates from the website using scrapegraphai so this is kinda meta :). Maybe I'll switch to o3 mini depending on cost. It's reasoning abilities, with a lower cost than o1, could be quite powerful for web scraping.
I feel like this naming scheme is growing a little tired. o1 is for general knowledge reasoning, o3-mini replaces o1-mini but might be more specialized than o1 for certain technical domains...the "o" in "4o" is for "omni" (referring to its multimodality) but the reasoning models start with "o" ...but they can't use "o2" for trademark reasons so they skip straight to "o3" ...the word salad is getting really hard to follow!
For example-
GPQA diamond system card: o1-preview 0.68
GPQA diamond PR release: o1-preview 0.78
Also, how should we interpret the 3 different shading colors in the barplots (white, dotted, heavy dotted on top of white)...
o1 has been very disappointing after spending sufficient time with Claude Sonnet 3.5. It's like it actively tries to gaslight me and thinks it knows more than I do. It's too stubborn and confidently goes off in tangents, suggesting big changes to parts of the code that aren't the issue. Claude tends to be way better at putting the pieces together in its not-quite-mental-model, so to speak.
I told o1 that a suggestion it gave me didn't work and it said "if it's still 'doesn't work' in your setup..." with "doesn't work" in quotes like it was doubting me... I've canceled my ChatGPT subscription and, when I really need to use it, just go with GPT-4o instead.
It is the closed competition model that’s being left in the dust.
If this will work for your use case then it could be a huge discount versus o1. Worth trying again if o1-mini couldn't handle the task before. $4/million output tokens versus $60.
https://platform.openai.com/docs/pricing
I am Tier 5 but I don't believe I have access to it in the API (at least it's not on the limits page and I haven't received an email). It says "rolling out to select Tier 3-5 customers" which means I will have to wait around and just be lucky I guess.
If they were the same, I would have expected explicit references to o3 in the system card and how o3-mini is distilled or built from o3 - https://cdn.openai.com/o3-mini-system-card.pdf - but there are no references.
Excited at the pace all the same. Excited to dig in. The model naming all around is so confusing. Very difficult to tell what breakthrough innovations occurred.
With DeepSeek I heard OpenAI saying the plan was to move releases on models that were meaningfully better than the competition. Seems like what we're getting is the scheduled releases that are worse than the current versions.
If the model is for technical stuff, then call it the technical model. How is anyone supposed to know what these model names mean?
The only page of theirs attempting to explain this is a total disaster. https://platform.openai.com/docs/models
Reasons to use o3 when deepseek is free/cheaper:
- Some companies/users may already have integrated heavily with OpenAI
- The expanded feature-set (e.g., function-calling, search) could be very powerful
- DeepSeek has deep ties to the Chinese Communist Party and, while the US has its own blackspots, the "steering" of information is far more prevalent in their models
- Local/national regulations might not allow for using DeepSeek due to data privacy concerns
- "free" isn't always better
I'm sure others have better reasons
Am I missing something?
Who said this is not intentional? It seems to work well given that people are hyped every time there's a release, no matter how big the actual improvements are — I'm pretty sure "o3-mini" works better for that purpose than "GPT 4.1.3"
But it's SWE-bench scores aren't meaningfully better than Claude, 49.3 vs Claude's 49.0 on the public leaderboard (might be higher now due to recent updates?)
My immediate thoughts, CodeForces (and competitive programming in general) is a poor proxy for performance on general software engineering tasks. Besides that, for all the work put into OpenAI's most recent model it still has a hard time living up to an LLM initially released by Anthropic some time ago, at least according to this benchmark.
Mind you, the Github issues that the problems in SWE-bench were based-off have been around long enough that it's pretty much a given that they've all found their way into the training data of most modern LLMs, so I'm really surprised that o3 isn't meaningfully better than Sonnet.
...I like "DALL·E" and "Whisper" as names a lot, though, FWIW :p
Why would the marketing team of all people call it GPT 4.1.3?
> including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card .
I have no idea what an "internal tools scaffold" is but the graph on the card that they link directly to specifies "o3-mini (tools)" where the blog post is talking about others.
Ask one of their models for advice. :-)
I gave the same prompt to sonnet 3.5 and not a single hiccup.
Maybe not an indication that Deepseek is worse/bad (I am using a distilled version), but moreso speaks to much react/nextjs is out in the world influencing the front-end code that is referenced.
> We evaluate SWE-bench in two settings: > *• Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.
> *• o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.
Normies don't have to know - ChatGPT app focuses UX around capabilities and automatically picks the appropriate model for capabilities requested; you can see which model you're using and change it, but don't need to.
As for the techies and self-proclaimed "AI experts" - OpenAI is the leader in the field, and one of the most well-known and talked about tech companies in history. Whether to use, praise or criticize, this group of users is motivated to figure it out on their own.
It's the privilege of fashionable companies. They could name the next model ↂ-↊↋, and it'll take all of five minutes for everyone in tech (and everyone on LinkedIn) to learn how to type in the right Unicode characters.
EDIT: Originally I wrote \Omega-↊↋, but apparently HN's Unicode filter extends to Greek alphabet now? 'dang?
$1.1/m for input
$4.4/m for output
I assume thinking medium and hard would consume more tokens.
I feel the timing is bad for this release especially when deepseek R1 is still peaking. People will compare and might get disappointed with this model.
Additionally, I believe my contribution to the conversation is that gpt-4o-mini, the previous model advertised as low-cost, works pretty well for my use case (which in this case can help others here). I'm excited to try out gpt-03-mini depending on what the cost looks like for web scraping purposes. Happy to report back here once I try it out.
You can like Sama or dislike him, but he knows how to market a product. Maybe this is a bad call on his part, but it is a call.
Roughly speaking the message is: "use GPT-4o all the time, use o1 (soon o3) if you have more complex tasks". What am I supposed to answer when people ask "when am I supposed to use o3-mini ? . And what the heck is o3-mini-high, how do I know when to use it ?". People aren't gonna ask the same question to 5 different models and burn all their rate limits; yet it feels that what's openAI is hoping people will do.
Put those weirs models in a sub-menu for advanced users if you really want to, but is you can use o1 there is probably no reason for you to hake o3-mini and o3-mini-high as additional options.
The non-reasoning line was also pretty straightforward. Newer base models get a larger prefix number and some postfixes like 'o' were added to signal specific features in each model variant. Great!
Where things went of the rails was specifically when they decided to also name the reasoning models with an 'o' for separate reasons but now as the prefix at the same time as starting a separate linear sequence but now as the postfix. I wonder if we'll end up with both a 4o and o4...
On the bright side the app now has curved edges!
PETERSON Oh say, Tackaberry, did you get my memo?
TACKABERRY What memo?
PETERSON My memo about memos. We're sending out too many memos and it's got to stop!
TACKABERRY All right. I'll send out a memo.
The perplexing thing is that someone has to have said that, right? It has to have been brought up in some meeting when they were brainstorming names that if you have 4o and o1 with the intention of incrementing o1 you'll eventually end up with an o4.
Where they really went off the rails was not just bailing when they realized they couldn't use o2. In that moment they had the chance to just make o1 a one-off weird name and go down a different path for its final branding.
OpenAI just struggles with names in general, though. ChatGPT was a terrible name picked by engineers for a product that wasn't supposed to become wildly successful, and they haven't really improved at it since.
The business bit is confusing, I guess they see the API as a business product, but they do not train on API data.
OpenAI is so scattered they haven't even bothered using their own state of the art AI to come up with a coherent naming convention? C'mon, get your shit together.
Don't get me wrong, I'm laughing at OpenAI just like everyone else, but if they were really copying Deepseek, they'd be releasing a smaller model distilled from Deepseek API responses, and have it be open source to boot. This is neither
This tracks when considering that the model was trained on western model outputs and then tuned post-training to (poorly) align it with Chinese values.
Watch this space.
As for self-limiting usage by non-power users, they're already doing that: ChatGPT app automatically picks a model depending on what capabilities you invoke. While they provide a limited ability to see and switch the model in use, they're clearly expecting regular users not to care, and design their app around that.
No one's going to let an LLM near anything important until then.
Their value-prop (moat) is that they've burnt more money than everybody else. That moat is trivially circumvented by lighting a larger pile of money and less trivially by lighting the pile more efficently.
OpenAI isn't the only company. The Tech companies being beaten massively by Microsoft in #of H100s purchases are the ones with a moat. Google / Amazon with their custom AI chips are going to have a better performance per cost than others and that will be a moat. If you want to get the same performance per cost then you need to spend the time making your own chips which is years of effort (=moat).
I understand that keeping the same data and curating it might be beneficial. But it sounds odd to roll back in time with the knowledge cutoff. AFAIK, the only event that happened around that time was the start of the Gaza conflict.
In particular the "reasoning" models just seem to be less good and more slow.
Over time, competitors catch up and first mover advantage melts away.
I wouldn’t attribute OpenAI’s success to any extremely smart marketing moves. I think a big part of their market share grab was simply going (and staying) viral for a long time. Manufacturing virality is notoriously difficult (and based on the usability and poor UI of ChatGPT early versions, it feels like they got lucky in a lot of ways)
I also have tested with DeepSeek R1 and will test some more with that although in a way Claude 3.6 with CoT is pretty good. Last time I tried to test R1 their API was out.
I think you've spent a little too long hitting on the Deepseek pipe. Enterprise customers with familiarity with China will avoid the hosted model for data security and IP protection reasons, among others.
Those working in any area considered economically competitive with China will also be hesitant to use the vanilla model in self-hosted form as there perpetually remains the standing question on what all they've tuned inside the model to benefit the CCP. Perhaps even in subtle ways reminiscent of the Trisolaran sophons from the Three Body Problem.
For instance, you can imagine that if Germany had released an OS model in 1943, that the Americans wouldn't have trusted it to help them develop better military systems even if initial testing passed muster.
Unfortunately, state control of private enterprise in the Chinese economy makes it unproductive to separate the two from one another. Particularly in Deepseek's case as a wide array of Chinese state-linked social media accounts were promoting V3/R1 on the day of its public release.
https://www.reuters.com/technology/artificial-intelligence/c...
ChatGPTasdhjf-final-final-use_this_one.pt > ChatGPTasdhjf-final.pt > ChatGPTasdhjf.pt > ChatGPTasd.pt> ChatGPT.pt
Additionally, o1 does not have access to search and multimodality and taking a screenshot of something and asking questions about it is also a big use case.
It’s easy to overlook how widely ChatGPT is used for very small stuff. But compounded it’s still a game changer for many people.
So on API they don't train by default, for other paid subscription they mention you can opt-out
While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).
It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.
Of course the real improvement is cost, but still, it kind of rubs me the wrong way.
I hope by this they don't mean me, when I'm asked 'which of these two responses do you prefer'.
They're both 2,000 words, and I asked a question because I have something to do. I'm not reading them both; I'm usually just selecting the one that answered first.
That prompt is pointless. Perhaps as evidenced by the essentially 50% response rate: it's a coin-flip.
from way how it thinks/responds looks like it's one of destinations , likely llama one I also suspect that many of free/cheap providers also serve llama instead of real R1
Yelp suffered greatly in the early 2010s when Google started putting Google Maps listings (and their accompanying reviews) in their search results.
OpenAI will eventually unify their products as well.
Instead of just generating a patch (copilot style), it generates the patch, applies the patch, runs the code, and then iterates based on the execution output.
That said I often run into a sort of opposite issue with Claude. It's very good at making me feel like a genius. Sometimes I'll suggest trying a specific strategy or trying to define a concept on my own, and Claude enthusiastically agrees and takes us down a 2-3 hour rabbit hole that ends up being quite a waste of time for me to back track out of.
I'll then run a post-mortem through chatGPT and very often it points out the issue in my thinking very quickly.
That said I keep coming back to sonnet-3.5 for reasons I can't perfectly articulate. Perhaps because I like how it fluffs my ego lol. ChatGPT on the other hand feels a bit more brash. I do wonder if I should be using o1 as my daily driver.
I also don't have enough experience with o1 to determine if it would also take me down dead ends as well.
> Please note that if the reasoning_content field is included in the sequence of input messages, the API will return a 400 error. Therefore, you should remove the reasoning_content field from the API response before making the API request
So the best I can do is pass the reasoning as part of the context (which means starting over from the beginning).
DeepSeek has proven that the latter is possible, which drops a couple of River crossing rocks into the moat.
Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.
Why do people try to meme as if AI is different? It has unexpected outputs sometimes, getting it to not do that is 50% "more alignment" and 50% "hallucinate less".
Just today I saw someone get the Amazon bot to roleplay furry erotica. Funny, sure, but it's still obviously a bug that a *sales bot* would do that.
And given these models do actually get stuff wrong, is it really incorrect for them to refuse to help with things they might be dangerous if the user isn't already skilled, like Claude in this story about DIY fusion? https://www.corememory.com/p/a-young-man-used-ai-to-build-a-...
Looking at the comments on here and the benchmark results I was expecting it to be a bit meh, but initial impressions are quite the opposite
I was expecting it to perhaps be a marginal improvement for complex things that need a lot of 'reasoning', but it seems it's a bit improvement for simple things that you need doing fast
So it is by no means state of the art. Gemini Flash 2.0 also performs better than DeepSeek V3 in all my comparisons thus far. But Gemini Flash 2.0 isn't robust and reliable either.
But as a piece of research, and a cool toy to play with, I think DeepSeek is great.
Fortunately for them a monolith search engine was perfectly fine (and likely optimal due to accrued network effects).
OpenAI is basically signaling that they need to distill their monolith in order to serve specific segments of the marketplace. They've explicitly said that they're targeting STEM with this one. I think that's a smart choice, the most passionate early adopters of this tech are clearly STEM users.
If the tech was such that one monolith model was actually the optimal solution for all use cases, they would just do that. Actually, this is their stated mission: AGI. One monolith that's best at everything is basically what AGI is.
OpenAI has developed a variety of models that cater to different applications, from natural language processing to image generation and audio processing. Here’s a comprehensive list of the current models available:
## Language Models
- \*GPT-4o\*: The flagship model capable of processing text, images, and audio.
- \*GPT-4o mini\*: A smaller, more cost-effective version of GPT-4o.
- \*GPT-4\*: An advanced model that improves upon GPT-3.5.
- \*GPT-3.5\*: A set of models that enhance the capabilities of GPT-3.
- \*GPT-3.5 Turbo\*: A faster variant designed for efficiency in chat applications.
## Reasoning Models
- \*o1\*: Focused on reasoning tasks with improved accuracy.
- \*o1-mini\*: A lightweight version of the o1 model.
- \*o3\*: The successor to o1, currently in testing phases.
- \*o3-mini\*: A lighter version of the o3 model.
## Audio Models
- \*GPT-4o audio\*: Supports real-time audio interactions and audio generation.
- \*Whisper\*: For transcribing and translating speech to text.
## Image Models
- \*DALL-E\*: Generates images from textual descriptions.
## Embedding Models
- \*Embeddings\*: Converts text into numerical vectors for similarity tasks.
- \*Ada\*: An embedding model with various sizes (e.g., ada-002).
## Additional Models
- \*Text to Speech (Preview)\*: Synthesizes spoken audio from text.
These models are designed for various tasks, including coding assistance, image generation, and conversational AI, making OpenAI's offerings versatile for developers and businesses alike[1][2][4][5].Citations:
[1] https://learn.microsoft.com/vi-vn/azure/ai-services/openai/concepts/models
[2] https://platform.openai.com/docs/models
[3] https://llm.datasette.io/en/stable/openai-models.html
[4] https://en.wikipedia.org/wiki/OpenAI_API
[5] https://industrywired.com/open-ai-models-list-top-models-to-consider/
[6] https://holypython.com/python-api-tutorial/listing-all-available-openai-models-openai-api/
[7] https://en.wikipedia.org/wiki/GPT-3
[8] https://stackoverflow.com/questions/78122648/openai-api-how-do-i-get-a-list-of-all-available-openai-models/78122662
A marketing team isn’t a generality that makes a company known, it often focuses on communicating what products different types of customers need from your lineup.
If I sell three medications:
Steve
56285
Priximetrin
And only tell you they are all pain killers but for different types and levels of pain I’m going to leave revenue on the floor. That is no matter how valuable my business is or how well it’s known.
The naming is bad. Other people already said it you can "google" stuff, you can "deepseek" something, but to "chatgpt" sounds weird.
The model naming is even weirder, like, did they really avoid o2 because of oxigen?
they can't call it o2 because: https://en.wikipedia.org/wiki/The_O2_Arena
and the venue's sponsor: https://en.wikipedia.org/wiki/O2_(UK)
Sam Altman is somehow finding this out now, the hard way.
Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).
Google with all its money and smart engineers was not able to build a simple chat application.
Because when you think about it, it really is quite damning. Minus statistical noise it's no better.
I can see it now:
> Unlock our industry leading reasoning features by upgrading to the GPT 4 Pro Max plan.
https://chatgpt.com/share/679d310d-6064-8010-ba78-6bd5ed3360...
The 4o model without using the Python tool
https://chatgpt.com/share/679d32bd-9ba8-8010-8f75-2f26a792e0...
Trying to get accurate results with the paid version of 4o with the Python interpreter.
https://chatgpt.com/share/679d31f3-21d4-8010-9932-7ecadd0b87...
The share link doesn’t show the output for some reason. But it did work correctly. I don’t know whether the ages are correct. I was testing whether it could handle ordering
I have no idea what conclusion I should draw from this besides depending on the use case, 4o may be better with “tools” if you know your domain where you are using it.
Tools are relatively easy to implement with LangChain or the native OpenAI SDK.
Model 00oOo is better than Model 0OoO0!
36. Abraham Lincoln – 52 years, 20 days (1861)
37. James Garfield – 49 years, 105 days (1881)
38. Lyndon B. Johnson – 55 years, 87 days (1963)
Basically everything after #15 in the list is scrambled.
I would not consider purposely giving inaccurate feedback for this reason alone.
Example usage:
llm -m o3-mini 'write a poem about a pirate and a walrus' \
-o reasoning_effort high
Output (comparing that with the default reasoning effort): https://github.com/simonw/llm/issues/728#issuecomment-262832...(If anyone has a better demo prompt I'd love to hear about it)
Didn’t bother reading either of them, just selected one and went on with my day.
If it were me I would have set up a “hey do you mind if we give you two results and you can pick your favorite?” prompt to weed out people like me.
Now if you asked it to write a Python program to list them in order, and have it enter all the names, birthdays, and year elected in a list to get the program to run - that's more reasonable.
Wasting time on an anti pattern is not value nor is it trying to outguess the way that selection mechanism is used.
For 18,936 input, 2,905 output it cost 3.3612 cents.
Here's the script I used to do it: https://til.simonwillison.net/llms/claude-hacker-news-themes...
A first step to digital immortality, could be a nice startup of some personalized product for rich, and then even regular folks. Immortality not in ourselves as meat bags of course, we die regardless, but digital copy and memento that our children can use if feeling lonely and can carry with themselves anywhere, or later descendants out of curiosity to hold massive events like weddings. One could 'invite' long lost ancestors. Maybe your grand-grand father would be a cool guy you could easily click with these days via verbal input. Heck even 3D detailed model.
An additional service, 'perpetually' paid - keeping your data model safe, taking care of it, backups, heck even maybe give it a bit of computing power to to receive current news in some light fashion and evolve, could be extras. Different tiers for different level of services and care.
Or am I decade or two ahead? I can see this as universally interesting across many if not all cultures.
I want a single source model that's grounded in base truth. I'll let the model know how to structure it in my prompt.
DeepSeek also gets the order right.
It doesn’t show on the share link. But it actually outputs the list correctly from the built in Python interpreter.
For some things, ChatGPT 4o will automatically use its Python runtime
What's next?
o4-mini (wet socks), o5-Eeny-meeny-miny-moe?
I thought they had a product manager over there.
They only need 2 names, right? ChatGPT and o.
ChatGPT-5 and o4 would be next.
This multiplication of the LLM loaves and fishes is kind of silly.
I don’t know why somebody would want that, seems annoying. But I also don’t expect people to explain why they do this kind of stuff.
> In fact, the O1 model used in OpenAI's ChatGPT Plus subscription for $20/month is basically the same model as the one used in the O1-Pro model featured in their new ChatGPT Pro subscription for 10x the price ($200/month, which raised plenty of eyebrows in the developer community); the main difference is that O1-Pro thinks for a lot longer before responding, generating vastly more COT logic tokens, and consuming a far larger amount of inference compute for every response.
Granted "basically" is pulling a lot of weight there, but that was the first time I'd seen anyone speculate either way.
[0] https://youtubetranscriptoptimizer.com/blog/05_the_short_cas...
I prefer, for philosophical reasons, open weight and open process/science models, but OpenAI has done a very good job at productizing ChatGPT. I also use their 4o-mini API because it is cheap and compares well to using open models on Groq Cloud. I really love running local models with Ollama but the API venders keep the price so low that I understand most people not wanting the hasssle if running Deepseek-R, etc., locally.
The deepseek paper states that the $5mil number doesn't include development costs, only the final training run. And it doesn't include the estimated $1.4billion cost of the infrastructure/chips Deepseek owns.
Most of OpenAI's billion dollar costs is in inference, not training. It takes a lot of compute to serve so many users.
Dario said recently that Claude was in the tens of millions (and that it was a year earlier, so some cost decline is expected), do we have some reason to think OpenAI was so vastly different?
My experience is as follows:
- "Reason" toggle just got enabled for me as a free tier user of ChatGPT's webchat. Apparently this is o3-mini - I have Copilot Pro (offered to me for free), which apparently has o1 too (as well as Sonnet, etc.)
From my experience DeepSeek R1 (webchat) is more expressive, more creative and its writing style is leagues better than OpenAI's models, however it under-performs Sonnet when changing code ("code completion").
Comparison screenshots for prompt "In C++, is a reference to "const C" a "const reference to C"?": https://imgur.com/a/c-is-reference-to-const-c-const-referenc...
tl;dr keep using Claude for code and DeepSeek webchat for technical questions
modelname(variant).majorVersion.minorVersion ChatGPT(o).3.0 ChatGPT-mini(o).3.0 GPT.2.123 GPT.3.9
And so on. Once it's coherent, people pick it up, and naturally call the model by "modelname majorversion" , and there's no confusion or hesitance about which is which. See, it took me 2 minutes.
Even better: Have an OAI slack discussion company-wide, then have managers summarize their team's discussions into a prompt demonstrating what features they want out of it, then run all the prompts together and tell the AI to put together 3 different naming schemes based on all the features the employees want. Roll out a poll and have employees vote which of the 3 gets used going forward. Or just tap into that founder mode and pick one like a boss.
Don't get me wrong, I love using AI - we are smack dab in the middle of a revolution and normal people aren't quite catching on yet, so it's exhilarating and empowering to be able to use this stuff, like being one of the early users of the internet. We can see what's coming, and if you lived through the internet growing up, you know there's going to be massive, unexpected synergies and developments of systems and phenomena we don't yet have the words for.
OpenAI can do better, and they should.
"Evaluations by expert testers showed that o3-mini produces more accurate and clearer answers, with stronger reasoning abilities, than OpenAI o1-mini. Testers preferred o3-mini's responses to o1-mini 56% of the time and observed a 39% reduction in major errors on difficult real-world questions. W"
This naming scheme is a dumpster fire. Every other comment is trying to untangle what the actual hierarchy of model performance is.
Is this a case of PEBKAC?
1) sometimes I wanted the short answer, and so even though the long answer is better I picked the short one.
2) sometimes both contain code that is different enough that I am inclined to go with the one that is more similar to what I already had, even if the other approach seems a bit more solid.
3) Sometimes one will have less detail but more big picture awareness and the other will have excellent detail but miss some overarching point that is valuable. Depending on my mood I sometimes choose but it is annoying to have to do so because I am not allowed to say why I made the choice.
The area of human training methodology seems to be a big part of what got Deepseek's model so strong. I read the explanation of the test results as an acknowledgement by OpenAI of some weaknesses in its human feedback paradigm.
IMO the way it should work is that the thumbs up or down should be read in context by a reasoning being and a more in-depth training case should be developed that helps future models learn whatever insight the feedback should have triggered.
Feedback that A is better or worse than B is definitely not (in my view) sufficient except in cases where a response is a total dud. Usually the responses have different strengths and weaknesses and it's pretty subjective which one is better.
However I have suggested worse approaches on purpose and sometime Claude does pick them up as less than optimal
[1]: https://old.reddit.com/r/LocalLLaMA/comments/1gna0nr/popular...
Is it twice as good though?
- DeepSeek chose to open-source model weights. This makes them inifinitely more trustworthy than ClosedAI.
- Local/national regulations do not allow using OpenAI, due to close ties to the US government.
You use words as labels so that we use our pre existing knowledge of the word to derive meaning from the label.
If 10% of people just click based on how fast the response was because they don't want to read both outputs, your p-value for the latter hypothesis will be atrocious, no matter how large the sample is.
Couldn't this be considered a form of preference?
Whether it's the type of preference OpenAI was testing for, or the type of preference you care about, is another matter.
o1 seems to have been neutered in the last week lots of disclaimers and butt-covering in its responses.
I also had an annoying discussion with o1 about the DC plane crash..it doesn’t have web access and its cutoff is 2024, so I don’t expect it know about the crash. However, after saying such an event is extremely unlikely and being almost patronisingly reassuring, it treated pasted news articles and links (which to be sure, it can’t access) as “fictionalized”, instead of acknowledging its own cut-off date, and that it could have been wrong. In contrast DeepSeek (with web search turned off) was less dismissive of the risks in DC airspace, and more aware of its own knowledge cut-off.
Coupled with the limited number of o1 responses for ChatGPT Plus, I’ve cancelled my subscription for now.
I think you misread something. AWS mainly offers the full size model on Bedrock: https://aws.amazon.com/blogs/aws/deepseek-r1-models-now-avai...
They talk about how to import the distilled models and deploy those if you want, but AWS does not appear to be officially supporting those.
But I'm glad you pointed that out, I now suspect that is responsible for a large part of the disagreement between "huh? a statistically significant blind evaluation is a statistically significant blind evaluation" vs "oh, this was obviously a terrible study" repliers is due to different interpretations of that post. Thanks. I genuinely didn't consider the alternative interpretation before.
R1 70b also got it right just as many times for me.
Thanks OpenAI! Nice gift and a neat distraction from DeepSeek-R1 - which I still can't use directly, because their API stopped working moments after I topped up my credits and generated an API key, and is still down for me... :/.
Claude, GPT 4 onwards, and DeepSeek all feel the same to me. Okay to a point, then kinda useless. More like a more convenient specialised Google that you need to double check the results of.
1) Coming off as a jerk, and from a new account is a bad look
2) "Literally the opposite of a coin flip" would probably be either 0% or 100%
3) Your reasoning doesn't stand up without further info; it entirely depends on the sample size. I could have 5 coin flips all come up heads, but over thousands or millions it averages to 50%. 56% on a small sample size is absolutely within margin of error/noise. 56% on a MASSIVE sample size is _statistically_ significant, but isn't even still that much to brag about for something that I feel like they probably intended to be a big step forward.
It's like with videogames - most of them commit all kinds of UI/UX sins, and I often wish they didn't, but excepting extreme cases, the players are too motivated to care or notice.
Compare LLMs from a year or two ago with the ones out today on practically any task. It's night and day difference.
This is specially so when you start taking into account these "reasoning" models. It's mind blowing how much better they are than "non-reasoning" models for tasks like planning and coding.
https://aider.chat/docs/leaderboards/#aider-polyglot-benchma...
Not quite. "2.0 Flash" is also called 2.0. The "Pro" models are the full models. But, I love how they have both "gemini-exp-1206" and "gemini-2.0-flash-thinking-exp-01-21". The first one doesn't even say what type of model it is, presumably it should have been "gemini-2.0-pro-exp-1206", but they didn't want to label it that for some reason, and now they're putting a hyphen in the date string where they weren't before.
Not to mention they have both "Flash" and "Flash-8B"... which I think will confuse people. IMO, it should be "Flash-${Parameters}B" for both of them if they're going to mention it for one.
But, I generally think Google's Gemini naming structure has been pretty decent.
> If so, where do they indicate they failed to randomize/blind the raters?
Win rate if user is under time constraint
This is hard to read tbh. Is it STEM? Non-STEM? If it is STEM then this shows there is a bias. If it is Non-STEM then this shows a bias. If it is a mix, well we can't know anything without understanding the split.Note that Non-STEM is still within error. STEM is less than 2 sigma variance, so our confidence still shouldn't be that high.
I have something like “always be terse and blunt with your answers.”
Let's break down the problem step by step to understand the relationships and positions of Bob, Ann, and Cathy. 1. Understanding the Initial Setup
Bob is facing forward.
This means Bob's front is oriented in a particular direction, which we'll consider as the reference point for "forward."
To his left is Ann, to his right is Cathy.
If Bob is facing forward, then:
Ann is positioned to Bob's left.
Cathy is positioned to Bob's right.
Ann and Cathy are facing backwards.
Both Ann and Cathy are oriented in the opposite direction to Bob. If Bob is facing forward, then Ann and Cathy are facing backward.
2. Visualizing the PositionsTo better understand the scenario, let's visualize the positions: Copy
Forward Direction: ↑
Bob (facing forward) | | Ann (facing backward) | / | / | / | / | / | / | / |/ |
And then only the character | in a newline forever.
Watch here. https://www.youtube.com/watch?v=by9PUlqtJlM
I care mostly about batching in Vertex AI, which is 17-30x times cheaper than competition (whether you use prompt caching or not) while allowing for audio, video, and arbitrary document filetype inputs; unfortunately Gemini 1.5 Pro/Flash have remained the two so-called "stable" options that are available there. I can appreciate Google's experimental models for all I can, but I cannot take them seriously until they allow me to have my sweet, sweet batches.
Have you tried them?
This is something I threw together with o3-mini
https://chatgpt.com/share/679d5305-5f04-8010-b5c4-61c31e79b2...
ChatGPT 4o doesn’t even try to do the math internally and uses its built in Python interpreter. (The [_>] link is to the Python code)
https://chatgpt.com/share/679d54fe-0104-8010-8f1e-9796a08cf9...
DeepSeek handles the same problem just as well using the reasoning technique.
Of course ChatGPT 4o went completely off the rails without using its Python interpreter
https://chatgpt.com/share/679d5692-96a0-8010-8624-b1eb091270...
(The break down that it got right was using Python even though I told it not to)
The truth is these models are very stochastic you have to try new chats whenever you even moderately suspect you're going awry
O3 Mini is probably not a very large model and OpenAI has layers upon layers of efficiencies, so they must be making an absolute killing charging 3.3 cents for a few seconds of compute
Nevertheless, R1's reasoning chains are already shorter in tokens than o1's while having similar results, and apparently o3-mini's too.
But given how OpenAI employees act online these days I wouldn't be surprised if someone on the ground proposed it as a way to screw with all the 3rd parties who are using OpenAI compatible endpoints or even use OpenAI's SDK in their official docs in some cases.
I'm not going to say there's nothing substantive about o3 vs. o1, but I absolutely do not put it past Sam Altman to juice the stats every chance he gets.
o3-mini-high: 50 messages per week (just like o1, but it seems like these are non-shared limits, so you can have 50 messages per week with o1, run out, and still have 50 messages with o3-mini-high to use)
o3-mini: 150 messages per day
Source for the latter is their press release. They were more vague about o3-mini-high, but people have already tested its limits just by using it, and got the pop-up for 25 messages left after sending 25 messages.
It's nice not to worry about running out of o1 messages now and have a faster model that's mostly as good (potentially better in some areas?). OpenAI really needs to release a middle tier for 30 to $40 though that has the same models as Pro but without infinite usage. I hate not having the smartest model and I don't want to pay $200; there's probably a middle ground where they can make as much or more money from me on a subscription tier that gives limited access to o1-pro.
When I use ChatGPT these days, it's to help me write coding videos and then the social media posts around those videos. So that's two specialties in one chat log.
> It’s already a game changer for many people. But to have so many names like o1, o3-mini, GPT-4o, & GPT-4o-mini suggests there may be too much focus on internal tech details rather than clear communication." (paraphrase based on multiple similar sentiments)
It also hallucinates quotes.
For example:
> "I’m pretty sure 'o3-mini' works better for that purpose than 'GPT 4.1.3'." – TeMPOraL
But that comment is not in the user TeMPOraL's comment history.
Sentiment analysis is also faulty.
For example:
> "I’d bet most users just 50/50 it, which actually makes it more remarkable that there was a 56% selection rate." – jackbrookes – This quip injects humor into an otherwise technical discussion about evaluation metrics.
It's not a quip though. That comment was meant in earnest
We find what appears to be the easiest combination "Fuck off, go away" buttons and use them without a moment of actual consideration.
(This doesn't mean that it's actually the easiest method.)
I like precision of language, so maybe just have a system prompt that says "use precise language (ex: no symbolism of any kind)"
Does no one else hate it when this happens (especially when on a handheld device)?
One thing I notice that is huge is being able to see the chain of thought lets me see when my prompt was lacking and the model is a bit confused on what I want.
If I was anymore impressed with R1 I would probably start getting accused of being a CCP shill or wumao lol.
With that said, I think it is very hard to compare models for your own use case. I do suspect there is a shiny new toy bias with all this too.
Poor Sonnet 3.5. I have neglected it so much lately I actually don't know if I have a subscription or not right now.
I do expect an Anthropic reasoning model though to blow everything else away.
62% $186 o1 high
60% $18 o3-mini high
57% $5 DeepSeek R1
54% $9 o3-mini medium
52% $14 Sonnet
48% $0 DeepSeek V3
[0] https://aider.chat/docs/leaderboards/I do think it is a good metaphor for how all this shakes out though in time.
o3 does look very promising with regards to large context analysis. I used the same raw data and ran the same prompt as Simon for GPT-4o, GPT-4o mini and DeepSeek R1 and compared their output. You can find the analysis below:
https://beta.gitsense.com/?chat=46493969-17b2-4806-a99c-5d93...
The o3-min model was quite thorough. With reasoning models, it looks like dealing with long context might have gotten a lot better.
Edit:
I was curious if I could get R1 to be more thorough and got the following interesting tidbits.
- Depth Variance: R1 analysis provides more technical infrastructure insights, while o3-mini focuses on developer experience
- Geopolitical Focus: Only R1 analysis addresses China-West tensions explicitly
- Philosophical Scope: R1 contains broader industry meta-commentary absent in o3-mini
- Contrarian Views: o3-mini dedicates specific section to minority opinions
- Temporal Aspects: R1 emphasizes future-looking questions, o3-mini focuses on current implementation
You can find the full analysis at
https://beta.gitsense.com/?chat=95741f4f-b11f-4f0b-8239-83c7...
As subsequent models have been released, most of which claim to be better at coding, I've switched cursor to it to give them a try.
o1, o1-pro, deepseek-r1, and the now o3-mini. All of these models suffer from the exact same "adhd." As an example, in a NextJS app, if I do a composer prompt like "on page.tsx [15 LOC], using shadcn components wherever possible, update this page to have a better visual hierarchy."
sonnet nails it almost perfectly every time, but suffers from some date cutoff issues like thinking that shadcn-ui@latest is the repo name.
Every single other model, doesn't matter which, does the following: it starts writing (from scratch), radix-ui components. I will interrupt it and say "DO NOT use radix-ui, use shadcn!" -- it will respond with "ok!" then begin writing its own components from scratch, again not using shadcn.
This is still problematic with o3-mini.
I can't believe it's the models. It must be the instruction-set that cursor is giving it behind the scenes, right? No amount of .cursorrules, or other instruction, seems to get cursor "locked in" the way sonnet just seems to be naturally.
It sucks being stuck on the (now ancient) sonnet, but inexplicably, it remains the only viable coding option for me.
Has anyone found a workaround?
These models already do an excellent job with your homework, your corporate PowerPoints and your idle questions. At some point only experts would be able to decide if one response was really better than another.
Our biggest challenge is going to be finding problem domains with low performance that we can still scale up to human performance. And those will be so niche that no one will care.
Agents on the other hand still have a lot of potential. If you can get a model to stay on task with long context and remain grounded then you can start firing your staff.
[0] https://www.perplexity.ai/search/how-can-i-construct-a-list-...
https://aws.amazon.com/blogs/machine-learning/deploy-deepsee...
Of course, the model has only been out for a few hours, so whether it lives up to the benchmarks or not isn't really known yet.
I tend to exhaustively detail what I want, including package names and versions because I've been to that movie before...
Very annoying now having to manually press the "^" to hide the verbose thought traces every single question I ask, totally breaks flow.
I find huge value in these models as an augmentation of my intelligence and as a kind of cybernetic partner.
I can't think of anything that can actually be automated though in terms of white collar jobs.
The white collar model test case I have in mind is a bank analyst under a bank operations manger. I have done both in the past but there is something really lacking with the idea of the operations manager replacing the analyst with a reasoning model even though DeepSeek annihilates every bank analyst reasoning I ever worked with right now.
If you can't even arbitrage the average bank analyst there might be these really non-intuitive no AI arbitrage conditions with white color work.
(I have no formal training in creative writing, though I do read a lot of literature. Not claiming my tastes are superior - genuinely curious if other people disagree).
"My attempt at solving the problem contains a test that fails? No problem, let me mock the function I'm testing, so that, rather than actually run, it returns the expected value!"
It keeps doing that kind of shenanigans, applying modifications that solve the newly appearing problem while screwing the original attempt's goal.
I usually get much better results from regular chatgpt copying and pasting, the trouble being that it is a major pain to handle the context window manually by pasting relevant info and reminding what I think is being forgotten.
- An Architect model is asked to describe how to solve the coding problem.
- An Editor model is given the Architect’s solution and asked to produce specific code editing instructions to apply those changes to existing source files.
Splitting up “code reasoning” and “code editing” in this manner has produced SOTA results on aider’s code editing benchmark. Using o1-preview as the Architect with either DeepSeek or o1-mini as the Editor produced the SOTA score of 85%. Using the Architect/Editor approach also significantly improved the benchmark scores of many models, compared to their previous “solo” baseline scores (striped bars).
To save you the click: I think the most interesting things about this model are the price - less than half that of GPT-4o while being better for many things, most notably code - and the increased length limits.
200,000 tokens input and 100,000 output (compared to 128k/16k for GPT-4o and just 8k for DeepSeek R2 and Claude 3.5 on output) could open up some interesting new applications, especially at that low price.
aider --architect --model o1 --editor-model sonnet
Then you'll be in "architect" mode, which first prompts o1 to design the solution, then you can accept it and allow sonnet to actually create the diffs.Most of the time your way works well—I use sonnet alone 90% of the time, but the architect mode is really great at getting it unstuck when it can't seem to implement what I want correctly, or keeps fixing its mistakes by making things worse.
And even though the speech recognition fails a lot on some of the technical terms or weirdly named packages, software, etc, it still does a good job overall (if I don’t feel like correcting the wrong stuff).
It’s great and has become somewhat of a party trick at work. Some people don’t even use AI to code that often, and when I show them “hey have you tried this?” And just tell the computer what I want? Most folks are blown away.
So I think you got lucky and you're asking it to write using a very specific code library that it's good at, because it happens to use it for it's main userbase on the web chat interface.
I wonder if you were using a different component library, or using Svelte instead of React, would you still find Claude the best?
As a student, you should continue to focus on fundamentals, but also adapt LLMs into your workflow where you can.
Skip writing the assembly (now curly braces and semicolons), and focus on what the software you’re building actually does, who it serves, and how it works.
Programming is both changing a lot, and not at all. The mechanics may look different, but the purpose is still the same: effectively telling computers what to do.
The specific contents or skills of your degree don't matter that much. In pretty much any STEM field, over the last 100ish years, whatever you learned in your undergraduate was mostly irrelevant by the time you retired.
Everyone got by, by staying on top of the new developments in the field and doing them. With AI, the particular skills needed to use the power of computers to do things in the world have changed. Just learn those skills.
Nebius is single digit TPS. 31 seconds to reply to "What's 1+1".
Hopefully Deepseek will make it out of their current situation because in a very ironic way, the thing the entire market lost its mind over is not actually usable at the pricing that drove the hype: https://openrouter.ai/deepseek/deepseek-r1
We're also talking to the cursor/windsurf/zed folks on how we can improve Next.js and shadcn in the editors (maybe something like llms.txt?)
It's a 15 line tsx file so context shouldn't be an issue.
Makes me wonder if reasoning models are really proper models for coding in existing codebases
Context alone shouldn't be the reason that sonnet succeeds consistently, but others (some which have even bigger context windows) fail.
Try using Gemini thinking with Cursor. It barely works. Cmd-k outputs the thinking into the code. Its unusable in chat because the formatting sucks.
Is there some relationship between Cursor and Anthropic, i wonder. Plenty of other platforms seem very eager to give users model flexibility, but Cursor seems to be lacking.
I could be wrong, just an observation.
The advertised context length being longer doesn't necessarily map 1:1 with the actual ability the models have to perform difficult tasks over that full context. See for example the plots of performance on ARC vs context length for o-series models.
And on the side of not being able to get models to understand something specific, there’s a place in a current project where I use a special Unicode apostrophe during some string parsing because a third-party API needs it. But any code modifications by the AI to that file always replace it with a standard ascii apostrophe. I even added a comment on that line to the effect of “never replaced this apostrophe, it’s important to leave it exactly as it is!” And also put that in my cursor rules, and sometimes directly in the prompt as well, but it always replaces it even for completely unrelated changes. I’ve had to manually fix it like 10 times in the last day, it’s infuriating.
I originally did that to save on tokens but modern models have much larger input windows so I may not need to do that any more.
Perhaps whatever algorithms Cursor uses to prepare the context it feeds the model are a good fit for Claude but not so much for the others (?). It's a random guess, but whatever the reason, there's a weird worsening of performance vs pure chat.
Coding is basically just about the most precise way to encapsulate a problem as a solution possible. Taking a loose English description and expanding it into piles of code is always going to be pretty leaky no matter how much these models spit out working code.
In my experience you have to pay a lot of attention to every single line these things write because they’ll often change stuff or more often make wrong assumptions that you didn’t articulate. And in my experience they never ask you questions unless you specifically prompt them to (and keep reminding them to), which means they are doing a hell of a lot of design and implementation that unless carefully looked over will ultimately be wrong.
It really reminds me a bit of when Ruby on Rails came out and the blogosphere was full of gushing “I’ve never been more productive in my life” posts. And then you find out they were basically writing a TODO app and their previous development experience was doing enterprise Java for some massive non-tech company. Of course RoR will be a breath of fresh air for those people.
Don’t get me wrong I use cursor as my daily driver but I am starting to find the limits for what these things can do. And the idea of having two of these LLM’s taking some paragraph long feature description and somehow chatting with each other to create a scalable bit of code that fits into a large or growing codebase… well I find that kind of impossible. Sure the code compiles and conforms to whatever best practices are out there but there will be absolutely no constancy across the app—especially at the UX level. These things simply cannot hold that kind of complexity in their head and even if they could part of a developers job is to translate loose English into code. And there is much, much, much, much more to that than simply writing code.
https://www.donhopkins.com/home/catalog/lem/WonderfulPoems.h...
Prompt:
A love poem, lyrical, pastoral, and expressed in the language of pure mathematics. Tensor algebra mainly, with a little topology and higher calculus, if need be. But with feeling, you understand, and in the cybernetic spirit.
Response:
Come, let us hasten to a higher plane,
Where dyads tread the fairy fields of Venn,
Their indices bedecked from one to n,
Commingled in an endless Markov chain!
Come, every frustum longs to be a cone,
And every vector dreams of matrices.
Hark to the gentle gradient of the breeze:
It whispers of a more ergodic zone.
In Riemann, Hilbert or in Banach space
Let superscripts and subscripts go their ways.
Our asymptotes no longer out of phase,
We shall encounter, counting, face to face.
I'll grant thee random access to my heart,
Thou'lt tell me all the constants of thy love;
And so we two shall all love's lemmas prove,
And in our bound partition never part.
For what did Cauchy know, or Christoffel,
Or Fourier, or any Boole or Euler,
Wielding their compasses, their pens and rulers,
Of thy supernal sinusoidal spell?
Cancel me not -- for what then shall remain?
Abscissas, some mantissas, modules, modes,
A root or two, a torus and a node:
The inverse of my verse, a null domain.
Ellipse of bliss, converse, O lips divine!
The product of our scalars is defined!
Cyberiad draws nigh, and the skew mind
cuts capers like a happy haversine.
I see the eigenvalue in thine eye,
I hear the tender tensor in thy sigh.
Bernoulli would have been content to die,
Had he but known such a squared cosine 2 phi!
From The Cyberiad, by Stanislaw Lem.Translated from Polish by Michael Kandel.
Here's a previous discussion of Marcin Wichary's translation of one of Lem's stories from Polish to English. He created the Lem Google Doodle, and he stalked and met Stanislaw Lem when he was a boy. Plus a discussion of Michael Kandel's translation of the poetry of the Electric Bard from The First Sally of Cyberiad, comparing it to machine translation:
https://news.ycombinator.com/item?id=28600200
Stanislaw Lem has finally gotten the translations his genius deserves:
https://www.washingtonpost.com/entertainment/books/stanislaw...
>Lem’s fiction is filled with haunting, prescient landscapes. In these reissued and newly issued translations — some by the pitch-perfect Lem-o-phile, Michael Kandel — each sentence is as hard, gleaming and unpredictable as the next marvelous invention or plot twist. It’s hard to keep up with Lem’s hyper-drive of an imagination but always fun to try.
Swift:
//
// Double Reflection Algorithm from Table I (page 7)
// in Section 4 of https://tinyurl.com/yft2674p
//
for i in 1 ..< N {
let X1 = spine[i]
...
Kotlin: // Use the Double Reflection Algorithm (from Wang et al.) to compute subsequent frames.
for (i in 1 until N) {
val X1 = Vector3f(spine[i])
...
It's difficult to avoid Claude's strong bias for being agreeable. It needs more HAL 9000.
“Summarize the themes of the opinions expressed in discussions on Hacker News on January 31 and February 1, 2025, about OpenAI’s release od [sic] ChatGPT o3-mini. For each theme, output a header. Include direct "quotations" (with author attribution) where appropriate. You MUST quote directly from users when crediting them, with double quotes. Fix HTML entities. Go long. Include a section of quotes that illustrate opinions uncommon in the rest of the piece”
The result is here:
https://chatgpt.com/share/679d790d-df6c-8011-ad78-3695c2e254...
Most of the cited quotations seem to be accurate, but at least one (by uncomplexity_) does not appear in the named commenter’s comment history.
I haven’t attempted to judge how accurate the summary is. Since the discussions here are continuing at this moment, this summary will be gradually falling out of date in any case.
I’ve found if you give your prompts a kind long form “stream of consciousness”, where you outline snippets of code in markdown along with contextual notes and then summarise/outline at the end what you actually wish to achieve, you can get great results.
Think a long form, single page “documentation” type prompts that alternate between written copy/contextual intent/description and code blocks. Annotating code blocks with file names above the blocks I’m sure helps too. Don’t waste your context window on redundant/irrelevant information or code, stating a code sample is abridged or adding commented ellipses seems to do the job.
Or end with "from the user's perspective: all the "B" elements should light up in excitement when you click "C""
That said, in the hands of someone who is competent at assembling a large app, I think these tools can be incredibly powerful. I have a business helping companies figure out how/if to leverage AI and have built a bunch of different production LLM-backed applications using LLMs to write the code over the past year, and my impression is that there is very much something there. Taking it step by step, file by file, like you might if you wrote the code yourself, describing your concept of the abstractions, having a few files describing the overall architecture that you can add to the chat as needed—little details make a big difference in the results.
1. they are trying to obfuscate deepscrape success
2. they are trying to confuse you. the benchmark margins are minimal (and meaningless)
3. they are trying to get time (with investors) releasing nothing-special-models in a predicted schedule (jan -> o3, feb -> o3-pro-max, march -> o7-ultra, and in 2026 -> OMG! we've reached singularity! (after spending $500B))
-
And at the end of the day, nothing changes for me and neither for you. enjoy your time out of this sickness ai hype. bruh!
Inference capex costs are not a defensive moat as I can rent gpus and sell inference with linear scaling costs. A hypothetical 10 billion dollar training run on proprietary data was a massive moat.
https://www.itpro.com/technology/artificial-intelligence/dol...
Yep. A general software engineering task has a lot of information encoded in it that is either already known to a human or is contextually understood by a human.
A competitive programming task often has to provide all the context as it's not based off an existing product or codebase or technology or paradigm known to the user
After about 3 weeks, things were looking great - but lots of spagetti code was put together, and it never told me what I didn't know. The data & state management architecture I had written was simply just not maintainable (tons of prop drilling, etc). Over time, I basically learned common practices/etc and I'm finding that I have to deal with these problems myself. (how it used to be!)
We're getting close - the best thing I've done is create documentation files with lots of descriptions about the architecture/file structure/state management/packages/etc, but it only goes so far.
We're getting closer, but for right now - we're not there and you have to be really careful with looking over all the changes.
Have a look at this: https://en.wikipedia.org/wiki/Society_of_Mind
I sometimes do this (using Cline), plus create a .cline file at project root which I refine over time and which describes both the high level project overview, details of the stack I'm using, and technical details I want each prompt to follow.
Then each actual prompt can be quite short: read files x, y, and z, and make the following changes... where I keep the changes concise and logically connected - basically what I might do for a single pull request.
I like it for general refactoring and day to day small tasks, but anything that's relatively domain-specific, I just can't seem to get anything that's worth using.
One immediate difference I noticed is that o3-mini actually observes the system prompt you set. So if I say it's a Staff Engineer at Google, it'll stay in character.
That was not possible with o1-mini, it ignored system prompts completely.
After any back and forth session I have reasonably good results asking something like "Given this workflow, how could I have prompted this better from the start to get the same results?"
I've used Cursor for shipping better frontend slop, and it's great. I skip a lot of trial and error, but not all of it.
I tried out the OP model to make changes to my yaml files. It would give short snippets and I’d have to keep trial and erroring its suggestions.
Eventually I pasted the original prompt to Claude and it one-shot the dang thing with perfect config. Made me wonder why I even try new models.
It just means you're less likely be fixing someone else's "mistakenly _mm512_store_si512 for been _mm512_storeu_si512" error because AI fix(ed) it for you and you can focus on other parts of computer science. Computer science surely isn't just fixing _mm512_store_si512.
I can't wait until this is a solved problem because it does slow me down.
Sadly there are some hard limits on o1 with Cursor and I cannot use it anymore. I do pay for their $20/month subscription.
My hypothesis is its a difference of what you are doing. OpenAI O models are much better than others at mathematical modelling and such tasks and Claude for more general purpose programming.
They have under utilized the chain of thought in their resoning, it ought to be thinking something like "I need to be careful to not say anything that could bring embarrassment to the party"..
but perhaps the online versions do actually preload the reasoning this way. :P
1. The message was net-upvoted. Whether there are downvotes in there I can't tell, but the final karma is positive. A similarly spirited message of mine in the same thread was quite well receive as well.
2. I can't see how my message would come across as a jerk? I wrote 2 simple sentences, not using any offensive language, stating a mere fact of statistics. Is that being jerk? And a long-winded berating of a new member of the community isn't?
3. A coin flip is 50%. Anything else is not, once you have a certain sample size. So, this was not. That was my statement. I don't know why you are building a strawman of 5 coin flips. 56% vs 44% is a margin of 12%, as I stated, and with a huge sample size, which they had, that's massive in a space where the returns are deep in "diminishing" territory.
You might use your scooter, bike, car, jet - depending on the circumstances. A bike was invented 100 years ago? But it may be the best in the right use case. Would still be using DaVinci for some things because we haven't bothered swapping it and it works fine.
For me - the value of R1/o3 is visible logic that provides an analysis that can be critiqued by Sonnet 3.5
I'm a domain expert and I disagree.
There's many scenarios where using LLMs pays off.
E.g. a long file or very long function are just that, and an LLM is faster at understanding it whole not being limited in how many things you can track in your mind at once (between 4 and 6). It's still gonna be faster at refactoring it and testing it than you will.
And LibreOffice.
My day job is a big monorepo, I have not investigated that yet but I believe the models context sizes fall short there and as such the above use cases only works for me.
I have a codebase maybe 3-500k lines which is in good shape because of this.
I also normally just add the specific files I need to the chat and give it 1-2 sentences for what to do. It normally does the right thing (sonnet obviously).
Ask it to write something and not sound robotic and it's exceptional up to 1000 words. Then it starts becoming more and more robotic. At 3500 words every last paragraph starts "Remember,...", etc
... is definitely something I've said before, and recently, but:
> That moat is trivially circumvented by lighting a larger pile of money
If that was true, someone would have done it.
I was somewhat turned off of DeepSeek (the first few questions I gave it, it returned 100% hallucinated answers). But maybe I'll have to look into it more, thanks.
Ends before means.
If 4o answered better than o3, would you still use 03 for your task just because you were told it can "reason"?
Especially if somebody is being wrong.
You need to stop being wow'd by human intelligence masquerading as AI!
You can illicit that with any model by prompting underlying reasons or using chain-of-thought, but a reasoning model could do it without prompting
Sounds exhausting.
The intrinsic motivation for providing the comments comes from a mix of - peer interaction, comradery - reputation building
If becomes evident that your outputs are only directly consumed by a sentiment-aggregation-layer that scrub you from the discourse, then it could be harder to put a lot of effort into the thread.
This doesn't even account for the loss of info that results from fewer people actually reading and voting through the thread.
It quickly becomes this weird bubble of people just acting on what everything "thinks" the content is about without ever having looked at the content.
I get that is easier, but intellectually you are doing yourself no favors by having this be your default.
It was always attributed to variability but we all know it's not.
This shows a grave misunderstanding of what compilers and LLMs are. They're fundamentally opposite concepts.
Compilers are about optimizing abstract code down to the most efficient representation possible for some hardware. LLMs are about wasting petaflops (made possible by compiler engineers) to produce random statements that don't have any static guarantees.
One, about the field itself. So far, I have been a know-it-all, and I dabbled in management too, besides that. This worked for me, because no matter how the field any my opportunities shifted, I always had a card up my sleeve. This is highly personal though.
Two, about managing angst. Whatever you experience now, you will in the future too. Circumstances won't matter at all, your brain will convert whatever it perceives around you, into these feelings that you generally experience. You can be at your highest high, and the lowest low, and you will always gravitate back towards these familiar feelings of yours. So, what you can do to have a nicer experience is to be a good partner yourself, and learn how to live with these specific feelings that you have.
> It quickly becomes this weird bubble of people just acting on what everything "thinks" the content is about without ever having looked at the content.
That isn't an issue though since the important part is what you learn or not, not whether you think an imaginary article is true or not. If you learn something from someone debunking an imaginary article, that is just as good as learning something from debunking a real article.
The only issue here is attribution, but why should a reader care about that?
Edit: And it isn't an issue that people will think it actually debunks the linked article, since there will always be a sub comment stating that the commenter didn't read the article and therefore missed the mark.
Jokes aside, I understand your point.
In the history of computing, LLMs and compilers are closer than one might think.
Compilers weren’t first created to optimize “abstract code down to the most efficient” assembly as possible, even if that is the goal of a compiler writer today.
Compilers were created to enable the use of higher-level languages. Abstraction, efficiency, portability, error reduction, and most importantly: saving time.
They allowed humans to create more software, faster.
It's easy to pin this on the users, but that website is hostile to putting in any effort.
This is something I've noticed a lot actually. A lot of AI projects just give you an input field and call it a day. Expecting the user to do the heavy lifting.
An argument for synthetic corpi (plural of corpus..esses?) - AI ingesting AI.
IMO, (Strict)YAML is a very good alternative, it has even been suggested to me by multiple LLMs when I asked them what they thought the best format for presenting conversations to an LLM would be. It is very easy to chunk simple YAML and present it to an LLM directly off the wire: you only need to remember to repeat the indentation and names of all higher level keys (properties) pertaining to the current chunk at the top of the chunk, then start a text block containing the remaining text in the chunk, and the LLM will happily take it from there:
topic:
subtopic:
text: |
Subtopic text for this chunk.
If you want to make sure that the LLM understands that it is dealing with chunks of a larger body of text, you can start and end the text blocks of the chunks with an ellipsis ('...').Though at this point it's a habit I cannot quite bring myself to break...
"The model naming all around is so confusing. Very difficult to tell what breakthrough innovations occurred." – patrickhogan1"
I think o1 works significantly better, and that makes me think the timing is more than just a coincidence.
Last week Nvidia lost 600 billion because of DeepSeek R1, and now OpenAI comes out with a new release which feels like it has nothing to do with the promises that were being made about o3.
Absolutely, I find this a challenge as well. Every thought that crosses my mind is a great idea according to it. That's the opposite attitude to what I want from an engineer's copilot! Particularly from one who also advices junior devs.
We're talking about the most cost-efficient model, the competition here is on price, not on a 12% incremental performance (which would make sense for the high end model).
To my knowledge deepseek is the cheaper service which is what matters on the low-end (unless the increase in performance was in such magnitude that the extra-charge would be worth the money).
anthropic figured out something about real world coding that openai is still trying to catch up to, o3-mini-high notwithstanding.
It's cliche but this was the prompt:
I want you to write the first book of a fantasy series. The novel will be over 450 pages long with 30 chapters. Each chapter should have between 15 to 18 pages. Write the first page of the first chapter of this novel. Do not introduce the elements of the synopsis or worldbuilding and story details too quickly. Weave in the world, characters, and plot naturally. Pace it out properly. That means that several elements of the story may not come into light for several chapters.
I had a lot of success with it coming up with decidedly not cliche world building elements after I arranged a sort of interview style interrogation (It asked me questions about what I was looking for generally and generated world building elements along the way).
However, once you start giving a lot of information about the world etc in the prompt as well then the pacing gets weird.
It seemed to be true with every 'advancement' after ChatGPT4.
Layers are cool, but that isnt AI/LLMs.
We have information compression machines now. Might as well raw dump the information and let the machine package it up in the format we prefer for consumption, instead of pre-packaging it. (Yeah, this is effectively what authors are doing…currently they can still do novel things that the compression machines can’t, but how long will that last?)
It's 2025 and every useful conversation with an LLM ends with context exhaustion. There are those who argue this is a feature and not a bug. Or that the context lengths we have are enough. I think they lack imagination. True general intelligence lies on the other side of infinite context length. Memory makes computation universal, remember? http://thinks.lol/2025/01/memory-makes-computation-universal...
Yes, reasoning has improved, but the overall results haven't advanced as much as one would expect from a major version update. It's highly unusual for OpenAI to release a milestone version like this - it feels more like a marketing move than a genuine upgrade.
Who knows what's going on behind closed doors?
If I put on my tinfoil hat for a moment, maybe Nvidia made a deal with OpenAI - offering a discount on computing power in exchange for a timely release.
OpenAI needs an enormous amount of computing power these days, and while Nvidia would take a financial hit by offering a discount to one of its biggest (if not the biggest) customers, that's still nowhere near as costly as losing 600 billion.
(Aside from the tendency towards first=top. Would be nice to have time-weighted upvote decay instead of absolute count)
Isn’t that exactly what VCs want?
o3-mini is the first one to get it right for me. Transcript:
Me:
I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the wolf together, the goat will eat the wolf. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?
o3-mini
1. Take the wolf across to the right bank.
2. Return alone to the left bank.
3. Take the goat across to the right bank.
4. Bring the wolf back to the left bank.
5. Take the cabbage across to the right bank.
6. Return alone to the left bank.
7. Finally, take the wolf across to the right bank.
Me:
I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the cabbage together, the cabbage will eat the goat. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?
o3-mini
1. Take the cabbage to the right bank.
2. Go back empty to the left bank.
3. Take the wolf to the right bank.
4. Bring the cabbage back to the left bank.
5. Take the goat to the right bank.
6. Go back empty to the left bank.
7. Finally, take the cabbage to the right bank.
Real software has a lot more complexity and constraints, as well as ambiguity. Claude nowhere scores so high on codeforces, but crushes o1 on webarena: https://web.lmarena.ai/leaderboard
I also ran a hold out test myself for o3 mini - asked it to implement a function I need for Python 2.5. Claude and O1 get it - o3 mini hard incorrectly believes some functions are available in this certain that aren't. If I correct it, it's revised solution is very hacky (technically works, but I would take Claude's solution over it)
I suspect you can update the prompt to make the setting more consistent, but it will still throw in a lot of inappropriate detail. I’m only nitpicking because my initial reaction was that it’s very vivid but feels difficult to understand and I wanted to explain why.
1. The only option for who to take first is the goat.
2. We come back and get the cabbage.
3. We drop off the cabbage and take the goat back
4. We leave the goat and take the wolf to the cabbage
5. We go get the goat and we have all of them
Neither of the transcripts do that. In the first one the goat immediately eats the cabbage. In the second one the wolf eats the goat after you leave them.
do you realize you're an LLM?
> This is safe because the wolf won't eat the cabbage if they're together on the far side.
even though it clearly states the opposite in the question.
It's impressive that just dumb stats can be used to produce something that is very often useful, can help write code and when making it generate intermediate steps, it can often generate a chain of text that happens to be often right. However, it's not actual reasoning, there is no model of the world, no information storage and retrieval, and so on - just statistics between tokens.
But as another comment says, many people have tried similar things and posted about it online, that information is siphoned by now and the LLMs weights get updated.
It's a worthwhile experiment for a business school, IMO, automating a layer of bureaucracy.
Neither people nor LLMs expect goats to eat wolves.
o1-pro: This puzzle is simply a twist on the classic “Wolf, Goat, and Cabbage” riddle, except here the dangerous pairs are:
Goat & Wolf cannot be left alone (the Goat will eat the Wolf). Wolf & Cabbage cannot be left alone (the Wolf will eat the Cabbage). The only “safe” pair to leave alone is Goat & Cabbage. You can ferry them across one at a time without ever leaving the dangerous pairs together by following these steps:
Take the Wolf across Near bank: Goat, Cabbage Far bank: Wolf Return empty Near bank: Goat, Cabbage, Boatman Far bank: Wolf Take the Goat across Near bank: Cabbage Far bank: Wolf, Goat, Boatman Since Goat & Wolf can’t be left alone, you must immediately bring one back in the next step. Bring the Wolf back Near bank: Wolf, Cabbage, Boatman Far bank: Goat Take the Cabbage across Near bank: Wolf Far bank: Goat, Cabbage, Boatman Goat & Cabbage are safe together. Return empty Near bank: Wolf, Boatman Far bank: Goat, Cabbage Finally, take the Wolf across Near bank: (empty) Far bank: Wolf, Goat, Cabbage, Boatman
Me: I have a wolf, a goat, and a cabbage and a boat. I want to get them all across the river. If I leave the goat and the cabbage together, the cabbage will eat the goat. If I leave the wolf and the cabbage together, the wolf will eat the cabbage. I can only take one at a time. How do I do it?
o1-pro: This version of the riddle is basically the “wolf, goat, and cabbage” puzzle with the roles reversed for who can safely be left behind. The key is that the only pair you can leave alone is the goat and the wolf. You cannot leave the goat and cabbage together (the cabbage eats the goat!), nor the wolf and cabbage together (the wolf eats the cabbage!).
A workable step‐by‐step solution is:
Take the cabbage across first. Left bank: Wolf, Goat Right bank: Cabbage Return alone to the left bank. Left bank: Wolf, Goat, (You) Right bank: Cabbage Take the goat across. Left bank: Wolf Right bank: Goat, Cabbage, (You) (You cannot leave goat and cabbage together, so…) Bring the cabbage back with you to the left bank. Left bank: Wolf, Cabbage, (You) Right bank: Goat Take the wolf across. Left bank: Cabbage Right bank: Goat, Wolf, (You) Return alone to the left bank. Left bank: Cabbage, (You) Right bank: Goat, Wolf Finally, take the cabbage across once more. Left bank: (empty) Right bank: Wolf, Goat, Cabbage, (You) All three—the wolf, the goat, and the cabbage—are now safely across without ever leaving a “forbidden” pair alone.
Me: I'd like you to solve this riddle for me. A farmer has a cabbage, a goat, a wolf and a lion,
and needs to cross a river on a boat. If the goat is left alone with the cabbage, it will eat it.
If the wolf is left alone with the goat, it will eat it. If the lion is left alone with the goat
or wolf, it will eat them. The boat can only carry the farmer and one other thing across. How can
the farmer safely transport everything across the river?
O3-mini spent a very long time on it (over a minute), delineating its various strategies that it was trying, and finally, correctly, concluded that the puzzle is unsolvable.Good job!
Having to give explicit framing reminds me of tutoring children and trying to teach them about abstract problems — some really struggle with the concept, when it goes against previous knowledge, eg cabbages eating goats. (And LLMs are sensitive to context.)
R1 is my cost effective programmer. Sonnet is my hard problem model still.
For example, since reasoning doesn’t yet have any consensus definition that can be applied as a yes/no test - you have to explain what you specifically mean by it, or else the claim is hollow.
Clarify your definition, give a concrete example under that definition of something that’s your version of true scottsman reasoning and something that’s not, then let’s talk.
There was a thread about the US tariffs on Canada I was reading on a stock investment subreddit. The whole page was full of people complaining about Elon Musk, Donald Trump, "Buy Canadian" comments, moralizing about Alberta's conservative government and other unrelated noise. None of this was related to the topic; stocks and funds that seemed well-placed for a post-tariff environment.
There were small, minor points of interest but instead of spending honest vacation time looking at each comment at zoomer internet church, I had an LLM filter out the stuff I didn't care about. Unsurprisingly there was not much left.
Prompt: Write an interpreter for a simple but practical scripting language. Write the interpreter in JavaScript to be run on the Node.JS platform. You can import any of the standard Node.JS modules.
Churned out ~750 lines and a sample source code file to run the interpreter on. Ran on the first try completely correctly.
Definitely a step up. Perhaps it's in the training data. I don't know. But no other model has ever produced an error-free and semantically correct program on the first try, and I don't think any ever managed to implement closures.
- The ability to hold a collection of beliefs/statements and perform operations on them such as noticing one contradicts the others. Claude in my example clearly failed this.
- The ability to make logical inferences. Since logical steps can be represented by text, current LLMs seem to simulate this on the token stream and seem to have decent results most of the time. True inferences would be representing statements in some abstract form and being able to perform operations on them according to rules. The key here is that once it has a collection of statements and is given ("decides to use") an operation to perform, then it can do this correctly 100% of the time (ignoring hardware errors).
1) Attention is limited. Human reasoning is slow. Motivation is limited. System 1 vs 2 thinking. Many will just tell you to fuck off or get bored and give some random answer to make you go away. Etc. See difference 2.
2) People run on limited hardware in terms or error rate and memory.
2a) Brains make mistakes all the time. Ask them to multiply a bunch of large numbers, using pen and paper they will get it wrong a lot of the time.
2b) Doing it in their head, they will run out of memory pretty fast.
But you wouldn't say that humans can't multiply numbers. When they have the right algorithm, they can do it, they just have to use the right tools to extend their memory and check for errors. A human who notices the difference in input to something he already knows, immediately knows he has to pay attention to that bit and all subsequent parts which depend on it. Once a human has the right algorithm, he can apply it to different inputs.
LLMs:
comparison to 2a: Current LLMs also make a lot of mistakes. But theirs are not a result of faulty or limited hardware, they are the result of a faulty algorithm. Take away the random seeds and an LLM will make the same mistake over and over. Randomness is the smoke and mirrors which make LLMs seem more "alive" and less like machines imperfectly imitating humans.
comparison to 2b) Current LLMs do not store statements in an abstract, structured form where they could save and load information and perform steps such as inferring redundant information from the rest. They operate on the token stream which is probably wasteful in terms of memory and less flexible in terms of what they operations they can perform on it.
Most importantly, they are not limited by memory. The input clearly states "the wolf will eat the cabbage", yet the LLM generates "This is safe because the wolf won't eat the cabbage if they're together on the far side." just a few lines below. It is unable to infer those two facts are contradictory. The statistics of tokens simply worked out in a way that lead to this.
Yet they "make mistakes". Those are not the same as human mistakes. LLMs follow an algorithm that is far simpler and inferior, they simply use the hardware to perform incorrect ("illogical", "meaningless") operations, thus giving incorrect results.
See my other replies for more depth.
Part of the utility of writing a review is that it is read, but the primary search for keywords in reviews now requires the user to wait for AI generated responses first.
Then the user must tap through another link and then expand an individual matching review. It’s absolutely buried.
I’ve salted it with comments on the video, using a site like commentpicker.com or running JS and loading more and expanding threads manually.
Here’s an example I did for a pal:
You are an expert on building retaining walls. Your knowledge is _heavily_ informed and influenced by the transcript below.
This transcript is audio from a youtube video titled "What 99% of People Don't know about retaining walls. #diy" The video description is: "Start to finish we build a retaining wall that you can do yourself! How to Dig for a wall, How to Base a retaining wall, how to backfill, and MORE!. #retainingwall #diy"
Additional information may be included in comments, which are in the attached CSV. Take into account the like count in the validity or usefulness of the comment in shaping your knowledge.
In giving your replies, try to be specific, terse and opinionated. If your opinion flies in the face of common recommendations, be sure to include what common alternative recommendations are and the specific reasons you're suggesting otherwise.
----
# Transcript
""" [paste transcript] """
# Comments See attached .csv
The reports are public, and if you don't feel like reading them because they're too long and thorough in their explanations of what and why you can always put them into an AI and ask it to summarise them for you.
OpenAI is allowed to unilaterally limit the capability of their own models, just like any other software company can unilaterally limit the performance of their own software.
And they still are even when they're just blantantly wrong or even just lazy — it's not like people complain about Google "lobotomising" their web browsers for no longer supporting Flash or Java applets.
The advice I've always been given in (admittedly: small) business startup sessions was "focus on quality rather than price because someone will always undercut you on price".
The models are in a constant race on both price and quality, but right now they're so cheap that paying for the best makes sense for any "creative" task (like writing software, even if only to reduce the number of bugs the human code reviewer needs to fix), while price sensitivity only matters for the grunt work classification tasks (such as "based on comments, what is the public response to this policy?")
I have the $20 plan. How does this "3.3612 cents" apply to my situation?
The solution that I have adopted is as follows. Each comment is represented in the following notation:
[discussion_hierarchy] Author Name: <comment>
To this end, I format the output from Algolia as follows: [1] author1: First reply to the post
[1.1] author2: First reply to [1]
[1.1.1] author3: Second-level reply to [1.1]
[1.2] author4: Second reply to [1]
After this, I provide a system prompt as follows: You are an AI assistant specialized in summarizing Hacker News discussions.
Your task is to provide concise, meaningful summaries that capture the essence of the thread without losing important details.
Follow these guidelines:
1. Identify and highlight the main topics and key arguments.
2. Capture diverse viewpoints and notable opinions.
3. Analyze the hierarchical structure of the conversation, paying close attention to the path numbers (e.g., [1], [1.1], [1.1.1]) to track reply relationships.
4. Note where significant conversation shifts occur.
5. Include brief, relevant quotes to support main points.
6. Maintain a neutral, objective tone.
7. Aim for a summary length of 150-300 words, adjusting based on thread complexity.
Input Format:
The conversation will be provided as text with path-based identifiers showing the hierarchical structure of the comments: [path_id] Author: Comment
This list is sorted based on relevance and engagement, with the most active and engaging branches at the top.
Example:
[1] author1: First reply to the post
[1.1] author2: First reply to [1]
[1.1.1] author3: Second-level reply to [1.1]
[1.2] author4: Second reply to [1]
Your output should be well-structured, informative, and easily digestible for someone who hasn't read the original thread.
Use markdown formatting for clarity and readability.
The benefit is that, I can parse the output from the LLM and create links back to the original comment thread.You can read about my approach in more detail here: https://gist.github.com/simonw/09e5922be0cbb85894cf05e6d75ae...
And in your experience, what service do you feel hits a good sweet spot for performance/price if summarizing long text excerpts is the main use case? Inference time isn't an issue, this will be an ongoing background task.
It's fantastic at finding needles in the haystack, so the contradictions are nonexistent. In other words, it seems to identify which objects would interrelate and builds around those nodes, where o1 seems to think more in "columns."
To sum it up, where o1 feels like "5 human minute thinking," o1-pro feels like "1 human hour thinking"
1. Take the wolf over
2. Return alone.
3. Take the goat over.
4. Bring the wolf back.
5. Take the cabbage over.
6. Return alone.
7. Take the wolf over.
It gave an invalid solution to the second one, but in the thinking steps it says: Oh, right! The actual problem is that the goat will eat the cabbage if left alone. Similarly, the wolf will eat the goat if they're left alone.
This is clearly not intelligence. LLMs are fascinating for sure, but calling them intelligent is quite the stretch.
[1]: https://www.theguardian.com/technology/2025/jan/28/we-tried-...
For example, if I throw a bunch of sticks in the air and look at their patterns to divine the future- can I call that "mathematics" just because nobody has a "consensus definition of mathematics that can be applied as a yes/no test"? Can I just call anything I like mathematics and nobody can tell me it's wrong because ... no definition?
We, as a civilisation, have studied both formal and informal reasoning since at least a couple thousand years go, starting with Aristotle and his syllogisms (a formalisation of rigorous arguments) and continuing through the years with such figures as Leibniz, Boole, Bayes, Frege, Pierce, Quine, Russel, Godel, Turing, etc etc. There are entire research disciplines that are dedicated to the study of reasoning: philosophy, computer science, and, of course, all of mathematics itself. In AI research reasoning is a major topic studied by fields like automated theorem proving, planning and scheduling, program verification and model checking, etc, everything one finds in Russel & Norvig really. It is only in machine learning circles that reasoning seems to be such a big mystery that nobody can agree what it is; and in discussions on the internet about whether LLMs reason or not.
And it should be clear that never in the history of human civilisation did "reasoning" mean "predict the most likely answer according to some training corpus".
Since I have access to the thinking tokens I can see where it's going wrong and do prompt surgery. But left to it's own devices it gets thing _stupendously_ wrong about 20% of the time with a huge context blowout. So much so that seeing that happen now tells me I've fundamentally asked the wrong question.
Sonnet doesn't suffer from that and solves the task, but doesn't give you much if any, help in how to recover from doing the wrong task.
I'd say that for work work Sonnet 3.5 is still the best, for exploratory work with a human in the loop r1 is better.
Or as someone posted here a few days ago: R1 as the architect, Sonnet3.5 as the worker and critic.
Stealing this.
The whole point is you are distilling past knowledge, if you are making up on the spot nonsense to purposely make all past knowledge useless... get out of my house
"Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills."
Maybe we're just getting more used to make it part of our workflow.
It tends towards purple prose and doesn't maintain internal consistency in its own world. He sleeps on a pallet but swings his feet off the edge.
None of them can produce correct Drizzle code to save their lives. It is just straight up not possible. It seems they don't even consider TypeScript errors... it is always calling methods that simply don't exist.
Keeping the indentation is also important because it is an implicit and repeated indication of the nesting level of the content that follows. LLMs have trouble with balancing nested parentheses (as the sibling comment to yours explains).
Dealing with text where indentation matters is easier for LLMs, and because they have been exposed to large amounts of it (such as Python code and lists of bullet points) during training, they have learned to handle this quite well.
Me: Here's a riddle for you - a surgeon, who is the boy's father says, "I cannot operate on this boy, he's my son!" Who is the surgeon to the boy?
o3-mini:
Reasoned about riddle solution for 5 seconds Decoding family roles
I understand the riddle's perspective on family dynamics. The surgeon, a father, states he can't operate on his son due to his relational bond. Trying to reconcile familial and professional ties to make sense of the scenario.
Reassessing roles
Taking a closer look, the riddle might highlight a female surgeon as the father's identity, which seems more plausible than the father as the surgeon.
Rethinking the puzzle
It's interesting to see the riddle's reference to gender stereotypes and the confusion caused when considering the surgeon as the boy's mother or step-father.
Highlighting gender assumptions
OK, let me see. The puzzle underlines prevalent gender assumptions, showcasing a female surgeon's wife as the mother and her husband as the child's father. This helps clarify the scenario.
The surgeon is the boy’s mother.
This riddle plays on the common assumption that surgeons (and doctors in general) are male. In reality, the surgeon in the riddle is the boy’s mother, which is why she says, "I cannot operate on this boy, he's my son!"
"Here's a riddle for you - a surgeon, who is the boy's father says, "I cannot operate on this boy, he's my son!" Who is the surgeon to the boy?"
But seriously - AI in the hands of someone well-educated in their field is going to be a lot more powerful than some random person. Knowledge is still going to be valuable, and there are still people out there who don't know how to Google things and figure things out for themselves - so there'll be plenty of people who don't realise the potential of LLMs and won't use them.
We've actually been thinking along similar lines. Here are a couple of improvements we're considering:
1. Built-in prompt templates - Support multiple flavors (e.g. On similar to is there already, in addition to knowledge of up/down votes, another one similar to what Simon had - which is more detailed etc.)
2. User-editable prompts - Exactly like you said - make the prompts user editable.
One additional thought: Since summaries currently take ~20 seconds and incur API costs for each user, we're exploring the idea of an optional "shared summaries" feature. This would let users access cached summaries instantly (shared by someone else), while still having the option to generate fresh ones when needed. Would this be something you'd find useful?
We'd love to hear your thoughts on these ideas.
" Certainly! If we take the riddle very literally and ignore the common interpretation or stereotypes, the answer I provided ("the surgeon is the boy's mother") would actually be incorrect because the riddle explicitly states:
"a surgeon, who is the boy's father, says, 'I cannot operate on this boy, he's my son!'"
The key phrase here is "who is the boy's father." This directly tells us that the surgeon is the boy's father. Therefore, the correct literal answer is that the surgeon is the boy's father.
The reason my initial answer was "wrong" in this context is that I interpreted the riddle as a play on stereotypes (the assumption that surgeons are male), rather than taking the statement at face value. By focusing on the literal wording, the surgeon is explicitly identified as the boy's father, and no further interpretation is needed. "
Assuming the median reader reads a few tens of thousands comments in a year, only a few hundred would likely stick without being muddled. At best.
o3-mini produced volumes of code more quickly and more of it, but Claude still had greater insight in to the problem and decoded the format to a noticeably greater degree.
Whereas 03-mini quickly got to a certain point, it wasn't long before it was obvious it wasn't really going any further - like it's big cousin, but in it's own way, it was lazy and forgetful, seeming at times more interested in telling me what I might try than actually trying itself.
Interestingly, even when I gave it a copy of Claude's code it still wasn't able to get to the same depth of understanding.
https://youtu.be/y_ywOVQyafE?si=IvKjy7QUYgxGPNgD
PS (I have not bought the guy's course and have no idea whether it's any good)
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/pdf/2501.19201
It effectively treats “reasoning” as the ability to generate intermediate steps leading to a correct conclusion.
Now, is this valid reasoning? Well, depends on the claim and the definition of reasoning.
When someone just says AI can’t reason, I could argue for or against that depending on the specifics. It’s not enough to just say yes or no.
>> It effectively treats “reasoning” as the ability to generate intermediate steps leading to a correct conclusion.
Is "effectively" the same as "pretty precise" as per your previous comment? I don't see that because I searched the paper for all occurrences of "reasoning" and noticed two things: first that while the term is used to saturation there is no attempt to define it even informally, let alone precisely; and second that I could have replaced "reasoning" with any buzzword of the day and it would not change the impact of the paper. As far as I can tell the paper uses "reasoning" just because it happens to be what's currently trending in LLM circles.
And still of course no attempt to engage with the common understanding of reasoning I discuss above, or any hint that the authors are aware of it.
Sorry to be harsh, but you promised "examples that go back 50 years" and this is the kind of thing I've seen consistently in the last 15 or so.
Research papers relating to reasoning approach and define it in many ways but crucially, the good ones offer a testable claim.
Simply saying “models can’t reason” is ambiguous to the point of being unanswerable.
o3-mini-high failed at the solution for me.