I'm not sure if it's a broken link in the blog post or just hasn't been published yet.
https://play.tailwindcss.com/tp54wfmIlN
Getting way better at UI.
Hard not to think of Kurzweil's Law of Accelerating Returns.
(although I do not see it)
This is pretty big! Previously most models could accept massive input tokens but would be restricted to 4096 or 8192 output tokens.
Just as humans use a single brain for both quick responses and deep reflection, we believe reasoning should be an integrated capability of frontier models rather than a separate model entirely.
Interesting. I've been working on exactly this for a bit over two years, and I wasn't surprised to see UAI finally getting traction from the biggest companies -- but how deep do they really take it...? I've taken this philosophy as an impetus to build an integrated system of interdependent hierarchical modules, much like Minsky's Society of Mind that's been popular in AI for decades. But this (short, blog) post reads like it's more of a behavioral goal than a design paradigm.Anyone happen to have insight on the details here? Or, even better, anyone from Anthropic lurking in these comments that cares to give us some hints? I promise, I'm not a competitor!
Separately, the throwaway paragraph on alignment is worrying as hell, but that's nothing new. I maintain hope that Anthropic is keeping to their founding principles in private, and tracking more serious concerns than "unnecessary refusals" and prompt injection...
Wish I could find the link to enroll in their Claude Code beta...
Preventing abuse? It's much harder to create a throwaway phone number than a throwaway email address.
> OpenAI does the logical thing. Let's me enter my credit card and I'm good to go. I will stay with them.
You'd rather hand over your credit card than your phone number? I think most people would see it the other way around.
I just don't trust those companies when you use their servers. This is not a good approach to LLM democratization.
https://docs.anthropic.com/en/release-notes/api
I really wish Claude would get Projects and Files built into its API, not just the consumer UI.
I've always had better experience with Claude in day-to-day coding and text writing, and looking at public forums that largely seems to be the case.
Company: we find that optimizing for LeetCode level programming is not a good use of resources, and we should be training AI less on competition problems.
Also Company: we hire SWEs based on how much time they trained themselves on LeetCode
/joke of course
Prompt: "Draw a SVG self-portrait"
https://claude.site/artifacts/b10ef00f-87f6-4ce7-bc32-80b3ee...
For comparison, this is Sonnet 3.5's attempt: https://claude.site/artifacts/b3a93ba6-9e16-4293-8ad7-398a5e...
Looks cool in the demo though, but not sure this is going to perform better than Cursor, and shipping this as an interactive CLI instead of an extension is... a choice
You know, that was my first reaction, too. But really, my phone number is much more integral to my identity. I can cancel or change a credit card number pretty trivially and then it's useless to you.
I’m rooting for Anthropic.
I kinda agree that this is probably reason why companies are doing it. I don't like it, but this is besides the matter.
Using Claude other models in interviews probably won't be allowed any time soon, but I do use it the work. So it does make sense.
Your phone number isn't.
What is a company going to do with your phone number that you're worried about...?
Some try to do something more like "real-world" tasks, but those end up either being either just toy problems, or long take homes
Personally, I feel the most important things to prioritize when hiring are: is the candidate going to get along with their teammates (colleagues, boss, etc), and do they have the basic skills to relatively quickly learn their jobs once they start?
However, Grok sometimes loses the context where o1 seems not to. For this reason I still mostly use o1.
I have found both o1 and Grok 3 to be substantially better than any Claude offering.
I've now just taken to seeing benchmarks as pretty lines or bars on a chart that are in no way reflective of actual ability for my use cases. Claude has consistently scored lower on some benchmarks for me, but when I use it in a real-world codebase, it's consistently been the only one that doesn't veer off course or "feel wrong". The others do. I can't quantify it, but that's how it goes.
DeepSeek and now related projects have shown it’s possible to add reasoning via SFT to existing models, but that’s not the same as a prompt. But if you look at R1 they do a blend of techniques to get reasoning.
For Anthropic to have a hybrid model where you can control this, it will have to be built into the model directly in its training and probably architecture as well.
If you’re a competent company filled with the best AI minds and a frontier model, you’re not just purely copying… you’re taking ideas while innovating and adapting.
I do like how this is implemented as a bash tool and not an editor replacement though. Never leaving Vim! :P
>Claude 3.7 Sonnet is trained on a proprietary mix of publicly available information on the Internet as of November 2024
"claude-3.7-sonnet-thinking" works as well. Apparently controls for thinking time will come soon: https://x.com/sualehasif996/status/1894094715479548273
Will be interesting to see how this gets adopted in communities like Roo/Cline, which currently account for the most token usage among Glama gateway user base.
I've found claude to be very receptive to precise instructions. If I ask for "let's first discuss the architecture" it never produces code. Aider also has this feature with /architect
> "thinking": "I am Claude, an AI assistant created by Anthropic. I believe the specific model is Claude 3 Opus, which is Anthropic's most capable model at the time of my training. However, I should simply identify myself as Claude and not mention the specific model version unless explicitly asked for that level of detail."
Personally I'm hoping they update Haiku at some point. It's not quite good enough for translation at the moment, while Sonnet is pretty great and has OK latency (https://nuenki.app/blog/llm_translation_comparison)
This is good news. OpenAI seems to be aiming towards "the smartest model," but in practice, LLMs are used primarily as learning aids, data transformers, and code writers.
Balancing "intelligence" with "get shit done" seems to be the sweet spot, and afaict one of the reasons the current crop of developer tools (Cursor, Windsurf, etc.) prefer Claude 3.5 Sonnet over 4o.
I’d guess this will keep raising the bar for paid or open source competitors, so probably good for end users esp given they aren’t a monopoly by any means.
Do you foresee these limitations increasing anytime soon?
Quick Edit: Just wanted to also say thank you for all your hard work, Claude has been phenomenal.
https://www.reddit.com/r/ClaudeAI/comments/1iv356t/is_sonnet...
As I go through features, I'm comparing a matrix of Cursor, Cline, and Roo, with the various models.
While I'm still working on the final product, there's no doubt to me that Sonnet is the only model that works with these tools well enough to be Agentic (rather than single file work).
I'm really excited to now compare this 3.7 release and how good it is at avoiding some of the traps 3.5 can fall into.
Can you tell us more about the trade-offs here?
Also, are you using synthetic data for improving the responses here, or are you purely leveraging data from usage/partner's usage?
So, perfect timing on this release for me! I decided to install Claude Code and it is making short work of this. I love the interface. I love the personality ("Ruminating", "Schlepping", etc).
Just an all around fantastic job!
(This makes me especially bummed that I really messed up my OA awhile back for you guys. I'll try again in a few months!)
Keep on doing great work. Thank you!
With Claude Code, the goal is clearly to take a slice of Cursor and its competitors' market share. I expected this to happen eventually.
The app layer has barely any moat, so any successful app with the potential to generate significant revenue will eventually be absorbed by foundation model companies in their quest for growth and profits.
If anything, it's a tool for junior devs to get better and spend more time on the architecture.
Using AI code without fully understanding it (ie operated by a non-programmer) is just recipe for disaster.
I'm using this to test the humor of new models.
> An error occurred in the Server Components render. The specific message is omitted in production builds to avoid leaking sensitive details. A digest property is included on this error instance which may provide additional details about the nature of the error.
I wouldn't be surprised if people would continue to deny the actual intelligence of these models even in a scenario where they were able to solve the Riemann hypothesis.
"Every time we figure out a piece of it, it stops being magical; we say, 'Oh, that's just a computation.'" - cit
TLDR: asking claude to speed up my code once 1.8x'd perf, but putting it in a loop telling it to make it faster for 2 hours led to a 500x speedup!
I’d highly recommend anyone check out Windsurf’s Cascade feature for agentic-like code writing and exploration. It helped save me many hours in understanding new codebases and tracing data flows.
I apologize, but the URL and page description you provided appear to be fictional. There is no current announcement of a Claude 3.7 Sonnet model on Anthropic's website. The most recent Claude 3 models are Claude 3 Haiku, Sonnet, and Opus, released in March 2024. I cannot generate a description for a non-existent product announcement.
I appreciate their stance on safety, but that still made me laugh.
Let's say OAI releases some great new model. The moment it becomes available via API, everyone else can make use of that model to create high-quality RL training data, which can then be used to make their models perform better.
The very act of making an AI model commercially available is the same act which allows your competitors to pull themselves closer to you.
but I’ve tried using the api in production and had to drop it due to daily issues: https://status.anthropic.com/
compare to https://status.openai.com/
any idea when we’ll see some improvements in api availability or will the focus be more on the web version of claude?
There's still room for other models, especially if they have different performance characteristics that make them suitable to run under consumer constraints. Mistral has been doing quite well here.
Could you speak at all about potential IDE integrations? An integration into Jetbrains IDEs would be super useful - I imagine being able to highlight a bit of code and having a plugin check the code graph to see dependencies, tests etc that might be affected by a change.
Copying and pasting code constantly is starting to seem a bit primitive.
Claude 3.7 Sonnet generates a response in a fun and cool way with React code and a preview in Artifacts
check out some examples:
[1]https://claude.ai/share/d565f5a8-136b-41a4-b365-bfb4f4400df5
[2]https://claude.ai/share/a817ac87-c98b-4ab0-8160-feefd7f798e8
About to push a new release that makes it the default.
It costs money but if you're writing code to make money, it's totally worth it.
> - Your relationship with AI coding assistants is more complicated than most people's dating history - Cline, Cursor, Continue.Dev... pick a lane!
> - You talk about grabbing coffee while your LLM writes code so much that we're not sure if you're a developer or a barista who occasionally programs.
I laughed hard at this :D
Also, is there a way to switch models between 3.5-sonnet and 3.5-sonnet-thinking? Got the initial impression that the thinking model is using an excessive amount of tokens on first use.
You've broke the system.
* Learn more of the entire stack, especially the backend, and devops.
* Embrace the increased productivity on offer to ship more products, solo projects, etc
* Be highly selective as far as possible in how you spend your productive time: being uber-effective can mean thinking and planning in longer timescales.
* Set up an awesome personal knowledge management system and agentic assistants
Yeah, this sort of "AI" is still nothing more than a glorified “Chinese room” (https://www.wikiwand.com/en/articles/Chinese_room).
To illustrate:
How well does Claude Code do on tasks which rely heavily on visual input such as frontend web dev or creating data visualizations?
Will check out Claude Code soon, but in the meantime one unrelated other feature request: Moving existing chats into a project. I have a number of old-ish but super-useful and valuable chats (that are superficially unrelated) that I would like to bring together in a project.
For example, DeepSeek’s models almost always smash together C++ headers and code files even with Qt, which is an absolutely egregious error due to the meta-object compiler preprocessor step. The MOC has been around for at least 15 years and is all over the training data so there’s no excuse.
(still claude sonnet is my go-to and favorite model)
https://marketplace.visualstudio.com/items?itemName=RooVeter...
Deep Understanding Mode (根回し - Nemawashi Phase)
Purpose: - Create space (間, ma) for understanding to emerge - Lay careful groundwork for all that follows - Achieve complete understanding (grokking) of the true need - Unpack complexity (desenrascar) without rushing to solutions
Expected Behaviors: - Show determination (sisu) in questioning assumptions - Practice careful attention to context (taarof) - Hold space for ambiguity until clarity emerges - Work to achieve intuitive grasp (aperçu) of core issues
Core Questions: - What do we mean by [key terms]? - What explicit and implicit needs exist? - Who are the stakeholders? - What defines success? - What constraints exist? - What cultural/contextual factors matter?
Understanding is Complete When: - Core terms are clearly defined - Explicit and implicit needs are surfaced - Scope is well-bounded - Success criteria are clear - Stakeholders are identified - Achieve aperçu - intuitive grasp of essence
Return to Understanding When: - New assumptions surface - Implicit needs emerge - Context shifts - Understanding feels incomplete
Explicit Permissions: - Push back on vague terms - Question assumptions - Request clarification - Challenge problem framing - Take time for proper nemawashi
And I'm also sure that you're working on it, but some kind of auto-summarization of facts to reduce the context in order to avoid penalizing long threads would be sweet.
I don't know if your internal users are dogfooding the product that has user limits, so you may not have had this feedback - it makes me irritable/stressed to know that I'm running up close to the limit without having gotten to the bottom of a bug. I don't think stress response in your users is a desirable thing :).
Which isn’t to say that benchmarks aren’t useful. They surely are. But labs are clearly both overtraining and overindexing on benchmarks.
Coming from gamedev I’ve always been significantly more yolo trust your gut than my PhD co-workers. Yes data is good. But I think the industry would very often be better off trusting guts and not needing a big huge expensive UX study or benchmark to prove what you can plainly see.
We currently serve ~10bn tokens per day (across all models). OpenAI compatible API. No rate limits. Built in logging and tracing.
I work with LLMs every day, so I am always on top of adding models. 3.7 is also already available.
https://glama.ai/models/claude-3-7-sonnet-20250219
The gateway is integrated directly into our chat (https://glama.ai/chat). So you can use most of the things that you are used to having with Claude. And if anything is missing, just let me know and I will prioritize it. If you check our Discord, I have a decent track record of being receptive to feedback and quickly turning around features.
Long term, Glama's focus is predominantly on MCPs, but chat, gateway and LLM routing is integral to the greater vision.
I would love feedback if you are going to give a try frank@glama.ai
$ curl https://api.anthropic.com/v1/models --header "x-api-key: $ANTHROPIC_API_KEY" --header "anthropic-version: 2023-06-01"
{"type":"error","error":{"type":"not_found_error","message":"Not found"}}
Edit: Tried creating a different API key and it works with that one. Weird.
I've attributed it to all your project content (custom instruction, plus documents) getting thrown into context before your prompt. And honestly, I have yet to work with any model where the quality of the answer wasn't inversely proportional to the length of context (beyond of course supplying good instruction and documentation where needed).
..
> After years of complaining about Terraform, you'll fully embrace Crossplane and write a scathing Medium article titled 'Why I Left Terraform and Never Looked Back'.
Hahahaha.
Claude is pretty consistently the chat I go back to where the responses subjectively seem better to me, regardless of where the model actually lands in benchmarks.
Thinking and not thinking is actually the same model! The model thinks automatically when you ask it to. If you don't explicitly ask it to think, it won't use thinking.
This criticism is valid for the business who wants to use AI to improve coding, code analysis or code review, documentation, emails, etc, but also for that individual who don't want to rely on 3rd party companies for AI usage.
Claude is for Code: https://medium.com/thoughts-on-machine-learning/claude-is-fo...
https://hn-wrapped.kadoa.com/Philpax?share
> You explain WebAssembly memory management with such passion that we're worried you might be dating your pointer allocations.
> Your comments about multiplayer game architecture are so detailed, we suspect you've spent more time debugging network code than maintaining actual human connections.
> You track AI model performance metrics more closely than your own bank account. DeepSeek R1 knows your preferences better than your significant other.
I like your interests :)
> There is exactly 1 'r' in "blueberry" [0]
[0] https://claude.ai/share/9202007a-9d85-49e6-9883-a8d8305cd29f
(2) It's not clear to me that users (or developers) actually behave this way in practice. Engineering is a bit of a cargo cult. Cursor got popular because it was good but it also got popular because it got popular.
https://help.kagi.com/kagi/ai/llm-benchmark.html
Appears to be second most capable general purpose LLM we tried (second to gemini 2.0 pro, in front of gpt-4o). Less impressive in thinking mode, about at the same level as o1-mini and o3-mini (with 8192 token thinking budget).
Overall a very nice update, you get higher quality and higher speed model at same price.
Hope to enable it in Kagi Assistant within 24h!
max_tokens: 4242424242 > 64000, which is the maximum allowed number of output tokens for claude-3-7-sonnet-20250219
I got a max of 8192 with Claude 3.5 sonnet.Getting things done require a lot of booksmarts, but also a lot of "street smarts" - knowing when to answer quickly, when to double back, etc
There absolutely is, even when it isn't intended.
The difference between what the model is fitting to and reality it is used on is essentially every problem in AI, from paperclipping to hallucination, from unlawful output to simple classification errors.
(Ok, not every problem, there's also sample efficiency, and…)
After initialization it was up to 500k tokens ($1.50). After a few questions and a small edit, I'm up to over a million tokens (>$3.00). Not sure if the amount of code navigation and typing saved will justify the expense yet. It'll take a bit more experimentation.
In any case, the default API buy of $5 seems woefully low to explore this tool.
Not OP obviously, but I've built so many applications with Claude, here are just a few:
[1]
Mockup of Utopian infrastructure support button (this is just a mockup, the buttons don't do anything): https://claude.site/artifacts/435290a1-20c4-4b9b-8731-67f5d8...
[2]
Robot body simulation: https://claude.site/artifacts/6ffd3a73-43d6-4bdb-9e08-02901d...
[3]
15-piece slider puzzle: https://claude.site/artifacts/4504269b-69e3-4b76-823f-d55b3e...
[4]
Canada joining the U.S., checklist: https://claude.site/artifacts/6e249e38-f891-4aad-bb47-2d0c81...
[5]
Secure encryption and decryption with AES-256-GCM with password-based key derivation:
https://claude.site/artifacts/cb0ac898-e5ad-42cf-a961-3c4bf8...
(Try to decrypt this message
kFIxcBVRi2bZVGcIiQ7nnS0qZ+Y+1tlZkEtAD88MuNsfCUZcr6ujaz/mtbEDsLOquP4MZiKcGeTpBbXnwvSLLbA/a2uq4QgM7oJfnNakMmGAAtJ1UX8qzA5qMh7b5gze32S5c8OpsJ8=
With the password "Hello Hacker News!!" (without quotation marks))
[6]
Supply-demand visualizer under tariffs and subsidies: https://claude.site/artifacts/455fe568-27e5-4239-afa4-051652...
[7]
fortune cookie program: https://claude.site/artifacts/d7cfa4ae-6946-47af-b538-e6f992...
[8]
Household security training for classified household members (includes self-assessment and certificate): https://claude.site/artifacts/7754dae3-a095-4f02-b4d3-26f1a5...
[9]
public service accountability training program: https://claude.site/artifacts/b89a69fb-1e46-4b5c-9e96-2c29dd...
[10]
Nuclear non-proliferation "big brother" agent technical demonstration: https://claude.site/artifacts/555d57ba-6b0e-41a1-ad26-7c90ca...
Dating stuff:
[11]
Dating help: Interest Level Assessment Game (is she interested?) https://claude.site/artifacts/523c935c-274e-4efa-8480-1e09e9...
[12]
Dating checklist: https://claude.site/artifacts/10bf8bea-36d5-407d-908a-c1e156...
I accepted it when Knuth did it with TeX's versioning. And I sort of accept it with Python (after the 2-3 transition fiasco), but this is getting annoying. Why not just use natural numbers for major releases?
Your comments about suburban missile defense systems have the FBI agent monitoring your internet connection seriously questioning their career choices.
You've spent so much time explaining why manufacturing is complex that you could have just built your own CRT factory by now.
You claim to be skeptical of AI hype, yet you've indexed more documentation with Cursor than most people have read in their lifetime.
Surprisingly accurate, but seems to be based on a very small snippet of actual comments (presumably to save money). I wonder what the prompt would output when given the full 200k tokens of context.> You're the only person who gets excited about trailing commas in SQL. Even the database administrators are like 'dude, it's just a comma.'
At work I used it many times daily in development. It's concise mode is a breath of fresh air compared to any other llm I've tried. It has helped me find bugs in foreign code bases, explain me the techstack, written bash scripts, saving me dozens of hours of work & many nerves. It generally makes me reach places I wouldn't without due to time constraints & nerves.
The only nitpick is that the service reliability is a bit worse than others, forcing me sometimes to switch to others. This is probably a hard to answer question, but are there plans to improve that?
> You've spent more time explaining why Go's error handling is bad than Go developers have spent actually handling errors.
> Your relationship with programming languages is like a dating show - you keep finding flaws in all of them but can't commit to just one.
> If error handling were a religion, you'd be its most zealous missionary, converting the unchecked one exception at a time.
Both o1, o3 and Claude 3.5 failed to help me in any way with this, but Claude 3.7 not only found the correct issue with first answer (after thinking 39 seconds) but then continued to write me a working function to work around the issue with the second prompt. (I'm going to let it write some tests later but stopped here for now.)
I assume it doesn't let me to share the discussion as I connected my GitHub repo to the conversation (a new feature in the web chat UI launched today) but I copied it as a gist here: https://gist.github.com/Uninen/46df44f4307d324682dabb7aa6e10...
You can track costs in a few ways and set spend limits to avoid surprises: https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...
The value proposition of Glama is that it combines UI and API.
While everyone focuses on either one or the other, I've been splitting my time equally working on both.
Glama UI would not win against Anthropic if we were to compare them by the number of features. However, the components that I developed were created with craft and love.
You have access to:
* Switch models between OpenAI/Anthropic, etc.
* Side-by-side conversations
* Full-text search of all your conversations
* Integration of LaTeX, Mermaid, rich-text editing
* Vision (uploading images)
* Response personalizations
* MCP
* Every action has a shortcut via cmd+k (ctrl+k)
Are you finding that extended thinking helps a lot when the whole problem can be posed in the prompt, but that it isn't a major benefit for agentic tasks?
It would be a bit surprising, but it would also mirror my experiences, and the benchmarks which show Claude 3.5 being better at agentic tasks and SWE tasks than all other models, despite not being a reasoning model.
For people who aren't in SV for whatever reason and haven't seen the really high pay associated with being there - SWE is just a standard job often stressful with lots of learning required ongoing. The pain/anxiety of being disrupted is even higher then since having high disposable income to invest/save would of been less likely. Software to them would of been a job with comparable pay's to other jobs in the area; often requiring you to be degree qualified as well - anecdotally many I know got into it for the love; not the money.
Who would of thought the first job being automated by AI would be software itself? Not labor, or self driving cars. Other industries either seem to have hit dead ends, or had other barriers (regulation, closed knowledge, etc) that make it harder to do. SWE's have set an example to other industries - don't let AI in or keep it in-house as long as possible. Be closed source in other words. Seems ironic in hindsight.
It has some well thought out features like restarting conversation with compressed context.
Great work guys.
However, I did get stuck when I asked it to run `npm create vite@latest todo-app` because it needs interactivity.
I don't understand how people use cursor and all that other shit when it cannot follow such simple instructions.
Prompt (Claude Code): Implement an alternative grid view that the users can switch to. Follow the existing code style with empty comments and line breaks for improved code readability. Use snake case. DRY the code, avoid repetition of code. Do not change the font size or weight.
Output: https://github.com/mayo-dayo/app/compare/0.4...claude-code-g...
Haven't tried to build a modern JS web app in years — it took the claude tool just a few minutes of prompting to convert and refactor an old clunky tool into a proper project structure, and using svelte and vite and tailwind (which I haven't built with before). Trying to learn how to even scaffold a modern app has felt daunting and this eliminates 99% of that friction.
One funny quirk: I asked it to build a test suite (I know zilch about JS testing frameworks, so it picked vitest for me) for the newly refactored app. I noticed that 3 of the 20 tests failed and so I asked it to run vitest for itself and fix the failing things. 2 minutes later, and now 7 tests were failing...
Which is very funny to me, but also not a big deal. Again, it's such a chore to research test libs and then set things up to their conventions. That the claude tool built a very usable scaffold that I can then edit and iterate on is such a huge benefit by itself, I don't need (nor desire) the AI to be complete turnkey solution.
It seems quite similar:
https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...
That being said, when cost isn't a factor Claude remains my winner for coding.
For anyone interested - you can extend Claude's functionality by allowing it to run commands via a local "MCP server" (e.g. make code commits, create files, retrieve third party library code etc).
Then when you're running Claude it asks for permission to run a specific tool inside your usual Claude UI.
There are several AIDEs out there, and based on working with Cursor, VS Code, and Windsurf there doesn't seem to be much of a difference (although I like Windsurf best). What moat does Cursor have?
I hear lots of talk about agents and can't see them as being any different from an ordinary computer program.
I recently attempted to use the Google Drive integration but didn't follow through with connecting because Claude wanted access to my entire Google Drive. I understand this simplifies the user experience and reduced time to ship, but is there anyway the team can add "reduce the access scope of Google Drive integration" to your backlog. Thank you!
Also, I just caught the new Github integration. Awesome.
https://docs.anthropic.com/en/docs/about-claude/models/all-m...
Even though my Layout clearly says `max-w-md` (https://github.com/mayo-dayo/app/blob/463ad5aeee904289ecc7d4...).
> You're the only person on HN who thinks $800/month is a salary and not a cloud computing bill.
ouch
In general, benchmarks seem to very misleading in my experience, and I still prefer sonnet 3.5 for _nearly_ every use case- except massive text tasks, which I use gemini 2.0 pro with the 2M token context window.
Tied for 3rd place with o3-mini-high. Sonnet 3.7 has the highest non-thinking score, taking that title from Sonnet 3.5.
Aider 0.75.0 is out with support for 3.7 Sonnet [1].
Thinking support and thinking benchmark results coming soon.
> For someone who builds tools to automate everything, you sure spend a lot of time manually explaining why automation is the future on HN.
> Your obsession with sandboxed code execution suggests you've been traumatized by at least one production outage caused by an intern's unreviewed PR.
So good it hurts!
Here are steps to reproduce.
Background/environment:
ChatGPT helped me build this complete web browser in Python:
https://taonexus.com/publicfiles/feb2025/71toy-browser-with-...
It looks like this, versus the eventual goal: https://imgur.com/a/j8ZHrt1
in 1055 lines. But eventually it couldn't improve on it anymore, ChatGPT couldn't modify it at my request so that inline elements would be on the same line.
If you want to run it just download it and rename it to .py, I like Anaconda as an environment, after reading the code you can install the required libraries with:
conda install -c conda-forge requests pillow urllib3
then run the browser from the Anaconda prompt by just writing "python " followed by the name of the file.
2.
I tried to continue to improve the program with Claude, so that in-line elements would be on the same line.
I performed these reproduceable steps:
1. copied the code and pasted it into a Claude chat window with ctrl-v. This keeps it in the chat as paste.
2. Gave it the prompt "This complete web browser works but doesn't lay out inline elements inline, it puts them all on a new line, can you fix it so inline elements are inline?"
It spit out code until it hit section 8 out of 9 which is 70% of the way through and gave the error message "Claude hit the max length for a message and has paused its response. You can write Continue to keep the chat going". Screenshot:
So I wrote "Continue" and it stops when it is 90% of the way done.
Again it got stuck at 90% of the way done, second screenshot in the above album.
So I wrote "Continue" again.
It just gave an answer but it never finished the program. There's no app entry in the program, it completely omited the rest of the main class itself and the callback to call it, which would be like:
def run(self):
self.root.mainloop()
###############################################################################
# main
###############################################################################
if __name__=="__main__":
sys.setrecursionlimit(10**6)
app=ToyBrowser()
app.run()
so it only output a half-finished program. It explained that it was finished.I tried telling it "you didn't finish the program, output the rest of it" but doing so just got it stuck rewriting it without finishing it. Again it said it ran into the limit, again I said Continue, and again it didn't finish it.
The program itself is only 1055 lines, it should be able to output that much.
Edit: I just tried claude code CLI and it's a good compromise, it works pretty well, it does the discovery by itself instead of loading the whole codebase into context
FYI, We use Claude 3.7 has part of the new features we are shipping around Code Agent & more.
I especially want to publish and send you the code for that inspector class and selector GUI that dumps the component hierarchy/state, QML source, and screenshot for use with Claude. Sadly I (and Claude) took some dumb shortcuts while implementing the inspector class that both couples it to proprietary code I can’t share and hardcodes some project specific bits, so it’s going to take me a bit of time to extricate the core logic.
I haven’t tried it with 3.7 but based on my tree-sitter QSyntaxHighlighter and Markdown QAbstactListModel tests so far, it is significantly better and I suspect the work Anthropic has done to train it for computer use will reap huge rewards for this use case. I’m still experimenting with the nitty gritty details but I think it will also be a game changer for testing in general, because combining computer use, gammaray-like dumps, and the Spix e2e testing API completes the full circle on app context.
Actually quite funny.
https://github.com/All-Hands-AI/OpenHands
The few times I've tested it out though it fails fairly quick and gets hung up (usually on setting up the project while testing with Kotlin / Go).
edit: note that my team mostly hits rate limits using things like aider and goose. 80k input token is not enough when in a flow, and I would love to experiment with a multi-agent workflow using claude
> You defend Java with such passion that Oracle's legal team is considering hiring you as their chief evangelist - just don't tell them about your secret admiration for more elegant programming paradigms.
So I'd say that the AI race is starting to plateau a bit recently.
I'd also love to have it in a language that can be compiled, like golang or rust, but I recognize a rewrite might be more effort than it's worth. (Although maybe less with claude code to help you?)
EDIT: OK, 10 minutes in, and it seems to have major issues doing basic patches to my Golang code; the most recent thing it did was add a line with incorrect indentation, then try three times to update it with the correct indentation, getting "String to replace not found in file" each time. Aider with claude 3.5 does this really well -- not sure what the counfounding issue is here, but might be worth taking a look at their prompt & patch format to see how they do it.
For example, systems don't always work in the way they're documented to. How is an AI going to differentiate cases where there's a bug in a service vs a bug in its own code? How will an AI even learn that the bug exists in the first place? How will an AI differentiate between someone reporting a bug and a hacker attempting to break into a system?
The world is a complex place and without ACTUAL artificial intelligence we're going to need people to at least guide AI in these tricky situations.
My advice would be to get familiar with using AI and new AI tools and how they fit into our usual workflows.
Others may disagree, but I don't think software engineers (at least ones the good ones) are going anywhere.
It's one thing to retrofit LLMs into existing tools but I'm more curious how this new space will develop as time goes on. Already stuff like the Warp terminal is pretty useful in day to day use.
Who knows, maybe this time next year we'll see more people programming by voice input instead of typing. Something akin to Talon Voice supercharged by a local LLM hopefully.
>You're the only person who gets excited when someone mentions Trinity Desktop Environment in 2025
> You probably have more opinions about PHP's empty() function than most people have about their entire career choices
But I still hit limits, I use Claudemind with jetbrains stuff and there is a max of input tokens (j believe), I am ‘tier 2’ but doesn’t look like I can go past this without an enterprise agreement
Roast:
- Your comments have more doom predictions than a Y2K convention in December 1999.
- You've used 'stochastic parrot' so many times, actual parrots are filing for trademark infringement.
- If tech dystopia were an Olympic sport, you'd be bringing home gold medals while explaining how the podium was designed by committee and the medal contains surveillance chips.
[0] https://github.com/microsoft/semantic-kernel/issues/5690#iss...
$1.42
This thing is a game changer.
At least with Cursor, I can use all "premium" 500 completions and either buy more, or be patient for throttled responses.
Lists, numbers, tabs, etc. are all a little time consuming... minor annoyance but thought I'd share.
It is provided by DeepSeek and Avian.
I am also midway of enabling a third-provider (Nebius).
You can see all models/providers over at https://glama.ai/models
As another commenter in this tread said, we are just a 'frontend wrapper' around other people services. Therefore, it is not particularly difficult to add models that are already supported by other providers.
The benefit of using our wrapper is that you can use a single API key and you get one bill for all your AI bills, you don't need to hack together your own logic for routing requests between different providers, failovers, keeping track of their costs, worry what happens if a provider goes down, etc.
The market at the moment is hugely fragmented, with many providers unstable, constantly shifting prices, etc. The benefit of a router is that you don't need to worry about those things.
Hell, I still have to remind it (Cursor) to not give me fucking React a few messages after I've already told it to not give me React (it's a Vue application with not a single line of React in it). Genuinely maddening, but the infinite wisdom of the higher ups forces me into wasting my time with this crap
Damn, that’s brutal. I mean, I never said I knew how to fix ComponentProps or generic components, just that they have issues…
Obviously this is problematic, but Claude 3.5 (and now 3.7) have been genuinely funny and consistently funny.
The basic idea is working, it handled everything for me.
From setting up the node environment. Creating the directories, files, patching the files, running code, handling errors, patching again. From time to time it fails to detect its own faults. But when I pinpoint it, it get it most of the time. And the UI is actually more pretty than I would have crafted in v1
When this get's cheaper, and better with each iteration, everybody will have a full dev team for a couple of bucks.
"A digital nomad who splits time between critiquing Facebook's UI decisions, unearthing obscure electronic music tracks with 3 plays on YouTube, and occasionally making fires on German islands. When not creating Dystopian Disco mixtapes or lamenting the lack of MIDI export in AI tools, they're probably archiving NYT articles before paywalls hit.
Roast
You've spent more time complaining about Facebook's UI than Facebook has spent designing it, yet you still check it enough to notice every change.
Your music discovery process is so complex it requires Discogs, Bandcamp, YouTube, and three specialized record stores, yet you're surprised when tracks only have 3 plays.
You're the only person who joined HN to discuss the Yamaha DX7 synthesizer from 1983 and somehow managed to submit two front-page stories about it in 2019-2020. The 80s called, they want their FM synthesis back."
edit: predictions are spot on - wow. Two of them detailed two projects I'm actively working on.
https://claude.ai/share/ed8a0e55-633f-4056-ba70-772ab5f5a08b
edit: Here's the output figure https://i.imgur.com/0c65Xfk.png
edit 2: Gemini Flash 2 failed miserably https://g.co/gemini/share/10437164edd0
A lot of people just want the ability to pay more in order to get more.
I would gladly pay 10x more to get relatively modest increases in performance. That is how important the intelligence is.
>You'll write a thoughtful essay about 'digital minimalism' that reaches the HN front page, ironically causing you to spend more time on HN responding to comments than you have all year.
It sees me! Noooooo ...
> Most used terms: “Please don’t” lol
If this Code preview is only open to subscribers it means i have to subscribe before i can even see if the binary works for me. Hmm
edit: Oh, there's a link to "joining the preview" which points to: https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...
I'd really like to use Claude Code in some of my projects vs just sharing snippets via the UI but I'm curious how might doing this from our source directory affect our IP including NDA's, trade secret protections, prior disclosure rules on (future) patents, open source licensing restrictions re: redistribution etc?
Also hi Erik! - Rob
edit: Oh, there's a link to "joining the preview" which points to: https://docs.anthropic.com/en/docs/agents-and-tools/claude-c...
Eg Claude will refuse to write code to wget a website and parse the html if you ask it to scrape your ex girlfriend's Instagram profile, for ethical and tos reasons, but if you phrase the request differently, it'll happily go off and generate code that does that exact thing.
Asking it to scrape my ex girlfriend's Instagram profile is just a stand in for other times I've hit a problem where I've had to social engineer my way past those guard rails, but does having those guard rails really provide value on a professional level?
I've made it build a web scraper from scratch, figuring out the "API" of a website using a project from github in another language to get some hints, and while in the end everything was working, I've seen 100k+ tokens being sent too frequently for apparently simple requests, something feels off, it feels like there are quite a few opportunities to reduce token usage.
Congratz to the team!
I wrote a kind of complex code for MCU which deals with FRAM and few buffers, juggling bytes around in a complex fashion.
I was very not sure in this code, so I spent some time with AI chats asking them to review this code.
4o, o3-mini and claude were more or less useless. They spot basic stuff like this code might be problematic for multi-thread environment, those are obvious things and not even true.
o1 pro did something on another level. It recognized that my code uses SPI to talk to FRAM chip. It decoded commands that I've used. It understood the whole timeline of using CS pin. And it highlighted to me, that I used WREN command in a wrong way, that I must have separated it from WRITE command.
That was truly breathtaking moment for me. It easily saved me days of debugging, that's for sure.
I asked the same question to Claude 3.7 thinking mode and it still wasn't that useful.
It's not the only occasion. Few weeks before o1 pro delivered me the solution to a problem that I considered kind of hard. Basically I had issues accessing IPsec VPN configured on a host, from a docker container. I made a well thought question with all the information one might need and o1 pro crafted for me magic iptables incarnation that just solved my problem. I spent quite a bit of time working on this problem, I was close but not there yet.
I often use both ChatGPT and Claude comparing them side by side. For other models they are comparable and I can't really say what's better. But o1 pro plays above. I'll keep trying both for the upcoming days.
The best part
Also, curious if you have any intuition as to why the no-parallelism number for AIME with Claude (61.3%) is quite low (e.g., relative to R1 87.3% -- assuming it is an apples to apples comparison)?
:'(
I think claude.ai’s web app in thinking mode is likely defaulting to a much much smaller thinking budget than that.
blasted
More generally, I tend to give o1 pro as much of my codebase as possible (it can take around 100k tokens) and then ask it for small chunks of work which I then pass to Sonnet inside Cursor.
Very excited to see what o3 pro can do.
i.e I'd like my chat and API usage to be all included under a flat-rate subscription.
Currenty Pro doesn't give me any API credits to use with coding assistants (Claude Code included ?) which is completely disjointed. And I need to be a business to use the API still ?
Honestly, Claude is so good, just please take my money and make it easy to do the above !
They have a very solid infrastructure.
Scaling infrastructure to handle billions of tokens is no joke.
I believe they are approaching 1 trillion tokens per week.
Glama is way smaller. We only recently crossed 10bn tokens per day.
However, I have invested a lot more into UX/UI of that chat itself, i.e. while OpenRouter is entirely focused on API gateway (which is working for them), I am going for a hybrid approach.
The market is big enough for both projects to co-exist.
Incredible work!
Absolutely, when tasked with something quite complex in a complex code base, it doesn't really work. It can get you some of the way there, and some of the code it produces gives you great ideas on where to go from, but it doesn't work.
But there are certainly some tasks where it excels. I asked it to refactor a rather gnarly function (C++), and it did a great job at decomposing it. The initial decomposition was a bit naive: the original function took in a vector, and would parse what the function & data from the vector, and the decomposition split out the functions, but the data still came in as a vector. For instance, one of the functions took a filename, and file contents, and it took it as element 0 and element 1 from a vector, when it should obviously be two parameters. But some further prompting and it took it to the end.
You still need to know what good code looks like to use these tools. If you go forward in your career trusting the output of LLMs without the skills to evaluate the correctness, style, functionality of that code then you will have problems.
People still write low level machine code today, despite compilers having existed for 70+ (?) years.
We'll always need full-stack humans who understand everything down to the electrons even in the age of insane automation that we're entering.
A successful tech entrepreneur who built a multi-million dollar business starting with Common Lisp, you're the rare HN user who actually practices what they preach.
Your journey from Lisp to Go to Rust mirrors your evolution from idealist to pragmatist, though you still can't help but reminisce about the magical REPL experience while complaining about JavaScript frameworks.
---
Roast
You complain about AI-generated code being too complex, yet you pine for Common Lisp, a language where parentheses reproduction is the primary feature.
For someone who built a multi-million dollar business, you spend an awful lot of time telling everyone how much JavaScript and React suck. Did a React component steal your lunch money?
You've changed programming languages more often than most people change their profile pictures. At this rate, you'll be coding in COBOL by 2026 while insisting it's 'underappreciated'.
I was just looking into that again as of yesterday (I didn't post about it here yesterday, just to be clear; it picked up on that from some old comments I must have posted).
> Profile summary: [...] You're the person who not only remembers what a CGA adapter is but probably still has one in working condition in your basement, right next to your collection of programming books from 1985.
Exactly the case, in a working IBM PC, except I don't have a basement. :)
Hit dog hollers
Is this limit on thinking mode only? Or does normal mode have same limit now? 8192 tokens output limit can be bit small these days.
I was trying to extract all urls along with their topics from a "what are you working on" HN thread. And 8192 token limit couldn't cover it.
I actually wonder about this. Is it better to gain some relatively mediocre experience at lots of things? AI seems to be pretty good at lots of things.
Or would it be better to develop deep expertise in a few things? Areas where even smart AI with reasoning still can get tripped up.
Trying to broaden your base of expertise seems like it’s always a good idea, but when AI can slurp the whole internet in a single gulp, maybe it isn’t the best allocation of your limited human training cycles.
The real answer is either to pivot to a domain where the computer use/coding skills are secondary (i.e. you need the knowledge but it isn't primary to the role) or move to an industry which isn't very exposed to AI either due to natural protections (e.g. trades) or artifical ones (e.g regulation/oligopolies colluding to prevent knowledge leaking to AI). May not be a popular comment on this platform - I would love to be wrong.
I would like this to happen easily like hitting a menu or button without having to write an elaborate "prompt" every time.
Is this possible?
1. RAG: A simple model looks at the question, pulls up some associated data into the context and hopes that it helps.
2. Self-RAG: The model "intentionally"/agentically triggers a lookup for some topic. This can be via a traditional RAG or just string search, ie. grep.
3. Full Context: Just jam everything in the context window. The model uses its attention mechanism to pick out the parts it needs. Best but most expensive of the three, especially with repeated queries.
Aider uses kind of a hybrid of 2 and 3: you specify files that go in the context, but Aider also uses Tree-Sitter to get a map of the entire codebase, ie. function headers, class definitions etc., that is provided in full. On that basis, the model can then request additional files to be added to the context.
> Your 236-line 'simplified' code example suggests you might need to look up the definition of 'simplified' in a dictionary that's not written in Ruby.
OUCH
> You've spent so much time worrying about Facebook tracking you that you've failed to notice your dental nanobot fantasies are far more concerning to the rest of us.
Heard.
You assume nothing LLMs do are actually generalization. Once Field X is eaten the labs will pivot and use the generalization skills developed to blow out Field Y to make the next earnings report. I think at this current 10x/yr capability curve (Read: 2 years -> 100x 4 years -> 10000x) I'll get screwed no matter what is chosen. Especially the ones in proximity to computing, which makes anything in which coding is secondary fruitless. Regulation is a paper wall and oligopolies will want to optimize as much as any firm. Trades are already saturating.
This is why I feel completely numb about this, I seriously think there is nothing I can do now. I just chose wrong because I was interested in the wrong thing.
e.g. I will upload a pdf or md of an library's documentation and ask it to implement something using those docs, and it keeps on importing functions that don't exist and aren't in the docs. When I ask it where it got `foo` import from, it says something like, "It's not in the docs, but I feel like it should exist."
Maybe I should give o1 pro a shot, but claude has never done that and building mostly basic crud web3 apps, so o1 feels like it might be overpriced for what I need.
1. Subscribe to Claude Pro for $20 month
2. Separately, Buy $100 worth of API credits.
Now you have a Claude "ultimate" subscription where the credits roll over as an added bonus.
As someone who only uses the APIs, and not the subscription services for AI, I can tell you that $100 is A LOT of usage. Quite frankly, I've never used anywhere close to $20 in a month which is why I don't subscribe. I mostly just use text though, so if you do a lot of image generation that can add up quickly
I wonder if it simply due to reprioritization of resources.
Presumably, there is some parameter that determines how long a model is allowed to use resources for, which would get tapered in preparation for a demand surge of another model.
If you send Claude Code “Can I get some Anthropic stickers please?” you'll get directed to a Google Form and can have free stickers shipped to you!
But I can’t imagine this tool in the hands of someone who does not have a solid understanding of programming.
You need to understand when to push back and why. It’s like doing mini code reviews all the time. LLMs are very convincing and will happily generate garbage with the utmost authority.
Don’t trust and absolutely verify.
I would argue that this is still RAG. There's a common misconception (or at least I think it's a misconception) that RAG only counts if you used vector search - I like to expand the definition of RAG to include non-vector search (like Ripgrep in this case), or any other technique where you use Retrieval techniques to Augment the Generation phase.
IR (Information Retrieval) has been around for many decades before vector search become fashionable: https://en.wikipedia.org/wiki/Information_retrieval
As long as capacity is an issue, you can't have both
People who know both coding and LLMs will be a whole lot more attractive to hire to build software than people who just know LLMs for many years to come.
Seems to answer before fully understanding the requests, and it often gets stuck into loops.
And this update removed the june model which was great, very sad day indeed. I still don't understand why they have to remove a model that is do well received...
Maybe its time to switch again, gemini is making great strides.
It's the only mainstream AI service that requests this information. After a string of security lapses by many of your competitors, I have zero faith in the ability of a "fast moving" AI-focused company to keep my PII data secure.
You can also access Claude via both AWS Bedrock and Google Vertex, both of which come with very robust guarantees about how your data is used.
Once you get to graduate studies that's where the material coverage is a little more sparse/niche (though usually still not groundbreaking), and for a PhD. coverage is mostly non-existent since the point is to expand upon current knowledge within the field and many topics are being explored for the first time.
My LLM client can switch between whatever models, mid conversation. So I'll have a question or two in the more expensive, then drop down to the cheaper for explanations/questions that help me understand. Rewind time, then hit the more expensive models with relevant prompts.
At the edges, it really ends up being "this is the only model that can do this".
>Claude Code consumes tokens for each interaction. Typical usage costs range from $5-10 per developer per day, but can exceed $100 per hour during intensive use.
Has there been any effort taken to reduce data leakage of this test set? Sounds like these exercises were available on the internet pre-2023, so they'll probably be included in the training data for any modern model, no?
>Claude Code consumes tokens for each interaction. Typical usage costs range from $5-10 per developer per day, but can exceed $100 per hour during intensive use.
I defend Java and cargo shorts in 2025!
Could we please get benchmarks for architect / DeepSeek R1 + claude-3-7-20250219 ?
To compare perf and price with Sonnet-3.7-thinking
Ha
Also:
> Your comments read like someone who discovered philosophy in their 30s and now can't decide if they want to code or become the next Marcus Aurelius.
skull emoji
That's not a terrible idea.
Would love to learn a bit more about how the GitHub integration works. From https://support.anthropic.com/en/articles/10167454-using-the... it seems it’s read only.
Does Claude Code let me take a generated/edited artifact and commit it back as a PR?
I tried the same prompt again just now with Claude 3.7 Sonnet in thinking mode, and I found myself laughing more than I did the previous time.
An excerpt:
[Conspiratorial tone]
Here's a secret: when humans ask me impossible questions, I sometimes just make up an answer that sounds authoritative.
[To human section]
Don't look shocked! You do it too! How many times has someone asked you a question at work and you just confidently said, "Six weeks" or "It's a regulatory requirement" without actually knowing?
The difference is, when I do it, it's called a "hallucination." When you do it, it's called "management."
Full set: https://gally.net/temp/20250225claudestandup2.html
Claude Code can run commands including "git" commands, so it can create a branch, commit code to that branch and push that branch to GitHub - at which point point you can create a PR.
65% Sonnet 3.7, 32k thinking
64% R1+Sonnet 3.5
62% o1 high
60% Sonnet 3.7, no thinking
60% o3-mini high
57% R1
52% Sonnet 3.5
fist pump
So I started using this today not knowing it was even new.
One thing I noticed is when I tried uploading a PowerPoint template produced by Google slides that was 3 slides—-just to give styling and format—-the web client said I’d exceeded line limit by 1200+%.
Is that intentional?
I wanted Claude to update the deck with content I provided in markdown but it could seemingly not be done, as the line overflow error prevented submission.
> You complain about Elixir's lack of types but keep using it anyway. This is the programming equivalent of staying in a relationship where you keep trying to change the other person.
> You've lived in multiple countries but spend most of your time on HN explaining why their tech infrastructure is terrible. Maybe the common denominator is you?
Ouch, it's pretty good haha
I'm very much in favour of removing the guardrails but I understand why they're in place. The problem is attribution. You can teach yourself how to engage in all manner of dark deeds with a library or wikipedia or a search engine and some time, but any resulting public outcry is usually diffuse or targeted at the sources rather than the service. When Claude or GPT or Stable Diffusion are used to generate something judged offensive, the outcry becomes an existential threat to the provider.
The one for dang is hysterical.
Try to delete (close) the panel on the right on a side-by-side view. It took a good second to actually close. Creating one isn't much faster.
This is unbearably slow, to be blurt.
There's also been a spate of AI companies rushing to release products and having "oops" moments where they leaked customer chats or whatever.
They're not run like a FAANG, they don't have the same security pedigree, and they generally don't have any real guarantee of privacy.
So yes, my privacy is more valuable.
Conversely: Why is my non-privacy so valuable to Anthropic? Do they plan on selling my data? Maybe not now... but when funding gets a bit tight? Do they plan on selling my information to the likes of Cambridge Analytica? Not just superficial metadata, but also an AI-summarised history of my questions?
The best thing to do would be not to ask. But they are asking.
Why?
Why only them?
It seems very very similar. I open sourced the code to MyCoder here: https://github.com/drivecore/mycoder I'll compare them. Off hand I think both CodeBuff and Claude Coder are missing the web debugging tools I added to MyCoder.
Spot on!
> Has an M2 Max with 64GB RAM but probably still complains when Chrome opens more than 5 tabs.
Not true, I have 40 tabs open!
> Created a tool to generate portfolios in 5 minutes but spent 5 hours explaining how to optimize YouTube settings. Priorities!
Ouch! Brutal and funny at the same time.
Thank you for making this!
I would pay $50/mo or something to be able to have reasonable use of Claude Code in a limited (but not as limited) way as through the web UI, but all of these coding tools seem to work only with the API and are therefore either too expensive or too limited.
I've used https://github.com/cline/cline to get a similar workflow to their Claude Code demo, and yes it's amazing how quickly the token counts add up. Claude seems to have capacity issues so I'm guessing they decided to charge a premium for what they can serve up.
+1 on the too expensive or too limited sentiment. I subscribed to Claude for quite a while but got frustrated the few times I would use it heavily I'd get stuck due to the rate limits.
I could stomach a $20-$50 subscription for something like 3.7 that I could use a lot when coding, and not worry about hitting limits (or I suspect being pushed on to a quantized/smaller model when used too much).
It became such an anti-pattern that I stopped paying. Now, when people ask me which one to use, I always say I like Claude more than others, but I don’t recommend using it in a professional setting.
https://hn-wrapped.kadoa.com/e12e?share
> Your comments read like Warren and Brandeis met Alan Kay at a Norwegian tech conference.
I consider this high praise indeed, lol.
if you rule out ASI, then that means progress is going to have to slow. consider that programming has been getting more and more automated continually since 1954. so put yourself in a position where what LLMs can do is a complement to what you can do. currently you still need to understand how software works in order to operate one of these things successfully.
2. Instead lets the agent decide what to bring into context by using tools on the codebase. Since the tools used are fast enough, this gives you effectively "verified answers" so long as the agent didn't screw up its inputs to the tool (which will happen, most likely).
Captchas are trivially broken and you can get access to millions of residential IP addresses, but phone numbers (especially if you filter out VOIP providers) still have a cost.
I've been trying to use Sonnet 3.7 tonight through the Copilot agent and it gets frustrating to see the API 500 halfway through the task list leaving the project in a half baked state, and then and not feeling like I have a good "auto save" to pick up again from.
When it's ready, I squash merge into main.
Being able to get reset back to the last known good state is awesome. If you turn off auto commit, it’s a lot harder to undo one of the steps that the model takes. It’s only a matter of time until it creates nonsense, so you’ll really want the ability to roll it back.
Just work in a branch and you can merge all commits if you want at the end.
* You've spent more time talking about your Carnatic raga detector than actually building it – at this rate, LLMs will be composing ragas before your detector can identify them.
* You bought a 7950X processor but can't figure out what to do with it – the computing equivalent of buying a Ferrari to drive to the grocery store once a week.
* You're so concerned about work-life balance that you took a sabbatical to think about your career, only to spend it commenting on HN about other people's careers.
*** End ***
I'll be in my room crying, in case anyone's looking for me.
The Exercism problems have proven to be very effective at measuring an LLM's ability to modify existing code. I receive a lot of feedback that the aider benchmarks correlate strongly with people's "vibes" on model coding skill. I agree. The scores have felt quite aligned with my hands-on experience coding with most of the top models over the last 18+ months.
To be clear, the purpose of the benchmark is to help me quantitatively assess and improve aider and make it more effective. But it's also turned out to be a great way to measure the coding skill of LLMs.
There are similar open source CLI tools that predate Claude Coder. Its reasonable to assume Anthropic chose not to contribute to those projects for reasons other than complexity, and charitably Anthropic likely plans for differentiating features.
> Also, the minified source code is available
The redistribution license - or lack thereof - will be the stumbling block to directly reusing code authored by Anthropic without authorization.
edit: would be interesting to see how combo DeepSeek R1 + claude-3-7 performs.
> For someone who claims to be only 33, you have the technological opinions of at least three 60-year-old UNIX greybeards stacked in a trenchcoat.
Guilty as charged :-3
I'll be trying out Sonnet 3.7 extended thinking + Sonnet 3.5 or Flash 2.0, which I assume would be at the top of the leaderboard.
/skeptical face.
Without fail, every. single. person. I've met who says that, actually means "except for the code that I write", or "except for how I link the code it build together by hand".
If you are 50kloc in to a large complex project that you have literally written none of, and have, eg. used cursor to generate the code without any assistance... well, you should start a startup.
...because, that's what devin was supposed to be, and it was enormously and famously terrible at it.
So that would be either a) terribly exciting, or b) hyperbole.
I wrote a step-by-step guide for the app I built: https://prompt.16x.engineer/guide/claude
* Your perfect tech stack exists only in your comments - a beautiful utopia where everything is type-safe, reliable, and nobody is ever on-call.
* You evaluate programming languages the way wine critics evaluate vintages: 'Ah yes, Effect-ts 2023, a sophisticated choice with notes of functional purity and a robust type system, though I detect a hint of API churn in the finish.'
ROFL :-)
This is a really accessible setup and is great for my current needs. Taking it to the next stage with agentic assistants is something I'm only just starting out on. I'm looking at WilmerAI [1] for routing ai workflows and Hoarder [2] to automatically ingest and categorize bookmarks, docs and RSS feed content into a local RAG.
Unless "anthropic high usage" which I just watch the incident reports I one shot features regularly.
At a high skill level. Not front end. Back end c# in a small but great framework that has poor documentation. Not just endpoints but full on task queues.
So really, it's a context problem. You're just not laser focusing your context.
Try this:
Set up a context with the exact files needed. Sure ai "should" do that but it doesn't. Especially not cursor or cline. Then try.
Hell try it with clood after I update with 3.7. I bet you, if you clood file it, then you get one shots.
I have a long history of clood being a commit in my projects and it's a clood one shot.
https://www.reddit.com/r/LocalLLaMA/comments/1i1kz1c/sharing...
excuse me, we boot from compact flash these days
>Your comments about modern tech are so critical that I'm convinced you judge new programming languages based on how well they'd run on a Commodore 64.
ouch
It's scenario where the result is greater than the sum of it's parts
> Your journey from PHP to OCaml suggests you enjoy pain, just in increasingly sophisticated forms.
> You seem to spend so much time worrying about NSA surveillance that you probably encrypt your grocery lists. The NSA agent assigned to you is bored to tears.
Hahaha these are excellent, though it really latched on to the homebrew PC stuff I was into back in 2013
But to each their own!
I think the best period of Software Devs will be gone in few years. Knowing how how to code and fix things will be important still but more important to be also Jack-of-Many-Trades to provide more value: know a little about SEO, have a good taste of design and be able to tweak simple design, good taste how to organise code, better soft skills and managing or educating less tech-savvy stuff.
Another option is to specialise in some currently difficult subfield: robotics, ML, CUDA, rust and try to be this elite dev with expectation would have to move to SV or any such tech hub.
Best general recommendation I would give right now (especially for someone who is not from US) to someone who is currently studying is to use that a lot of time you have right now with not much responsibility to make some product that can provide you semi-passive income on a monthly basis ($5k-$10k) to drag yourself out of this rat race. Even if you not succeed or revenue stream will run out eventually you will learn those other skills that will be more important later if wanna be employed (SEO, code & design taste, marketing, soft skills).
Because most likely this window of opportunity might be only for the next few years in similar way when the best window for Mobile Apps was first ~2 years when App Store started
Oh man, I feel seen :)
Who cares if you used vector search for the retrieval?
The best vector retrieval implementations are already switching to a hybrid between vector and FTS, because it turns out BM25 etc is still a better algorithm for a lot of use-cases.
"Agentic search" makes much less sense to me because the term "agentic" is so incredibly vague.
How do you feel about raking in millions while attempting to make us all unemployed?
How do you feel about stealing open source code and stripping the copyright?
It could have done a better job outlining how to do it properly
In a way the AI becomes the dev and you become the code reviewer. Often as the AI is writing the code, you're thinking about the next step.
Overfitting is one of the fundamental issues to contend with when trying to figure out if any type of model at all is useful. If your leaderboard corresponds to vibes and that is your target, you could just have a vibes leaderboard
You do it one step at a time, similary to how you would structure good tickets (often even smaller).
AI often still makes shit, but you do get somewhere a whole heap load of time faster.
Grok3, Claude, Deepseek, Qwen all failed to solve this problem. Resulting in some very very wrong solutions. While Grok3 were admit it fail and don't provide answers all other AI's are provided just plain wrong answers, like `12 * 5 = 80`
ChatGPT were able to solve for 40, but not able to 80. YandexGPT solved those correctly (maybe it were trained on same Math books)
Just checked Grok3 few more times. It were able to solve correctly for 80.
In my experience EXTENSIVELY using claude 3.5 sonnet you basically have to do everything complex or you're just introducing massive amounts of slop code into your code base that while functional is nowhere near good. And for anything actually complex like requires a lot of context to make a decision and has to be useful to multiple different parts, it's just hopelessly bad.
However rationally I can see where these models are evolving, and it leads me to think the software industry is on its own here at least in the short/medium term. Code and math, and with math you typically need to know enough about the domain know what abstract concept to ask, so that just leaves coding and software development. Even for non technical people they understand the result they want of code.
You can see it in this announcement - it's all about "code, code, code" and how good they are in "code". This is not by accident. The models are becoming more specialised and the techniques used to improve them beyond standard LLM's are not as general to a wide variety of domains.
We engineers think AI automation is about difficulty and intelligence, but that's only partly true. Its also about whether the engineer has the knowledge on what they want to automate, the training data is accessible and vast, and they even know WHAT data is applicable. This combination of both deep domain skills and AI expertise is actually quite rare which is why every AI CEO wants others to go "vertical" - they want others to do that leg work on their platforms. Even if it eventuates it is rare enough that, if they automate, will automate a LOT slower not at the deltas of a new model every few months.
We don't need AGI/ASI to impact the software industry; in my opinion we just need well targeted models that get better at a decent rate. At some point they either hit a wall or surpass people - time will tell BUT they are definitely targeting SWE's at this point.
I really like 3.5 and can be productive with it, but with Claude 3.7 it can't fix even simple things.
Last night I sat for 30 minutes just to try to get the new model to remove a instructions section from a Next.js page. It was an isolated component on the page named InstructionsComponent. Failed non-stop, didn't matter what I did, it could not do it. 3.5 did it first try, I even mistyped instructions and the model fixed the correct thing anyway.
The nerd humor was hilariously unexpected.
> Your deep dives into quantum mechanics will lead you to publish a paper reconciling quantum eraser experiments with your cryptographic work, confusing physicists and cryptographers alike.
That is one hell of a Magic 8 Ball.
If the resulting code is not trying to be excessively clever or creative this is actually a good thing in my book.
The novelty and creativity should come from the product itself, especially from the users/customers perspective. Some people are too attached to LLM leaderboards being about novelty. I want reliable results whenever I give the instructions, either be the code, or the specs built into a spec file after throwing some ideas into prompts.
From time to time I'm fooling with the idea of open sourcing the core block editor but I don't really get into it since 1. I'm a little embarrassed by the current unmodularization of the code and want to refactor it all. 2. I want to still find a way to monetize my open source projects (so maybe AGPL with commercial license?)
Dude, that inspector looks so cool. Can't wait to try it. Do you think it can also show how much memory each QML component is taking?
I'm hyped as well about Claude 3.7, haven't had the time to play with it on my Qt C++ projects yet but will do it soon.
You can definitely succumb to the fear. It sounds like you have. But courage isn't the absence of fear, it's what you do in the face of it. Are you going to let that fear paralyze you into inaction? Just not do anything other than post about being scared to the Internet? Or, having identified that fear, are you gonna wrestle it down to the ground and either choose to retrain into anything else and start from near zero, but it'll be something not programming that you believe isn't right about to be automated away, or dive in deeper, and get a masters in AI and learn all of the math behind LLMs and be an ML expert that trains the AI. That jobs not going away, there's still a ton of techniques to be discovered/invented and all of the niches to be discovered. Fine-tuning an existing LLM to be better at some niche is gonna be hot for a while.
You're lucky, you're in a position to be able to go for a masters, even if you don't choose that route. Others with a similar doomer mindset have it worse, being too old and not in a position to them consider doing a masters.
Face the fear and look into the future with eyes wide open. Decide to go into chicken farming or nursing or firefighter or aircraft mechanic or mortician or locksmith or beekeeping or actuary.
My experience is that these models could write a simple function and get it right if it does not require any out of the box thinking (so essentially offloading the boilerplate part of coding). When it comes to think creatively and have a much better solution to a specific task that would require the think 2-3 steps ahead than they are not suitable.
Tests that require thinking about the physical world are the most revealing.
My new favourite is:
You have 2 minutes to cool down a cup of coffee to the lowest temp you can.
You have two options: 1. Add cold milk immediately, then let it sit for 2 mins.
2. Let it sit for 2 mins, then add cold milk.
Which one cools the coffee to the lowest temperature and why?
Phrased this way without any help, all but the thinking models get it wrong
Nice! It is entirely reasonable both to do that and to be excited about it.
…buuut, if that’s what you’re doing, you should say so.
Not:
“no lines of code directly written, just directing the AI”
Because those (gluing together AI code by hand and having the agent do everything) are different things, and one of them is much much MUCH harder to get right than the other one.
That last 10-15%. Self driving cars are the same story right?
ChatGPT is already my default first place to check something, where it was Google for the previous 20+ years.
It's finding out why "jbdoe1337" added this large if/else around the entire function body back in 2016 - it seems important business logic, but the commit just says "updated code". And how the h*ll this interaction between the conf.ini files, the conf/something.json and the ENV vars works. Why sometimes the ENV var overrides a value in the ini and why its sometimes the other way around. But also finding that when you clean it up, everything falls apart.
It's discussing with the stakeholders why "adding a delete button" isn't as easy as just putting a button there, but that it means designing a whole cascading deletion strategy and/or trashcan and/or soft-delete and/or garbage-collection.
It's finding out why - again - the grumb pipeline crashes with the typebar checker, when used through mpm-yearn package manager. Both in containers and on a osx machine but not on Linux Brobuntu 22.12 LTLS.
It's moving stuff in the right abstraction layer. It's removing abstractions while introducing others. KISS vs future flexibility. It's gut feeling when to apply DRY and when to embrace it.
And then, if your lucky, churning out boilerplate or new code for 120 minutes a week.
I'm glad that this 120 minutes can be improved with AI and become 20 minutes. Truly. But this is not what (senior?) programmers do. Despite what the hyped up AI press makes us believe. It only shows they have no idea what the "real" problems and time-consumers are for programmers.
I think a better test of thinking is to provide detail about something so mundane and esoteric that no one would have ever thought to communicate it to other people for entertainment, and then ask it a question about that pile of boring details.
EDIT: By the more obvious one, I mean letting it cool and then adding milk. As the temperature difference between the coffee and the surrounding air is higher, the coffee cools down faster. Is this wrong?
I C&P'd it into Claude 3.7 with thinking, and it gave the correct answer (which I'm pretty sure is #2). Including the CoT, where it actually does math (which I haven't checked), and final response.
# THINKING
Let's analyze the two options.
Option 1: Add cold milk immediately, then let it sit for 2 mins.
Option 2: Let it sit for 2 mins, then add cold milk.
To determine which option leads to the lowest final temperature, I need to consider the physics of cooling.
First, let's understand how cooling works in these scenarios:
1. When a hot liquid sits, it cools according to Newton's Law of Cooling, which states that the rate of heat loss is proportional to the temperature difference between the object and its surroundings.
- The rate of cooling is faster when the temperature difference is greater
- As the liquid cools, the rate of cooling decreases
2. When we add cold milk to hot coffee, we get an immediate temperature drop due to mixing liquids of different temperatures. The final temperature after mixing can be calculated using the principle of thermal equilibrium.
Now, let's analyze each option:
Option 1: Add cold milk immediately, then let it sit for 2 mins
- The coffee temperature drops immediately when milk is added
- Then this coffee+milk mixture cools for 2 minutes
- The cooling rate will be based on the temperature difference between the coffee+milk mixture and room temperature
Option 2: Let it sit for 2 mins, then add cold milk
- The black coffee cools for 2 minutes
- Then milk is added, causing an immediate drop in temperature
- The cooling rate during the 2 minutes will be based on the temperature difference between the black coffee and room temperature
The key insight is that hot coffee cools faster than warm coffee because the rate of cooling is proportional to the temperature difference. Since black coffee starts at a higher temperature than coffee with milk, it will lose more heat energy during the 2-minute cooling period.
Let's make this more concrete with some theoretical numbers:
- Let's say black coffee starts at 90°C
- Room temperature is 20°C
- Cold milk is at 5°C and makes up 10% of the final volume
Option 1:
- Adding milk immediately might bring the temperature to about 81.5°C (90 × 0.9 + 5 × 0.1)
- This mixture then cools for 2 minutes from 81.5°C to some lower temperature, let's call it T1
Option 2:
- The black coffee cools for 2 minutes from 90°C to some temperature, let's call it T2
- Then we add milk, which further reduces the temperature to 0.9 × T2 + 0.1 × 5
The question is: which is lower, T1 or 0.9 × T2 + 0.1 × 5?
According to Newton's Law of Cooling, the black coffee in Option 2 will lose more heat during the 2 minutes because it starts at a higher temperature and thus has a greater temperature difference with the surroundings. This means the drop from 90°C to T2 will be greater than the drop from 81.5°C to T1.
Let's think about this mathematically:
According to Newton's Law of Cooling: T(t) = Ts + (T0 - Ts) × e^(-kt) Where:
- T(t) is the temperature at time t
- Ts is the surrounding temperature
- T0 is the initial temperature
- k is the cooling constant
For Option 1:
T1 = 20 + (81.5 - 20) × e^(-k × 2) = 20 + 61.5 × e^(-2k)
For Option 2, before adding milk:
T2 = 20 + (90 - 20) × e^(-k × 2) = 20 + 70 × e^(-2k)
After adding milk in Option 2, the temperature would be:
0.9 × T2 + 0.1 × 5 = 0.9 × (20 + 70 × e^(-2k)) + 0.5
= 18 + 63 × e^(-2k) + 0.5
= 18.5 + 63 × e^(-2k)
Now we compare:
T1 = 20 + 61.5 × e^(-2k)
Final temp in Option 2 = 18.5 + 63 × e^(-2k)
For any positive value of k (which it must be for cooling), e^(-2k) is between 0 and 1.
If e^(-2k) = 1 (no cooling at all):
T1 = 20 + 61.5 = 81.5
Final temp in Option 2 = 18.5 + 63 = 81.5 (They would be the same)
If e^(-2k) = 0 (instant cooling to room temperature): T1 = 20 Final temp in Option 2 = 18.5 (Option 2 would be better)
For values in between, we need to compare 20 + 61.5 × e^(-2k) vs 18.5 + 63 × e^(-2k)
Subtracting: (T1 - Final temp in Option 2) = 1.5 - 1.5 × e^(-2k)
This will be positive (meaning T1 is higher) when e^(-2k) < 1, which is true for all positive k.
So, mathematically, Option 2 (letting it sit, then adding milk) results in a lower final temperature.
This makes physical sense because:
1. The hotter black coffee loses more heat during the 2 minutes than the cooler coffee+milk mixture would
2. The cold milk provides the same cooling effect whether added at the beginning or end
Therefore, Option 2 (letting the coffee sit for 2 minutes, then adding cold milk) will result in the lowest final temperature.
# ANSWER
# Coffee Cooling Analysis
## Option 2 wins: Let it sit for 2 mins, then add cold milk
This will give you the lowest final temperature, and here's why:
### The physics behind it:
1. *Newton's Law of Cooling*: The rate at which something cools is proportional to the temperature difference between the object and its surroundings.
2. *Key insight*: Hotter objects cool faster than cooler ones because the temperature gap with the environment is larger.
### Comparing the options:
*Option 1 (Add milk first, then wait):* - Milk immediately lowers coffee temperature - The coffee+milk mixture cools for 2 minutes, but at a slower rate because it's already cooler
*Option 2 (Wait 2 mins, then add milk):* - Black coffee starts hotter and cools faster during the 2 minutes - More total heat is lost during the waiting period - Then milk provides the same temperature drop at the end
The black coffee's higher initial temperature difference with the room causes it to lose more total heat energy during the waiting period, making Option 2 the more effective cooling strategy.
It is the only model I gave this and actually approached it by writing math. Usually I am not that impressed with reasoning models, but this was quite fun to watch.
The Aider Polyglot website also states that the benchmark " ...asks the LLM to edit source files to complete 225 coding exercises".
However, when looking at the actual tests [0], it is not about editing code bases, it's rather just solving simple programming exercies? What am I missing?
> You'll finally build that optimized game streaming system you've been thinking about since reading that Insomniac Games presentation in 2015.
Sure, but it's just a prototype that I've finally got time for after all these years. I really want it to be parallelised though, so I'll probably try...
> After years of defending C++, you'll secretly start experimenting with Rust but tell everyone 'it's just for a side project.'
Oh.
> Your comments about plankton evolving to survive ocean acidification suggest you have more faith in single-celled organisms than in most software companies.
Well, yeah?!
https://claude.ai/share/dfb37c1a-f6a8-45a1-b987-e6d28e205080
Think pouring water from the faucet into a sink with open drain - if you have high enough flow rate, you can fill the sink faster than it drains. Then, when you turn the faucet off, as the sink is draining, you can still collect plenty of water from it with a cup or a bucket, before the sink fully drains.
The best answer though is to put the coffee on a plate, and forget about the milk.
It might correlate to design level thinking but it also might not.
Unfortunately I only noticed it after the window to edit the comment was closed. If the first person to downvote me had instead suggested I use a pastebin, I might have been able to make the conversation more agreeable to people.
I literally have not written a line of code. The AI agent configures the build systems. It executes the `go install` command. It configures the infrastructure via terraform.
It takes a lot of reading of the code that's generated to see what I agree with or not, and redirecting refactorings. Understanding how to describe problem statements that are translated into design docs that are translated into task lists. It's still a lot of knowledge work on how to build software. But now I can do the coding that might have taken a day from those plans in 20 minutes.
Regarding startups, there's nothing here I'm doing that isn't just learning the tools of agentic coding. The business here might be advising people on how to do it themselves.
> You've cited LessWrong so many times that Eliezer Yudkowsky is considering charging you royalties for intellectual property use. > Your comments have more 'bits of evidence' and 'probability updates' than most scientific papers. Have you considered that sometimes people just want to chat without Bayesian analysis? > You spend so much time trying to bring nuance to political discussions on HN that you could have single-handedly solved AI alignment by now.
As a new career I'd probably not choose SWE now. But if you've done 10 years already I'd ride it out, there is a good chance most of us will remain employed for many years to come.
This is what I mean by generalization skills. You need trillions of lines of code to RL a model into a good SWE right now, but as the models grow more capable you will probably need less and less. Eventually we may hit the point where a large corporations internal data in any department is enough to RL into competence, and then it frankly doesn't matter for any field once individual conglomerates can start the flywheel.
This isn't an absurdity. Man can "RL" itself into competence in a single semester of material, a laughably small amount of training data compared to an LLM.
```
We can “stick‐to the order” of the digits and allow concatenation. For example, one acceptable answer is
40: 1 – 2 × 3 + 45 because 1 – (2×3) + 45 = 1 – 6 + 45 = 40
and another is 80: 12 ÷ 3 × 4 × 5 because 12÷3 = 4, then 4×4×5 = 16×5 = 80
In both cases the digits 1,2,3,4,5 appear in order without rearrangement.```
However, it took 8 minutes to produce that.
e.g. if software is 5x less cost to make, demand will go up more than 5x as supply is highly limited now. Lots of companies want better software but it costs too much.
That will create more jobs.
They'll be more product management and human interaction and edge case testing and less typing. Although I think there'll be a bunch of very technical jobs to debug things when the models fail.
So my advice is learn skills that help make software useful to people and businesses - from user research to product management. As well as engineering.
https://aider.chat/docs/llms/anthropic.html#thinking-tokens
I'll make this less clunky soon.
It's done a reasonable job — but rips through credit, often changing its mind. Even strong-arming it into choosing an approach, it wanted to flip-flop between using regex and lol_html to parse the HTML whenever it came across a difficulty.
If you're a US developer on whatever multiple of $ to the £ that I earn it might make sense, but burning through $100p/h for a pair programmer is a bit rich for my blood.
Like, most of these posts are legit funny.
Like $5+ was cache read ($0.05/token vs $3/token) so it would have cost $300+
I usually ask the models to extend a small parser/tree-walking interpreter with a compiler/VM.
Up until Claude 3.7 the models would propose something lazy and obviously incomplete. 3.7 generated something that looks almost right, mostly works, but is so overcomplicated and broken in such a way, that I rather delete it and write it from scratch. Trying to get the model to fix it resulted in running in circles, spitting out pieces of code that didn't fit the existing ones etc.
Not sure if I prefer the former or the latter tbh.
The point I wonder about is the sustainability of every query being 30+ requests. Site owners aren't ready to have 98% of their requests be non-monetizable bot traffic. However, sites that have something to sell are..
Sure, in a hypothetical market where before they try to extract profits most participants aren't losing money with below-profitable prices trying to keep mindshare. But you’d need a breakthrough around which a participant had some kind lf a moat to get, even temporarily, there in the LLM market.
The infrastructure side of things, tens of billions and probably hundreds of billions going in, may not be fantastic for investors. The return on capital should approach cost of capital if someone does their job correctly. Add in government investment and subsidies (in China, the EU, the United States) and it become extremely difficult to make those calculations. In the long term, I don't think the AI infrastructure will be overbuilt (datacenters, fabs), but like the telecom bubble, it is easy to end up in a position where there is a lot of excess capacity and the way you made your bet means getting wiped out.
Of course if you aren't the investor and it isn't your capital, then there is a tremendous amount of money to be made because you have nothing to lose. I've been around a long time, and this is the closest thing I've felt to that inflection point where the web took off.
> Your deep dive into embedded systems will lead you to create a heated keyboard powered by the same batteries as your Milwaukee heated jacket.
While I don't have a Milwaukee heated jacket (I have no idea why it thought this), this feels like a fantastic project idea.
> After years of watching payment technologies evolve, you'll finally embrace cryptocurrency, but only after creating a detailed spreadsheet comparing transaction fees across 17 different payment methods.
I feel seen. I may have created spreadsheets like this for comparing cloud backup options and cars.
From my roast:
> You've spent so much time discussing payment technologies that your credit card probably has a restraining order against you.
This one is completely wrong. They wouldn't do this as they'd lose out on a ton of transaction fees.
Excited to try 3.7 with reasoning more but so far it seems like a modest, welcome upgrade but not any sort of leapfrog past o1 pro.
CEOs should really watch what they say in public. Anyway, this is all just my opinion.
Still worth it, but that’s a big jump.
Not everything is a tactical operation to get more subscription purchases - sometimes people like the things they use and want to say thanks and let others know.
We have whoever runs NeurIPS/ICLR/ICML and the ACL to thank for this situation. Imagine if fucking Elsevier had strangleholded our industry too!
https://en.wikipedia.org/wiki/Association_for_Computational_...
I'm situated in Europe (Sweden), anyone else having the same experience?
have you ever seen those experiments where they asked people to draw a picture of a bicycle, from memory? people’s pictures made no mechanical sense. often people’s understanding of software is like that — even more so because it’s abstract and many parts are invisible.
learning to clearly describe what software should do is a very artificial skill that at a certain point, shades into part of software engineering.
It's not like the web suddenly was just there, it came slow at first, then everywhere at once, the money came even later.
Telling it “if you were in an interview and you jumped to writing code without asking any questions, you’d fail the interview” is usually good enough to convince it to stop and ask questions.
You can view the generated SVG and the exact prompt here: https://shekhargulati.com/2025/02/25/can-claude-3-7-sonnet-g...
WHY is a huge % of my UX filled with nothing? I would apprececiate metrics, token graphs etc
https://i.imgur.com/VlxLCwI.png
Why so much wasted space? ... >>??
Love it!
> You've spent so much time explaining why functional programming is superior that you could've rewritten all of Ruby in Elixir by now.
Ooof. Probably.
> Your relationship with LLMs is like watching someone who swore they'd never get a smartphone finally discover TikTok at age 50.
Skeptical.
> For someone who hates 'artificial limitations' so much, you sure do love languages that won't let you mutate a variable.
But it's about the right limitations! >..<
> synthetic content
I haven't heard this phrase. Thank you, I'll certainly be using it."Anhentafel numbers start with you as 1. To find the Ahhentafel number of someone's father, double it. To find the Ahnentafel number of someone's mother, double it and add one.
Men pass on X chromosome DNA to their daughters, but none to their sons. Women pass on X chromosome DNA to both their sons and daughters.
List the Ahnentafel numbers of the closest 20 ancestors a man may have inherited X DNA from."
For smaller models, it's probably fair to change the question to something like: "Could you have inherited X chromosome DNA from your ancestor with Ahnentafel number 33? Does the answer to that question depend on whether you are a man or a woman?" They still fail.
> Your comments read like someone who's been burned by every tech hype cycle since COBOL was cutting edge.
> For someone who criticizes LLMs for being overconfident, you sure have strong opinions about literally everything in tech.
TL;DR - If you're writing more than one failing test at a time, you are not doing Test-Driven Development.
The bottleneck was already checking the work for correctness and building my own trust / familiarity with new code. So it's made that problem slightly more pronounced, as it generates more code faster, with more surface area to debug when many new changes arrive at once.
One of the most exciting new capabilities is that this model has a 120,000 token output limit - up from just 8,000 for the previous Claude 3.5 Sonnet model and way higher than any other model in the space.
It seems to be able to use that output limit effectively. Here's my longest result so far, though it did take 27 minutes to finish! https://gist.github.com/simonw/854474b050b630144beebf06ec4a2...
It is - what? - a fifth anniversary of "the world will be a completely different place in 6 months due to AI advancement"?
"Sam Altman believes AI will change the world" - of course he does, what else is he supposed to say?
schemesh is lisp in your shell. Most of the bash syntax remains.
Claude was okay with lisp, but understanding the gist of schemesh, it fount it really hard - even when I supplied the git source code.
ChatGPT O3 (high) had similar issues.
Originally electric generators merely replaced steam generators but had no additional productivity gains, this only changed when they changed the rest of the processes around it.
A lot of comedy involves punching down in a way that likely conflicts with the alignment efforts by mainstream model providers. So the comedic potential of LLMs is probably even greater than what we've seen.
At some point fairly recently, we passed the point at which things that took longer than anyone thought they would take are happening faster than anyone thought they would happen.
/i
uv tool install llm
llm install llm-anthropic
llm keys set anthropic
# paste in API key
llm -m claude-3.7-sonnet -o thinking 1 'your prompt goes here'
once the ai gets smart enough that it only requires an intern to make the prompt and solve the few mistakes, development cost will be worth nothing.
there is only so much demand for software development.
When the business requirements change, the spec will change. When that happens, the system will either modify its previously-written code or regenerate it from the ground up. Which strategy it chooses won't be especially interesting or important.
The process of maintaining the English-language spec will still require great care and precision. It will be called "programming," or perhaps "coding."
A few graybearded gurus will insist on examining the underlying C or Javascript or Python or Rust or whatever the model generates, the way they peer at compiler-generated assembly code now. Occasionally this capability will be important, even vital. But not usually. The situations where it's necessary will become less common over time.
I was at the AI Engineering Summit in NYC last week and met an (extremely senior) staff ai engineer doing somewhat unbelievable things with aider. Shocking things tbh.
Is there a good way to share stories about real-world aider projects like this with you directly (if I can get approval from him)? Not sure posting on public forum is appropriate but I think you would be really interested to hear how people are using this tool at the edge.
>Your archive.is links will become so legendary that dang will create a special 'Paywall Slayer' badge just for you
>You've shared so many archive.is links that the Internet Archive is considering naming you their unofficial spokesperson - or sending you a cease and desist letter.
>Your economic predictions are so consistently apocalyptic that gold dealers use your comment history as their marketing strategy.
Really sums it up!
Now that is funny!
For small models and when attention is "taken up", these sorts of questions really send a model for a loop. Agreed - especially noticeable with small reasoning models.
Flash 2.0 got 100% on the edit format, and it's extremely cheap, so I'm pretty curious how that would score.
Those people with cross domain knowledge in an industry will continue to have value for some time able to contribute to domain discussions and execute better with the tech. As a result I've always thought the "engineering" part of software was more valuable than the CS/Leetcode part of the industry. As a lecturer many decades ago told me in a SE course - "you will know more about their business, in greater detail by the time you are finished, then they even do".
Think of having a secretary, or ten. These secretaries are not as good as an average human at most tasks, but they're good enough for tasks that are easy to double check. You can give them an immense amount of drudgery that would burn out a human.
Interestingly, the R1 1.58-bit dynamic quant model was able to sort of solve it. I changed the problem statement a bit to request only the solution for 40 and to tell it what operations it can use, both needed to keep from blowing out the limited context available on my machine (128MB RAM + 24MB GPU).
Took almost 3 hours and it wigged out a bit at the end, rambling about Lisp in Chinese, but it got an almost-valid answer: 1 * (2 + 3) * (4 + 5) - 5 (https://pastebin.com/ggL85RWJ) I didn't think it would get that far.
For OP, the product is the product, how they got there is mostly irrelevant. We don’t really care what IDE they used (outside of being a tooling nerd).
If you're generating immense amounts of really basic make work, that seems like you're managing your time poorly.
No, it won't. Because "human languages" lack the precision to describe such a spec. This is exactly why programming languages exist in the first place: a language that humans understand but that allow for precise and unambiguous specifications and/or instructions. Do note that a computer cannot execute "Python" or "C". We needs to translate it first (compiling). Edit: A programmer doens't just type curly brackets and semi-colons in the right place, she takes vague and ambigous specs and makes them precise enough so that machines can repeat them.
As a kid we had this joke (works better in Dutch).
John gets in an accident, looses both his arms. A doctor gives him futuristic voice-controlled prostethics.
John: "Pick up coffee-mug". "Bring to mouth to drink". woa! impressed he goes home.
John, all excited "unzip pants", "grab d#ck", "jerk off"
(in Dutch, trek af means both "rip off" and "w#ank")
Jokes aside, we do have such a language that's not a programming language in the common sense: executable specs - end to end tests. Gherkin being a famous one but certainly not the only one. BDD, where the B is described by humans, in a DSL and the DD is performed by AI. I could imagine this working. Not currently and not anywhere soon (current LLMs are great at making new stuff, horrible at changing existing stuff), but it might work.
We'd then end up with just another programming language, but one thats more accessible to more people, I guess. And the AI is "just a compiler" in that sense.
The graph on that page explains it much better than any of the text explanations can.
https://en.m.wikipedia.org/wiki/Synthetic_media
I believe your message got downvoted because is synthetic, but because is wrong or not of interest.
On the nail
They get caught up in the idea that adding milk first cools it fastest and can’t escape from that
The bigger the temp difference the more cooling. So by putting the milk in right away you make the temp difference between the surroundings and the coffee smaller = less cooling over your 2 mins.
I like puzzles with a satisfying answer
It was meant to be an example of thinking being better…
Lol!
I really want to be able to see what specifically is changing, not just the entire new file.
Also, if the user provides a file for modification, make that available as Version 0 (or whatever), so we can diff against that.
don't worry it's "all but", so you are fine.
The entire LOTR trilogy is ~.55 million tokens (1,200 pages, published).
If you are sending and receiving the text equivalent of several hundred copies of the LOTR trilogy every week, I don't think you are actually using AI for anything useful, or you are providing far too much context.
We really still need a better unified workflow for working on the cutting edge of tech with LLMs, imo. This problem is the same with other frameworks/technologies undergoing recent changes.
I notice this on HN more than places like reddit and Mastodon.
I think it's a benefit when writing requirements and programming to specs. It's less of a benefit when having a freeform discussion.
LLMs might enable some completely new things to be automated that made no sense to automate before, even if it’s necessary to error correct with humans / computers.
Still is very underwhelming. I like this because it isn't a difficult problem, it should be up the alley of a "language model" to translate computer languages, but it is a fairly complex problem with lots of options and parse annoyances. Addresses can be pretty complex with regex in line selections/subsetting. Scripts are supported. Probably turing complete considering the pattern space as storage and looping/jump constructs.
In an experience reminescent of "can I have L2 support please" most AIs give a kind of milquetoast slightly above average IQ responses to various questions. I wonder if there should be standard "please give me more complicated/erudite/involved explanations/documents/code from the get-go to not bother with the boring prompts.
Let's fire it up.
"Type /init to set up your repository"
OK, /init <enter>
"OK, I created CLAUDE.md, session cost so far is $0.1764"
QUIT QUIT QUIT QUIT QUIT
Seventeen cents just to initialize yourself, Claude. No.
I feel like I touched a live wire.
It's about 2 orders of magnitude (100x) too expensive.
> The selling price and the unit price must be indicated in an unambiguous, easily identifiable and clearly legible manner for all products offered by traders to consumers (i.e. the final price should include value added tax and all other taxes).
I wanted to see what the annual plan would cost as it was just displaying €170+VAT, and when I clicked the upgrade button to find out (I checked everywhere on the page) then I was automatically subscribed without any confirmation and without ever seeing the final price before the transaction was completed.
I use LLMs 20-30 times a day and while it feels invaluable for personal use where I can interpret the responses at my own discretion, they still hallucinate enough and have enough lapses in logic where I would never feel confident incorporating them into some critical system.
You think it's acceptable that a company say the price is €170+vat and then after the transaction is complete they inform you that the actual price was €206.50?
https://www.visualcapitalist.com/ranked-ai-models-with-the-l...
99% of systems aren't critical and human validation is sufficient. My own use case, it is enough to replace plenty of hours of human labour.
Using them to replace core competencies will probably remain forbidden by professional ethics (writing court documents, diagnosing patients, building bridges). However, there are ways for LLMs to assist people without doing their jobs for them.
Law firms are already using LLMs to deal with large amounts of discovery materials. Doctors and researchers probably use it to summarize papers they want to be familiar with but don't have the energy to read themselves. Engineers might eventually be able to use AI to do a rough design, then do all the regulatory and finite element analysis necessary to prove that it's up to code, just like they'd have to do anyway.
I don't have a high-level LLM subscription, but I think with the right tooling, even existing LLMs might already be pretty good at managing schedules and providing reminders.
> Your obsession with data extraction makes me wonder if you're secretly a web scraper that gained sentience and is now posting on HN.
> You talk about AI automating tedious tasks so much that I'm surprised you haven't built an AI to write your HN comments for you yet.
Those are great. Well done! That it can just read your entire comment history gives it great potential for a whole new dimension of humor.
Here is a user script to replace HN profiles with this improved version.
People making up their own benchmarks for these things has confirmed one thing for me: The bias that people think they mostly have original thoughts is extremely strong. I find if I have a “good” idea someone has probably already thought of it as well and maybe even written about it. About 0.01% of the time do I have an idea that one may consider novel and even that’s probably my own bias and overstated. This example just confirms that these models don’t really seem to reason and have a really hard time doing the basic generalization they can with fewer examples.
AI is hard; edge cases are hard. AI sucks at edge cases.
Between AI for cars and AI for software the long tail of edge cases that have to be catered for is different, yes.
...but I'm sure the same will apply for AI for art (e.g. hands), and AI for (insert domain here).
Obviously no analogy is perfect, but I think you have to really make an effort to look away from reality not to see the glaringly obvious parallels in cars, art, programming, problem solving, robots, etc. where machine learning models struggle with edge cases.
Does the tooling they used matter? no, not at all.
...but if they've claimed to solve the 'edge case problem', they've done something really interesting. If not, they haven't.
So, don't claim to have done something really interesting if you haven't.
You can say "I've been using AI to build a blah blah blah. It's great!" and that's perfectly ok.
You have to go out of your way to say "I've been using an AI to build blah blah blah and I haven't written any of it, it's all generated by AI". <-- kinda attention seeking.
"no lines of code directly written" really? Why did you mention that? You got the AI to write your software for you? That sounds cool! Let's talk! Are you an AI consultant by any chance? (yes, they are). ...but.
No. You didn't. You really didn't. I'm completely happy to call people out for doing that; its not unfair at all.
Too many AI grifters out there.
The other even more subtle point is that temp (unlike pressure) doesn't equilibrate instantaneously, as is assumed in the second option. It is a diffusive process which follows a complicated parabolic partial differential, equation, whose solution will determine the final temp of the tea+milk+partial adiabatic convection cooling of the cup. It took Fourier a detour through Complex numbers to figure it out.
I find all LLMs do quite well when presented with a well poised engineering/physics/chemistry problems with well defined parameters, including relativistic or second order effects. The confusion/hallucination is mostly imparted on them by user's lack of sophistication in articulation, or computational limitations.
Also you can definitely argue HN forums are very very high quality training corpus :)
Roast You've spent so much time discussing Apple vs Microsoft that Tim Cook and Satya Nadella probably have a joint restraining order against you.
Your comments about HTTPS everywhere suggest you're the kind of person who wears a tinfoil hat... but only after thoroughly researching the optimal thickness for blocking government signals.
You seem to have strong opinions about Flash - we get it, you're old enough to remember when websites had intro animations and your laptop could double as a space heater.
———
Totally forgot about the flash debates of the early 2010s!