Most active commenters

simonw(8)
ffsm8(3)
1dom(3)
j45(3)
pydry(3)

Popular/hot comments

>>44439886 #
>>44441323 #

←back to thread

Building a Personal AI Factory

(www.john-rush.com)

1. simonw ◴[02 Jul 25 00:02 UTC] No.44439075[source]▶

>>44438065 (OP) #

My hunch is that this article is going to be almost completely impenetrable to people who haven't yet had the "aha" moment with Claude Code.

That's the moment when you let "claude --dangerously-skip-permissions" go to work on a difficult problem and watch it crunch away by itself for a couple of minutes running a bewildering array of tools until the problem is fixed.

I had it compile, run and debug a Mandelbrot fractal generator in 486 assembly today, executing in Docker on my Mac, just to see how well it could do. It did great! https://gist.github.com/simonw/ba1e9fa26fc8af08934d7bc0805b9...

replies(7): >>44439177 #>>44439259 #>>44439544 #>>44440242 #>>44441017 #>>44441069 #>>44441796 #

2. gerdesj ◴[02 Jul 25 00:20 UTC] No.44439177[source]▶

>>44439075 (TP) #

Crack on - this is YC!

Why are you not already a unicorn?

replies(2): >>44439197 #>>44441263 #

3. lucubratory ◴[02 Jul 25 00:24 UTC] No.44439197[source]▶

>>44439177 #

An LLM wrapper does not have serious revenue potential. Being able to do very impressive things with Claude Code has a pretty strict ceiling on valuation because at any point Anthropic could destroy your business by removing access, incorporating whatever you're doing into their core feature set, etc.

replies(1): >>44439530 #

4. zackify ◴[02 Jul 25 00:38 UTC] No.44439259[source]▶

>>44439075 (TP) #

If it helps anyone else. I downgraded from Claude max to pro for $20 and the usage limits are really good.

I think they’re trying to compete with Gemini cli and now I’m glad I’m paying less

replies(2): >>44440204 #>>44440222 #

5. petesergeant ◴[02 Jul 25 01:37 UTC] No.44439530{3}[source]▶

>>44439197 #

Having worked with some serious pieces of enterprise software, I don't think this is right. Anthropic is not going to perfect multi-vendor integrations, spin up a support team, and solution architect your problems for you. Enterprise software gets into the walls, and can be very hard to displace once deployed. If you build an LLM-wrapper resume parser, once you've got it into your client's workflows, they're going to find it hard to unembed it to replace it with raw Anthropic.

replies(1): >>44440220 #

6. low_common ◴[02 Jul 25 01:41 UTC] No.44439544[source]▶

>>44439075 (TP) #

That's a pretty trivial example for one of these IDEs to knock out. Assembly is certainly in their training sets, and obviously docker is too. I've watched cursor absolutely run amok when I let it play around in some of my codebase.

I'm bullish it'll get there sooner rather than later, but we're not there yet.

replies(2): >>44439886 #>>44441960 #

7. simonw ◴[02 Jul 25 02:59 UTC] No.44439886[source]▶

>>44439544 #

I think the hardest problem in computer science right now may be coming up with an LLM demo that doesn't get called "pretty trivial".

replies(14): >>44439918 #>>44440031 #>>44441154 #>>44441225 #>>44441323 #>>44441441 #>>44441638 #>>44441811 #>>44442389 #>>44442493 #>>44443084 #>>44444778 #>>44446970 #>>44457389 #

8. fragmede ◴[02 Jul 25 03:06 UTC] No.44439918{3}[source]▶

>>44439886 #

I think Cloudflare's oauth library qualifies https://news.ycombinator.com/item?id=44159166

replies(1): >>44440627 #

9. skydhash ◴[02 Jul 25 03:32 UTC] No.44440031{3}[source]▶

>>44439886 #

Because they are trivial in a way that you can go on GitHub and copy one of those while not pretending LLM isn't a mashup of the internet.

What people agree on being non-trivial is working on a real project. There's a lot of opensource projects that could benefit from a useful code contribution. But they only got slop thrown at them.

replies(1): >>44440066 #

10. ffsm8 ◴[02 Jul 25 04:16 UTC] No.44440204[source]▶

>>44439259 #

you will run through the pro rate limiting within <1h if you do it the way the article lays out.

But yeah, if you're babysitting a single agent, only applying after reading what it wants to do ... You'll be fine for 3-4 hours before the token limit refreshed after the 5th

replies(2): >>44440718 #>>44440721 #

11. skydhash ◴[02 Jul 25 04:18 UTC] No.44440218{5}[source]▶

>>44440066 #

I took the time to investigate the work being done there (all those years learning assembly and computer architecture come in handy), and it confirms (to me) that the key aspect of using LLM is pattern matching. Meaning you know that there's a solution out there (in this case, anything involving multiplying/dividing by a power of 2 can use such trick) and framing your problem (intentionally or not) and you'll get a derived text that will contain a possible solution.

But there's nothing truly novel in the result. The key aspect is being similar enough to something that's already in the training data so that the LLM can extrapolate the rest. The hint can be quite useful and sometimes you have something that shorten the implementation time, but you have to at least have some basic understanding of the domain in order to recognize the signs.

The issue is that the result is always tainted by your prompt. The signs may be there because of your prompt and not because there's some kind of data that need s to be explored further. And sometimes it's a bad fit, similar but different (what you want and what you get). So for the few domain that's valuable to me, I prefer to construct my own mental database that can lead me to concrete artifacts (books, articles, blog posts,...) that exists outside the influence of my query.

ADDENDUM

I can use LLMs with great results and I've done so. But it's more rewarding (and more useful to me) to actually think through the problem and learning from references. Instead of getting a perfect (or wobbly or the wrong category) circle that fits my query, I go to find a strange polygon formed (by me) from other strange polygon. Then because I know I need a circle, I only need to find its center and its radius.

It's slower, but the next time I need another circle (or a square) from the same polygon, it's going to be faster and faster.

12. ffsm8 ◴[02 Jul 25 04:19 UTC] No.44440220{4}[source]▶

>>44439530 #

But if you did become a unicorn, It would suddenly become very easy to replace for anthropic, because they're the ones actually providing the sauce and can just replicate your efforts. So your window of opportunity is to be too small for anthropic to notice and get interested. That can't be called unicorn

That was the point he was making, at least that's how I understood it

replies(1): >>44442649 #

13. csomar ◴[02 Jul 25 04:19 UTC] No.44440222[source]▶

>>44439259 #

I am on max and burning daily (ccusage) roughly my monthly subscription. It is not clear whether the API is very overpriced or we are getting aggressively subsidized. I can afford $100-200/month but not $3.000. Let's hope this last for a good while as GitHub copilot turned off the tap on unlimited usage very recently.

14. csomar ◴[02 Jul 25 04:24 UTC] No.44440242[source]▶

>>44439075 (TP) #

That's a very simple example/context that I suspect most LLMs will be able to knock out with minimal frustration. I had much more complex Rust dependency upgrade done on a 30+ iterations on very custom code (wasm stuff where training data is probably scarce). Claude would ping context7 and mcp-lsp to get details. You do find its limits after a while though and as you push it harder.

replies(1): >>44440355 #

15. nico ◴[02 Jul 25 04:54 UTC] No.44440355[source]▶

>>44440242 #

> That's a very simple example/context that I suspect most LLMs will be able to knock out with minimal frustration

Yes an No. You are right that it's a relatively small project. However, I've had really bad experiences trying to get ChatGPT (any of their models) to write small arm64 assembly programs that can compile and run on apple silicon

16. gen6acd60af ◴[02 Jul 25 05:57 UTC] No.44440627{4}[source]▶

>>44439918 #

This one?

>Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards.

>To emphasize, this is not "vibe coded". Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.

Some time later...

https://github.com/advisories/GHSA-4pc9-x2fx-p7vj / CVE-2025-4143

>The OAuth implementation in workers-oauth-provider that is part of MCP framework https://github.com/cloudflare/workers-mcp, did not correctly validate that redirect_uri was on the allowed list of redirect URIs for the given client registration.

replies(1): >>44443019 #

17. stpedgwdgfhgdd ◴[02 Jul 25 06:18 UTC] No.44440718{3}[source]▶

>>44440204 #

Same experience. One terminal window with Pro is okay. Multiple CC running in parallel not.

We most likely implement a policy that starters in our company can use Pro. Power users need Max.

18. zxexz ◴[02 Jul 25 06:19 UTC] No.44440721{3}[source]▶

>>44440204 #

I've heard that if you have several relatively active separate sessions open, the limit is a little less restrictive. Especially if you do a /clear and continue your session on a different project. Honestly, a lot of Claude Code seems vibecoded if you look at the client side, too. Can't tell if I'm surprised that the backend has an element of that, too. Hey, dogfood tastes good - I respect them for that.

19. com2kid ◴[02 Jul 25 07:24 UTC] No.44441017[source]▶

>>44439075 (TP) #

> That's the moment when you let "claude --dangerously-skip-permissions" go to work on a difficult problem and watch it crunch away by itself for a couple of minutes running a bewildering array of tools until the problem is fixed.

Eh, I just watched Claude spend an hour trying to incorrectly fix code. Eventually I realized what was happening, stepped in and asked it to write a bunch of unit tests first, get the code working against those unit tests, and then get back to me.

Claude Code is amazing, but I still have to step in and give it basic architectural guidance again and again.

20. barrenko ◴[02 Jul 25 07:33 UTC] No.44441069[source]▶

>>44439075 (TP) #

https://youtu.be/bUBF5V6oDKw I'd like to add this video from the AI Engineer conf, which may also be impenetrable, by the folks from Dagger (person behind Docker)

21. th0ma5 ◴[02 Jul 25 07:52 UTC] No.44441154{3}[source]▶

>>44439886 #

Maybe you should try something other than demos? Have you tried creating a reliable system?

replies(1): >>44443451 #

22. jkhdigital ◴[02 Jul 25 08:06 UTC] No.44441225{3}[source]▶

>>44439886 #

No the hardest problem is teaching CS undergrads. I just started this year (no background in academia, just 75% of a PhD and well-rounded life experience) and I’ve basically torn up the entire curriculum they handed to me and started vibe-teaching.

23. sussmannbaka ◴[02 Jul 25 08:15 UTC] No.44441263[source]▶

>>44439177 #

As it turns out, the VC potential of Mandelbrot and HelloWorld.py are quite limited :o)

replies(1): >>44441664 #

24. 1dom ◴[02 Jul 25 08:27 UTC] No.44441323{3}[source]▶

>>44439886 #

I'm very pro LLM and AI. But I completely agree with the comment about how many pieces praising LLMs are doing so with trivial examples. Trivial might not be the right word, but I can't think of a better one that doesn't have a negative connotation, but this shouldn't be negative. Your examples are good and useful, and capture a bunch of tasks a software engineer would do.

I'd say your mandelbrot debug and the LLVM patch are both "trivial" in the same sense: they're discrete, well defined, clear-success-criteria-tasks that could be assigned to any mid/senior software engineer in a relevant domain and they could chip through it in a few weeks.

Don't get me wrong, that's an insane power and capability of LLMs, I agree. But ultimately it's just doing a day job that millions of people can do sleep deprived and hungover.

Non-trivial examples are things that would take a team of different specialist skillsets months to create. One obvious potential reason why there's few non-trivial AI examples is because non-trivial AI examples require non-trivial amount of time to be able to generate and verify.

A non-trivial example isn't an example you can look at the output and say "yup, AI's done well here". It requires someone spends time going into what's been produced, assessing it, essentially redesigning it as a human to figure out all the complexity of a modern non-trivial system to confirm the AI actually did all that stuff correctly.

An in depth audit of a complex software system can take months or even years and is a thorough and tedious task for a human, and the Venn diagrams of humans who are thinking "I want to spend more time doing thorough, tedious code tasks" and "I want to mess around with AI coding" is 2 separate circles.

replies(7): >>44441342 #>>44441663 #>>44441824 #>>44441879 #>>44443505 #>>44444529 #>>44445225 #

25. sokoloff ◴[02 Jul 25 08:31 UTC] No.44441342{4}[source]▶

>>44441323 #

> ultimately it's just doing a day job that millions of people can do sleep deprived and hungover.

Doing for < $10 and under an hour what could be done in a few weeks by $10K+ worth of senior staff time is pretty valuable.

replies(1): >>44441546 #

26. cranium ◴[02 Jul 25 08:51 UTC] No.44441441{3}[source]▶

>>44439886 #

Instead of "pretty trivial", I'd say it's "well-defined and generally understood".

The implicit decisions it had to make were also inconsequential, eg. selection of ASCII chars, color or not, bounds of the domain,...

However, it shows that agents are powerful translators / extractors of general knowledge!

27. 1dom ◴[02 Jul 25 09:10 UTC] No.44441546{5}[source]▶

>>44441342 #

If it's something a single senior staff member can do, then - personally - I'd consider it not complex, it's relatively trivial: it can be done by literally a single person.

I'm pro AI, I'm not saying it's not valuable for trivial things. But that's a distinct discussion to the trivial nature of many LLM examples/demos in relation to genuinely complex computer systems.

replies(1): >>44443727 #

28. sroussey ◴[02 Jul 25 09:25 UTC] No.44441638{3}[source]▶

>>44439886 #

Convert react-stockcharts to react v19. I’ve tried Claude Code and Cursor but only ended up with hilariously bad results.

replies(1): >>44443541 #

29. sroussey ◴[02 Jul 25 09:29 UTC] No.44441663{4}[source]▶

>>44441323 #

LLMs are best demonstrated with greenfield examples.

replies(1): >>44441830 #

30. addandsubtract ◴[02 Jul 25 09:29 UTC] No.44441664{3}[source]▶

>>44441263 #

Bakeries have been in business for thousands of years. Should be pretty easy to sell Mandelbrot everywhere around the world.

31. CjHuber ◴[02 Jul 25 09:51 UTC] No.44441796[source]▶

>>44439075 (TP) #

Is it that much better than Codex?

32. j45 ◴[02 Jul 25 09:54 UTC] No.44441811{3}[source]▶

>>44439886 #

Many big problems are made up of small problems.

33. j45 ◴[02 Jul 25 09:57 UTC] No.44441824{4}[source]▶

>>44441323 #

There is a scale somewhere in these types of articles that will emerge.

It might be something being actually new (cutting edge) vs new to someone vs the human mind wanting to have it be novel and different enough as a comparable percentage of the experience of the first time using ChatGPT 4.

There is also the wiring of non-deterministic software frameworks and architectures compared to the deterministic only software development we're used to.

The former is a different thing than the latter.

34. j45 ◴[02 Jul 25 09:58 UTC] No.44441830{5}[source]▶

>>44441663 #

Plus, applying non-deterministic algorithms in a deterministic way might not always work the same. The software developers are also changing the frames and terms of reference.

35. sundache ◴[02 Jul 25 10:07 UTC] No.44441879{4}[source]▶

>>44441323 #

I only see 148 lines of assembly and a dockerfile that's 7 lines long. Am I missing something or should that take a human less then several weeks.

replies(1): >>44442232 #

36. Havoc ◴[02 Jul 25 10:20 UTC] No.44441960[source]▶

>>44439544 #

I suspect personal tools are as close as we're going to get to this mythical demo that satisfies all critics. i.e. here is a list of problems i've solved with just AI.

Strikes a balance between simplicity and real world usefulness

replies(1): >>44443461 #

37. dotancohen ◴[02 Jul 25 11:05 UTC] No.44442232{5}[source]▶

>>44441879 #

Depends on what's in those 148 lines.

38. afro88 ◴[02 Jul 25 11:25 UTC] No.44442389{3}[source]▶

>>44439886 #

It coming from computer science might be the issue. There's a lot of open source repos out there that have tricky bugs, and todo lists of features that are too complex or time consuming for casual contributors to tackle. Adding significant value to an open source project is a pretty nice demo that won't get called "pretty trivial".

Can't be too far off!

39. raxxorraxor ◴[02 Jul 25 11:42 UTC] No.44442493{3}[source]▶

>>44439886 #

The complexity of the problem masqerades the common problem of providing sensible context to your AI of choice to have it doing something constructive in your personal codebase. Or giving it tools to check the truth of one of its assertions. Something a developer does countless times.

40. lucubratory ◴[02 Jul 25 11:58 UTC] No.44442649{5}[source]▶

>>44440220 #

She, but yes.

replies(1): >>44442989 #

41. ffsm8 ◴[02 Jul 25 12:34 UTC] No.44442989{6}[source]▶

>>44442649 #

https://upload.wikimedia.org/wikipedia/en/f/f8/Internet_dog....

42. kentonv ◴[02 Jul 25 12:38 UTC] No.44443019{5}[source]▶

>>44440627 #

Sorry, my code has bugs sometimes.

43. pydry ◴[02 Jul 25 12:45 UTC] No.44443084{3}[source]▶

>>44439886 #

Really? This paper cut through the same kind of bullshit with puzzles: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

What do you think is so difficult about doing the same thing with coding problems?

replies(1): >>44443442 #

44. simonw ◴[02 Jul 25 13:26 UTC] No.44443442{4}[source]▶

>>44443084 #

I don't understand the connection between that paper and my comment.

replies(1): >>44443831 #

45. ◴[02 Jul 25 13:26 UTC] No.44443451{4}[source]▶

>>44441154 #

46. simonw ◴[02 Jul 25 13:27 UTC] No.44443461{3}[source]▶

>>44441960 #

I tried that with https://tools.simonwillison.net/colophon - over 100 personal tools, some of which I use on a daily basis.

47. simonw ◴[02 Jul 25 13:31 UTC] No.44443505{4}[source]▶

>>44441323 #

> Non-trivial examples are things that would take a team of different specialist skillsets months to create.

Thank you for providing a spelled out definition of "non-trivial" there!

replies(1): >>44445198 #

48. simonw ◴[02 Jul 25 13:34 UTC] No.44443541{4}[source]▶

>>44441638 #

I had great success with o4-mini via ChatGPT for they kind of upgrade, since of can use its search tool to look up what's changed.

I used this prompt a few weeks ago:

> This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

49. simonw ◴[02 Jul 25 13:51 UTC] No.44443727{6}[source]▶

>>44441546 #

Maybe the definition of "non-trivial" in these conversations should be defined as "stuff an LLM system can't do yet".

50. pydry ◴[02 Jul 25 13:59 UTC] No.44443831{5}[source]▶

>>44443442 #

They created an environment to expose LLMs to problems and test their performance which were immune from benchmark hacking using puzzles.

Your comment was about how this was unreasonably hard (for coding challenges).

Anecdotally Ive seen LLMs do all sorts of amazing shit which was obviously drawn from their training set and fall flat on their faces doing simple coding tasks which are novel enough to not appear in the training set.

replies(1): >>44444237 #

51. simonw ◴[02 Jul 25 14:33 UTC] No.44444237{6}[source]▶

>>44443831 #

That Apple paper mainly demonstrated that "reasoning" LLMs - with no access to additional tools - can't solve problems that deliberately exceed their token context length.

I don't think it has much relevance at all to a conversational about how good LLMs are at solving programming problems by running tools in a loop.

I keep seeing this idea that LLMs can't handle problems that aren't in their training data and it's frustrating because anyone who has spent significant time working with these systems knows that it obviously isn't true.

replies(1): >>44452688 #

52. fho ◴[02 Jul 25 14:55 UTC] No.44444529{4}[source]▶

>>44441323 #

Point in case: i've been trying for weeks now to generate a CFD solver that is more than the basic FDM "toy example".

The models clearly know the equations, but run into the same issues I had when implementing it myself (namely exploding simulations that the models try to paper over by applying more and more relaxation terms).

53. dust42 ◴[02 Jul 25 15:16 UTC] No.44444778{3}[source]▶

>>44439886 #

I have one for you: implement gemma 3n multimodel support in llama.cpp

54. 1dom ◴[02 Jul 25 15:49 UTC] No.44445198{5}[source]▶

>>44443505 #

Haha, it was made up on the spot, thank you though! I think your articles and notes are proof that there's a lot of value and use in "trivial" examples. They're very close to the sort of examples a lot of tech people can actually use as individual professional engineers.

I think the void where non-trivial examples should be is the same space where contrarians and the last remaining few LLMs-are-useless crowd hangout.

55. edmundsauto ◴[02 Jul 25 15:50 UTC] No.44445225{4}[source]▶

>>44441323 #

Current state AI is a best fit for jobs that can be easily verified as correct. In my 20+ years, this is at least 75% of the work I’ve ever done. Maybe 99.999% (I have led a very boring career.)

There’s an enormous amount of value in doing this. For the harder problems you mentioned - most IC SWE are also incapable or unwilling to do the work. So maybe the current state has equivalent capabilities to 95% of coders out there? But it works faster, cheaper, and doesn’t object to tedious work like documentation. It doesn’t require labor law compliance, hiring, onboarding/offboarding, or cause interpersonal conflict.

56. x0x0 ◴[02 Jul 25 18:09 UTC] No.44446970{3}[source]▶

>>44439886 #

I have one: features I've tried this on in my codebase. Because claude and gemini have both failed pretty badly.

So it's pretty stupid to just assume that critics haven't tried.

Example feature: send analytics events on app start triggered by notifications. Both Gemini and Claude completely failed to understand the component tree; rewrote hundreds of lines of code in broken ways; and even when prompted with the difficulty (this is happening outside of the component tree), failed to come up with a good solution. And even when deliberately prompted not to, like to simultaneously make cosmetic code changes to other pieces of the files they're touching.

57. pydry ◴[03 Jul 25 07:57 UTC] No.44452688{7}[source]▶

>>44444237 #

It demonstrated that there was a hard limit on the complexity of a puzzle that LLMs could solve no matter how many tokens they threw at it (using a form of puzzle construction that it ensured that the LLM couldn't just refer to its training data to solve it).

58. kayge ◴[03 Jul 25 17:36 UTC] No.44457389{3}[source]▶

>>44439886 #

The "No True Scotsware" problem? :)

↑