Most active commenters

simonw(9)
steveklabnik(7)
(5)
stavros(5)
Aeolun(4)
petesergeant(4)
pydry(4)
apwell23(3)
ffsm8(3)
nico(3)

Popular/hot comments

>>44439886 #
>>44439075 #
>>44441323 #
>>44439147 #
>>44438330 #
>>44438730 #
>>44438820 #
>>44439540 #

Building a Personal AI Factory

(www.john-rush.com)

1. IncreasePosts ◴[01 Jul 25 21:51 UTC] No.44438316[source]▶

>>44438065 (OP) #

Okay, what is he actually building with this?

I have a problem where half the times I see people talking about their AI workflow, I can't tell if they are talking about some kind of dream workflow that they have, or something they're actually using productively

replies(1): >>44438447 #

2. steveklabnik ◴[01 Jul 25 21:53 UTC] No.44438330[source]▶

>>44438065 (OP) #

I'd love to see more specifics here, that is, how Claude and o3 talk to each other, an example session, etc.

replies(3): >>44438361 #>>44438366 #>>44438730 #

3. skybrian ◴[01 Jul 25 22:00 UTC] No.44438359[source]▶

>>44438065 (OP) #

> It’s essentially free to fire off a dozen attempts at a task - so I do.

What sort of subscription plan is that?

replies(1): >>44438383 #

4. breckenedge ◴[01 Jul 25 22:00 UTC] No.44438361[source]▶

>>44438330 #

I presume via Goose via MCP in Claude Code:

> I also have a local mcp which runs Goose and o3.

replies(1): >>44438374 #

5. steveklabnik ◴[01 Jul 25 22:02 UTC] No.44438374{3}[source]▶

>>44438361 #

Ah, I skimmed the docs for Goose but I couldn't figure out exactly what it is that it does, which is a common issue for docs.

For example: https://block.github.io/goose/docs/category/tutorials/ I just want to see an example workflow before I set this up in CI or build a custom extension to it!

replies(1): >>44438560 #

6. steveklabnik ◴[01 Jul 25 22:03 UTC] No.44438383[source]▶

>>44438359 #

Claude Code's $200 Max subscription can take a lot of usage. I haven't done a dozen things at once, but I have worked on two side projects simultaneously with it before.

ccusage shows me getting over 10x the value of paying via API tokens this month so far...

replies(2): >>44439059 #>>44440118 #

7. photon_garden ◴[01 Jul 25 22:08 UTC] No.44438401[source]▶

>>44438065 (OP) #

It’s hard to evaluate setups like this without knowing how the resulting code is being used.

Standalone vibe coded apps for personal use? Pretty easy to believe.

Writing high quality code in a complex production system? Much harder to believe.

replies(2): >>44438831 #>>44439142 #

8. ClawsOnPaws ◴[01 Jul 25 22:15 UTC] No.44438447[source]▶

>>44438316 #

I keep coming to the same conclusion, which basically is: if I had an LLM write it for me, I just don't care about it. There are 2 projects out of the maybe 50 or so that are LLM generated, and even for those two I cared enough to make changes myself without an LLM. The rest just sit there because one day I thought huh wouldn't it be neat if, and then realized actually I cared more about having that thought than having the result of that thought. Then you end up fighting with different models and implementation details and then it messes up something and you go back and forth about how you actually want it to work, and somehow this is so much more draining and exhausting than just getting the work done manually with some slight completion help perhaps, maybe a little bit of boilerplate fill-in. And yes, this is after writing extensive design docs, then having some reasoning LLM figure out the tasks that need to be completed, then having some models talk back and forth about what needs to happen and while it's happening, and then I spent a whole lot of money on what exactly? Questionably working software that kinda sorta does what I wanted it to do? If I have a clear idea, or an existing codebase, if I end up guiding it along, agents and stuff are pretty cool I guess. But vibe coding? Maybe I'm in the minority here but as soon as it's a non trivial app, not just a random small script or bespoke app kind of deal, it's not fun, I often don't get the results I actually wanted out of it even if I tried to be as specific as I wanted with my prompting and design docs and example data and all that, it's expensive, code is still messy as heck, and at the end I feel like I just spent a whole lot of time actually literally arguing with my computer. Why would I want to do that?

replies(2): >>44438648 #>>44439201 #

9. breckenedge ◴[01 Jul 25 22:33 UTC] No.44438560{4}[source]▶

>>44438374 #

Classic Steve Klabnik comment.

replies(2): >>44438748 #>>44439230 #

10. solomonb ◴[01 Jul 25 22:36 UTC] No.44438580[source]▶

>>44438065 (OP) #

> When something goes wrong, I don’t hand-patch the generated code. I don’t argue with claude. Instead, I adjust the plan, the prompts, or the agent mix so the next run is correct by construction.

I don't think "correct by construction" means what OP thinks it means.

replies(2): >>44438657 #>>44438688 #

11. jwpapi ◴[01 Jul 25 22:47 UTC] No.44438648{3}[source]▶

>>44438447 #

I’ve written a full stack monorepo with over 1,000 files alone now. I’ve started with AI doing a lot of the work, but the percentage goes down and down. For me a good codebase is not about how much you’ve written, but about how it’s architectured. I want to have an app that has the best possible user and dev experience meaning its easy to maintain and easy to extend. This is achieved by making code easy to understand, for yourself, for others.

In my case it’s more like developing a mindset building a framework than to push feature after feature. I would think it’s like that for most companies. You can get an unpolished version of most apps easily, but polishing takes 3-5x the time.

Lets not talk about development robustness, backend security etc etc. Like AI has just way too many slippages for me in these cases.

However I would still consider myself a heavy AI user, but I mainly use it to discuss plans,(what google used to be) or to check it if I’ve forgotten anything.

For most features in my app I’m faster typing it out exactly the way I want it. (with a bit of auto-complete) The whole brain-coordination works better.

I guess long talk, but you’re not alone trust your instinct. You don’t seem narrow minded.

replies(1): >>44438868 #

12. btbuildem ◴[01 Jul 25 22:48 UTC] No.44438657[source]▶

>>44438580 #

Also, aren't they just rolling the dice here? Can you turn down the temperature via Claude Code?

13. ◴[01 Jul 25 22:52 UTC] No.44438688[source]▶

>>44438580 #

14. vFunct ◴[01 Jul 25 22:58 UTC] No.44438727[source]▶

>>44438065 (OP) #

The issue I'm facing with multiple agents working on separate work trees is that each independent agent tends to have completely different ideas on absolutely every detail, leading to inconsistent user experience.

For example, an agent working on the dashboard for the Documents portion of my project has a completely different idea from the agent working on the dashboard for the Design portion of my project. The design consistency is not there, not just visually, but architecturally. Database schema and API ideas are inconsistent, for example. Even on the same input things are wildly different. It seems that if it can be different, it will be different.

You start to update instruction files to get things consistent, but then these end up being thousands of lines on a large project just to get the foundations right, eating into the context window.

I think ultimately we might need smaller language models trained on certain rules & schemas only, instead of on the universe of ideas that a prompt could result in. Small language models are likely the correct path.

replies(2): >>44438788 #>>44438865 #

15. schmookeeg ◴[01 Jul 25 22:59 UTC] No.44438730[source]▶

>>44438330 #

I use Zen MCP and OpenRouter. Every once in awhile, my instance of claude code will "phone a friend" and use Gemini for a code review. Often unprompted, sometimes me asking for "analysis" or "ultrathink" about a thorny feature when I doubt the proposed implementation will work out or cause footguns.

It's wild to see in action when it's unprompted.

For planning, I usually do a trip out to Gemini to check our work, offer ideas, research, and ratings of completeness. The iterations seem to be helpful, at least to me.

Everyone in these sorta threads asks for "proofs" and I don't really know what to offer. It's like 4 cents for a second opinion on what claude's planning has cooked up, and the detailed response has been interesting.

I loaded 10 bucks onto OpenRouter last month and I think I've pulled it down by like 50 cents. Meanwhile I'm on Claude Max @ $200/mo and GPT Plus for another $20. The OpenRouter stuff seems like less than couch change.

$0.02 :D

replies(3): >>44438830 #>>44438942 #>>44439781 #

16. IncreasePosts ◴[01 Jul 25 23:02 UTC] No.44438748{5}[source]▶

>>44438560 #

An uncommon Aaron Breckenridge comment

17. Swizec ◴[01 Jul 25 23:09 UTC] No.44438788[source]▶

>>44438727 #

> each independent agent tends to have completely different ideas on absolutely every detail, leading to inconsistent user experience

> The design consistency is not there, not just visually, but architecturally.

Seniors always gonna have to senior. Doesn't matter if the coders are AI or humans. You have to make sure you provide enough structures for the agents to move in roughly the same direction while allowing enough flexibility that you're not better off just writing the code.

18. marviel ◴[01 Jul 25 23:17 UTC] No.44438820[source]▶

>>44438065 (OP) #

Thanks for the writeup!

I talked about a similar, but slightly simpler workflow in my post on "Vibe Specs".

https://lukebechtel.com/blog/vibe-speccing

I use these rules in all my codebases now. They essentially cause the AI to do two things differently:

(1) ask me questions first (2) Create a `spec.md` doc, before writing any code.

Seems not too dissimilar from yours, but I limit it to a single LLM

replies(3): >>44439633 #>>44440374 #>>44441110 #

19. conradev ◴[01 Jul 25 23:19 UTC] No.44438830{3}[source]▶

>>44438730 #

proof -> show the code if you can!

Then engineers can judge for themselves

replies(1): >>44438882 #

20. kasey_junk ◴[01 Jul 25 23:20 UTC] No.44438831[source]▶

>>44438401 #

I don’t really understand this article or the workflow it’s describing as it’s kind of vague.

But I use multiple agents talking to each other, async agents, git work trees etc on complex production systems as my day to day workflow. I wouldn’t say I go so far as to never change the outputs but I certainly view it as signal when I don’t get the outputs I want that I need to work on my workflow.

21. pjm331 ◴[01 Jul 25 23:25 UTC] No.44438865[source]▶

>>44438727 #

I’ve had success with building the first version of a thing mostly by hand and then telling Claude code to look at it as an example of how to do things when building the next N of them

replies(1): >>44439649 #

22. ozten ◴[01 Jul 25 23:25 UTC] No.44438868{4}[source]▶

>>44438648 #

What does the full stack monorepo do?

replies(1): >>44441394 #

23. schmookeeg ◴[01 Jul 25 23:28 UTC] No.44438882{4}[source]▶

>>44438830 #

Yeahhhhhh I've been to enough code reviews / PR reviews to know this will result in 100 opinions about what color the drapes should be and what a catastrophe we've vibe coded for ourselves. If I shoot something to GH I'll highlight it for others, but nothing yet. I can appreciate this makes me look like I'm shilling.

It makes usable code for my projects. It often gets into the weeds and makes weird tesseracts of nonsense that I need to discover, tear down, and re-prompt it to not do that again.

It's cheap or free to try. It saves me time, particularly in languages I am not used to daily driving. Funnily enough, I get madder when I have it write ts/py/sql code since I'm most conversant in those, but for fringe stuff that I find tedious like AWS config and tests -- it mostly just works.

Will it rot my brain? Maybe? If this thing turns me from an engineer to a PM, well, I'll have nobody to blame but myself as I irritate other engineers and demand they fibonacci-size underdefined jira tix. :D

I think there's going to be a lot of momentum in this direction in the coming year. I'm fortunate that my clients embrace this stuff and we all look for the same hallucinations in the codebase and shut them down and laugh together, but I worry that I'm not exactly justifying my rate by being an LLM babysitter.

24. steveklabnik ◴[01 Jul 25 23:39 UTC] No.44438942{3}[source]▶

>>44438730 #

It’s not about proof: it’s that at this point I’m a fairly heavy Claude Code user and I’d like to up my game, but I’m also not so up on many of these details that I can just figure out how to give this a try just from the description of it. I’m already doing plan-up-front workflows with just Claude, but haven’t figured out some of this more advanced stuff.

I have two MCPs installed (playwright and context7) but it never seems like Claude decides to reach for them on its own.

I definitely appreciate why you’re not posting code, as you said in another comment.

replies(1): >>44440099 #

25. simonw ◴[02 Jul 25 00:00 UTC] No.44439059{3}[source]▶

>>44438383 #

I had to look that up: https://github.com/ryoppippi/ccusage

  npx ccusage@latest

Outputs a table of your token usage over the last few days, which it reads from the jsonl files that Claude Code leaves tucked away in the ~/.claude/ directory.

replies(1): >>44439078 #

26. simonw ◴[02 Jul 25 00:02 UTC] No.44439075[source]▶

>>44438065 (OP) #

My hunch is that this article is going to be almost completely impenetrable to people who haven't yet had the "aha" moment with Claude Code.

That's the moment when you let "claude --dangerously-skip-permissions" go to work on a difficult problem and watch it crunch away by itself for a couple of minutes running a bewildering array of tools until the problem is fixed.

I had it compile, run and debug a Mandelbrot fractal generator in 486 assembly today, executing in Docker on my Mac, just to see how well it could do. It did great! https://gist.github.com/simonw/ba1e9fa26fc8af08934d7bc0805b9...

replies(7): >>44439177 #>>44439259 #>>44439544 #>>44440242 #>>44441017 #>>44441069 #>>44441796 #

27. steveklabnik ◴[02 Jul 25 00:03 UTC] No.44439078{4}[source]▶

>>44439059 #

Don’t sleep on the other options either, the live updates are cool, see where you’re at in the five hour session.

28. dkdcio ◴[02 Jul 25 00:05 UTC] No.44439087[source]▶

>>44438065 (OP) #

I went down this (and even built a bit of internal web tooling) —- it’s like playing multiple games of online poker for me (instead of the factoria analogy here)

it’s really promising, but I found focusing on a single task and doing it well is still more efficient for now. excited for where this goes

29. 9cb14c1ec0 ◴[02 Jul 25 00:15 UTC] No.44439142[source]▶

>>44438401 #

Exactly. I use claude code as a major speedup in coding, but I stay in the loop on every code change to make sure it is creating an optimal system. The few times that I've just let it run have resulted in bugs that customers had to deal with.

replies(2): >>44440091 #>>44440755 #

30. geekymartian ◴[02 Jul 25 00:16 UTC] No.44439147[source]▶

>>44438065 (OP) #

ADHD coding, brute forcing product generation until you get it right? Just freaking write the code that you can expand and modify in the future instead of increasing your carbon footprint.

replies(4): >>44439246 #>>44441190 #>>44441850 #>>44443387 #

31. gerdesj ◴[02 Jul 25 00:18 UTC] No.44439163[source]▶

>>44438065 (OP) #

"Here’s the secret sauce: iterate the inputs":

No it isn't. There are no short cuts to ... anything. You expend a lot of input for a lot of output and I'm not too sure you understand why.

"Example: an agent once wrote code ..." - not exactly world beating.

If you believe this will take over the world, then go full on startup. YC is your oyster.

I've run my own firm for 25 years. Nothing exciting and certainly not YC excitable.

You wont with this.

replies(1): >>44439576 #

32. gerdesj ◴[02 Jul 25 00:20 UTC] No.44439177[source]▶

>>44439075 #

Crack on - this is YC!

Why are you not already a unicorn?

replies(2): >>44439197 #>>44441263 #

33. lucubratory ◴[02 Jul 25 00:24 UTC] No.44439197{3}[source]▶

>>44439177 #

An LLM wrapper does not have serious revenue potential. Being able to do very impressive things with Claude Code has a pretty strict ceiling on valuation because at any point Anthropic could destroy your business by removing access, incorporating whatever you're doing into their core feature set, etc.

replies(1): >>44439530 #

34. tptacek ◴[02 Jul 25 00:25 UTC] No.44439201{3}[source]▶

>>44438447 #

We just had a story last night about a Python cryptography maintainer using Claude to add formally-verified optimizations to LLVM. I think the ship has sailed on skepticism about whether LLMs are going to produce valuable code; you can follow Simon Willison's blog for more examples.

replies(1): >>44441203 #

35. apwell23 ◴[02 Jul 25 00:27 UTC] No.44439217[source]▶

>>44438065 (OP) #

ppl are getting slowly disillusioned with vibe coding.

yes AI assisted workflow might be here to stay but it won't be the magical put programmers out of job thing.

And this the best product market fit for LLMs. I imagine it will be even worse in other domains.

replies(2): >>44439276 #>>44439540 #

36. c4pt0r ◴[02 Jul 25 00:28 UTC] No.44439226[source]▶

>>44438065 (OP) #

Maybe a bit off-topic, but the minimalist style of the blog looks really cool.

37. steveklabnik ◴[02 Jul 25 00:29 UTC] No.44439230{5}[source]▶

>>44438560 #

It's true that I deeply care about docs! Turns out they're good for both humans and LLMs :)

38. cube00 ◴[02 Jul 25 00:34 UTC] No.44439246[source]▶

>>44439147 #

The end goal is to remove the developer from this equation.

Business owner asks for a new CRUD app and there it is in production.

Of course it's full of full of bugs, slow as syrup, saves to a public unauthed database but that's none of my business *gulps scalding hot tea*

replies(1): >>44440407 #

39. zackify ◴[02 Jul 25 00:38 UTC] No.44439259[source]▶

>>44439075 #

If it helps anyone else. I downgraded from Claude max to pro for $20 and the usage limits are really good.

I think they’re trying to compete with Gemini cli and now I’m glad I’m paying less

replies(2): >>44440204 #>>44440222 #

40. azan_ ◴[02 Jul 25 00:44 UTC] No.44439276[source]▶

>>44439217 #

Are they though? I’m seeing more and more people that used gpt4 and got substandard results get blown away with Claude code and opus once they gave it a chance. Also remember that progress has not stopped (whether it has slowed down is also controversial), so I wouldn’t make strong assumptions that ai won’t replace many devs. I hope it won’t, I really like intellectual work associated with it.

41. petesergeant ◴[02 Jul 25 01:33 UTC] No.44439507[source]▶

>>44438065 (OP) #

> I keep several claude code windows open, each on its own git-worktree.

Can someone convince me they're doing their due-diligence on this code if they're using this approach? I am smart and I am experienced, and I have trouble keeping on top of the changes and subtle bugs being created by one Claude Code.

42. petesergeant ◴[02 Jul 25 01:37 UTC] No.44439530{4}[source]▶

>>44439197 #

Having worked with some serious pieces of enterprise software, I don't think this is right. Anthropic is not going to perfect multi-vendor integrations, spin up a support team, and solution architect your problems for you. Enterprise software gets into the walls, and can be very hard to displace once deployed. If you build an LLM-wrapper resume parser, once you've got it into your client's workflows, they're going to find it hard to unembed it to replace it with raw Anthropic.

replies(1): >>44440220 #

43. petesergeant ◴[02 Jul 25 01:40 UTC] No.44439540[source]▶

>>44439217 #

> ppl are getting slowly disillusioned with vibe coding.

This is the absolute polar opposite from my experience. I'm in a large non-tech community with a coders channel, and every day we get a few more Claude Code converts. I would say that vibe-coding is moving into the main-stream with experienced, professional developers who were deeply skeptical a few months ago. It's no longer fancy auto-complete: I have myself seen the magic of wishing a (low importance) front-end app into existence from scratch in an hour or so that would have taken me an order of magnitude more time beforehand.

replies(3): >>44439667 #>>44440413 #>>44441226 #

44. low_common ◴[02 Jul 25 01:41 UTC] No.44439544[source]▶

>>44439075 #

That's a pretty trivial example for one of these IDEs to knock out. Assembly is certainly in their training sets, and obviously docker is too. I've watched cursor absolutely run amok when I let it play around in some of my codebase.

I'm bullish it'll get there sooner rather than later, but we're not there yet.

replies(2): >>44439886 #>>44441960 #

45. tranchebald ◴[02 Jul 25 01:49 UTC] No.44439576[source]▶

>>44439163 #

You come across as a massive hater. Maybe it’s a cultural thing. Do you actually have employees?

46. rolha-capoeira ◴[02 Jul 25 02:02 UTC] No.44439633[source]▶

>>44438820 #

I guess a lot of us are trying this (naturally) as solo devs, where we can take an engineering-first mindset and build a machine or factory that spits out gizmos. I haven't gotten to the finish line, mostly because for me, the holy grail is code confidence via e2e tests that the agent generated (separately, not alongside the implementation).

replies(1): >>44439642 #

47. marviel ◴[02 Jul 25 02:04 UTC] No.44439642{3}[source]▶

>>44439633 #

Totally. Yeah I think your approach is a solid take!

48. swader999 ◴[02 Jul 25 02:06 UTC] No.44439649{3}[source]▶

>>44438865 #

The things that work on a regular dev team translate well to the agentic mode.

49. apwell23 ◴[02 Jul 25 02:10 UTC] No.44439667{3}[source]▶

>>44439540 #

oh yea thats true. I was talking more about ppl who have been vibe coding for a while.

https://www.reddit.com/r/ClaudeAI/comments/1loj3a0/this_pret...

50. Uehreka ◴[02 Jul 25 02:38 UTC] No.44439781{3}[source]▶

>>44438730 #

> Everyone in these sorta threads asks for "proofs" and I don't really know what to offer

I’ve tried building these kinds of multi agent systems a couple times, and I’ve found that there’s a razor thin edge between a nice “humming along” system I feel good about and a “car won’t start” system where the first LLM refuses to properly output JSON and then the rest of them start reading each others <think> thoughts.

The difference seems to often come down to:

- Which LLM wrappers are you using? Are they using/exposing features like MCP, tools and chain-of-thought correctly for the particular models you’re using?

- What are your prompts? What are the 5 bullet points with capital letters that need to be in there to keep things in line? Is there a trick to getting certain LLMs to actually use the available MCP tools?

- Which particular LLM versions are you using? I’ve heard people say that Claude Sonnet 4 is actually better than Claude Opus 4 sometimes, so it’s not always an intuitive “pick the best model” kind of thing.

- Is your system capable of “humming along” for hours or is this a thing where you’re doing a ton of copy-paste between interfaces? If it’s the latter then hey, whatever works for you works for you. But a lot of people see the former as a difficult-to-attain Holy Grail, so if you’ve figured out the exact mixture of prompts/tools that makes that happen people are gonna want to know the details.

The overall wisdom in the post about inputs mattering more than outputs etc is totally spot on, and anyone who hasn’t figured that out yet should master that before getting into these weeds. But for those of us who are on that level, we’d love to know more about exactly what you’re getting out of this and how you’re doing it.

(And thanks for the details you’ve provided so far! I’ll have to check out Zen MCP)

51. simonw ◴[02 Jul 25 02:59 UTC] No.44439886{3}[source]▶

>>44439544 #

I think the hardest problem in computer science right now may be coming up with an LLM demo that doesn't get called "pretty trivial".

replies(14): >>44439918 #>>44440031 #>>44441154 #>>44441225 #>>44441323 #>>44441441 #>>44441638 #>>44441811 #>>44442389 #>>44442493 #>>44443084 #>>44444778 #>>44446970 #>>44457389 #

52. fragmede ◴[02 Jul 25 03:06 UTC] No.44439918{4}[source]▶

>>44439886 #

I think Cloudflare's oauth library qualifies https://news.ycombinator.com/item?id=44159166

replies(1): >>44440627 #

53. webprofusion ◴[02 Jul 25 03:07 UTC] No.44439921[source]▶

>>44438065 (OP) #

The basic idea is that you can continuously document what your system should do (high level and detailed features), how it should prove it has done that, optionally how you want it to do it (architecture and code style etc).

The multi-model AI part is just the (current) tool to help avoid bias and make fine tuned selections for certain parts of the task.

Eventually large complex systems will be built and re-built from a set of requirements and software will finally match the stated requirements. The only "legacy code" will be legacy requirements specifications. Fix your requirements, not the generated code.

replies(1): >>44441787 #

54. guicen ◴[02 Jul 25 03:07 UTC] No.44439924[source]▶

>>44438065 (OP) #

This "AI factory for everyone" model may be able to break resource inequality and allow people from more places to participate in truly valuable entrepreneurship.

55. namuol ◴[02 Jul 25 03:16 UTC] No.44439961[source]▶

>>44438065 (OP) #

No real mention of results that aren’t self-referential.

I guess vibe-coding is on its way to becoming the next 3D printing: Expensive hobby best suited for endless tinkering. What’s today’s vibe coding equivalent of a “benchy”? Todo apps?

replies(2): >>44440003 #>>44440029 #

56. ◴[02 Jul 25 03:23 UTC] No.44440003[source]▶

>>44439961 #

57. SchemaLoad ◴[02 Jul 25 03:31 UTC] No.44440029[source]▶

>>44439961 #

3D printing actually is useful though. Basically everyone designing products or any kind of engineering is using it. The only reason it never took off for the average consumer is that every pre designed piece of plastic junk you could ever want to download and print is already available from Amazon.

In a pre online shopping world 3D printing would be far more useful for the average person. Going forward it looks like it's only really useful for people who can design their own files for actually custom stuff you can't buy.

replies(1): >>44440160 #

58. skydhash ◴[02 Jul 25 03:32 UTC] No.44440031{4}[source]▶

>>44439886 #

Because they are trivial in a way that you can go on GitHub and copy one of those while not pretending LLM isn't a mashup of the internet.

What people agree on being non-trivial is working on a real project. There's a lot of opensource projects that could benefit from a useful code contribution. But they only got slop thrown at them.

replies(1): >>44440066 #

59. am17an ◴[02 Jul 25 03:36 UTC] No.44440044[source]▶

>>44438065 (OP) #

I actually don't understand how you can offload the instruction pointer of the program to another program, permanently. How are you accountable for anything then? You can't debug, you can't program, just a tourist in your own home. Own your code, even if AI wrote it.

60. Aeolun ◴[02 Jul 25 03:45 UTC] No.44440091{3}[source]▶

>>44439142 #

I think you can probably get a pretty decent thing going if you have models review output they haven’t written themselves (not still in context anyway)

61. Aeolun ◴[02 Jul 25 03:51 UTC] No.44440099{4}[source]▶

>>44438942 #

> I have two MCPs installed (playwright and context7) but it never seems like Claude decides to reach for them on its own.

Not even when you add ‘memories’ that tell it to always use those tools in certain situations?

My admonitions to always run repomix at the start of coding, and always run the build command before crying victory seem to be followed pretty well anyway.

replies(2): >>44440327 #>>44440488 #

62. Aeolun ◴[02 Jul 25 03:54 UTC] No.44440118{3}[source]▶

>>44438383 #

Given you can nearly run two full code instances with Opus, and Opus is claimed to be 5x more expensive than Sonnet, you can maybe do 10 sonnet instances at the same time?

63. namuol ◴[02 Jul 25 04:07 UTC] No.44440160{3}[source]▶

>>44440029 #

Yeah I’m not saying either aren’t useful, just that they can both be a trap for tinkerers.

64. ffsm8 ◴[02 Jul 25 04:16 UTC] No.44440204{3}[source]▶

>>44439259 #

you will run through the pro rate limiting within <1h if you do it the way the article lays out.

But yeah, if you're babysitting a single agent, only applying after reading what it wants to do ... You'll be fine for 3-4 hours before the token limit refreshed after the 5th

replies(2): >>44440718 #>>44440721 #

65. skydhash ◴[02 Jul 25 04:18 UTC] No.44440218{6}[source]▶

>>44440066 #

I took the time to investigate the work being done there (all those years learning assembly and computer architecture come in handy), and it confirms (to me) that the key aspect of using LLM is pattern matching. Meaning you know that there's a solution out there (in this case, anything involving multiplying/dividing by a power of 2 can use such trick) and framing your problem (intentionally or not) and you'll get a derived text that will contain a possible solution.

But there's nothing truly novel in the result. The key aspect is being similar enough to something that's already in the training data so that the LLM can extrapolate the rest. The hint can be quite useful and sometimes you have something that shorten the implementation time, but you have to at least have some basic understanding of the domain in order to recognize the signs.

The issue is that the result is always tainted by your prompt. The signs may be there because of your prompt and not because there's some kind of data that need s to be explored further. And sometimes it's a bad fit, similar but different (what you want and what you get). So for the few domain that's valuable to me, I prefer to construct my own mental database that can lead me to concrete artifacts (books, articles, blog posts,...) that exists outside the influence of my query.

ADDENDUM

I can use LLMs with great results and I've done so. But it's more rewarding (and more useful to me) to actually think through the problem and learning from references. Instead of getting a perfect (or wobbly or the wrong category) circle that fits my query, I go to find a strange polygon formed (by me) from other strange polygon. Then because I know I need a circle, I only need to find its center and its radius.

It's slower, but the next time I need another circle (or a square) from the same polygon, it's going to be faster and faster.

66. ffsm8 ◴[02 Jul 25 04:19 UTC] No.44440220{5}[source]▶

>>44439530 #

But if you did become a unicorn, It would suddenly become very easy to replace for anthropic, because they're the ones actually providing the sauce and can just replicate your efforts. So your window of opportunity is to be too small for anthropic to notice and get interested. That can't be called unicorn

That was the point he was making, at least that's how I understood it

replies(1): >>44442649 #

67. csomar ◴[02 Jul 25 04:19 UTC] No.44440222{3}[source]▶

>>44439259 #

I am on max and burning daily (ccusage) roughly my monthly subscription. It is not clear whether the API is very overpriced or we are getting aggressively subsidized. I can afford $100-200/month but not $3.000. Let's hope this last for a good while as GitHub copilot turned off the tap on unlimited usage very recently.

68. csomar ◴[02 Jul 25 04:24 UTC] No.44440242[source]▶

>>44439075 #

That's a very simple example/context that I suspect most LLMs will be able to knock out with minimal frustration. I had much more complex Rust dependency upgrade done on a 30+ iterations on very custom code (wasm stuff where training data is probably scarce). Claude would ping context7 and mcp-lsp to get details. You do find its limits after a while though and as you push it harder.

replies(1): >>44440355 #

69. steveklabnik ◴[02 Jul 25 04:45 UTC] No.44440327{5}[source]▶

>>44440099 #

I have not done that, maybe that's the missing bit. Thanks!

70. nico ◴[02 Jul 25 04:54 UTC] No.44440355{3}[source]▶

>>44440242 #

> That's a very simple example/context that I suspect most LLMs will be able to knock out with minimal frustration

Yes an No. You are right that it's a relatively small project. However, I've had really bad experiences trying to get ChatGPT (any of their models) to write small arm64 assembly programs that can compile and run on apple silicon

71. ◴[02 Jul 25 05:01 UTC] No.44440374[source]▶

>>44438820 #

72. nico ◴[02 Jul 25 05:02 UTC] No.44440384[source]▶

>>44438065 (OP) #

> If you know Factorio you know it’s all about building a factory that can produce itself

This is a very interesting concept

Could this be extended to the point of an LLM producing/improving itself?

If not, what are the current limitations to get to that point?

replies(1): >>44441864 #

73. 6510 ◴[02 Jul 25 05:06 UTC] No.44440407{3}[source]▶

>>44439246 #

You have users fill out bug reports then throw some buckets of money at it.

You could even add a magic button for when things don't work that reruns the same prompt and possibly get better results.

A slot machine animation while waiting would be cool.

replies(1): >>44440786 #

74. ◴[02 Jul 25 05:07 UTC] No.44440413{3}[source]▶

>>44439540 #

75. manmal ◴[02 Jul 25 05:27 UTC] No.44440488{5}[source]▶

>>44440099 #

What do you tell Claude to do with repomix? Get an overview into the context?

replies(1): >>44451220 #

76. gen6acd60af ◴[02 Jul 25 05:57 UTC] No.44440627{5}[source]▶

>>44439918 #

This one?

>Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards.

>To emphasize, this is not "vibe coded". Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.

Some time later...

https://github.com/advisories/GHSA-4pc9-x2fx-p7vj / CVE-2025-4143

>The OAuth implementation in workers-oauth-provider that is part of MCP framework https://github.com/cloudflare/workers-mcp, did not correctly validate that redirect_uri was on the allowed list of redirect URIs for the given client registration.

replies(1): >>44443019 #

77. stpedgwdgfhgdd ◴[02 Jul 25 06:18 UTC] No.44440718{4}[source]▶

>>44440204 #

Same experience. One terminal window with Pro is okay. Multiple CC running in parallel not.

We most likely implement a policy that starters in our company can use Pro. Power users need Max.

78. zxexz ◴[02 Jul 25 06:19 UTC] No.44440721{4}[source]▶

>>44440204 #

I've heard that if you have several relatively active separate sessions open, the limit is a little less restrictive. Especially if you do a /clear and continue your session on a different project. Honestly, a lot of Claude Code seems vibecoded if you look at the client side, too. Can't tell if I'm surprised that the backend has an element of that, too. Hey, dogfood tastes good - I respect them for that.

79. stpedgwdgfhgdd ◴[02 Jul 25 06:26 UTC] No.44440755{3}[source]▶

>>44439142 #

I noticed that i hardly look anymore at the generated Go code… I do give a lot of attention to the tests. I let CC write some failing tests, implement. Let it run against some real scenarios. Find bugs, let it write more tests, fix. And iterate.

Writing this I realise, i should more clearly separate the functional tests from the implementation oriented unit tests.

80. danielbln ◴[02 Jul 25 06:31 UTC] No.44440786{4}[source]▶

>>44440407 #

It's like a salt mine in here. Go ahead and hand weave your copper cable code while the world moves on and accelerates. Will there be slop along the way? Oh hell yes.

The Model T car was notorious for blowing out tires left and right, to the point that a carriage might have been less hassle at times. Yet here we are.

replies(2): >>44443128 #>>44461888 #

81. dgunay ◴[02 Jul 25 07:09 UTC] No.44440949[source]▶

>>44438065 (OP) #

I am experimenting with a similar workflow and thought I'd share my experience.

I might be a little too hung up on the details compared to a lot of these agent cluster testimonials I've read, but unlike the author I'll be open and say that the codebase I work on is several hundred thousand lines of Go and currently does serve a high 5 to low 6 figure number of real, B2C users. Performance requirements are forgiving but correctness and reliability are very important. Finance.

Currently I use a very basic setup of scripts that clone a repo, configure an agent, and then run it against a prompt in a tmux session. I rely mainly on codex-cli since I am only given an OpenAI key to work with. The codex instances ping me in my system notifications when it's my turn, and I can easily quake-mode my terminal into view and then attach to the session (with a bit of help from fzf). I haven't gotten into MCP yet but it's on my radar.

I can sort of see the vision. For those small but distracting tasks, they are very helpful and I (mostly) passively produce a lot more small PRs to clean up papercuts around our codebase now. The "cattle not pets" mentality remains relevant - I just fire off a quick prompt when I feel the urge to get sidetracked on something minor.

I haven't gotten as much out of them for more involved tasks. Maybe I haven't really got enough of a context flywheel going yet, but I do typically have to intervene most of the time. Even on a working change, I always read the generated code first and make any edits for taste before submitting it for code review since I still view the output as my complete responsibility.

I still mostly micromanage the change control process too (branching, making commits, and pushing). I've dabbled in tools that can automate this but haven't gotten around to it.

I 100% resonate with the "fix the inputs, not the outputs" mindset as well. It's incredibly powerful without AI and our industry has been slowly but surely adopting it in more places (static typing, devops, IAC, etc). With nondeterministic processes like LLMs though it feels a lot harder to achieve, more like practice and not science.

replies(1): >>44441040 #

82. nilirl ◴[02 Jul 25 07:12 UTC] No.44440959[source]▶

>>44438065 (OP) #

"Fix inputs" => The assumption is there exists some perfect input that will give you exactly what you want.

It probably works well for small inputs and tasks well-represented in the training data (like writing code for well-represented domains).

But how does this work for old code, large codebases, and emergencies?

- Do you still "learn" the system like you used to before?

- How do you think of refactoring if you don't get a feel for the experience of working through the code base?

Overall: I like it. I think this adds speed for code that doesn't need to be reinvented. But new domains, new tools, new ways to model things, the parts that are fun to a developer, are still our monsters to slay.

replies(1): >>44441098 #

83. com2kid ◴[02 Jul 25 07:24 UTC] No.44441017[source]▶

>>44439075 #

> That's the moment when you let "claude --dangerously-skip-permissions" go to work on a difficult problem and watch it crunch away by itself for a couple of minutes running a bewildering array of tools until the problem is fixed.

Eh, I just watched Claude spend an hour trying to incorrectly fix code. Eventually I realized what was happening, stepped in and asked it to write a bunch of unit tests first, get the code working against those unit tests, and then get back to me.

Claude Code is amazing, but I still have to step in and give it basic architectural guidance again and again.

84. barrenko ◴[02 Jul 25 07:26 UTC] No.44441040[source]▶

>>44440949 #

There's been a lot of talk recently (with "recently" being measured in days for the agents field) about context management, but I'm having the hardest time managing my own context when using these methods.

85. barrenko ◴[02 Jul 25 07:30 UTC] No.44441057[source]▶

>>44438065 (OP) #

> Is 'Azure OpenAI subscription' cheaper than ChatGPT via OpenAI?

86. barrenko ◴[02 Jul 25 07:33 UTC] No.44441069[source]▶

>>44439075 #

https://youtu.be/bUBF5V6oDKw I'd like to add this video from the AI Engineer conf, which may also be impenetrable, by the folks from Dagger (person behind Docker)

87. myflash13 ◴[02 Jul 25 07:38 UTC] No.44441098[source]▶

>>44440959 #

> But how does this work for old code, large codebases, and emergencies?

Have you actually tried Claude Code? It works pretty well on my old code, medium size SaaS codebase. I’ve had it build entire features end to end in (backend, front end, data migrations, tests) in one or two prompts.

88. myflash13 ◴[02 Jul 25 07:41 UTC] No.44441110[source]▶

>>44438820 #

Claude Code now handles this natively with “plan mode”. Bit slow and annoying to do it manually with .md files in my opinion.

replies(1): >>44445179 #

89. th0ma5 ◴[02 Jul 25 07:52 UTC] No.44441154{4}[source]▶

>>44439886 #

Maybe you should try something other than demos? Have you tried creating a reliable system?

replies(1): >>44443451 #

90. mmarian ◴[02 Jul 25 07:54 UTC] No.44441164[source]▶

>>44438065 (OP) #

And here I am struggling to get Claude to create a nice-looking search bar a la booking.com , with some adjustments for my personal use case; it does ok, but never gets to the end result and once I refreshed my Tailwind knowledge it felt much slower than hand coding. I feel like I'm living in a different world.

replies(2): >>44441308 #>>44441660 #

91. stavros ◴[02 Jul 25 07:58 UTC] No.44441190[source]▶

>>44439147 #

Man, programming has changed forever, and the sooner you realize that, the better for you. Saying "write the code" is like telling people to shoe their own horses instead of dealing with them newfangled cars that can break down.

replies(1): >>44441875 #

92. stavros ◴[02 Jul 25 08:00 UTC] No.44441203{4}[source]▶

>>44439201 #

I don't understand people who are sceptical about whether LLMs can give value. We're way past that, now at the stage where we're trying to figure out how to extract the most value out of them, but I guess humans don't like change much.

replies(1): >>44464146 #

93. jkhdigital ◴[02 Jul 25 08:06 UTC] No.44441225{4}[source]▶

>>44439886 #

No the hardest problem is teaching CS undergrads. I just started this year (no background in academia, just 75% of a PhD and well-rounded life experience) and I’ve basically torn up the entire curriculum they handed to me and started vibe-teaching.

94. stavros ◴[02 Jul 25 08:06 UTC] No.44441226{3}[source]▶

>>44439540 #

I don't doubt that LLMs are extremely useful for making simple things quickly. I haven't been able to get them to write hard code on their own, though. I was trying to make a sound card with a Pi Pico the other day, and had crackling and popping in the audio. I kept telling Opus to fix that, it kept being absolutely convinced it knows what the problem is every time, and went through multiple iterations of being absolutely sure it will solve the problem this time (with every time bringing a different reason for why the pops are there), and spent $35.

In the end, it had written 500 lines, the problem was still there, and the code didn't work any differently. It worries me that I don't know what those 500 lines were for.

In my experience, LLMs are amazing for writing 10-20 lines at a time, while you review and fix any errors. If I let them go to town on my code, I've found that's an expensive way to get broken code.

replies(1): >>44441748 #

95. sussmannbaka ◴[02 Jul 25 08:15 UTC] No.44441263{3}[source]▶

>>44439177 #

As it turns out, the VC potential of Mandelbrot and HelloWorld.py are quite limited :o)

replies(1): >>44441664 #

96. hamstergene ◴[02 Jul 25 08:23 UTC] No.44441308[source]▶

>>44441164 #

I think coding assistants aren't great at UI/UX yet because they can't see, their understanding of left/right/lighter/darker is guessed from textual descriptions that accompanied CSS tutorials but they are never actually imagining the looks of what they are working with. I had Cursor repeatedly fix and mess up a CSS grid, over and over again, until I switched to HTML table so that browser would handle layout. Once switched from visuals ("leftmost") to semantics ("first cell in a row") the agent immediately started getting tasks done right.

I guess keep them on backend/library tasks for now. I am sure the companies are already working on getting a snapshot of a browser page and feeding it back into multimodal model so it can comprehend what "looking" means.

replies(1): >>44441405 #

97. 1dom ◴[02 Jul 25 08:27 UTC] No.44441323{4}[source]▶

>>44439886 #

I'm very pro LLM and AI. But I completely agree with the comment about how many pieces praising LLMs are doing so with trivial examples. Trivial might not be the right word, but I can't think of a better one that doesn't have a negative connotation, but this shouldn't be negative. Your examples are good and useful, and capture a bunch of tasks a software engineer would do.

I'd say your mandelbrot debug and the LLVM patch are both "trivial" in the same sense: they're discrete, well defined, clear-success-criteria-tasks that could be assigned to any mid/senior software engineer in a relevant domain and they could chip through it in a few weeks.

Don't get me wrong, that's an insane power and capability of LLMs, I agree. But ultimately it's just doing a day job that millions of people can do sleep deprived and hungover.

Non-trivial examples are things that would take a team of different specialist skillsets months to create. One obvious potential reason why there's few non-trivial AI examples is because non-trivial AI examples require non-trivial amount of time to be able to generate and verify.

A non-trivial example isn't an example you can look at the output and say "yup, AI's done well here". It requires someone spends time going into what's been produced, assessing it, essentially redesigning it as a human to figure out all the complexity of a modern non-trivial system to confirm the AI actually did all that stuff correctly.

An in depth audit of a complex software system can take months or even years and is a thorough and tedious task for a human, and the Venn diagrams of humans who are thinking "I want to spend more time doing thorough, tedious code tasks" and "I want to mess around with AI coding" is 2 separate circles.

replies(7): >>44441342 #>>44441663 #>>44441824 #>>44441879 #>>44443505 #>>44444529 #>>44445225 #

98. sokoloff ◴[02 Jul 25 08:31 UTC] No.44441342{5}[source]▶

>>44441323 #

> ultimately it's just doing a day job that millions of people can do sleep deprived and hungover.

Doing for < $10 and under an hour what could be done in a few weeks by $10K+ worth of senior staff time is pretty valuable.

replies(1): >>44441546 #

99. caporaltito ◴[02 Jul 25 08:37 UTC] No.44441364[source]▶

>>44438065 (OP) #

Show us the code, mate.

100. jwpapi ◴[02 Jul 25 08:42 UTC] No.44441394{5}[source]▶

>>44438868 #

It’s nothing special. Not in the realm of anything technical outstanding. I just stated that to emphasize that it’s a slightly bigger project than default single-dev coded SAAS projects which are just a single wrapper. We have workers, multiple white-labeled applications sharing a common infrastructure, data scraping modules, AI-powered services, and email processing pipelines.

I’ve had an impossible learning curve the last year, but as I should rather be vibe-coded biased I still use less AI now to make sure it’s more consistent.

I think the two camps are different in terms of skill honestly, but also in terms of needs. Like of course you are faster vibe-coding a front-end then to write the code manually, but build a robust backend/processing system its a different kind of tier.

So instead of picking a side it’s usually best to stay as unbiased as possible and choose the right tool for the task

101. mmarian ◴[02 Jul 25 08:44 UTC] No.44441405{3}[source]▶

>>44441308 #

Thx for sharing your experience, good to know I'm not the only one struggling ^_^ The advice makes sense as well.

102. hamish-b ◴[02 Jul 25 08:50 UTC] No.44441435[source]▶

>>44438065 (OP) #

This sounds great, and is similar to the workflow I get from a high level stand point with https://ampcode.com/ - albeit without the model wrangling.

To the author & anyone reading - publicly release your agent harnesses, even if its shit or vibe coded! I am constantly iterating on my meta and seeking to improve.

103. cranium ◴[02 Jul 25 08:51 UTC] No.44441441{4}[source]▶

>>44439886 #

Instead of "pretty trivial", I'd say it's "well-defined and generally understood".

The implicit decisions it had to make were also inconsequential, eg. selection of ASCII chars, color or not, bounds of the domain,...

However, it shows that agents are powerful translators / extractors of general knowledge!

104. nurettin ◴[02 Jul 25 08:54 UTC] No.44441470[source]▶

>>44438065 (OP) #

This sounds nice and great and all, but I wonder what the output is like and if there is a measurable difference between doing the factory and trying to two shot the whole thing with claude 4 sonnet.

105. 1dom ◴[02 Jul 25 09:10 UTC] No.44441546{6}[source]▶

>>44441342 #

If it's something a single senior staff member can do, then - personally - I'd consider it not complex, it's relatively trivial: it can be done by literally a single person.

I'm pro AI, I'm not saying it's not valuable for trivial things. But that's a distinct discussion to the trivial nature of many LLM examples/demos in relation to genuinely complex computer systems.

replies(1): >>44443727 #

106. sroussey ◴[02 Jul 25 09:25 UTC] No.44441638{4}[source]▶

>>44439886 #

Convert react-stockcharts to react v19. I’ve tried Claude Code and Cursor but only ended up with hilariously bad results.

replies(1): >>44443541 #

107. derencius ◴[02 Jul 25 09:28 UTC] No.44441660[source]▶

>>44441164 #

I use Claude and Cursor in parallel. cursor is doing great on the ui, I took quick screenshots to instruct the changes I wanted and it got it right.

replies(1): >>44441853 #

108. sroussey ◴[02 Jul 25 09:29 UTC] No.44441663{5}[source]▶

>>44441323 #

LLMs are best demonstrated with greenfield examples.

replies(1): >>44441830 #

109. addandsubtract ◴[02 Jul 25 09:29 UTC] No.44441664{4}[source]▶

>>44441263 #

Bakeries have been in business for thousands of years. Should be pretty easy to sell Mandelbrot everywhere around the world.

110. petesergeant ◴[02 Jul 25 09:42 UTC] No.44441748{4}[source]▶

>>44441226 #

> I haven't been able to get them to write hard code on their own, though

For sure, and me neither, for what it's worth. But most of the code I write isn't "hard" code; the hard code is also the stuff I enjoy writing the most. I will note that a few months ago I found them helpful for small things inside the GPT window, and then tried agentic mode (specifically Roo, then Claude Code), and have seen a huge speedup in my ability to get stuff done.

replies(1): >>44441940 #

111. qiine ◴[02 Jul 25 09:50 UTC] No.44441787[source]▶

>>44439921 #

sorry but again...

https://i.pinimg.com/736x/03/af/06/03af0602a8caa51507717edd6...

replies(1): >>44443182 #

112. CjHuber ◴[02 Jul 25 09:51 UTC] No.44441796[source]▶

>>44439075 #

Is it that much better than Codex?

113. j45 ◴[02 Jul 25 09:54 UTC] No.44441811{4}[source]▶

>>44439886 #

Many big problems are made up of small problems.

114. j45 ◴[02 Jul 25 09:57 UTC] No.44441824{5}[source]▶

>>44441323 #

There is a scale somewhere in these types of articles that will emerge.

It might be something being actually new (cutting edge) vs new to someone vs the human mind wanting to have it be novel and different enough as a comparable percentage of the experience of the first time using ChatGPT 4.

There is also the wiring of non-deterministic software frameworks and architectures compared to the deterministic only software development we're used to.

The former is a different thing than the latter.

115. j45 ◴[02 Jul 25 09:58 UTC] No.44441830{6}[source]▶

>>44441663 #

Plus, applying non-deterministic algorithms in a deterministic way might not always work the same. The software developers are also changing the frames and terms of reference.

116. NitpickLawyer ◴[02 Jul 25 10:02 UTC] No.44441850[source]▶

>>44439147 #

> Just freaking write the code that you can expand and modify in the future instead of

Why is it always this argument? Is it that hard to believe that you can get recent coding assistants to write expandable and maintainable code in 0shot? Have you tried just ... asking for that type of code?

117. mmarian ◴[02 Jul 25 10:03 UTC] No.44441853{3}[source]▶

>>44441660 #

Cheers! It's hard to keep track of what's good for what.

118. NitpickLawyer ◴[02 Jul 25 10:05 UTC] No.44441864[source]▶

>>44440384 #

> Could this be extended to the point of an LLM producing/improving itself?

Check out aider writing aider stats here: https://aider.chat/HISTORY.html

replies(1): >>44449136 #

119. voidUpdate ◴[02 Jul 25 10:06 UTC] No.44441875{3}[source]▶

>>44441190 #

Didn't they say "programming has changed forever" about web3 stuff? I've not heard much about that recently

replies(1): >>44441944 #

120. sundache ◴[02 Jul 25 10:07 UTC] No.44441879{5}[source]▶

>>44441323 #

I only see 148 lines of assembly and a dockerfile that's 7 lines long. Am I missing something or should that take a human less then several weeks.

replies(1): >>44442232 #

121. stavros ◴[02 Jul 25 10:16 UTC] No.44441940{5}[source]▶

>>44441748 #

Agreed, I no longer have to write the same code for the Nth time, or spend two minutes times a hundred looking up API docs. I love it.

replies(1): >>44446348 #

122. stavros ◴[02 Jul 25 10:16 UTC] No.44441944{4}[source]▶

>>44441875 #

I don't know who "they" is, but I never said that.

123. Havoc ◴[02 Jul 25 10:20 UTC] No.44441960{3}[source]▶

>>44439544 #

I suspect personal tools are as close as we're going to get to this mythical demo that satisfies all critics. i.e. here is a list of problems i've solved with just AI.

Strikes a balance between simplicity and real world usefulness

replies(1): >>44443461 #

124. dotancohen ◴[02 Jul 25 11:05 UTC] No.44442232{6}[source]▶

>>44441879 #

Depends on what's in those 148 lines.

125. GTP ◴[02 Jul 25 11:21 UTC] No.44442359[source]▶

>>44438065 (OP) #

> That loop is the factory: the code itself is disposable; the instructions and agents are the real asset.

Why do I hear the words "technical debt"? More to the point, the risk I see with this approach is that the author would end hp throwing away working and well tested code to implement some minor change. This has an high risk of introducing many easily avoidable bugs.

126. afro88 ◴[02 Jul 25 11:25 UTC] No.44442389{4}[source]▶

>>44439886 #

It coming from computer science might be the issue. There's a lot of open source repos out there that have tricky bugs, and todo lists of features that are too complex or time consuming for casual contributors to tackle. Adding significant value to an open source project is a pretty nice demo that won't get called "pretty trivial".

Can't be too far off!

127. raxxorraxor ◴[02 Jul 25 11:42 UTC] No.44442493{4}[source]▶

>>44439886 #

The complexity of the problem masqerades the common problem of providing sensible context to your AI of choice to have it doing something constructive in your personal codebase. Or giving it tools to check the truth of one of its assertions. Something a developer does countless times.

128. lucubratory ◴[02 Jul 25 11:58 UTC] No.44442649{6}[source]▶

>>44440220 #

She, but yes.

replies(1): >>44442989 #

129. ffsm8 ◴[02 Jul 25 12:34 UTC] No.44442989{7}[source]▶

>>44442649 #

https://upload.wikimedia.org/wikipedia/en/f/f8/Internet_dog....

130. kentonv ◴[02 Jul 25 12:38 UTC] No.44443019{6}[source]▶

>>44440627 #

Sorry, my code has bugs sometimes.

131. pydry ◴[02 Jul 25 12:45 UTC] No.44443084{4}[source]▶

>>44439886 #

Really? This paper cut through the same kind of bullshit with puzzles: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinkin...

What do you think is so difficult about doing the same thing with coding problems?

replies(1): >>44443442 #

132. hagbarth ◴[02 Jul 25 12:50 UTC] No.44443128{5}[source]▶

>>44440786 #

I am not certain this is acceleration in anything other than lines of code.

133. pydry ◴[02 Jul 25 12:56 UTC] No.44443182{3}[source]▶

>>44441787 #

It is weirdly easy to create a language that expresses specifications that is more complex and difficult to understand than the code which implements it. E.g. Z notation.

134. slightwinder ◴[02 Jul 25 13:20 UTC] No.44443387[source]▶

>>44439147 #

> ADHD coding, brute forcing product generation until you get it right?

Are we now pretending that humans aren't doing the same? Sure, it's usually on a higher level, but at the end we are also just brute forcing our way toward a solution through trial and error, and if someone is very experienced in the problem-domain, they can do it mostly in their head.

> carbon footprint

So if the AI-datacentre is running on renewables, you would be OK with this?

135. simonw ◴[02 Jul 25 13:26 UTC] No.44443442{5}[source]▶

>>44443084 #

I don't understand the connection between that paper and my comment.

replies(1): >>44443831 #

136. ◴[02 Jul 25 13:26 UTC] No.44443451{5}[source]▶

>>44441154 #

137. simonw ◴[02 Jul 25 13:27 UTC] No.44443461{4}[source]▶

>>44441960 #

I tried that with https://tools.simonwillison.net/colophon - over 100 personal tools, some of which I use on a daily basis.

138. simonw ◴[02 Jul 25 13:31 UTC] No.44443505{5}[source]▶

>>44441323 #

> Non-trivial examples are things that would take a team of different specialist skillsets months to create.

Thank you for providing a spelled out definition of "non-trivial" there!

replies(1): >>44445198 #

139. mfalcon ◴[02 Jul 25 13:32 UTC] No.44443522[source]▶

>>44438065 (OP) #

"Outputs are disposable; plans and prompts compound."

I agree with this and it aligns with the general opinion about what is the true value the SWE's bring to the table.

140. simonw ◴[02 Jul 25 13:34 UTC] No.44443541{5}[source]▶

>>44441638 #

I had great success with o4-mini via ChatGPT for they kind of upgrade, since of can use its search tool to look up what's changed.

I used this prompt a few weeks ago:

> This code needs to be upgraded to the new recommended JavaScript library from Google. Figure out what that is and then look up enough documentation to port this code to it.

https://simonwillison.net/2025/Apr/21/ai-assisted-search/#la...

141. simonw ◴[02 Jul 25 13:51 UTC] No.44443727{7}[source]▶

>>44441546 #

Maybe the definition of "non-trivial" in these conversations should be defined as "stuff an LLM system can't do yet".

142. pydry ◴[02 Jul 25 13:59 UTC] No.44443831{6}[source]▶

>>44443442 #

They created an environment to expose LLMs to problems and test their performance which were immune from benchmark hacking using puzzles.

Your comment was about how this was unreasonably hard (for coding challenges).

Anecdotally Ive seen LLMs do all sorts of amazing shit which was obviously drawn from their training set and fall flat on their faces doing simple coding tasks which are novel enough to not appear in the training set.

replies(1): >>44444237 #

143. simonw ◴[02 Jul 25 14:33 UTC] No.44444237{7}[source]▶

>>44443831 #

That Apple paper mainly demonstrated that "reasoning" LLMs - with no access to additional tools - can't solve problems that deliberately exceed their token context length.

I don't think it has much relevance at all to a conversational about how good LLMs are at solving programming problems by running tools in a loop.

I keep seeing this idea that LLMs can't handle problems that aren't in their training data and it's frustrating because anyone who has spent significant time working with these systems knows that it obviously isn't true.

replies(1): >>44452688 #

144. fho ◴[02 Jul 25 14:55 UTC] No.44444529{5}[source]▶

>>44441323 #

Point in case: i've been trying for weeks now to generate a CFD solver that is more than the basic FDM "toy example".

The models clearly know the equations, but run into the same issues I had when implementing it myself (namely exploding simulations that the models try to paper over by applying more and more relaxation terms).

145. dust42 ◴[02 Jul 25 15:16 UTC] No.44444778{4}[source]▶

>>44439886 #

I have one for you: implement gemma 3n multimodel support in llama.cpp

146. marviel ◴[02 Jul 25 15:47 UTC] No.44445179{3}[source]▶

>>44441110 #

Yeah just learned about this!

The md files are actually pretty great for shareability, versioning, and picking up where you left off.

147. 1dom ◴[02 Jul 25 15:49 UTC] No.44445198{6}[source]▶

>>44443505 #

Haha, it was made up on the spot, thank you though! I think your articles and notes are proof that there's a lot of value and use in "trivial" examples. They're very close to the sort of examples a lot of tech people can actually use as individual professional engineers.

I think the void where non-trivial examples should be is the same space where contrarians and the last remaining few LLMs-are-useless crowd hangout.

148. edmundsauto ◴[02 Jul 25 15:50 UTC] No.44445225{5}[source]▶

>>44441323 #

Current state AI is a best fit for jobs that can be easily verified as correct. In my 20+ years, this is at least 75% of the work I’ve ever done. Maybe 99.999% (I have led a very boring career.)

There’s an enormous amount of value in doing this. For the harder problems you mentioned - most IC SWE are also incapable or unwilling to do the work. So maybe the current state has equivalent capabilities to 95% of coders out there? But it works faster, cheaper, and doesn’t object to tedious work like documentation. It doesn’t require labor law compliance, hiring, onboarding/offboarding, or cause interpersonal conflict.

149. codemonkey-zeta ◴[02 Jul 25 16:53 UTC] No.44445982[source]▶

>>44438065 (OP) #

> Because most of my day-to-day is in clojure I tend to use sonnet 4 to get the parens right.

In case the author is lurking, you may want to apply the same fix they do in clojure-mcp: https://github.com/bhauman/clojure-mcp/blob/8150b855282babcd...

The insight that team had was that LLMs get confused with parens, but they are excellent at indentation, so if you run parinfer over the LLMs output it will be correct in 99% of cases.

150. apwell23 ◴[02 Jul 25 17:21 UTC] No.44446348{6}[source]▶

>>44441940 #

> write the same code for the Nth time

who does this though ? maybe you should extract that into a library/method/abstraction ?

151. x0x0 ◴[02 Jul 25 18:09 UTC] No.44446970{4}[source]▶

>>44439886 #

I have one: features I've tried this on in my codebase. Because claude and gemini have both failed pretty badly.

So it's pretty stupid to just assume that critics haven't tried.

Example feature: send analytics events on app start triggered by notifications. Both Gemini and Claude completely failed to understand the component tree; rewrote hundreds of lines of code in broken ways; and even when prompted with the difficulty (this is happening outside of the component tree), failed to come up with a good solution. And even when deliberately prompted not to, like to simultaneously make cosmetic code changes to other pieces of the files they're touching.

152. nico ◴[02 Jul 25 21:39 UTC] No.44449136{3}[source]▶

>>44441864 #

Super interesting, thank you for the link

Aider writing its own code is definitely cool and within the same concept

I’d love to see an LLM or some sort of coding model that modifies/trains the model itself

153. Aeolun ◴[03 Jul 25 02:59 UTC] No.44451220{6}[source]▶

>>44440488 #

Yeah, it’s just a shortcut to it exploring the code for half an hour before doing something. At least it seems to make its searching more targeted.

154. pydry ◴[03 Jul 25 07:57 UTC] No.44452688{8}[source]▶

>>44444237 #

It demonstrated that there was a hard limit on the complexity of a puzzle that LLMs could solve no matter how many tokens they threw at it (using a form of puzzle construction that it ensured that the LLM couldn't just refer to its training data to solve it).

155. PhilippGille ◴[03 Jul 25 10:56 UTC] No.44453727[source]▶

>>44438065 (OP) #

> Next claude code execute the plan, either with sonnet 3.7 or sonnet 4 depending on the complexity of the task. Because most of my day-to-day is in clojure I tend to use sonnet 4 to get the parens right.

This made me chuckle.

Perfect example of why heavily LLM-driven devs and processes might want to pick a popular programming language which the LLM had a ton of training data for. Or a strong point for specialized LLMs (e.g. here it could be a smaller/cheaper/faster Clojure-specialized model).

156. puersum ◴[03 Jul 25 12:33 UTC] No.44454354[source]▶

>>44438065 (OP) #

I believe we need to find more effective ways to integrate AI into our workflows. Anyone who is actively trying to adopt AI has likely encountered similar challenges, yet a definitive solution has yet to emerge. In my view, a key principle at this stage is to assign AI minimal responsibility and highly specific tasks.

For example, I'm currently experimenting with an agent workflow for stock research. I've set up two AI roles: a 'Bullish Guy' and a 'Bearish Guy' and have them debate the pros and cons of a specific stock. The premise is that through this adversarial process, the AIs are forced to research opposing viewpoints, leading to a more comprehensive understanding and a superior final analysis. The idea was inspired by the kinds of arguments you see on social media.

replies(1): >>44462013 #

157. kayge ◴[03 Jul 25 17:36 UTC] No.44457389{4}[source]▶

>>44439886 #

The "No True Scotsware" problem? :)

158. 6510 ◴[04 Jul 25 07:08 UTC] No.44461888{5}[source]▶

>>44440786 #

The model T didn't manifest out of thin air after some incantation nor did it have a mysterious purposes somewhere between some and quite a lot. I assure you, compared to shit coin 90210 these are quite the interesting times.

It could be much bigger than the model T or much bigger than asbestos.

159. MatveySecured ◴[04 Jul 25 07:25 UTC] No.44462013[source]▶

>>44454354 #

hey! Your agent workflow for stock research sounds very interesting. Can you share a link please?

160. player1234 ◴[04 Jul 25 13:01 UTC] No.44464146{5}[source]▶

>>44441203 #

They jury is still out, they have spent hundreds of billions, trillions. And want trillions in ROI.

It does really cool stuff now when it is given away for free, but how cool is it when they want you to pay what it actually costs? With ROI and profits on top.

↑