Most active commenters

abdullin(13)
diggan(12)
koakuma-chan(4)
bandoti(3)
chamomeal(3)

Popular/hot comments

>>44316946 #
>>44317792 #
>>44318275 #
>>44317876 #
>>44317531 #
>>44317997 #
>>44318041 #
>>44318111 #

←back to thread

Andrej Karpathy: Software in the era of AI [video]

(www.youtube.com)

1. abdullin ◴[19 Jun 25 07:03 UTC] No.44316210[source]▶

>>44314423 (OP) #

Tight feedback loops are the key in working productively with software. I see that in codebases up to 700k lines of code (legacy 30yo 4GL ERP systems).

The best part is that AI-driven systems are fine with running even more tight loops than what a sane human would tolerate.

Eg. running full linting, testing and E2E/simulation suite after any minor change. Or generating 4 versions of PR for the same task so that the human could just pick the best one.

replies(7): >>44316306 #>>44316946 #>>44317531 #>>44317792 #>>44318080 #>>44318246 #>>44318794 #

2. ◴[19 Jun 25 07:19 UTC] No.44316306[source]▶

>>44316210 (TP) #

3. OvbiousError ◴[19 Jun 25 09:35 UTC] No.44316946[source]▶

>>44316210 (TP) #

I don't think the human is the problem here, but the time it takes to run the full testing suite.

replies(6): >>44317032 #>>44317123 #>>44317166 #>>44317246 #>>44317515 #>>44318555 #

4. Byamarro ◴[19 Jun 25 09:51 UTC] No.44317032[source]▶

>>44316946 #

I work in web dev, so people sometimes hook code formatting as a git commit hook or sometimes even upon file save. The tests are problematic tho. If you work at huge project it's a no go idea at all. If you work at medium then the tests are long enough to block you, but short enough for you not to be able to focus on anything else in the meantime.

5. diggan ◴[19 Jun 25 10:10 UTC] No.44317123[source]▶

>>44316946 #

It is kind of a human problem too, although that the full testing suite takes X hours to run is also not fun, but it makes the human problem larger.

Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?

Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.

If you're feeling brave, you start Robot B and C at the same time.

replies(2): >>44317507 #>>44317902 #

6. londons_explore ◴[19 Jun 25 10:18 UTC] No.44317166[source]▶

>>44316946 #

The full test suite is probably tens of thousands of tests.

But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.

Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.

Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.

replies(1): >>44318168 #

7. tlb ◴[19 Jun 25 10:33 UTC] No.44317246[source]▶

>>44316946 #

Yes, and (some near-future) AI is also more patient and better at multitasking than a reasonable human. It can make a change, submit for full fuzzing, and if there's a problem it can continue with the saved context it had when making the change. It can work on 100s of such changes in parallel, while a human trying to do this would mix up the reasons for the change with all the other changes they'd done by the time the fuzzing result came back.

LLMs are worse at many things than human programmers, so you have to try to compensate by leveraging the things they're better at. Don't give up with "they're bad at such and such" until you've tried using their strengths.

replies(1): >>44317950 #

8. abdullin ◴[19 Jun 25 11:11 UTC] No.44317507{3}[source]▶

>>44317123 #

This is the workflow that ChatGPT Codex demonstrates nicely. Launch any number of «robotic» tasks in parallel, then go on your own. Come back later to review the results and pick good ones.

replies(1): >>44317620 #

9. abdullin ◴[19 Jun 25 11:12 UTC] No.44317515[source]▶

>>44316946 #

Humans tend to lack inhumane patience.

10. yahoozoo ◴[19 Jun 25 11:15 UTC] No.44317531[source]▶

>>44316210 (TP) #

The problem is that every time you run your full automation with linting and tests, you’re filling up the context window more and more. I don’t know how people using Claude do it with its <300k context window. I get the “your message will exceed the length of this chat” message so many times.

replies(3): >>44317624 #>>44317703 #>>44318129 #

11. diggan ◴[19 Jun 25 11:30 UTC] No.44317620{4}[source]▶

>>44317507 #

Well, they're demonstrating it somewhat, it's more of a prototype today. First tell is the low limit, I think the longest task for me been 15 minutes before it gives up. Second tell is still using a chat UI which is simple to implement, easy to implement and familiar, but also kind of lazy. There should be a better UX, especially with the new variations they just added. From the top of my head, some graph-like UX might have been better.

replies(1): >>44318193 #

12. diggan ◴[19 Jun 25 11:31 UTC] No.44317624[source]▶

>>44317531 #

I don't know exactly how Claude works, but the way I work around this with my own stuff is prompting it to not display full outputs ever, and instead temporary redirect the output somewhere then grep from the log-file what it's looking for. So a test run outputting 10K lines of test output and one failure is easily found without polluting the context with 10K lines.

13. the_mitsuhiko ◴[19 Jun 25 11:43 UTC] No.44317703[source]▶

>>44317531 #

I started to use sub agents for that. That does not pollute the context as much

14. latexr ◴[19 Jun 25 11:54 UTC] No.44317792[source]▶

>>44316210 (TP) #

> Or generating 4 versions of PR for the same task so that the human could just pick the best one.

That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality. Why are we doing this to ourselves and embracing it?

A few years ago, it would have been seen as a joke to say “the future of software development will be to have a million monkey interns banging on one million keyboards and submit a million PRs, then choose one”. Today, it’s lauded as a brilliant business and cost-saving idea.

We’re beyond doomed. The first major catastrophe caused by sloppy AI code can’t come soon enough. The sooner it happens, the better chance we have to self-correct.

replies(6): >>44317876 #>>44317884 #>>44317997 #>>44318175 #>>44318235 #>>44318625 #

15. bonoboTP ◴[19 Jun 25 12:08 UTC] No.44317876[source]▶

>>44317792 #

If it's monkeylike quality and you need a million tries, it's shit. It you need four tries and one of those is top-tier professional programmer quality, then it's good.

replies(4): >>44317938 #>>44317975 #>>44318876 #>>44319399 #

16. diggan ◴[19 Jun 25 12:09 UTC] No.44317884[source]▶

>>44317792 #

> A truly terrible and demotivating way to work and produce anything of real quality

You clearly have strong feelings about it, which is fine, but it would be much more interesting to know exactly why it would terrible and demotivating, and why it cannot produce anything of quality? And what is "real quality" and does that mean "fake quality" exists?

> million monkey interns banging on one million keyboards and submit a million PRs

I'm not sure if you misunderstand LLMs, or the famous "monkeys writing Shakespeare" part, but that example is more about randomness and infinity than about probabilistic machines somewhat working towards a goal with some non-determinism.

> We’re beyond doomed

The good news is that we've been doomed for a long time, yet we persist. If you take a look at how the internet is basically held up by duct-tape at this point, I think you'd feel slightly more comfortable with how crap absolutely everything is. Like 1% of software is actually Good Software while the rest barely works on a good day.

replies(2): >>44317983 #>>44318020 #

17. TeMPOraL ◴[19 Jun 25 12:11 UTC] No.44317902{3}[source]▶

>>44317123 #

Worked in such a codebase for about 5 years.

No one really cares about improving test times. Everyone either suffers in private or gets convinced it's all normal and look at you weird when you suggest something needs to be done.

replies(1): >>44318811 #

18. agos ◴[19 Jun 25 12:18 UTC] No.44317938{3}[source]▶

>>44317876 #

if the thing producing the four PRs can't distinguish the top tier one, I have strong doubts that it can even produce it

replies(1): >>44319323 #

19. HappMacDonald ◴[19 Jun 25 12:19 UTC] No.44317950{3}[source]▶

>>44317246 #

You can't run N bots in parallel with testing between each attempt unless you're also running N tests in parallel.

If you could run N tests in parallel, then you could probably also run the components of one test in parallel and keep it from taking 2 hours in the first place.

To me this all sounds like snake oil to convince people to do something they were already doing, but by also spinning up N times as many compute instances and run a burn endless tokens along the way. And by the time it's demonstrated that it doesn't really offer anything more than doing it yourself, well you've already given them all of your money so their job is done.

replies(1): >>44318148 #

20. ◴[19 Jun 25 12:23 UTC] No.44317975{3}[source]▶

>>44317876 #

21. 3dsnano ◴[19 Jun 25 12:23 UTC] No.44317983{3}[source]▶

>>44317884 #

> And what is "real quality" and does that mean "fake quality" exists?

I think there is no real quality or fake quality, just quality. I am referencing the quality that Persig and C. Alexander have written about.

It’s… qualitative, so it’s hard to measure but easy to feel. Humans are really good at perceiving it then making objective decisions. LLMs don’t know what it is (they’ve heard about it and think they know).

replies(2): >>44318438 #>>44319060 #

22. koakuma-chan ◴[19 Jun 25 12:25 UTC] No.44317997[source]▶

>>44317792 #

> That sounds awful. A truly terrible and demotivating way to work and produce anything of real quality

This is the right way to work with generative AI, and it already is an extremely common and established practice when working with image generation.

replies(3): >>44318041 #>>44318110 #>>44318310 #

23. bgwalter ◴[19 Jun 25 12:29 UTC] No.44318020{3}[source]▶

>>44317884 #

If "AI" worked (which fortunately isn't the case), humans would be degraded to passive consumers in the last domain in which they were active creators: thinking.

Moreover, you would have to pay centralized corporations that stole all of humanity's intellectual output for engaging in your profession. That is terrifying.

The current reality is also terrifying: Mediocre developers are enabled to have a 10x volume (not quality). Mediocre execs like that and force everyone to use the "AI" snakeoil. The profession becomes even more bureaucratic, tool oriented and soulless.

People without a soul may not mind.

replies(1): >>44319044 #

24. notTooFarGone ◴[19 Jun 25 12:32 UTC] No.44318041{3}[source]▶

>>44317997 #

I can recognize images in one look.

How about that 400 Line change that touches 7 files?

replies(3): >>44318098 #>>44318227 #>>44318814 #

25. bandoti ◴[19 Jun 25 12:36 UTC] No.44318080[source]▶

>>44316210 (TP) #

Here’s a few problems I foresee:

1. People get lazy when presented with four choices they had no hand in creating, and they don’t look over the four and just click one, ignoring the others. Why? Because they have ten more of these on the go at once, diminishing their overall focus.

2. Automated tests, end-to-end sim., linting, etc—tools already exist and work at scale. They should be robust and THOROUGHLY reviewed by both AI and humans ideally.

3. AI is good for code reviews and “another set of eyes” but man it makes serious mistakes sometimes.

An anecdote for (1), when ChatGPT tries to A/B test me with two answers, it’s incredibly burdensome for me to read twice virtually the same thing with minimal differences.

Code reviewing four things that do almost the same thing is more of a burden than writing the same thing once myself.

replies(2): >>44318111 #>>44318430 #

26. koakuma-chan ◴[19 Jun 25 12:38 UTC] No.44318098{4}[source]▶

>>44318041 #

In my prompt I ask the LLM to write a short summary of how it solved the problem, run multiple instances of LLM concurrently, compare their summaries, and use the output of whichever LLM seems to have interpreted instructions the best, or arrived at the best solution.

replies(1): >>44318584 #

27. deadbabe ◴[19 Jun 25 12:40 UTC] No.44318110{3}[source]▶

>>44317997 #

It is not. The right way to work with generative AI is to get the right answer in the first shot. But it's the AI that is not living up to this promise.

Reviewing 4 different versions of AI code is grossly unproductive. A human co-worker can submit one version of code and usually have it accepted with a single review, no other "versions" to verify. 4 versions means you're reading 75% more code than is necessary. Multiply this across every change ever made to a code base, and you're wasting a shitload of time.

replies(2): >>44318128 #>>44318662 #

28. abdullin ◴[19 Jun 25 12:40 UTC] No.44318111[source]▶

>>44318080 #

A simple rule applies: "No matter what tool created the code, you are still responsible for what you merge into main".

As such, task of verification, still falls on hands of engineers.

Given that and proper processes, modern tooling works nicely with codebases ranging from 10k LOC (mixed embedded device code with golang backends and python DS/ML) to 700k LOC (legacy enterprise applications from the mainframe era)

replies(3): >>44318177 #>>44318268 #>>44319968 #

29. koakuma-chan ◴[19 Jun 25 12:42 UTC] No.44318128{4}[source]▶

>>44318110 #

> Reviewing 4 different versions of AI code is grossly unproductive.

You can have another AI do that for you. I review manually for now though (summaries, not the code, as I said in another message).

30. abdullin ◴[19 Jun 25 12:42 UTC] No.44318129[source]▶

>>44317531 #

Claude's approach is currently a bit dated.

Cursor.sh agents or especially OpenAI Codex illustrate that a tool doesn't need to keep on stuffing context window with irrelevant information in order to make progress on a task.

And if really needed, engineers report that Gemini Pro 2.5 keeps on working fine within 200k-500k token context. Above that - it is better to reset the context.

31. abdullin ◴[19 Jun 25 12:44 UTC] No.44318148{4}[source]▶

>>44317950 #

Running tests is already an engineering problem.

In one of the systems (supply chain SaaS) we invested so much effort in having good tests in a simulated environment, that we could run full-stack tests at kHz. Roughly ~5k tests per second or so on a laptop.

32. tele_ski ◴[19 Jun 25 12:47 UTC] No.44318168{3}[source]▶

>>44317166 #

It's an interesting idea, but reactive, and could cause big delays due to bisecting and testing on those regressions. There's the 'old' saying that the sooner the bug is found the cheaper it is to fix, seems weird to intentionally push finding side effect bugs later in the process because faster CI runs. Maybe AI will get there but it seems too aggressive right now to me. But yeah, put the automation slider where you're comfortable.

33. osigurdson ◴[19 Jun 25 12:47 UTC] No.44318175[source]▶

>>44317792 #

I'm not sure that AI code has to be sloppy. I've had some success with hand coding some examples and then asking codex to rigorously adhere to prior conventions. This can end up with very self consistent code.

Agree though on the "pick the best PR" workflow. This is pure model training work and you should be compensated for it.

replies(1): >>44318275 #

34. bandoti ◴[19 Jun 25 12:48 UTC] No.44318177{3}[source]▶

>>44318111 #

Agreed. I think engineers though following simple Test-Driven Development procedures can write the code, unit tests, integration tests, debug, etc for a small enough unit by default forces tight feedback loops. AI may assist in the particulars, not run the show.

I’m willing to bet, short of droid-speak or some AI output we can’t even understand, that when considering “the system as a whole”, that even with short-term gains in speed, the longevity of any product will be better with real people following current best-practices, and perhaps a modest sprinkle of AI.

Why? Because AI is trained on the results of human endeavors and can only work within that framework.

replies(1): >>44318282 #

35. abdullin ◴[19 Jun 25 12:50 UTC] No.44318193{5}[source]▶

>>44317620 #

I guess, it depends on the case and the approach.

It works really nice with the following approach (distilled from experiences reported by multiple companies)

(1) Augment codebase with explanatory texts that describe individual modules, interfaces and interactions (something that is needed for the humans anyway)

(2) Provide Agent.MD that describes the approach/style/process that the AI agent must take. It should also describe how to run all tests.

(3) Break down the task into smaller features. For each feature - ask first to write a detailed implementation plan (because it is easier to review the plan than 1000 lines of changes. spread across a dozen files)

(4) Review the plan and ask to improve it, if needed. When ready - ask to draft an actual pull request

(5) The system will automatically use all available tests/linting/rules before writing the final PR. Verify and provide feedback, if some polish is needed.

(6) Launch multiple instances of "write me an implementation plan" and "Implement this plan" task, to pick the one that looks the best.

This is very similar to git-driven development of large codebases by distributed teams.

Edit: added newlines

replies(1): >>44319532 #

36. abdullin ◴[19 Jun 25 12:52 UTC] No.44318227{4}[source]▶

>>44318041 #

Exactly!

This is why there has to be "write me a detailed implementation plan" step in between. Which files is it going to change, how, what are the gotchas, which tests will be affected or added etc.

It is easier to review one document and point out missing bits, than chase the loose ends.

Once the plan is done and good, it is usually a smooth path to the PR.

replies(1): >>44318795 #

37. ponector ◴[19 Jun 25 12:53 UTC] No.44318235[source]▶

>>44317792 #

>That sounds awful.

Not for the cloud provider. AWS bill to the moon!

38. elif ◴[19 Jun 25 12:54 UTC] No.44318246[source]▶

>>44316210 (TP) #

In my experience with Jules and (worse) Codex, juggling multiple pull requests at once is not advised.

Even if you tell the git-aware Jules to handle a merge conflict within the context window the patch was generated, it is like sorry bro I have no idea what's wrong can you send me a diff with the conflict?

I find i have to be in the iteration loop at every stage or else the agent will forget what it's doing or why rapidly. for instance don't trust Jules to run your full test suite after every change without handholding and asking for specific run results every time.

It feels like to an LLM, gaslighting you with code that nominally addresses the core of what you just asked while completely breaking unrelated code or disregarding previously discussed parameters is an unmitigated success.

39. ponector ◴[19 Jun 25 12:57 UTC] No.44318268{3}[source]▶

>>44318111 #

> As such, task of verification, still falls on hands of engineers.

Even before LLM it was a common thing to merge changes which completely brake test environment. Some people really skip verification phase of their work.

40. elif ◴[19 Jun 25 12:57 UTC] No.44318275{3}[source]▶

>>44318175 #

Yep this is what Andrej talks about around 20 minutes into this talk.

You have to be extremely verbose in describing all of your requirements. There is seemingly no such thing as too much detail. The second you start being vague, even if it WOULD be clear to a person with common sense, the LLM views that vagueness as a potential aspect of it's own creative liberty.

replies(6): >>44318409 #>>44318439 #>>44318599 #>>44318670 #>>44319080 #>>44323353 #

41. abdullin ◴[19 Jun 25 12:58 UTC] No.44318282{4}[source]▶

>>44318177 #

Agreed. AI is just a tool. Letting in run the show is essentially what the vibe-coding is. It is a fun activity for prototyping, but tends to accumulate problems and tech debt at an astonishing pace.

Code, manually crafted by professionals, will almost always beat AI-driven code in quality. Yet, one has still to find such professionals and wait for them to get the job done.

I think, the right balance is somewhere in between - let tools handle the mundane parts (e.g. mechanically rewriting that legacy Progress ABL/4GL code to Kotlin), while human engineers will have fun with high-level tasks and shaping the direction of the project.

42. xphos ◴[19 Jun 25 13:01 UTC] No.44318310{3}[source]▶

>>44317997 #

"If the only tool you have is a hammer, you tend to see every problem as a nail."

I think the worlds leaning dangerously into LLMs expecting them to solve every problem under the sun. Sure AI can solve problems but I think that domain 1 they Karpathy shows if it is the body of new knowledge in the world doesn't grow with LLMs and agents maybe generation and selection is the best method for working with domain 2/3 but there is something fundamentally lost in the rapid embrace of these AI tools.

A true challenge question for people is would you give up 10 points of IQ for access to the next gen AI model? I don't ask this in the sense that AI makes people stupid but rather that it frames the value of intelligence is that you have it. Rather than, in how you can look up or generate an answer that may or may not be correct quickly. How we use our tools deeply shapes what we will do in the future. A cautionary tale is US manufacturing of precision tools where we give up on teaching people how to use Lathes, because they could simply run CNC machines instead. Now that industry has an extreme lack of programmers for CNC machines, making it impossible to keep up with other precision instrument producing countries. This of course is a normative statement and has more complex variables but I fear in this dead set charge for AI we will lose sight of what makes programming languages and programming in general valuable

43. jebarker ◴[19 Jun 25 13:13 UTC] No.44318409{4}[source]▶

>>44318275 #

> the LLM views that vagueness as a potential aspect of it's own creative liberty.

I think that anthropomorphism actually clouds what’s going on here. There’s no creative choice inside an LLM. More description in the prompt just means more constraints on the latent space. You still have no certainty whether the LLM models the particular part of the world you’re constraining it to in the way you hope it does though.

44. eddd-ddde ◴[19 Jun 25 13:15 UTC] No.44318430[source]▶

>>44318080 #

With lazy people the same applies for everything, code they do write, or code they review from peers. The issue is not the tooling, but the hands.

replies(2): >>44318591 #>>44318646 #

45. abdullin ◴[19 Jun 25 13:17 UTC] No.44318438{4}[source]▶

>>44317983 #

It is actually funny that current AI+Coding tools benefit a lot from domain context and other information along the lines of Domain-Driven Design (which was inspired by the pattern language of C. Alexander).

A few teams have started incorporating `CONTEXT.MD` into module descriptions to leverage this.

46. 9rx ◴[19 Jun 25 13:17 UTC] No.44318439{4}[source]▶

>>44318275 #

> You have to be extremely verbose in describing all of your requirements. There is seemingly no such thing as too much detail.

If only there was a language one could use that enables describing all of your requirements in a unambiguous manner, ensuring that you have provided all the necessary detail.

Oh wait.

47. 9rx ◴[19 Jun 25 13:34 UTC] No.44318555[source]▶

>>44316946 #

Unless you are doing something crazy like letting the fuzzer run on every change (cache that shit), the full test suite taking a long time suggests that either your isolation points are way too large or you are letting the LLM cross isolated boundaries and "full testing suite" here actually means "multiple full testing suites". The latter is an easy fix: Don't let it. Force it stay within a single isolation zone just like you'd expect of a human. The former is a lot harder to fix, but I suppose ending up there is a strong indicator that you can't trust the human picking the best LLM result in the first place and that maybe this whole thing isn't a good idea for the people in your organization.

48. elt895 ◴[19 Jun 25 13:38 UTC] No.44318584{5}[source]▶

>>44318098 #

And you trust that the summary matches what was actually done? Your experience with the level of LLMs understanding of code changes must significantly differ from mine.

replies(1): >>44318628 #

49. freehorse ◴[19 Jun 25 13:39 UTC] No.44318591{3}[source]▶

>>44318430 #

The more tedious the work is, the less motivation and passion you get for doing it, and the more "lazy" you become.

Laziness does not just come from within, there are situations that promote behaving lazy, and others that don't. Some people are just lazy most of the time, but most people are "lazy" in some scenarios and not in others.

replies(1): >>44319090 #

50. joshuahedlund ◴[19 Jun 25 13:40 UTC] No.44318599{4}[source]▶

>>44318275 #

> You have to be extremely verbose in describing all of your requirements. There is seemingly no such thing as too much detail

I understand YMMV, but I have yet to find a use case where this takes me less time than writing the code myself.

51. chamomeal ◴[19 Jun 25 13:43 UTC] No.44318625[source]▶

>>44317792 #

I say this all the time!

Does anybody really want to be an assembly line QA reviewer for an automated code factory? Sounds like shit.

Also I can’t really imagine that in the first place. At my current job, each task is like 95% understanding all the little bits, and then 5% writing the code. If you’re reviewing PRs from a bot all day, you’ll still need to understand all the bits before you accept it. So how much time is that really gonna save?

replies(1): >>44319089 #

52. koakuma-chan ◴[19 Jun 25 13:43 UTC] No.44318628{6}[source]▶

>>44318584 #

It matched every time so far.

53. chamomeal ◴[19 Jun 25 13:46 UTC] No.44318646{3}[source]▶

>>44318430 #

I am not a lazy worker but I guarantee you I will not thoroughly read through and review four PRs for the same thing

54. RHSeeger ◴[19 Jun 25 13:48 UTC] No.44318662{4}[source]▶

>>44318110 #

That's not really comparing apples to apples though.

> A human co-worker can submit one version of code and usually have it accepted with a single review, no other "versions" to verify.

But that human co-worker spent a lot of time generating what is being reviewed. You're trading "time saved coding" for "more time reviewing". You can't complain about the added time reviewing and then ignore all the time saved coding. THat's not to say it's necessarily a win, but it _is_ a tradeoff.

Plus that co-worker may very well have spent some time discussing various approaches to the problem (with you), with is somewhat parallel to the idea of reviewing 4 different PRs.

55. SirMaster ◴[19 Jun 25 13:48 UTC] No.44318670{4}[source]▶

>>44318275 #

I'm really waiting for AI to get on par with the common sense of most humans in their respective fields.

replies(1): >>44318737 #

56. diggan ◴[19 Jun 25 13:56 UTC] No.44318737{5}[source]▶

>>44318670 #

I think you'll be waiting for a very long time. Right now we have programmable LLMs, so if you're not getting the results, you need to reprogram it to give the results you want.

57. layer8 ◴[19 Jun 25 14:04 UTC] No.44318794[source]▶

>>44316210 (TP) #

> Tight feedback loops are the key in working productively with software. […] even more tight loops than what a sane human would tolerate.

Why would a sane human be averse to things happening instantaneously?

58. bayindirh ◴[19 Jun 25 14:04 UTC] No.44318795{5}[source]▶

>>44318227 #

So you can create a more buggy code remixed from scraped bits from the internet which you don't understand, but somehow works rather than creating a higher quality, tighter code which takes the same amount of time to type? All the while offloading all the work to something else so your skills can atrophy at the same time?

Sounds like progress to me.

replies(1): >>44322806 #

59. diggan ◴[19 Jun 25 14:07 UTC] No.44318811{4}[source]▶

>>44317902 #

There a few of us around, but it's not a lot, agree. It really is an uphill battle trying to get development teams to design and implement test suites the same way they do with other "more important" code.

60. mistersquid ◴[19 Jun 25 14:07 UTC] No.44318814{4}[source]▶

>>44318041 #

> I can recognize images in one look.

> How about that 400 Line change that touches 7 files?

Karpathy discusses this discrepancy. In his estimation LLMs currently do not have a UI comparable to 1970s CLI. Today, LLMs output text and text does not leverage the human brain’s ability to ingest visually coded information, literally, at a glance.

Karpathy surmises UIs for LLMs are coming and I suspect he’s correct.

replies(1): >>44319905 #

61. layer8 ◴[19 Jun 25 14:13 UTC] No.44318876{3}[source]▶

>>44317876 #

The problem is, for any change, you have to understand the existing code base to assess the quality of the change in the four tries. This means, you aren’t relieved from being familiar with the code and reviewing everything. For many developers this review-only work style isn’t an exciting prospect.

And it will remain that way until you can delegate development tasks to AI with a 99+% success rate so that you don’t have to review their output and understand the code base anymore. At which point developers will become truly obsolete.

62. diggan ◴[19 Jun 25 14:31 UTC] No.44319044{4}[source]▶

>>44318020 #

> If "AI" worked (which fortunately isn't the case), humans would be degraded to passive consumers in the last domain in which they were active creators: thinking.

"AI" (depending on what you understand that to be) is already "working" for many, including myself. I've basically stopped using Google because of it.

> humans would be degraded to passive consumers in the last domain in which they were active creators: thinking

Why? I still think (I think at least), why would I stop thinking just because I have yet another tool in my toolbox?

> you would have to pay centralized corporations that stole all of humanity's intellectual output for engaging in your profession

Assuming we'll forever be stuck in the "mainframe" phase, then yeah. I agree that local models aren't really close to SOTA yet, but the ones you can run locally can already be useful in a couple of focused use cases, and judging by the speed of improvements, we won't always be stuck in this mainframe-phase.

> Mediocre developers are enabled to have a 10x volume (not quality).

In my experience, which admittedly been mostly in startups and smaller companies, this has always been the case. Most developers seem to like to produce MORE code over BETTER code, I'm not sure why that is, but I don't think LLMs will change people's mind about this, in either direction. Shitty developers will be shit, with or without LLMs.

replies(1): >>44322276 #

63. diggan ◴[19 Jun 25 14:34 UTC] No.44319060{4}[source]▶

>>44317983 #

> LLMs don’t know what it is

Of course they don't, they're probability/prediction machines, they don't "know" anything, not even that Paris is the capital of France. What they do "know" is that once someone writes "The capital of France is", the most likely tokens to come after that, is "Paris". But they don't understand the concept, nor anything else, just that probably 54123 comes after 6723 (or whatever the tokens are).

Once you understand this, I think it's easy to reason about why they don't understand code quality, why they couldn't ever understand it, and how you can make them output quality code regardless.

64. pja ◴[19 Jun 25 14:36 UTC] No.44319080{4}[source]▶

>>44318275 #

> You have to be extremely verbose in describing all of your requirements. There is seemingly no such thing as too much detail.

Sounds like ... programming.

Program specification is programming, ultimately. For any given problem if you’re lucky the specification is concise & uniquely defines the required program. If you’re unlucky the spec ends up longer than the code you’d write to implement it, because the language you’re writing it in is less suited to the problem domain than the actual code.

replies(1): >>44323942 #

65. diggan ◴[19 Jun 25 14:37 UTC] No.44319089{3}[source]▶

>>44318625 #

> Does anybody really want to be an assembly line QA reviewer for an automated code factory? Sounds like shit.

On the other hand, does anyone really wanna be a code-monkey implementing CRUD applications over and over by following product specifications by "product managers" that barely seem to understand the product they're "managing"?

See, we can make bad faith arguments both ways, but what's the point?

replies(2): >>44319854 #>>44323231 #

66. bandoti ◴[19 Jun 25 14:37 UTC] No.44319090{4}[source]▶

>>44318591 #

Seurat created beautiful works of art composed of thousands of tiny dots, painted by hand; one might find it meditational with the right mindset.

Some might also find laziness itself dreadfully boring—like all the Microsoft employees code-reviewing AI-Generated pull requests!

https://blog.stackademic.com/my-new-hobby-watching-copilot-s...

67. solaire_oa ◴[19 Jun 25 15:03 UTC] No.44319323{4}[source]▶

>>44317938 #

Making 4 PRs for a well-known solution sounds insane, yes, but to be the devil's advocate, you could plausibly be working with an ambiguous task: "Create 4 PRs with 4 different dependency libraries, so that I can compare their implementations." Technically it wouldn't need to pick the best one.

I have apprehension about the future of software engineering, but comparison does technically seem like a valid use case.

68. solaire_oa ◴[19 Jun 25 15:11 UTC] No.44319399{3}[source]▶

>>44317876 #

Top-tier professional programmer quality is exceedingly, impractically optimistic, for a few reasons.

1. There's a low probability of that in the first place.

2. You need to be a top-tier professional programmer to recognize that type of quality (i.e. a junior engineer could select one of the 3 shit PRs)

3. When it doesn't produce TTPPQ, you wasted tons of time prompting and reviewing shit code and still need to deliver, net negative.

I'm not doubting the utility of LLMs but the scattershot approach just feels like gambling to me.

replies(1): >>44320025 #

69. diggan ◴[19 Jun 25 15:22 UTC] No.44319532{6}[source]▶

>>44318193 #

> distilled from experiences reported by multiple companies

Distilled from my experience, I'd still say that the UX is lacking, as sequential chat just isn't the right format. I agree with Karpathy that we haven't found the right way of interacting with these OSes yet.

Even with what you say, variations were implemented in a rush. Once you've iterated with one variation you can not at the same time iterate on another variant, for example.

replies(1): >>44336655 #

70. nevertoolate ◴[19 Jun 25 15:56 UTC] No.44319854{4}[source]▶

>>44319089 #

Issue is if product people will do the “coding” and you have to fix it is miserable

replies(1): >>44320383 #

71. variadix ◴[19 Jun 25 16:02 UTC] No.44319905{5}[source]▶

>>44318814 #

The thing required isn’t a GUI for LLMs, it’s a visual model of code that captures all the behavior and is a useful representation to a human. People have floated this idea before LLMs, but as far as I know there isn’t any real progress, probably because it isn’t feasible. There’s so much intricacy and detail in software (and getting it even slightly wrong can be catastrophic), any representation that can capture said detail isn’t going to be interpretable at a glance.

replies(2): >>44320927 #>>44322430 #

72. xpe ◴[19 Jun 25 16:08 UTC] No.44319968{3}[source]▶

>>44318111 #

> A simple rule applies: "No matter what tool created the code, you are still responsible for what you merge into main".

Beware of claims of simple rules.

Take one subset of the problem: code reviews in an organizational environment. How well does they simple rule above work?

The idea of “Person P will take responsibility” is far from clear and often not a good solution. (1) P is fallible. (2) Some consequences are too great to allow one person to trigger them, which is why we have systems and checks. (3) P cannot necessarily right the wrong. (4) No-fault analyses are often better when it comes to long-term solutions which require a fear free culture to reduce cover-ups.

But this is bigger than one organization. The effects of software quickly escape organizational boundaries. So when we think about giving more power to AI tooling, we have to be really smart. This means understanding human nature, decision theory, political economy [1], societal norms, and law. And building smart systems (technical and organizational)

Recommending good strategies for making AI generated code safe is hard problem. I’d bet it is a much harder than even “elite” software developers people have contemplated, much less implemented. Training in software helps but is insufficient. I personally have some optimism for formal methods, defense in depth, and carefully implemented human-in-the-loop systems.

[1] Political economy uses many of the tools of economics to study the incentives of human decision making

73. zelphirkalt ◴[19 Jun 25 16:15 UTC] No.44320025{4}[source]▶

>>44319399 #

Also as a consequence of (1) the LLMs are trained on mediocre code mostly, so they often output mediocre or bad solutions.

74. diggan ◴[19 Jun 25 16:52 UTC] No.44320383{5}[source]▶

>>44319854 #

Even worse would be if we asked the accountants to do the coding, then you'll learn what miserable means.

What was the point again?

replies(1): >>44320556 #

75. nevertoolate ◴[19 Jun 25 17:08 UTC] No.44320556{6}[source]▶

>>44320383 #

Yes

76. mistersquid ◴[19 Jun 25 17:52 UTC] No.44320927{6}[source]▶

>>44319905 #

> The thing required isn’t a GUI for LLMs, it’s a visual model of code that captures all the behavior and is a useful representation to a human.

The visual representation that would be useful to humans is what Karpathy means by “GUI for LLMs”.

77. zelphirkalt ◴[19 Jun 25 20:32 UTC] No.44322276{5}[source]▶

>>44319044 #

The AI as it is currently, will not come up with that new app idea or that clever innovative way of implementing an application. It will endlessly rehash the training data it has ingested. Sure, you can tell an AI to spit out a CRUD, and maybe it will even eventually work in some sane way, but that's not innovative and not necessarily a good software. It is blindly copying existing approaches to implement something. That something is then maybe even working, but lacks any special sauce to make it special.

Example: I am currently building a web app. My goal is to keep it entirely static, traditional template rendering, just using the web as a GUI framework. If I had just told the AI to build this, it would have thrown tons of JS at the problem, because that is what the mainstream does these days, and what it mostly saw as training data. Then my back button would most likely no longer work, I would not be able to use bookmarks properly, it would not automatically have an API as powerful as the web UI, usable from any script, and the whole thing would have gone to shit.

If the AI tools were as good as I am at what I am doing, and I relied upon that, then I would not have spent time trying to think of the principles of my app, as I did when coming up with it myself. As it is now, the AI would not even have managed to prevent duplicate results from showing up in the UI, because I had a GPT4 session about how to prevent that, and none of the suggested AI answers worked and in the end I did what I thought I might have to do when I first discovered the issue.

replies(1): >>44322973 #

78. skydhash ◴[19 Jun 25 20:50 UTC] No.44322430{6}[source]▶

>>44319905 #

There’s no visual model for code as code isn’t 2d. There’s 2 mechanism in the turing machine model: a state machine and a linear representation of code and data. The 2d representation of state machine has no significance and the linear aspect of code and data is hiding more dimensions. We invented more abstractions, but nothing that map to a visual representation.

79. abdullin ◴[19 Jun 25 21:46 UTC] No.44322806{6}[source]▶

>>44318795 #

Here is another way to look at the problem.

There is a team of 5 people that are passionate about their indigenous language and want to preserve it from disappearing. They are using AI+Coding tools to:

(1) Process and prepare a ton of various datasets for training custom text-to-speech, speech-to-text models and wake word models (because foundational models don't know this language), along with the pipelines and tooling for the contributors.

(2) design and develop an embedded device (running ESP32-S3) to act as a smart speaker running on the edge

(3) design and develop backend in golang to orchestrate hundreds of these speakers

(4) a whole bunch of Python agents (essentially glorified RAGs over folklore, stories)

(5) a set of websites for teachers to create course content and exercises, making them available to these edge devices

All that, just so that kids in a few hundred kindergartens and schools would be able to practice their own native language, listen to fairy tales, songs or ask questions.

This project was acknowledged by the UN (AI for Good programme). They are now extending their help to more disappearing languages.

None of that was possible before. This sounds like a good progress to me.

Edit: added newlines.

replies(1): >>44325990 #

80. diggan ◴[19 Jun 25 22:09 UTC] No.44322973{6}[source]▶

>>44322276 #

> The AI as it is currently, will not come up with that new app idea or that clever innovative way of implementing an application

Who has claimed that they can do that sort of stuff? I don't think my comment hints at that, nor does the talk in the submission.

You're absolutely right with most of your comment, and seem to just be rehashing what Karpathy talks about but with different words. Of course it won't create good software unless you specify exactly what "good software" is for you, and tell it that. Of course it won't know you want "traditional static template rendering" unless you tell it to. Of course it won't create a API you can use from anywhere unless you say so. Of course it'll follow what's in the training data. Of course things won't automatically implement whatever you imagine your project should have, unless you tell it about those features.

I'm not sure if you're just expanding on the talk but chose my previous comment to attach it to, or if you're replying to something I said in my comment.

81. consumer451 ◴[19 Jun 25 22:46 UTC] No.44323231{4}[source]▶

>>44319089 #

I hesitate to divide a group as diverse as software devs into two categories, but here I go:

I have a feeling that devs who love LLM coding tools are more product-driven than those who hate them.

Put another way, maybe devs with their own product ideas love LLM coding tools, whilr devs without them do not.

I am genuinely not trying to throw shade here in any way. Does this rough division ring true to anyone else? Is there any better way to put it?

replies(1): >>44427318 #

82. throw234234234 ◴[19 Jun 25 23:08 UTC] No.44323353{4}[source]▶

>>44318275 #

I've found myself personally thinking English is OK when I'm happy with a "lossy expansion" and don't need every single detail defined (i.e. the tedious boilerplate, or templating kind of code). After all to me an LLM can be seen as a lossy compression of actual detailed examples of working code - why not "uncompress it" and let it assume the gaps. As an example I want a UI to render some data but I'm not as fussed about the details of it, I don't want to specify exact co-ordinates of each button, etc

However when I want detailed changes I find it more troublesome at present than just typing in the code myself. i.e. I know exactly what I want and I can express it just as easily (sometimes easier) in code.

I find AI in some ways a generic DSL personally. The more I have to define, the more specific I have to be the more I start to evaluate code or DSL's as potentially more appropriate tools especially when the details DO matter for quality/acceptance.

83. longhaul ◴[20 Jun 25 01:15 UTC] No.44323942{5}[source]▶

>>44319080 #

Agree, I used to say that documenting a program precisely and comprehensively ends up being code. We either need a DSL that can specify at a higher level or use domain specific LLMs.

84. bayindirh ◴[20 Jun 25 09:27 UTC] No.44325990{7}[source]▶

>>44322806 #

What you are describing is another application. My comment was squarely aimed at "vibe coding".

Protecting and preserving dying languages and culture is a great application for natural language processing.

For the record, I'm neither against LLMs, nor AI. What I'm primarily against is, how LLMs are trained and use the internet via their agents, without giving any citations, and stripping this information left and right and cry "fair use!" in the process.

Also, Go and Python are a nice languages (which I use), but there are other nice ways to build agents which also allows them to migrate, communicate and work in other cooperative or competitive ways.

So, AI is nice, LLMs are cool, but hyping something to earn money, deskill people, and pointing to something which is ethically questionable and technically inferior as the only silver bullet is not.

IOW; We should handle this thing way more carefully and stop ripping people's work in the name of "fair use" without consent. This is nuts.

Disclosure: I'm a HPC sysadmin sitting on top of a datacenter which runs some AI workloads, too.

replies(1): >>44336695 #

85. abdullin ◴[21 Jun 25 11:27 UTC] No.44336655{7}[source]▶

>>44319532 #

Yes. I believe, the experience will get better. Plus more AI vendors will catch up with OpenAI and offer similar experiences in their products.

It will just take a few months.

86. abdullin ◴[21 Jun 25 11:34 UTC] No.44336695{8}[source]▶

>>44325990 #

I think there are two different layers that get frequently mixed.

(1) LLMs as models - just the weights and an inference engine. These are just tools like hammers. There is a wide variety of models, starting from transparent and useless IBM Granite models, to open-weights Llama/Qwen to proprietary.

(2) AI products that are built on top of LLMs (agents, RAG, search, reasoning etc). This is how people decide to use LLMs.

How these products display results - with or without citations, with or without attribution - is determined by the product design.

It takes more effort to design a system that properly attributes all bits of information to the sources, but it is doable. As long as product teams are willing to invest that effort.

87. chamomeal ◴[30 Jun 25 20:10 UTC] No.44427318{5}[source]▶

>>44323231 #

No I think that’s accurate! But maybe instead of “devs who think about product stuff vs devs who don’t”, it depends on what hat you’re wearing.

When I’m working on something that I just want it to work, I love using LLMs. Shell functions for me to stuff into my config and use without ever understanding, UI for side projects that I don’t particularly care about, boilerplate nestjs config crap. Anything where all I care about is the result, not the process or the extensibility of the code: I love LLMs for that stuff.

When it’s something that I’m going to continue working on for a while, or the whole point is the extensibility/cleanliness of the program, I don’t like to use LLMs nearly as much.

I think it might be because most codebases are built with two purposes: 1) to be used as a product 2) to be extended and turned into something else

LLMs are super good at the first purpose, but not so good at the second.

I heard an interesting interview on the playdate dev podcast by the guy who made Obra Dinn. He said something along the lines of “making a game is awesome because the code can be horrible. All that matters is that the game works and is fun, and then you are done. It can just be finished, and then the code quality doesn’t matter anymore.”

So maybe LLMs are just really good for when you need something specific to work, and the internals don’t matter too much. Which are more the values of a product manager than a developer.

So it makes sense that when you are thinking more product-oriented, LLMs are more appealing!

↑