Most active commenters

abdullin(6)
diggan(4)

Andrej Karpathy: Software in the era of AI [video]

(www.youtube.com)

Show context

abdullin ◴[19 Jun 25 07:03 UTC] No.44316210[source]▶

Tight feedback loops are the key in working productively with software. I see that in codebases up to 700k lines of code (legacy 30yo 4GL ERP systems).

The best part is that AI-driven systems are fine with running even more tight loops than what a sane human would tolerate.

Eg. running full linting, testing and E2E/simulation suite after any minor change. Or generating 4 versions of PR for the same task so that the human could just pick the best one.

replies(7): >>44316306 #>>44316946 #>>44317531 #>>44317792 #>>44318080 #>>44318246 #>>44318794 #

1. OvbiousError ◴[19 Jun 25 09:35 UTC] No.44316946[source]▶

>>44316210 #

I don't think the human is the problem here, but the time it takes to run the full testing suite.

replies(6): >>44317032 #>>44317123 #>>44317166 #>>44317246 #>>44317515 #>>44318555 #

2. Byamarro ◴[19 Jun 25 09:51 UTC] No.44317032[source]▶

>>44316946 (TP) #

I work in web dev, so people sometimes hook code formatting as a git commit hook or sometimes even upon file save. The tests are problematic tho. If you work at huge project it's a no go idea at all. If you work at medium then the tests are long enough to block you, but short enough for you not to be able to focus on anything else in the meantime.

3. diggan ◴[19 Jun 25 10:10 UTC] No.44317123[source]▶

>>44316946 (TP) #

It is kind of a human problem too, although that the full testing suite takes X hours to run is also not fun, but it makes the human problem larger.

Say you're Human A, working on a feature. Running the full testing suite takes 2 hours from start to finish. Every change you do to existing code needs to be confirmed to not break existing stuff with the full testing suite, so some changes it takes 2 hours before you have 100% understanding that it doesn't break other things. How quickly do you lose interest, and at what point do you give up to either improve the testing suite, or just skip that feature/implement it some other way?

Now say you're Robot A working on the same task. The robot doesn't care if each change takes 2 hours to appear on their screen, the context is exactly the same, and they're still "a helpful assistant" 48 hours later when they still try to get the feature put together without breaking anything.

If you're feeling brave, you start Robot B and C at the same time.

replies(2): >>44317507 #>>44317902 #

4. londons_explore ◴[19 Jun 25 10:18 UTC] No.44317166[source]▶

>>44316946 (TP) #

The full test suite is probably tens of thousands of tests.

But AI will do a pretty decent job of telling you which tests are most likely to fail on a given PR. Just run those ones, then commit. Cuts your test time from hours down to seconds.

Then run the full test suite only periodically and automatically bisect to find out the cause of any regressions.

Dramatically cuts the compute costs of tests too, which in big codebase can easily become whole-engineers worth of costs.

replies(1): >>44318168 #

5. tlb ◴[19 Jun 25 10:33 UTC] No.44317246[source]▶

>>44316946 (TP) #

Yes, and (some near-future) AI is also more patient and better at multitasking than a reasonable human. It can make a change, submit for full fuzzing, and if there's a problem it can continue with the saved context it had when making the change. It can work on 100s of such changes in parallel, while a human trying to do this would mix up the reasons for the change with all the other changes they'd done by the time the fuzzing result came back.

LLMs are worse at many things than human programmers, so you have to try to compensate by leveraging the things they're better at. Don't give up with "they're bad at such and such" until you've tried using their strengths.

replies(1): >>44317950 #

6. abdullin ◴[19 Jun 25 11:11 UTC] No.44317507[source]▶

>>44317123 #

This is the workflow that ChatGPT Codex demonstrates nicely. Launch any number of «robotic» tasks in parallel, then go on your own. Come back later to review the results and pick good ones.

replies(1): >>44317620 #

7. abdullin ◴[19 Jun 25 11:12 UTC] No.44317515[source]▶

>>44316946 (TP) #

Humans tend to lack inhumane patience.

8. diggan ◴[19 Jun 25 11:30 UTC] No.44317620{3}[source]▶

>>44317507 #

Well, they're demonstrating it somewhat, it's more of a prototype today. First tell is the low limit, I think the longest task for me been 15 minutes before it gives up. Second tell is still using a chat UI which is simple to implement, easy to implement and familiar, but also kind of lazy. There should be a better UX, especially with the new variations they just added. From the top of my head, some graph-like UX might have been better.

replies(1): >>44318193 #

9. TeMPOraL ◴[19 Jun 25 12:11 UTC] No.44317902[source]▶

>>44317123 #

Worked in such a codebase for about 5 years.

No one really cares about improving test times. Everyone either suffers in private or gets convinced it's all normal and look at you weird when you suggest something needs to be done.

replies(1): >>44318811 #

10. HappMacDonald ◴[19 Jun 25 12:19 UTC] No.44317950[source]▶

>>44317246 #

You can't run N bots in parallel with testing between each attempt unless you're also running N tests in parallel.

If you could run N tests in parallel, then you could probably also run the components of one test in parallel and keep it from taking 2 hours in the first place.

To me this all sounds like snake oil to convince people to do something they were already doing, but by also spinning up N times as many compute instances and run a burn endless tokens along the way. And by the time it's demonstrated that it doesn't really offer anything more than doing it yourself, well you've already given them all of your money so their job is done.

replies(1): >>44318148 #

11. abdullin ◴[19 Jun 25 12:44 UTC] No.44318148{3}[source]▶

>>44317950 #

Running tests is already an engineering problem.

In one of the systems (supply chain SaaS) we invested so much effort in having good tests in a simulated environment, that we could run full-stack tests at kHz. Roughly ~5k tests per second or so on a laptop.

12. tele_ski ◴[19 Jun 25 12:47 UTC] No.44318168[source]▶

>>44317166 #

It's an interesting idea, but reactive, and could cause big delays due to bisecting and testing on those regressions. There's the 'old' saying that the sooner the bug is found the cheaper it is to fix, seems weird to intentionally push finding side effect bugs later in the process because faster CI runs. Maybe AI will get there but it seems too aggressive right now to me. But yeah, put the automation slider where you're comfortable.

13. abdullin ◴[19 Jun 25 12:50 UTC] No.44318193{4}[source]▶

>>44317620 #

I guess, it depends on the case and the approach.

It works really nice with the following approach (distilled from experiences reported by multiple companies)

(1) Augment codebase with explanatory texts that describe individual modules, interfaces and interactions (something that is needed for the humans anyway)

(2) Provide Agent.MD that describes the approach/style/process that the AI agent must take. It should also describe how to run all tests.

(3) Break down the task into smaller features. For each feature - ask first to write a detailed implementation plan (because it is easier to review the plan than 1000 lines of changes. spread across a dozen files)

(4) Review the plan and ask to improve it, if needed. When ready - ask to draft an actual pull request

(5) The system will automatically use all available tests/linting/rules before writing the final PR. Verify and provide feedback, if some polish is needed.

(6) Launch multiple instances of "write me an implementation plan" and "Implement this plan" task, to pick the one that looks the best.

This is very similar to git-driven development of large codebases by distributed teams.

Edit: added newlines

replies(1): >>44319532 #

14. 9rx ◴[19 Jun 25 13:34 UTC] No.44318555[source]▶

>>44316946 (TP) #

Unless you are doing something crazy like letting the fuzzer run on every change (cache that shit), the full test suite taking a long time suggests that either your isolation points are way too large or you are letting the LLM cross isolated boundaries and "full testing suite" here actually means "multiple full testing suites". The latter is an easy fix: Don't let it. Force it stay within a single isolation zone just like you'd expect of a human. The former is a lot harder to fix, but I suppose ending up there is a strong indicator that you can't trust the human picking the best LLM result in the first place and that maybe this whole thing isn't a good idea for the people in your organization.

15. diggan ◴[19 Jun 25 14:07 UTC] No.44318811{3}[source]▶

>>44317902 #

There a few of us around, but it's not a lot, agree. It really is an uphill battle trying to get development teams to design and implement test suites the same way they do with other "more important" code.

16. diggan ◴[19 Jun 25 15:22 UTC] No.44319532{5}[source]▶

>>44318193 #

> distilled from experiences reported by multiple companies

Distilled from my experience, I'd still say that the UX is lacking, as sequential chat just isn't the right format. I agree with Karpathy that we haven't found the right way of interacting with these OSes yet.

Even with what you say, variations were implemented in a rush. Once you've iterated with one variation you can not at the same time iterate on another variant, for example.

replies(1): >>44336655 #

17. abdullin ◴[21 Jun 25 11:27 UTC] No.44336655{6}[source]▶

>>44319532 #

Yes. I believe, the experience will get better. Plus more AI vendors will catch up with OpenAI and offer similar experiences in their products.

It will just take a few months.

↑