←back to thread

416 points floverfelt | 4 comments | | HN request time: 0s | source
Show context
daviding ◴[] No.45056856[source]
I get a lot of productivity out of LLMs so far, which for me is a simple good sign. I can get a lot done in a shorter time and it's not just using them as autocomplete. There is this nagging doubt that there's some debt to pay one day when it has too loose a leash, but LLMs aren't alone in that problem.

One thing I've done with some success is use a Test Driven Development methodology with Claude Sonnet (or recently GPT-5). Moving forward the feature in discrete steps with initial tests and within the red/green loop. I don't see a lot written or discussed about that approach so far, but then reading Martin's article made me realize that the people most proficient with TDD are not really in the Venn Diagram intersection of those wanting to throw themselves wholeheartedly into using LLMs to agent code. The 'super clippy' autocomplete is not the interesting way to use them, it's with multiple agents and prompt techniques at different abstraction levels - that's where you can really cook with gas. Many TDD experts have great pride in the art of code, communicating like a human and holding the abstractions in their head, so we might not get good guidance from the same set of people who helped us before. I think there's a nice green field of 'how to write software' lessons with these tools coming up, with many caution stories and lessons being learnt right now.

edit: heh, just saw this now, there you go - https://news.ycombinator.com/item?id=45055439

replies(1): >>45056943 #
tra3 ◴[] No.45056943[source]
It feels like Tdd/llm connection is implied — “and also generate tests”. Thought it’s not cannonical tdd of course. I wonder if it’ll turn the tide towards tech that’s easier to test automatically, like maybe ssr instead of react.
replies(2): >>45057027 #>>45057482 #
1. rvz ◴[] No.45057482[source]
> It feels like Tdd/llm connection is implied — “and also generate tests”.

That sounds like an anti-pattern and not true TDD to get LLMs to generate tests for you if you don't know what to test for.

It also reduces your confidence in knowing if the generated test does what it says. Thus, you might as well write it yourself.

Otherwise you will get these sort of nasty incidents. [0] Even when 'all tests passed'.

[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

replies(1): >>45058409 #
2. gck1 ◴[] No.45058409[source]
LLMs (Sonnet, Gemini from what I tested) tend to “fix” failing tests by either removing them outright or tweaking the assertions just enough to make them pass. The opposite happens too - sometimes they change the actual logic when what really needs updating is the test.

In short, LLMs often get confused about where the problem lies: the code under test or the test itself. And no amount of context engineering seems to solve that.

replies(1): >>45060274 #
3. Zerot ◴[] No.45060274[source]
I think in part the issue is that the LLM does not have enough context. The difference between a bug in the test or a bug in the implementation is purely based on the requirements which are often not in the source code and stored somewhere else(ticket system, documentation platform).

Without providing the actual feature requirements to the LLM(or the developer) it is impossible to determine which is wrong.

Which is why I think it is also sort of stupid by having the LLM generate tests by just giving it access to the implementation. That is at best testing the implementation as it is, but tests should be based on the requirements.

replies(1): >>45060549 #
4. gck1 ◴[] No.45060549{3}[source]
Oh, absolutely, context matters a lot. But the thing is, they still fail even with solid context.

Before I let an agent touch code, I spell out the issue/feature and have it write two markdown files - strategy.md and progress.md (with the execution order of changes) inside a feat_{id} directory. Once I’m happy with those, I wipe the context and start fresh: feed it the original feature definition + the docs, then tell it to implement by pulling in the right source code context. So by the time any code gets touched, there’s already ~80k tokens in play. And yet, the same confusion frequently happens.

Even if I flat out say “the issue is in the test/logic,”, even if I point out _exactly_ what the issue is, it just apologizes and loops.

At that point I stop it, make it record the failure in the markdown doc, reset context, and let it reload the feature plus the previous agent’s failure. Occasionally that works, but usually once it’s in that state, I have to step in and do it myself.