The quality of AI-assisted software depends on unit of work management

(blog.nilenso.com)

170 points mogambo1 | 1 comments | 18 Sep 25 13:06 UTC | HN request time: 0.35s | source

Show context

jonstewart ◴[18 Sep 25 13:58 UTC] No.45289784[source]▶

I first tried getting specific with Claude Code. I made the Claude.md, I detailed how to do TDD, what steps it should take, the commands it should run. It was imperfect. Then I had it plan (think hard) and write the plan to a file. I’d clear context, have it read the plan, ask me questions, and then have it decompose the plan into a detailed plan of discrete tasks. Have it work its way through that. It would inevitably go sideways halfway through, even clearing context between each task. It wouldn’t run tests, it would commit breakage, it would flip flop between two different broken approaches, it was just awful. Now I’ve just been vibing, writing as little as possible and seeing what happens. That sucks, too.

It’s amazing at reviewing code. It will identify what you fear, the horrors that lie within the codebase, and it’ll bring them out into the sunlight and give you a 7 step plan for fixing them. And the coding model is good, it can write a function. But it can’t follow a plan worth shit. And if I have to be extremely detailed at the function by function level, then I should be in the editor coding. Claude code is an amazing niche tool for code reviews and dialogue and debugging and coping with new technologies and tools, but it is not a productivity enhancement for daily coding.

replies(2): >>45289842 #>>45296208 #

cadamsdotcom ◴[18 Sep 25 23:21 UTC] No.45296208[source]▶

>>45289784 #

Don’t give up on TDD.

I’ve invested hundreds of hours in process and tooling, and can now ship major features with tests in record time with Claude Code.

You have to coach it in TDD - no matter how much you explain in CLAUDE.md. That’s part because “a test that fails because the code isn’t written yet” is conceptually very similar to “a test that passes without the code we’re about to write” and is also similar to “a test that asserts the code we’re about to write is not there”. You have to watch closely to make sure it produces the first thing.

Why does it keep getting confused? You can’t blame it really. When two things are conceptually similar, models need lots of examples to distinguish between them. If the set of samples is sparse the model is likely to jump the small distance from a concept to similar ones.

So, you have to accept this as how Claude 4 works, keep it on a short leash, keep reminding it that it must watch the test fail, ask it if the test failed for the right reason (not some setup issue), and THEN give it permission to write the code.

The result is two mirror copies of your feature or fix: code and tests.

Reviewing code and tests together is pleasant because they mirror one another. The tests forever ensure your feature works as described, no manual testing needed, no regressions. And the model knows all the tricks to make your tests really beautiful.

TDD is the check and balance missing from most people’s agentic software dev process.

replies(1): >>45309019 #

1. jonstewart ◴[20 Sep 25 01:29 UTC] No.45309019[source]▶

>>45296208 #

Oh, I will never give up on TDD. And the assistants are great at helping to write tests, and especially analyzing the tests you have and suggesting others for edge cases.

But I have repeatedly seen claude get hung up on TDD itself and I've tried lots of different prompts/directions. It runs into a problem and inevitably runs ever more complicated shell commands and creating weird temp input files than sticking to "cargo test" and addressing the failing test.

Since I need to review the agent's code, I'd much prefer it to use a workflow like a human, with a progression of small commits following TDD--much easier to review the code then. If it's just splatting up big diffs, then it makes review harder, and that offsets any productivity gains.

↑