Test-driven development with an LLM for fun and profit

1. erlapso ◴[17 Jan 25 11:13 UTC] No.42736336[source]▶

Super interesting approach! We've been working on the opposite - always getting your Unit tests written with every PR. The idea is that you don't have to bother running or writing them, you just get them delivered in your Github repo. You can check it out here https://www.codebeaver.ai

replies(2): >>42736448 #>>42737625 #

2. zild3d ◴[17 Jan 25 11:31 UTC] No.42736448[source]▶

>>42736336 (TP) #

First, I'm a fan of LLMs reducing friction in tests, but I would be concerned with the false sense of confidence here. The demo gif shows "hey I wrote your tests and they pass, go ahead and merge them"

OP makes a valid point

> Now we contend with the “who guards the guard” problem. Because LLMs are unreliable agents, it might so happen that Claude just scammed us by spitting out useless (or otherwise low-effort) test cases. [...] So it’s time to introduce some human input in the form of additional test cases, which is made extra convenient since the model already provided the overall structure of our test. If those cases pass, we can be reasonably confident in integrating this function into our codebase.

In our repos, I would love to have an LLM tool/product that helps out with test writing, but the workflow certainly needs to have some human in the loop for the time being. More like "Here I got you started with test coverage, add a few more of your own" or "Give me a few bullet points of cases that should pass or fail" and review the test code, not "go ahead and merge these tests I wrote for you"

3. lolinder ◴[17 Jan 25 14:06 UTC] No.42737625[source]▶

>>42736336 (TP) #

Test driven development is sequenced the way it is for a reason. Getting a failing test first builds confidence that the test is, you know, actually testing something. And the process of writing the tests is often where the largest amount of reasoning about design choices takes place.

Having an LLM generate the tests after you've already written the code for them is super counterproductive. Who knows whether those tests actually test anything?

I know this gets into "I wanted AI to do my laundry, not my art" territory, but a far more rational division of labor is for the humans to write the tests (maybe with the assistance of an autocomplete model) and give those as context for the AI. Humans are way better at thinking of edge cases and design constraints than the models are at this point in the game.