----
Several codebases I've known have provided a three-stage pipeline: unit tests, integration tests, and e2e tests. Each of these batches of tests depend on the creation of one of three environments, and the code being tested is what ends up in those environments. If you're interested in a particular failing test, you can use the associated environment and just iterate on the failing test.
For humans with a bit of tribal knowledge about the project, humans who have already solved the get-my-dev-environment-set-up problem in more or less uniform way, this works ok. Humans are better at retaining context over weeks and months, whereas you have to spin up a new session with an LLM every few hours or so. So we've created environments for ourselves that we ignore most of the time, but that are too complex to be bite sized for an agent that comes on the scene as a blank slate every few hours. There are too few steps from blank-slate to production, and each of them is too large.
But if successively more complex environments can be built on each other in arbitrarily many steps, then we could achieve finer granularity. As a nix user, my mental model for this is function composition where the inputs and outputs are environments, but an analogous model would be layers in a docker files where you test each layer before building the one on top of it.
Instead of maybe three steps, there are eight or ten. The goal would be to have both whatever code builds the environment, and whatever code tests it, paired up into bite-sized chunks so that a failure in the pipeline points you a specific stage which is more specific that "the unit tests are failing". Ideally test coverage and implementation complexity get distributed uniformly across those stages.
Keeping the scope of the stages small maximizes the amount of your codebase that the LLM can ignore while it works. I have a flake output and nix devshell corresponding to each stage in the pipeline and I'm using pytest to mark tests based on which stage they should run in. So I run the agent from the devshell that corresponds with whichever stage is relevant at the moment, and I introduce it to onlythe tests and code that are relevant to that stage (the assumption being that all previous stages are known to be in good shape). Most of the time, it doesn't need to know that it's working stage 5 of 9, so it "feels" like a smaller codebase than it actually is.
If evidence emerges that I've engaged the LLM at the wrong stage, I abandon the session and start over at the right level (now 6 of 9 or somesuch).
You can use an LLM to help document a codebase, but it's still an arduous task because you do need to review and fix up the generated docs. It will make, sometimes glaring sometimes subtle, mistakes. And you want your documentation to provide accuracy rather than double down on or even introduce misunderstanding.
In future I'll go "In the name of our new darling bot, let us unit test and refactor this complicated thing".