Test-driven development with an LLM for fun and profit

One trend I've noticed, framed as a logical deduction:

1. Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.

2. Coding agents do massively better when they have a test-driven reward signal.

3. If a problem can be framed in a way that a coding agent can solve, that speeds up development at least 10x from the base case of human + assistant.

4. From (1)-(3), if you can get all the necessary context into 50k tokens and measure progress via tests, you can speed up development by 10x.

5. Therefore all new development should be microservices written from scratch and interacting via cleanly defined APIs.

Sure enough, I see HN projects evolving in that direction.

> Coding assistants based on o1 and Sonnet are pretty great at coding with <50k context, but degrade rapidly beyond that.

I had a very similar impression (wrote more in https://hua.substack.com/p/are-longer-context-windows-all-yo...).

One framing is that effective context window (i.e. the length that the model is able to effectively reason over) determines how useful the model is. A human new grad programmer might effectively reason over 100s or 1000s of tokens but not millions - which is why we carefully scope the work and explain where to look for relevant context only. But a principal engineer might reason over many many millions of context - code yes, but also organizational and business context.

Trying to carefully select those 50k tokens is extremely difficult for LLMs/RAG today. I expect models to get much longer effective context windows but there are hardware / cost constraints which make this more difficult.