- The same spec is processed by the same LLM differently when implementing from scratch. This can maybe mitigated somewhat by adjusting the temperature slider. But generally speaking, the same spec won't give the same result unless you are very specific.
- Same if you use different LLMs. The same spec can give entirely different results for different LLMs.
- This can probably mitigated somewhat by getting more specific in the spec, but at some point, it is so specific as being the code itself. Unless of course you don't care that much about the details. But if you don't, you get a slightly different app every time you implement from scratch.
- Gemini 2.5 pro has "reasoning" capabilities and introduces a lot of "thinking" tokens into the context. Let's say you start with a single line spec and iterate from there. Gemini will give you a more detailed spec based on its thinking process. But if you then take the new thinking-process spec as a new starting point for the next iteration of the spec, you get even more thinking. In short, the spec gets automatically expanded by the way of "thinking" with reasoning models.
- Produced code can have small bugs, but they are not really worth to put in the spec, because they are an implementation detail.
I'll keep experimenting with it, but I don't think this is the holy grail of AI assisted coding.