Have you tried using a base model from HuggingFace? they can't even answer simple questions. You input a base, raw model the input
What is the capital of the United States?
And there's a fucking big chance it will complete it as
What is the capital of Canada?
as much as there is a chance it could complete it with an essay about the early American republican history or a sociological essay questioning the idea of Capital cities.
Impressive, but not very useful. A good base model will complete your input with things that generally make sense, usually correct, but a lot of times completely different from what you intended it to generate. They are like a very smart dog, a genius dog that was not trained and most of the time refuses to obey.
So, even simple behaviors like acting as a party in a conversation as a chat bot is something that requires fine-tuning (the result of them being the *-instruct models you find in HuggingFace). In Machine Learning parlance, what we call supervised learning.
But in the case of ChatBOT behavior, the fine-tuning is not that much complex, because we already have a good idea of what conversations look like from our training corpora, we have already encoded a lot of this during the unsupervised learning phase.
Now, let's think about editing code, not simple generating it. Let's do a simple experiment. Go to your project and issue the following command.
claude -p --output-format stream-json "your prompt here to do some change in your code" | jq -r 'select(.type == "assistant") | .message.content[]? | select(.type? == "text") | .text'
Pay attention to the incredible amount of tool use calls that the LLMs generates on its output, now, think as this a whole conversation, does it look to you even similar to something a model would find in its training corpora?
Editing existing code, deleting it, refactoring is a way more complex operation than just generating a new function or class, it requires for the model to read the existing code, generate a plan to identify what needs to be changed and deleted, generate output with the appropriate tool calls.
Sequences of token that simply lead to create new code have basically a lower entropy, are more probable, than complex sequences that lead to editing and refactoring existing code.