How to build a coding agent

Can someone confirm my understanding of how tool use works behind the scenes? Claude, ChatGPT, etc, through the API offer "tools" and give responses that ask for tool invocations which you then do and send the result back. However, the underlying model is a strictly text based medium, so I'm wondering how exactly the model APIs are turning the model response into these different sort of API responses. I'm assuming there's been a fine-tuning step with lots of examples which put desired tool invocations into some sort of delineated block or something, which the Claude/ChatGPT server understand? Is there any documentation about how this works exactly, and what those internal delineation tokens and such are? How do they ensure that the user text doesn't mess with it and inject "semantic" markers like that?

> I'm assuming there's been a fine-tuning step with lots of examples which put desired tool invocations into some sort of delineated block or something, which the Claude/ChatGPT server understand?

As far as I know that's what's happening. They are training it to return tool responses when it's unsure about the answer or instructed to do so. There are generic tool trainings for just following the response format, and then probably there are some tool specific trainings. For instance gpt-oss loves to use the search tool, even if it's not mentioned anywhere. Anthropic lists well known tools in their document (eg: text_editor, bash). They are likely to have been trained specifically to follow some deeper semantics compared to just generic tool usage.

The whole thing is pretty brittle and tool invocations are just taking place via in-band signalling, delineated by special tokens or token sequences.