How to build a coding agent

Can someone confirm my understanding of how tool use works behind the scenes? Claude, ChatGPT, etc, through the API offer "tools" and give responses that ask for tool invocations which you then do and send the result back. However, the underlying model is a strictly text based medium, so I'm wondering how exactly the model APIs are turning the model response into these different sort of API responses. I'm assuming there's been a fine-tuning step with lots of examples which put desired tool invocations into some sort of delineated block or something, which the Claude/ChatGPT server understand? Is there any documentation about how this works exactly, and what those internal delineation tokens and such are? How do they ensure that the user text doesn't mess with it and inject "semantic" markers like that?