> It will be interesting to know what challenges came up in nudging the model to work better with time travel debug data, since this data is novel and the models today might not be well trained for making use of it.
This is actually quite interesting - it's something I'm planning to make a future post about.
But basically the LLM seems to be fairly good at using this interface effectively so long as we tuned what tools we provide quite carefully:
* Where we would want the LLM to use a tool sparingly it was better not to provide it at all. When you have time travel debugging it's usually better to work backwards since that tells you the causality of the bug. If we gave Claude the ability to step forward it tended to use it for everything, even when appropriate.
* LLMs weren't great at managing state they've set up. Allowing the LLM to set breakpoints just confused it later when it forget they were there.
* Open ended commands were a bad fit. For example, a time travel debugger can usually jump around in time according to an internal timebase. If the LLM was given access to that, unconstrained, it tended to just waste lots of effort guessing timebases and looking to see what was there.
* Sometimes the LLM just wants to hold something the wrong way and you have to let it. It was almost impossible to get the AI to understand that it could step back into a function on the previous line. It would always try going to the line, then stepping back, resulting in an overshoot. We had to just adapt the tool so that it could use it the way it thought it should work.
The overall result is actually quite satisfactory but it was a bit of a journey to understand how to give the LLM enough flexibility to generate insights without letting it get itself into trouble.