←back to thread

566 points PaulHoule | 9 comments | | HN request time: 0.844s | source | bottom
1. true_blue ◴[] No.44491118[source]
I tried the playground and got a strange response. I asked for a regex pattern, and the model gave itself a little game-plan, then it wrote the pattern and started to write tests for it. But it never stopped writing tests. It continued to write tests of increasing size until I guess it reached a context limit and the answer was canceled. Also, for each test it wrote, it added a comment about if the test should pass or fail, but after about the 30th test, it started giving the wrong answer for those too, saying that a test should fail when actually it should pass if the pattern is correct. And after about the 120th test, the tests started to not even make sense anymore. They were just nonsense characters until the answer got cut off.

The pattern it made was also wrong, but I think the first issue is more interesting.

replies(5): >>44491301 #>>44493417 #>>44493628 #>>44497569 #>>44503983 #
2. fiatjaf ◴[] No.44491301[source]
This is too funny to be true.
3. beders ◴[] No.44493417[source]
I think that's a prime example showing that token prediction simply isn't good enough for correctness. It never will be. LLMs are not designed to reason about code.
4. ianbicking ◴[] No.44493628[source]
FWIW, I remember regular models doing this not that long ago, sometimes getting stuck in something like an infinite loop where they keep producing output that is only a slight variation on previous output.
replies(1): >>44494758 #
5. data-ottawa ◴[] No.44494758[source]
if you shrink the context window on most models you'll get this type of behaviour. If you go too small you end up with basically gibberish even on modern models like Gemini 2.5.

Mercury has a 32k context window according to the paper, which could be why it does that.

replies(1): >>44500870 #
6. _kidlike ◴[] No.44497569[source]
I had this happen to me on Claude Sonnet once. It started spitting out huge blocks of source code completely unrelated to my prompt, seemingly from its training data, and switching codebases once in a while... like, a few thousand lines of some C program, then switching to another JavaScript one, etc. it was insane!
replies(1): >>44497686 #
7. CSSer ◴[] No.44497686[source]
Sounds like solidgoldmagikarp[0]. There must've been something in your prompt that is over-represented throughout the training data.

[0] https://www.lesswrong.com/posts/jbi9kxhb4iCQyWG9Y/explaining...

8. jdiff ◴[] No.44500870{3}[source]
I've run into this even with the modern million context length that 2.5 Pro offers, it kept trying one of a handful of failed approaches, realizing its failure, and looping without ending its train of thought until I yanked the tokens out of its mouth.

Even though it has gotten drastically better and rarer, I think this is going to be one of the failure modes that's just fundamental to the technology.

9. throwaway314155 ◴[] No.44503983[source]
This is common amongst _all_ of the smaller LLM's.