author here, interesting to hear, I generally start a new chat for each interaction so I've never noticed this in the chat interfaces, and only with Claude using claude code, but I guess my sessions there do get much longer, so maybe I'm wrong that it's a harness bug
Yes, and with very long chats, you'll see it even forget how to do things like make tool calls - or even respond at all! I've had ChatGPT reply with raw JSON, regurgitate an earlier prompt, reply with a single newline, regurgitate information from a completely different chat, reply in a foreign language, and more.
Things get really wacky as it approaches decoherence.
Yeah, the raw JSON (in my case) is the result of a failed tool call, it was trying to generate an image. With thinking models, you can observe the degeneration of its understanding of image tool calls over the lifetime of a chat. It eventually puzzles over where images are supposed to be emitted, how it's supposed to write text, if it's allowed to provide commentary - and eventually, it gets all of it wrong. This also happens with file cites (in projects) and web search calls.