The command I ran was `curl -s https://r.jina.ai/https://www.lawfaremedia.org/article/anna-... | cb | ai -m gpt-5-mini summarize this article in one paragraph`. r.jina.ai pulls the text as markdown, and cb just wraps in a ``` code fence, and ai is my own LLM CLI https://github.com/david-crespo/llm-cli.
All of them seem pretty good to me, though at 6 cents the regular use of Sonnet for this purpose would be excessive. Note that reasoning was on the default setting in each case. I think that means the gpt-5 mini one did no reasoning but the other two did.
GPT-5 one paragraph: https://gist.github.com/david-crespo/f2df300ca519c336f9e1953...
GPT-5 three paragraphs: https://gist.github.com/david-crespo/d68f1afaeafdb68771f5103...
GPT-5 mini one paragraph: https://gist.github.com/david-crespo/32512515acc4832f47c3a90...
GPT-5 mini three paragraphs: https://gist.github.com/david-crespo/ed68f09cb70821cffccbf6c...
Sonnet 4.5 one paragraph: https://gist.github.com/david-crespo/e565a82d38699a5bdea4411...
Sonnet 4.5 three paragraphs: https://gist.github.com/david-crespo/2207d8efcc97d754b7d9bf4...
Rarely is this an issue with SOTA models like Sonnet-4.5, Opus-4.1, GPT-5-Thinking or better, etc. But that's expensive, so all the companies use cut-rate models or non-existent TTC to save on cost and to go faster.
Similarly I've had PMs blindly copy/paste summaries into larger project notes and ultimately create tickets based on either a misunderstanding from the LLM or a straight-up hallucination. I've repeatedly had conversations where a PM asks "when do you think Xyz will be finished?" only for me to have to ask in response "where and when did we even discuss Xyz? I'm not even sure what Xyz means in this context, so clarification would help." Only to have them just decide to delete the ticket/bullet etc. once they realize they never bothered to sanity check what they were pasting.
The ways they fail are often surprising if your baseline is “these are thinking machines”. If your baseline is what I wrote above (say, because you read the “Attention Is All You Need” paper) none of it’s surprising.
My own mental model (condensed to a single phrase) is that LLMs are extremely convincing (on the surface) autocomplete. So far, this model has not disappointed me.
Result was just trash. It would do exactly as you say: condense the information, but there was no semblance of "summary". It would just choose random phrases or keywords from the release notes and string them together, but it had no meaning or clarity, it just seemed garbled.
And it's not for lack of trying; I tried to get a suitable result out of the AI well past the amount of time it would have taken me to summarize it myself.
The more I use these tools the more I feel their best use case is still advanced autocomplete.