That’s not artificial intelligence.
That’s not artificial intelligence.
Everytime I try to work with them I lose more time than I gain. Net loss every time. Immensely frustrating. If i focus it on a small subtask I can gain some time (rough draft of a test). Anything more advanced and its a monumental waste of time.
They are not even good librarians. They fail miserably at cross referencing and contextualizing without constant leading.
A simple example is how LLMs do math. They are not calculators and have not memorized every sum in existence. Instead they deploy a whole set of mental math techniques that were discovered at training time. For example, Claude uses a special trick for adding 2 digit numbers ending in 6 and 9.
Many more examples in this recent reach report, including evidence of future planning while writing rhyming poetry.
https://www.anthropic.com/research/tracing-thoughts-language...
LLMs are unbelievably useful for me - never have I had a tool more powerful to assist my brain work. I useLLMs for work and play constantly every day.
It pretends to sound like a person and can mimic speech and write and is all around perhaps the greatest wonder created by humanity.
It’s still not artificial intelligence though, it’s a talking library.
Ive tried to use them as a research assistant in a history project and they have been also quite bad in that respect because of the immense naivety in its approaches.
I couldn’t call them a librarian because librarians are studied and trained in cross referencing material.
They have helped me in some searches but not better than a search engine at a monumentally higher investment cost to the industry.
Then again, I am also speaking as someone who doesn’t like to offload all of my communications to those things. Use it or lose it, eh
On the other hand, the prompt/answer interface really limits what you can do with it. I can't just say, like I could with a human assistant, "Here's my calendar. Send me a summary of my appointments each morning, and when I tell you about a new one, record it in here." I can script something like that, and even have the LLM help me write the scripts, but since I can already write scripts, that's only a speed-up at best, not anything revolutionary.
I asked Grok what benefit there would be in having a script fetch the weather forecast data, pass it to Grok in a prompt, and then send the output to my phone. The answer was basically, "So I can say it nicer and remind you to take an umbrella if it sounds rainy." Again, that's kind of neat, but not a big deal.
Maybe I just need to experiment more to see a big advance I can make with it, but right now it's still at the "cool toy" stage.
It’s weird to me that there’s such a giant gap with my experience of it bein a minimum 10x multiplier.
The hype men trying to make a buck off them aren’t helping, of course.
If you ask the LLM to explain how it got the answer the response it gives you won't necessarily be the steps it used to figure out the answer.
"Our methods study the model indirectly using a more interpretable “replacement model,” which incompletely and imperfectly captures the original."
"(...) we build a replacement model that approximately reproduces the activations of the original model using more interpretable components. Our replacement model is based on a cross-layer transcoder (CLT) architecture (...)"
https://transformer-circuits.pub/2025/attribution-graphs/bio...
"Remarkably, we can substitute our learned CLT features for the model's MLPs while matching the underlying model's outputs in ~50% of cases."
"Our cross-layer transcoder is trained to mimic the activations of the underlying model at each layer. However, even when it accurately reconstructs the model’s activations, there is no guarantee that it does so via the same mechanisms."
https://transformer-circuits.pub/2025/attribution-graphs/met...
These two papers were designed to be used as the sort of argument that you're making. You point to a blog post that glazes over it. You have to click through the "Read the paper" to find a ~100 page paper, referencing another ~100 page paper to find any of these caveats. The blog post you linked doesn't even feature the words "replacement (model)" or any discussion of the reliability of this approach.
Yet it is happy to make bold claims such as "we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviors" which is simply not true.
Sure, they added to the blog post: "the mechanisms we do see may have some artifacts based on our tools which don't reflect what is going on in the underlying model" but that seems like a lot of indirection when the fact is that all observations commented in the papers and the blog posts are about nothing but such artifacts.
I have fed ChatGPT a pdf file with activity codes from a local tax authority and asked how I could classify some things I was interested in doing. It invented codes that didn't exist.
I would be very very careful about asking any LLM to organize data for me and trusting the output.
As for "life advice" type of thing, they are very sycophantic. I wouldn't go to a friend who always agrees with me enthusiastically for life advice. That sort of yes man behavior is quite toxic.