Remember the revolutionary, seemingly inevitable tech that was poised to rewrite how humans thought about transportation? The incredible amounts of hype, the secretive meetings disclosing the device, etc.? That turned out to be the self-balancing scooter known as a Segway?
2. Segways were just ahead of their time: portable lithium-ion powered urban personal transportation is getting pretty big now.
The Segway always had a high barrier to entry. Currently for ChatGPT you don't even need an account, and everyone already has a Google account.
It is even cheaper to serve an LLM answer than call a web search API!
Zero chance all the users evaporate unless something much better comes along, or the tech is banned, etc...
> It is even cheaper to serve an LLM answer than call a web search API
These, uhhhh, these are some rather extraordinary claims. Got some extraordinary evidence to go along with them?
Anecdotally thanks to hardware advancements the locally-run AI software I develop has gotten more than 100x faster in the past year thanks to Moore's law
But I want to point out that going from CPU to TPU is basically the opposite of a Moore's law improvement.
(A mid to high end GPU can get similar or better performance but it's a lot harder to get more RAM.)
5060 Ti 16GB, $450
If you want more than 16GB, that's when it gets bad.
And you should be able to get two and load half your model into each. It should be about the same speed as if a single card had 32GB.
How cheap is inference, really? What about 'thinking' inference? What are the prices going to be once growth starts to slow and investors start demanding returns on their billions?
The unprofitability of the frontier labs is mostly due to them not monetizing the majority of their consumer traffic at all.
Relative to its siblings, things have gotten worse. A GTX 970 could hit 60% of the performance of the full Titan X at 35% of the price. A 5070 hits 40% of a full 5090 for 27% of the price. That's overall less series-relative performance you're getting, for an overall increased price, by about $100 when adjusting for inflation.
But if you have a fixed performance baseline you need to hit, as long as tech gets improving, things will eventually be cheaper for that baseline. As long as you aren't also trying to improve in a way that moves the baseline up. Which so far has been the only consistent MO of the AI industry.
This seems super duper expensive and not really supported by the more reasonably priced Nvidia cards, though. SLI is deprecated, NVLink isn't available everywhere, etc.
And nothing I've seen about recent GPUs or TPUs, from ANY maker (Nvidia, AMD, Google, Amazon, etc) say anything about general speedups of 100x. Heck, if you go across multiple generations of what are still these very new types of hardware categories, for example for Amazon's Inferentia/Trainium, even their claims (which are quite bold), would probably put the most recent generations at best at 10x the first generations. And as we all know, all vendors exaggerate the performance of their products.
Every layer of an LLM runs separately and sequentially, and there isn't much data transfer between layers. If you wanted to, you could put each layer on a separate GPU with no real penalty. A single request will only run on one GPU at a time, so it won't go faster than a single GPU with a big RAM upgrade, but it won't go slower either.