If I could have a SWAG at it I would say a low resolution model like llama-2 would probably be just fine (llama-2 quantizes without too much headache) but a higher resolution model like llama-3 probably not so much, not without massive retraining anyways.
The 95% gains is specifically only for multiplication operations, inference is compute light and memory heavy in the first place so the actual gains would be far less smaller .
Tech journalism (all journalism really) can hardly be trusted to publish grounded news with the focus on clicks and revenue they need to survive.
I'm not claiming it's not possible, nor am I claiming that it's not true, or, at least, honest.
But, there will need to be evidence that using real machines, and using real energy an _equivalent performance_ is achievable. A defense that "there are no suitable chips" is a bit disingenuous. If the 95% savings actually has legs some smart chip manufacturer will do the math and make the chips. If it's correct, that chip making firm will make a fortune. If it's not, they won't.
Also,
> Additionally, it reduces energy consumption by 55.4% to 70.0%
With humility, I don't know what that means. It seems like some dubious math with percentages.
I would start by downloading a 1.58 model such as: https://huggingface.co/HF1BitLLM/Llama3-8B-1.58-100B-tokens
Run the non-quantized version of the model on your 3090/4090 gpu and observe the power draw. Then load the 1.58 model and observe the power usage. Sure, the numbers have a wide range because there are many gpu/npu to make the comparison.
That said, time to market has been more important than any cares of efficiency for some time. Now and in the future, there is more of a focus on it as the expenses in equipment and power have really grown.
Terrible logic. By a similar logic we wouldn't be using python for machine learning at all, for example (or x86 for compute). Yet here we are.
Right now the only way to gain real knowledge is actually to read comments of those articles.
They are and have been buying AMD CPUs for a while now, which says something about AMD and Intel.
I like that the narrative has changed from "AI only runs on Cuda" to "sure it runs fine on AMD if you must"