The bottleneck for LLM is fast and large memory, not compute power.
Whoever is recommending investing in better chip(ALU) design hasn't done even a basic analysis of the problem.
Tokens per second = memory bandwidth divided by model size.
Whoever is recommending investing in better chip(ALU) design hasn't done even a basic analysis of the problem.
Tokens per second = memory bandwidth divided by model size.