NanoChat – The best ChatGPT that $100 can buy

Yes, you can always stream data when training or doing inference on models when vram is lacking but the slow down is extremely noticeable. This is the case for CPU code too and is why optimising for bandwidth is so critical in high-performance computing. Your ability to compute is almost always substantially larger than your bandwidth. An Avx512 capable CPU with a suitable amount of cores is easily capable of doing multiple terabytes of fp64 operations per second, but is typically limited by memory bandwidth, GPUs with LLMs have just broadened this knowledge to more people.

A fun consequence of the fact that CPUs got faster at a rate quicker than memory is look up tables of pre-computed values used to be common optimisations in code, but now it is almost always quicker to re-compute them than to retrieve a pre-computed value from memory for common use-cases.