> I have a gaming rig with a 4080 with 16GB of RAM and it can’t even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized.
Which Mixtral? What does “heavily quantized” mean?
> A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM
Not assuming the 4090 is in a computer with non-trivial system RAM (which it kind of needs to be), since models can be split between GPU and CPU. The M1 Max might have better performance for models that take >24GB and <= 32GB than the 4090 machine with 24GB of VRAM, but assuming the 4090 is in a machine with 16GB+ of system RAM, it will be able to run bigger models than the M2, as well as outperforming it for models requiring up to 24GB of RAM.
The system I actually use is a gaming laptop with a 16GB 3080Ti and 64GB of system RAM, and Mixtral 8x7B @ 4-bit (~30GB RAM) works.