Gemma 3 QAT Models: Bringing AI to Consumer GPUs

1. 999900000999 ◴[20 Apr 25 16:22 UTC] No.43744734[source]▶

Assuming this can match Claude's latest, and full time usage ( as in you have a system that's constantly running code without any user input,) you'd probably save 600 to 700 a month. A 4090 is only 2K and you'll see an ROI within 90 days.

I can imagine this will serve to drive prices for hosted llms lower.

At this level any company that produces even a nominal amount of code should be running LMS on prem( AWS if your on the cloud).

replies(1): >>43746595 #

2. rafaelmn ◴[20 Apr 25 21:17 UTC] No.43746595[source]▶

>>43744734 (TP) #

I'd say using a Mac studio with M4 Max and 128 GB RAM will get you way further than 4090 in context size and model size. Cheaper than 2x4090 and less power while being a great overall machine.

I think these consumer GPUs are way too expensive for the amount of memory they pack - and that's intentional price discrimination. Also the builds are gimmicky. It's just not setup for AI models, and the versions that are cost 20k.

AMD has that 128GB RAM strix halo chip but even with soldered ram the bandwidth there is very limited, half of M4 Max, which is half of 4090.

I think this generation of hardware and local models is not there yet - would wait for M5/M6 release.

replies(2): >>43747494 #>>43749915 #

3. tootie ◴[21 Apr 25 00:07 UTC] No.43747494[source]▶

>>43746595 #

There's certainly room to grow but I'm running Gemma 12b on a 4060 (8GB VRAM) which I bought for gaming and it's a tad slow but still gives excellent results. And it certainly seems software is outpacing hardware right now. The target is making a good enough model that can run on a phone.

4. retinaros ◴[21 Apr 25 09:27 UTC] No.43749915[source]▶

>>43746595 #

two 3090 are the way to go