←back to thread

169 points hunvreus | 4 comments | | HN request time: 0.558s | source
1. NitpickLawyer ◴[] No.43653452[source]
Cool article! The stack (and results) are impressive, but I also appreciate the article in itself, starting from basics and getting to the point in a clear and slowly expanding way. Easy to follow and appreciate.

On a bit of a tangent rant, this kind of writing is slowly going away, taken over by LLM slop (and I'm a huge fan of LLMs, just not the people who write those kinds of articles). I was recently looking for real world benchmarks for vllm/sglang deployments of DeepSeek3 on a 8x 96GB pod, to see if the model fits into the amount of RAM, with kv cache and context length, what numbers to people get, etc.

Of the ~20 articles that google surfaced on various attempts of keywords, none were what I was looking for. The excerpts seemed promising, some even offered tables & stuff related to ds3 and RAM usage, but all were LLM crap. All were written in that simple style - intro - bla bla - conclusion, some even had RAM requirements that made no sense (running a model trained in FP8 in 16bit, something noone would do, etc.)

replies(2): >>43653624 #>>43654145 #
2. fxtentacle ◴[] No.43653624[source]
While I fully agree with you on the absence of good benchmarks and the growing LLM slop ...

"running a model trained in FP8 in 16bit, something noone would do, etc"

I did that because on the RTX 3090 - which can be a good bang per buck for inference - the FP8 support is nerfed at the driver level. So a kernel that upscales FP8 to FP16 inside SRAM, then does the matmul, then downscales to FP8 again can bring massive performance benefits on those consumer cards.

BTW, you can run a good DeepSeek3 quant on a single H200.

replies(1): >>43653977 #
3. NitpickLawyer ◴[] No.43653977[source]
Thanks! I was looking at blackwell 6000PROs, 8x 96GB for running full fp8 (as it's supported and presumably fast).

I know AWQ should run, and be pretty snappy & efficient w/ the new MLA added, but wanted to check if fp8 fits as well, because from a simple napkin math it seems pretty tight (might only work for bs1, ctx_len <8k which would probably not be suited for coding tasks).

4. thomasjudge ◴[] No.43654145[source]
You are describing this as a writing problem, but it sounds more like a search results/search engine problem