Basic Facts about GPUs

(damek.github.io)

338 points ibobev | 3 comments | 24 Jun 25 12:15 UTC | HN request time: 0.668s | source

Show context

b0a04gl ◴[24 Jun 25 14:06 UTC] No.44366418[source]▶

been running llama.cpp and vllm on same 4070, trying to batch more prompts for serving. llama.cpp was lagging bad once I hit batch 8 or so, even though GPU usage looked fine. vllm handled it way better.

later found vllm uses paged kv cache with layout that matches how the GPU wants to read fully coalesced without strided jumps. llama.cpp was using a flat layout that’s fine for single prompt but breaks L2 access patterns when batching.

reshaped kv tensors in llama.cpp to interleave ; made it [head, seq, dim] instead of [seq, head, dim], closer to how vllm feeds data into fused attention kernel. 2x speedup right there w.r.t same ops.

GPU was never the bottleneck. it was memory layout not aligning with SM’s expected access stride. vllm just defaults to layouts that make better use of shared memory and reduce global reads. that’s the real reason it scales better per batch.

this took its own time of say 2+days and had to dig under the nice looking GPU graphs to find real bottlenecks, it was widly trial and error tbf,

> anybody got idea on how to do this kinda experiment in hot reload mode without so much hassle??

replies(5): >>44367323 #>>44367389 #>>44367889 #>>44367899 #>>44370340 #

chickenzzzzu ◴[24 Jun 25 19:53 UTC] No.44370340[source]▶

>>44366418 #

>GPU was never the botteneck >it was memory layout

ah right so the GPU was the bottleneck then

replies(1): >>44373234 #

CardenB ◴[25 Jun 25 02:48 UTC] No.44373234[source]▶

>>44370340 #

No because he was able to achieve the speedup without changing the GPU.

replies(1): >>44382025 #

1. chickenzzzzu ◴[25 Jun 25 21:35 UTC] No.44382025[source]▶

>>44373234 #

A more technically correct way to express this feeling is:

"The computational power of the cores on the GPU was never the issue-- however the code that I wrote resulted in a memory bandwidth bottleneck that starved the GPU cores of data to work on, which is firmly within my responsibilities as a programmer -- to fully understand the bandwidth and latency characteristics of the device(s) i'm running on"

replies(1): >>44395275 #

2. saagarjha ◴[27 Jun 25 09:31 UTC] No.44395275[source]▶

>>44382025 (TP) #

I mean they didn't write the code

replies(1): >>44400163 #

3. chickenzzzzu ◴[27 Jun 25 20:57 UTC] No.44400163[source]▶

>>44395275 #

And that's the reason why they misspoke

↑