←back to thread

Basic Facts about GPUs

(damek.github.io)
344 points ibobev | 2 comments | | HN request time: 0.43s | source
Show context
b0a04gl ◴[] No.44366418[source]
been running llama.cpp and vllm on same 4070, trying to batch more prompts for serving. llama.cpp was lagging bad once I hit batch 8 or so, even though GPU usage looked fine. vllm handled it way better.

later found vllm uses paged kv cache with layout that matches how the GPU wants to read fully coalesced without strided jumps. llama.cpp was using a flat layout that’s fine for single prompt but breaks L2 access patterns when batching.

reshaped kv tensors in llama.cpp to interleave ; made it [head, seq, dim] instead of [seq, head, dim], closer to how vllm feeds data into fused attention kernel. 2x speedup right there w.r.t same ops.

GPU was never the bottleneck. it was memory layout not aligning with SM’s expected access stride. vllm just defaults to layouts that make better use of shared memory and reduce global reads. that’s the real reason it scales better per batch.

this took its own time of say 2+days and had to dig under the nice looking GPU graphs to find real bottlenecks, it was widly trial and error tbf,

> anybody got idea on how to do this kinda experiment in hot reload mode without so much hassle??

replies(5): >>44367323 #>>44367389 #>>44367889 #>>44367899 #>>44370340 #
jcelerier ◴[] No.44367389[source]
did you do a PR to integrate these changes back into llama.cpp ? 2x speedup would be absolutely wild
replies(2): >>44367827 #>>44368492 #
zargon ◴[] No.44367827[source]
Almost nobody using llama.cpp does batch inference. I wouldn’t be surprised if the change is somewhat involved to integrate with all of llama.cpp’s other features. Combined with lack of interest and keeping up with code churn, that would probably make it difficult to get included, with the number of PRs the maintainers are flooded with.
replies(2): >>44367881 #>>44368471 #
1. tough ◴[] No.44367881[source]
if you open a PR, even if it doesnt get merged, anyone with the same issue can find it, and use your PR/branch/fix if it suits better their needs than master
replies(1): >>44369116 #
2. zargon ◴[] No.44369116[source]
Yeah good point. I have applied such PRs myself in the past. Eventually the code churn can sometimes make it too much of a pain to maintain them, but they’re useful for a while.