Basic Facts about GPUs

(damek.github.io)

344 points ibobev | 1 comments | 24 Jun 25 12:15 UTC | HN request time: 0.212s | source

Show context

b0a04gl ◴[24 Jun 25 14:06 UTC] No.44366418[source]▶

been running llama.cpp and vllm on same 4070, trying to batch more prompts for serving. llama.cpp was lagging bad once I hit batch 8 or so, even though GPU usage looked fine. vllm handled it way better.

later found vllm uses paged kv cache with layout that matches how the GPU wants to read fully coalesced without strided jumps. llama.cpp was using a flat layout that’s fine for single prompt but breaks L2 access patterns when batching.

reshaped kv tensors in llama.cpp to interleave ; made it [head, seq, dim] instead of [seq, head, dim], closer to how vllm feeds data into fused attention kernel. 2x speedup right there w.r.t same ops.

GPU was never the bottleneck. it was memory layout not aligning with SM’s expected access stride. vllm just defaults to layouts that make better use of shared memory and reduce global reads. that’s the real reason it scales better per batch.

this took its own time of say 2+days and had to dig under the nice looking GPU graphs to find real bottlenecks, it was widly trial and error tbf,

> anybody got idea on how to do this kinda experiment in hot reload mode without so much hassle??

replies(5): >>44367323 #>>44367389 #>>44367889 #>>44367899 #>>44370340 #

jcelerier ◴[24 Jun 25 15:39 UTC] No.44367389[source]▶

>>44366418 #

did you do a PR to integrate these changes back into llama.cpp ? 2x speedup would be absolutely wild

replies(2): >>44367827 #>>44368492 #

zargon ◴[24 Jun 25 16:19 UTC] No.44367827[source]▶

>>44367389 #

Almost nobody using llama.cpp does batch inference. I wouldn’t be surprised if the change is somewhat involved to integrate with all of llama.cpp’s other features. Combined with lack of interest and keeping up with code churn, that would probably make it difficult to get included, with the number of PRs the maintainers are flooded with.

replies(2): >>44367881 #>>44368471 #

buildxyz ◴[24 Jun 25 17:16 UTC] No.44368471[source]▶

>>44367827 #

Any speed up that is 2x is definitely worth fixing. Especially since someone has already figured out the issue and performance testing [1] shows that llamacpp* is lagging behind vLLM by 2x. This is a positive for all running LLMs locally using llamacpp.

Even if llamacpp isnt used for batch inference now, this can allow those to finally run llamacpp for batching and on any hardware since vLLM supports only select hardware. Maybe finally we can stop all this gpu api software fragmentation and cuda moat as llamacpp benchmarks have shown Vulkan to be as or more performant than cuda or sycl.

[1] https://miro.medium.com/v2/resize:fit:1400/format:webp/1*lab...

replies(1): >>44369063 #

menaerus ◴[24 Jun 25 18:14 UTC] No.44369063[source]▶

>>44368471 #

So, what exactly is batch inference workload and how would someone running inference on local setup benefit from it? Or how would I even benefit from it if I had a single machine hosting multiple users simultaneously?

I believe batching is a concept only useful when during the training or fine tuning process.

replies(1): >>44369308 #

zargon ◴[24 Jun 25 18:36 UTC] No.44369308[source]▶

>>44369063 #

Batch inference is just running multiple inferences simultaneously. If you have simultaneous requests, you’ll get incredible performance gains, since a single inference doesn’t leverage any meaningful fraction of a GPU’s compute capability.

For local hosting, a more likely scenario where you could use batching is if you had a lot of different data you wanted to process (lots of documents or whatever). You could batch them in sets of x and have it complete in 1/x the time.

A less likely scenario is having enough users that you can make the first user wait a few seconds while you wait to see if a second user submits a request. If you do get a second request, then you can batch them and the second user will get their result back much faster than if they had had to wait for the first user’s request to complete first.

Most people doing local hosting on consumer hardware won’t have the extra VRAM for the KV cache for multiple simultaneous inferences though.

replies(1): >>44369472 #

menaerus ◴[24 Jun 25 18:47 UTC] No.44369472[source]▶

>>44369308 #

Wouldn't batching the multiple inference requests from multiple different users with multiple different contexts simultaneously impact the inference results for each of those users?

replies(1): >>44370743 #

pests ◴[24 Jun 25 20:36 UTC] No.44370743[source]▶

>>44369472 #

The different prompts being batched do not mathematically affect each other. When running inference you have massive weights that need to get loaded and unloaded just to serve the current prompt and however long its context is (maybe even just a few tokens even). This batching lets you manipulate and move the weights around less to serve the same amount of combined context.

replies(2): >>44371018 #>>44411782 #

1. spwa4 ◴[29 Jun 25 10:04 UTC] No.44411782[source]▶

>>44370743 #

If you add a dimension to the input vector you can do them independently and more efficiently. Look at this. Let's say you have a 2x2 network, and you apply it to an input vector of two values:

[i1 i2 ]⋅[w1 w3 ; w2 w4 ] = [i1 ⋅w1 +i2 ⋅w3 i1 ⋅w2 +i2 ⋅w4 ]

Cool. Now what happens if we make the input vector a 2x2 matrix with, for some reason, a second set of two input values:

[i1 i2 ; j1 j2 ]⋅[w1 w3 ; w2 w4 ] = [i1 ⋅ w1 +i2 ⋅ w3 i1 ⋅ w2 +i2 ⋅ w4 ; j1 ⋅ w1 +j2 ⋅ w3 j1 ⋅ w2 +j2 ⋅w4 ]

Look at that! The input has 2 rows, each row has an input value for the network and the output matrix has 2 rows, each containing the outputs for the respective inputs. So you can "just" apply your neural network to any number of input values by just putting one to each row. You could do 2, or 1000 this way ... and a number of values would only need to be calculated once.

↑