←back to thread

124 points simonpure | 3 comments | | HN request time: 0s | source
Show context
baalimago ◴[] No.44354802[source]
So... It's a language model..? As in, not "large"? I'm a bit unsure of the magnitudes here, but surely "nano" and "large" cancel out
replies(1): >>44354838 #
IanCal ◴[] No.44354838[source]
No, vLLM is a thing for serving language models: https://github.com/vllm-project/vllm
replies(1): >>44356954 #
1. barrenko ◴[] No.44356954[source]
Is it more like llama.cpp then? I don't have access to the good hardware.
replies(1): >>44362462 #
2. jasonjmcghee ◴[] No.44362462[source]
llama.cpp is optimized to serve one request at a time.

vllm is optimized to serve many requests at one time.

If you were to fine tune a model and wanted to serve it to many users, you would use vllm, not llama.cpp

replies(1): >>44366601 #
3. jasonjmcghee ◴[] No.44366601[source]
Here's a super relevant comment from another post https://news.ycombinator.com/item?id=44366418