←back to thread

451 points imartin2k | 1 comments | | HN request time: 0s | source
Show context
einrealist ◴[] No.44478912[source]
The major AI gatekeepers, with their powerful models, are already experiencing capacity and scale issues. This won't change unless the underlying technology (LLMs) undergoes a fundamental shift. As more and more things become AI-enabled, how dependent will we be on these gatekeepers and their computing capacity? And how much will they charge us for prioritised access to these resources? And we haven't really gotten to the wearable devices stage yet.

Also, everyone who requires these sophisticated models now needs to send everything to the gatekeepers. You could argue that we already send a lot of data to public clouds. However, there was no economically viable way for cloud vendors to read, interpret, and reuse my data — my intellectual property and private information. With more and more companies forcing AI capabilities on us, it's often unclear who runs those models and who receives the data and what is really happening to the data.

This aggregation of power and centralisation of data worries me as much as the shortcomings of LLMs. The technology is still not accurate enough. But we want it to be accurate because we are lazy. So I fear that we will end up with many things of diminished quality in favour of cheaper operating costs — time will tell.

replies(3): >>44478949 #>>44479025 #>>44479921 #
kgeist ◴[] No.44479025[source]
We run our own LLM server at the office for a month now, as an experiment (for privacy/infosec reasons), and a single RTX 5090 is enough to serve 50 people for occasional use. We run Qwen3 32b which in some benchmarks is equivalent to GPT 4.1-mini or Gemini 2.5 Flash. The GPU allows 2 concurrent requests at the same time with 32k context each and 60 tok/s. At first I was skeptical a single GPU would be enough, but it turns out, most people don't use LLMs 24/7.
replies(3): >>44479225 #>>44480111 #>>44480983 #
pu_pe ◴[] No.44480111[source]
That's really great performance! Could you share more details about the implementation (ie which quantized version of the model, how much RAM, etc.)?
replies(1): >>44480736 #
kgeist ◴[] No.44480736[source]
Model: Qwen3 32b

GPU: RTX 5090 (no rops missing), 32 GB VRAM

Quants: Unsloth Dynamic 2.0, it's 4-6 bits depending on the layer.

RAM is 96 GB: more RAM makes a difference even if the model fits entirely in the GPU: filesystem pages containing the model on disk are cached entirely in RAM so when you switch models (we use other models as well) the overhead of unloading/loading is 3-5 seconds.

The Key Value Cache is also quantized to 8 bit (less degrades quality considerably).

This gives you 1 generation with 64k context, or 2 concurrent generations with 32k each. Everything takes 30 GB VRAM, which also leaves some space for a Whisper speech-to-text model (turbo & quantized) running in parallel as well.

replies(2): >>44481959 #>>44482838 #
1. pu_pe ◴[] No.44481959{3}[source]
Thanks a lot. Interesting that without concurrent requests the context could be doubled, 64k is pretty decent for working on a few files at once. A local LLM server is something a lot of companies should be looking into I think.