Most active commenters
  • einrealist(3)
  • kgeist(3)

←back to thread

451 points imartin2k | 22 comments | | HN request time: 1.249s | source | bottom
1. einrealist ◴[] No.44478912[source]
The major AI gatekeepers, with their powerful models, are already experiencing capacity and scale issues. This won't change unless the underlying technology (LLMs) undergoes a fundamental shift. As more and more things become AI-enabled, how dependent will we be on these gatekeepers and their computing capacity? And how much will they charge us for prioritised access to these resources? And we haven't really gotten to the wearable devices stage yet.

Also, everyone who requires these sophisticated models now needs to send everything to the gatekeepers. You could argue that we already send a lot of data to public clouds. However, there was no economically viable way for cloud vendors to read, interpret, and reuse my data — my intellectual property and private information. With more and more companies forcing AI capabilities on us, it's often unclear who runs those models and who receives the data and what is really happening to the data.

This aggregation of power and centralisation of data worries me as much as the shortcomings of LLMs. The technology is still not accurate enough. But we want it to be accurate because we are lazy. So I fear that we will end up with many things of diminished quality in favour of cheaper operating costs — time will tell.

replies(3): >>44478949 #>>44479025 #>>44479921 #
2. PeterStuer ◴[] No.44478949[source]
"how much will they charge us for prioritised access to these resources"

For the consumer side, you'll be the product, not the one paying in money just like before.

For the creator side, it will depend on how competition in the market sustains. Expect major regulatory capture efforts to eliminate all but a very few 'sanctioned' providers in the name of 'safety'. If only 2 or 3 remain, it might get realy expensive.

3. kgeist ◴[] No.44479025[source]
We run our own LLM server at the office for a month now, as an experiment (for privacy/infosec reasons), and a single RTX 5090 is enough to serve 50 people for occasional use. We run Qwen3 32b which in some benchmarks is equivalent to GPT 4.1-mini or Gemini 2.5 Flash. The GPU allows 2 concurrent requests at the same time with 32k context each and 60 tok/s. At first I was skeptical a single GPU would be enough, but it turns out, most people don't use LLMs 24/7.
replies(3): >>44479225 #>>44480111 #>>44480983 #
4. einrealist ◴[] No.44479225[source]
If those smaller models are sufficient for your use cases, go for it. But for how much longer will companies release smaller models for free? They invested so much. They have to recoup that money. Much will depend on investor pressure and the financial environment (tax deductions etc).

Open Source endeavors will have a hard time to bear the resources to train models that are competitive. Maybe we will see larger cooperatives, like a Apache Software Foundation for ML?

replies(7): >>44479267 #>>44479356 #>>44479513 #>>44479541 #>>44479835 #>>44479940 #>>44480209 #
5. tankenmate ◴[] No.44479267{3}[source]
"Maybe we will see larger cooperatives, like a Apache Software Foundation for ML?"

I suspect the Linux Foundation might be a more likely source considering its backers and how much those backers have provided LF by way of resources. Whether that's aligned with LF's goals ...

6. msgodel ◴[] No.44479356{3}[source]
Even Google and Facebook are releasing distills of their models (Gemma3 is very good, competitive with qwen3 if not better sometimes.)

There are a number of reasons to do this: You want local inference, you want attention from devs and potential users etc.

Also the smaller self hostable models are where most of the improvement happens these days. Eventually they'll catch up with where the big ones are today. At this point I honestly wouldn't worry too much about "gatekeepers."

7. ◴[] No.44479513{3}[source]
8. Gigachad ◴[] No.44479541{3}[source]
Seems like you don’t have to train from scratch. You can just distil a new model off an existing one by just buying api credits to copy the model.
replies(2): >>44479942 #>>44480855 #
9. DebtDeflation ◴[] No.44479835{3}[source]
It's not just about smaller models. I recently bought a Macbook M4 Max with 128GB RAM. You can run surprisingly large models locally with unified memory (albeit somewhat slowly). And now AMD has brought that capability to the X86 world with Strix. But I agree that how long Google, Meta, Alibaba, etc. will continue to release open weight models is a big question. It's obviously just a catch-up strategy aimed at the moats of OpenAI and Anthropic, once they catch up the incentive disappears.
10. ben_w ◴[] No.44479921[source]
> The major AI gatekeepers, with their powerful models, are already experiencing capacity and scale issues. This won't change unless the underlying technology (LLMs) undergoes a fundamental shift. As more and more things become AI-enabled, how dependent will we be on these gatekeepers and their computing capacity? And how much will they charge us for prioritised access to these resources? And we haven't really gotten to the wearable devices stage yet.

The scale issue isn't the LLM provider, it's the power grid. Worldwide, 250 W/capita. Your body is 100 W and you have a duty cycle of 25% thanks to the 8 hour work day and having weekends, so in practice some hypothetical AI trying to replace everyone in their workplaces today would need to be more energy efficient than the human body.

Even with the extraordinarily rapid roll-out of PV, I don't expect this to be able to be one-for-one replacement for all human workers before 2032, even if the best SOTA model was good enough to do so (and they're not, they've still got too many weak spots for that).

This also applies to open-weights models, which are already good enough to be useful even when SOTA private models are better.

> You could argue that we already send a lot of data to public clouds. However, there was no economically viable way for cloud vendors to read, interpret, and reuse my data — my intellectual property and private information. With more and more companies forcing AI capabilities on us, it's often unclear who runs those models and who receives the data and what is really happening to the data.

I dispute that it was not already a problem, due to the GDPR consent popups often asking to share my browsing behaviour with more "trusted partners" than there were pupils in my secondary school.

But I agree that the aggregation of power and centralisation of data is a pertinent risk.

11. ben_w ◴[] No.44479940{3}[source]
> Open Source endeavors will have a hard time to bear the resources to train models that are competitive.

Perhaps, but see also SETI@home and similar @home/BOINC projects.

12. einrealist ◴[] No.44479942{4}[source]
Your "API credits" don't buy the model. You just buy some resource to use the model that is running somewhere else.
replies(2): >>44480744 #>>44480992 #
13. pu_pe ◴[] No.44480111[source]
That's really great performance! Could you share more details about the implementation (ie which quantized version of the model, how much RAM, etc.)?
replies(1): >>44480736 #
14. brookst ◴[] No.44480209{3}[source]
Pricing for commodities does not allow for “recouping costs”. All it takes is one company seeing models as a complementary good to their core product, worth losing money on, and nobody else can charge more.

I’d support an Apache for ML but I suspect it’s unnecessary. Look at all of the money companies spend developing Linux; it will likely be the same story.

15. kgeist ◴[] No.44480736{3}[source]
Model: Qwen3 32b

GPU: RTX 5090 (no rops missing), 32 GB VRAM

Quants: Unsloth Dynamic 2.0, it's 4-6 bits depending on the layer.

RAM is 96 GB: more RAM makes a difference even if the model fits entirely in the GPU: filesystem pages containing the model on disk are cached entirely in RAM so when you switch models (we use other models as well) the overhead of unloading/loading is 3-5 seconds.

The Key Value Cache is also quantized to 8 bit (less degrades quality considerably).

This gives you 1 generation with 64k context, or 2 concurrent generations with 32k each. Everything takes 30 GB VRAM, which also leaves some space for a Whisper speech-to-text model (turbo & quantized) running in parallel as well.

replies(2): >>44481959 #>>44482838 #
16. Drakim ◴[] No.44480744{5}[source]
You don't understand what Gigachad is talking about. You can buy API credits to gain access to a model in the cloud, and then use that to train your own local model though a process called distilling.
17. hatefulmoron ◴[] No.44480855{4}[source]
"Just" is doing a lot of heavy lifting there. It definitely helps with getting data but actually training your model would be very capital intensive, ignoring the cost of paying for those outputs you're training on.
18. greenavocado ◴[] No.44480983[source]
Qwen3 isn't good enough for programming. You need at least Deepseek V3.
19. threeducks ◴[] No.44480992{5}[source]
What the parent poster means is that you can use the API to generate many question/answer pairs on which you then train your own model. For a more detailed explanation of this and other related methods, I can recommend this paper: https://arxiv.org/pdf/2402.13116
20. pu_pe ◴[] No.44481959{4}[source]
Thanks a lot. Interesting that without concurrent requests the context could be doubled, 64k is pretty decent for working on a few files at once. A local LLM server is something a lot of companies should be looking into I think.
21. oceansweep ◴[] No.44482838{4}[source]
Are you doing this with vLLM? If you're using Llama.cpp/Ollama, you could likely see some pretty massive improvements.
replies(1): >>44483295 #
22. kgeist ◴[] No.44483295{5}[source]
We're using llama.cpp. We use all kinds of different models other than Qwen3, and vLLM startup when switching models is prohibitively slow (several times slower than llama.cpp, which is already 5 sec)

From what I understand, vLLM is best when there's only 1 active model pinned to the GPU and you have many concurrent users (4, 8 etc.). But with just a single 32 GB GPU you have to switch the models pretty often, and you can't fit more than 2 concurrent users anyway (without sacrificing the context length considerably: 4 users = just 16k context, 8 users = 8k context), so I think vLLM so far isn't worth it. Once we have several cards, we may switch to vLLM.