There was some tangentially related discussion in this post: https://news.ycombinator.com/item?id=45050415, but this cost analysis answers so many questions, and gives me a better idea of how huge the margin on inference a lot of these providers could be taking. Plus I'm sure that Google or OpenAI can get more favorable data center rates than the average Joe Scmoe.

A node of 8 H100s will run you $31.40/hr on AWS, so for all 96 you're looking at $376.80/hr. With 188 million input tokens/hr and 80 million output tokens/hr, that comes out to around $2/million input tokens, and $4.70/million output tokens.

This is actually a lot more than Deepseek r1's rates of $0.10-$0.60/million input and $2/million output, but I'm sure major providers are not paying AWS p5 on-demand pricing.

Edit: those figures were per node, so the actual input and output prices would be divided by 12.$0.17/million input tokens, and $0.39/million output

replies(6): >>45065474 #>>45065821 #>>45065830 #>>45065838 #>>45065925 #>>45067796 #

7. ◴[29 Aug 25 15:27 UTC] No.45065380{4}[source]▶

>>45065147 #

8. SV_BubbleTime ◴[29 Aug 25 15:30 UTC] No.45065413{3}[source]▶

>>45065313 #

Also “open source” I feel covers for “open weights” which is not the same thing.

replies(1): >>45072258 #

9. s46dxc5r7tv8 ◴[29 Aug 25 15:31 UTC] No.45065424[source]▶

>>45064329 (OP) #

Separation of the prefill and decoding layers with sglang is quite nifty! Normally 8xH100 would barely be able to hold the 4bit quantization of the model without even considering the KV cache. One prefill node for 3 decode nodes is also fascinating, nice writeup.

10. ◴[29 Aug 25 15:35 UTC] No.45065474[source]▶

>>45065331 #

11. ollybee ◴[29 Aug 25 15:38 UTC] No.45065503{3}[source]▶

>>45064954 #

H100's can be $2 and hour, so $192 an hour for the full cluster. They report 22k tokens per second, so ~ 80 million an hour, thats $16 an hour at $0.2 per million. Maybe a bit more for input tokens, but it seems a long way off.

replies(1): >>45066003 #

12. randomjoe2 ◴[29 Aug 25 15:39 UTC] No.45065518{4}[source]▶

>>45065147 #

Local doesn't refer to "on metal" anymore to many people

replies(3): >>45065653 #>>45065663 #>>45067202 #

13. DSingularity ◴[29 Aug 25 15:42 UTC] No.45065549{4}[source]▶

>>45065147 #

I guess local for him is independent/private.

14. monsieurbanana ◴[29 Aug 25 15:50 UTC] No.45065653{5}[source]▶

>>45065518 #

I missed that train

replies(1): >>45065843 #

15. mwcz ◴[29 Aug 25 15:51 UTC] No.45065663{5}[source]▶

>>45065518 #

"On metal" is muddied too. I've heard people refer to web apps running in an OCI container as being "bare metal" deployment, as opposed to AWS or whatever hosting platform.

That's silly, but the idea that "local" is not the opposite of remote is even sillier.

replies(2): >>45065742 #>>45065883 #

16. arnaudsm ◴[29 Aug 25 15:53 UTC] No.45065693[source]▶

>>45064329 (OP) #

Interestingly, this is 10x cheaper than the cheapest provider on OpenRouter : https://openrouter.ai/deepseek/deepseek-r1?sort=price

Inference is more profitable than I thought.

17. ffsm8 ◴[29 Aug 25 15:56 UTC] No.45065742{6}[source]▶

>>45065663 #

You can run an OCI container on bare metal though. It doesn't stop being run on bare metal just because you're running in kernel namespaces, aka docker container

Lots of people were advocating for running their k8s on bare metal servers to maximize the performance of their containers

Now wherever that's applied to your conversation... I've no clue, too little context ( ｡ ŏ ﹏ ŏ )

replies(1): >>45066076 #

18. numpad0 ◴[29 Aug 25 16:00 UTC] No.45065796[source]▶

>>45065135 #

These open models are just commercial binary distributions made available at zero cost with intention to cripple opportunities for Western LLM providers to capitalize on investments.

These are more like really gorgeous corporate swags than FOSS.

replies(4): >>45067590 #>>45070046 #>>45070743 #>>45072256 #

19. matt-p ◴[29 Aug 25 16:01 UTC] No.45065821[source]▶

>>45065331 #

188M input / 80M output tokens per hour was per node I thought?

Reversing out these numbers tells us that they're paying about $2/H100/Hour (or $16/hour for a 8xH100 node).

Disclaimer (one of my sites) https://www.serversearcher.com/servers/gpu - says that a one month commit on a 8XH100 node goes for $12.91/hour. The "I'm buying the servers and putting them in COLO rate" usually works out at around $10/Hour, so there's scope here to reduce the cost by ~30% just by doing better/more committed purchasing.

replies(1): >>45066005 #

20. caminanteblanco ◴[29 Aug 25 16:02 UTC] No.45065830[source]▶

>>45065331 #

Ok, so the authors apparently used atlas cloud hosting, which charges $1.80 per h100/hr, which would change the overall cost to around $0.08/ million input and $0.18/million output, which seems much more in line with massive inference margins for major providers.

21. paxys ◴[29 Aug 25 16:02 UTC] No.45065838[source]▶

>>45065331 #

According to the post their costs were $0.20/1M output tokens (on cloud GPUs), so your numbers are off somewhere.

22. vFunct ◴[29 Aug 25 16:03 UTC] No.45065843{6}[source]▶

>>45065653 #

My basement server really confused by all this...

replies(1): >>45069674 #

23. brilee ◴[29 Aug 25 16:05 UTC] No.45065876[source]▶

>>45064329 (OP) #

For those commenting on cost per token:

This throughput assumes 100% utilizations. A bunch of things raise the cost at scale:

- There are no on-demand GPUs at this scale. You have to rent them for multi-year contracts. So you have to lock in some number of GPUs for your maximum throughput (or some sufficiently high percentile), not your average throughput. Your peak throughput at west coast business hours is probably 2-3x higher than the throughput at tail hours (east coast morning, west coast evenings)

- GPUs are often regionally locked due to data processing issues + latency issues. Thus, it's difficult to utilize these GPUs overnight because Asia doesn't want their data sent to the US and the US doesn't want their data sent to Asia.

These two factors mean that GPU utilization comes in at 10-20%. Now, if you're a massive company that spends a lot of money on training new models, you could conceivably slot in RL inference or model training to happen in these off-peak hours, maximizing utilization.

But for those companies purely specializing in inference, I would _not_ assume that these 90% margins are real. I would guess that even when it seems "10x cheaper", you're only seeing margins of 50%.

replies(7): >>45067585 #>>45067903 #>>45067926 #>>45068175 #>>45068222 #>>45072198 #>>45073200 #

24. dtech ◴[29 Aug 25 16:06 UTC] No.45065883{6}[source]▶

>>45065663 #

If you do bare metal as not being under a VM it fits. OCI on linux is cgroup so that counts as not a VM I'd say. Or at least it's a layer closer to the metal than a typical VM running OCI images.

I a Java app running on Linux bare metal?

25. zipy124 ◴[29 Aug 25 16:09 UTC] No.45065925[source]▶

>>45065331 #

AWS is absolutely not cheap, and never has been. You want to look for the hetzner of the GPU world like runpod.io where they are $2 an hour, so $16/hr for 8, that's already half of aws. You can also get a volume discount if you're looking for 96 almost certainly.

An H100 costs about $32k, amortized over 3-5 years gives $1.21 to $0.7 per hour, so adding in electricity costs and cpu/ram etc... runpod.io is running much closer to the actual cost compared to AWS.

replies(2): >>45071097 #>>45071121 #

26. zipy124 ◴[29 Aug 25 16:15 UTC] No.45066003{4}[source]▶

>>45065503 #

I think you mis-read. Thats 22k tokens per second per node, so per 8 h100's. With 12 nodes they get 264k tokens per second, or 950 million an hour. This get's you to roughly $0.2021 per million at $2 an hour.

27. caminanteblanco ◴[29 Aug 25 16:15 UTC] No.45066005{3}[source]▶

>>45065821 #

You were definitely right, I updated the original comment. Thanks for your correction!

28. zipy124 ◴[29 Aug 25 16:17 UTC] No.45066023[source]▶

>>45064819 #

This is all costs included. Thats 22k tokens per second per node, so per 8 h100's. With 12 nodes they get 264k tokens per second, or 950 million an hour. This get's you to roughly $0.2021 per million at $2 an hour for an h100, which is what they go for on services such as runpod.io . (cheaper if not paying spot-price + volume discounts).

29. ozgune ◴[29 Aug 25 16:18 UTC] No.45066036[source]▶

>>45064329 (OP) #

The SGLang Team has a follow-up blog post that talks about DeepSeek inference performance on GB200 NVL72: https://lmsys.org/blog/2025-06-16-gb200-part-1/

Just in case you have $3-4M lying around somewhere for some high quality inference. :)

SGLang quotes a 2.5-3.4x speedup as compared to the H100s. They also note that more optimizations are coming, but they haven't yet published a part 2 on the blog post.

replies(1): >>45074618 #

30. okasaki ◴[29 Aug 25 16:20 UTC] No.45066076{7}[source]▶

>>45065742 #

In my opinion, if you're running k8s on bare metal, that's "k8s on bare metal" but still "<your app> on kubernetes", not "<your app> on bare metal".

replies(1): >>45067194 #

31. ffsm8 ◴[29 Aug 25 17:44 UTC] No.45067194{8}[source]▶

>>45066076 #

Sorry, but then your opinion is just plain wrong

Bare metal in the context of running software is a technical term with a clear meaning that hasn't become contested like "AI" or "Crypto" - and that meaning is that the software is running directly on the hardware.

As k8s isn't virtualization, processes spawned by its orchestrator are still running on bare metal. It's the whole reason why containers are more efficient compared to virtual machines

replies(2): >>45067264 #>>45067809 #

32. bee_rider ◴[29 Aug 25 17:45 UTC] No.45067202{5}[source]▶

>>45065518 #

Local doesn’t need to be “on metal,” but I’m still confused as to what they are saying. Are they running some local cloud system?

33. bee_rider ◴[29 Aug 25 17:50 UTC] No.45067264{9}[source]▶

>>45067194 #

Bare metal as in, no operating system? Does Linux really get in the way of these LLM inference engines?

replies(1): >>45067336 #

34. ffsm8 ◴[29 Aug 25 17:56 UTC] No.45067336{10}[source]▶

>>45067264 #

No, as I said in my previous comment: bare metal as in not a virtual machine

https://en.m.wikipedia.org/wiki/Bare-metal_server

replies(1): >>45068930 #

35. jerrygenser ◴[29 Aug 25 18:17 UTC] No.45067585[source]▶

>>45065876 #

Re the overnight that's why some providers are offering there are batch tier jobs that are 50% off which return over up to 12 or 24 hours for non-interactive use cases.

36. badsectoracula ◴[29 Aug 25 18:17 UTC] No.45067590{3}[source]▶

>>45065796 #

> intention to cripple opportunities for Western LLM providers to capitalize on investments.

Western LLM providers release open weight models too (e.g. Mistral).

37. bluedino ◴[29 Aug 25 18:36 UTC] No.45067796[source]▶

>>45065331 #

> A node of 8 H100s will run you $31.40/hr on AWS, so for all 96 you're looking at $376.80/hr

And what stinks is that you can't even build a Dell/HPE server like this online. You have to 'request a quote' for an 'AI Server'

Going through SuperMicro, you're looking at about $60k for the server, plus 8 GPU's at $25,000 each, so you're close to $300,000 for an 8 GPU node.

Now, that doesn't include networking, storage, racks, electricity, cooling, someone to set that all up for you, $1,000 DAC cables, NVIDIA middleware, downtime as the H100's are the flakiest pieces of junk ever and will need to be replaced every so often...

Setting up a 96 H100 cluster (12 of those puppies) in this case is probably going to cost you $4-5 million. But it should cost less than AWS after a year and a half.

replies(2): >>45068071 #>>45071538 #

38. mystifyingpoi ◴[29 Aug 25 18:37 UTC] No.45067809{9}[source]▶

>>45067194 #

I think both of you are correct.

Of course, a process running inside Kubernetes Pod, on a baremetal node will show up in `top` if I run it on the node directly. In such terms, it is running directly on hardware.

But when I deploy this Pod, I'm not interacting with the OS in any way. I'm interacting with Kubernetes apiserver, telling it what to run, not really caring about the operating system underneath. In such terms, the application is running "in k8s".

39. cootsnuck ◴[29 Aug 25 18:46 UTC] No.45067901[source]▶

>>45064329 (OP) #

Super helpful to see actual examples of what it (roughly) can look like to deploy production inference workloads, and also the latest optimization efforts.

I consult in this space and clients still don't fully understand how complex it can get to just "run your own LLM".

40. lbhdc ◴[29 Aug 25 18:46 UTC] No.45067903[source]▶

>>45065876 #

If you are willing to spread your workload out over a few regions getting that many GPUs on demand can be doable. You can use something like compute classes on gcp to fallback to different machine types if you do hit stockouts. That doesn't make you impervious from stock outs, but makes it a lot more resilient.

You can also use duty cycle metrics to scale down your gpu workloads to get rid of some of the slack.

41. empiko ◴[29 Aug 25 18:49 UTC] No.45067926[source]▶

>>45065876 #

You also need to consider that the field is moving really fast and you cannot really rely on being able to have the same margins in a year or two.

42. Tepix ◴[29 Aug 25 19:03 UTC] No.45068071{3}[source]▶

>>45067796 #

I think you can get the server itself quite a bit cheaper than $60k. I found a barebone for around 19400€ at https://www.lambda-tek.de/Supermicro-SYS-821GE-TNHR-sh/B4760...

43. derefr ◴[29 Aug 25 19:12 UTC] No.45068175[source]▶

>>45065876 #

> There are no on-demand GPUs at this scale.

> These two factors mean that GPU utilization comes in at 10-20%.

Why don't these two factors cancel out? Why wouldn't a company building a private GPU cluster for their own use, also sit a workload scheduler (e.g. Slurm) in front of it, enable credit accounting + usage-based-billing on it, and then let validated customer partners of theirs push batch jobs to their cluster — where each such job will receive huge spot resource allocations in what would otherwise be the cluster's low-duty point, to run to completion as quickly as possible?

Just a few such companies (and universities) deciding to rent their excess inference capacity out to local SMEs, would mean that there would then be "on-demand GPUs at this scale." (You'd have to go through a few meetings to get access to it, but no more than is required to e.g. get a mortgage on a house. Certainly nothing as bad as getting VC investment.)

This has always been precisely how the commercial market for HPC compute works: the validated customers of an HPC cluster sending off their flights of independent "wide but short" jobs, that get resource-packed + fair-scheduled between other clients' jobs into a 2D (nodes, time) matrix, with everything getting executed overnight, just a few wide jobs at a time.

So why don't we see a similar commercial "GPU HPC" market?

I can only assume that the companies building such clusters are either:

- investor-funded, and therefore not concerned with dedicating effort to invent ways to minimize the TCO of their GPUs, when they could instead put all their engineering+operational labor into grabbing market share

- bigcorps so big that they have contracts with one big overriding "customer" that can suck up 100% of their spare GPU-hours: their state's military / intelligence apparatus

...or, if not, then it must turn out that these clusters are being 100% utilized by their owners themselves — however unlikely that may seem.

Because if none of these statements are true, then there's just a proverbial $20 bill sitting on the ground here. (And the best kind of $20 bill, too, from a company's perspective: rent extraction.)

replies(3): >>45068564 #>>45071087 #>>45074128 #

44. parhamn ◴[29 Aug 25 19:16 UTC] No.45068222[source]▶

>>45065876 #

Do we know how big the "batch processing" market is? I know the major providers offer 50%+ off for off-peak processing.

I assumed it was to slightly correct this problem and on the surface it seems like it'd be useful for big data places where process-eventually is enough, e.g. it could be a relatively big market. Is it?

replies(1): >>45069433 #

45. thenewwazoo ◴[29 Aug 25 19:46 UTC] No.45068564{3}[source]▶

>>45068175 #

> Why wouldn't a company ... let validated customer partners of theirs push batch jobs

A company standing up this infrastructure is presumably not in the business of selling time-shares of infrastructure, they're busy doing AI B2B pet food marketing or whatever. In order to make that sale, someone has to connect their underutilized assets with interested customers, which is outside of their core competency. Who's going to do that?

There's obviously an opportunity here for another company to be a market maker, but that's hard, and is its own speciality.

replies(3): >>45069211 #>>45069323 #>>45070325 #

46. pessimizer ◴[29 Aug 25 20:20 UTC] No.45068930{11}[source]▶

>>45067336 #

Note that this is a term whose meaning has been expanded to refer to non-VPS servers very recently. Bare-metal has traditionally meant "without an operating system." It did not mean "a server that is an actual server," because that was the default.

It also does not always "clearly" have this new meaning. Somebody who is used to running programs directly (with no intermediate OS) on hardware might not understand what you're saying, or might ask you to clarify, and you probably shouldn't feel put upon by a totally understandable misinterpretation.

edit: Especially when you keep repeating "directly on hardware" when you mean "not on a VM." VMs also run on hardware. You're saying that you're only running on one OS instead an OS in your OS.

47. loocorez ◴[29 Aug 25 20:46 UTC] No.45069211{4}[source]▶

>>45068564 #

Sounds like prime intellect

48. mistrial9 ◴[29 Aug 25 20:59 UTC] No.45069323{4}[source]▶

>>45068564 #

Snowflake ?

49. sdesol ◴[29 Aug 25 21:11 UTC] No.45069433{3}[source]▶

>>45068222 #

I don't think you need to be big data to benefit.

A major issue we have right now is, we want the coding process to be more "Agentic", but we don't have an easy way for LLMs to determine what to pull into context to solve a problem. This is a problem that I am working on with my personal AI search assistant, which I talk about below:

https://github.com/gitsense/chat/blob/main/packages/chat/wid...

Analyzers are the "Brains" for my search, but generating the analysis is both tedious and can be costly. I'm working on the tedious part and with batch processing, you can probably process thousands of files for under 5 dollars with Gemini 2.5 Flash.

With batch processing and the ability to continuously analyze 10s of thousands of files, I can see companies wanting to make "Agentic" coding smarter, which should help with GPU utilization and drive down the cost of software development.

replies(1): >>45073205 #

50. echelon ◴[29 Aug 25 22:16 UTC] No.45070046{3}[source]▶

>>45065796 #

> cripple opportunities for Western LLM

Good! If they're not open, they're creating more lock-in. And on top of that, they're using information they don't own to do so and then renting it back to us.

replies(1): >>45070483 #

51. quacksilver ◴[29 Aug 25 22:55 UTC] No.45070325{4}[source]▶

>>45068564 #

There are services like vast.ai that act as marketplaces.

You don't know who owns the GPUs / if or when your job will complete and if the owner is sniffing what you are processing though

52. numpad0 ◴[29 Aug 25 23:17 UTC] No.45070483{4}[source]▶

>>45070046 #

I 100% support it as a consumer :P but IMO we do have to be aware that it's just a happy coincidence.

53. ttrotcre6454 ◴[29 Aug 25 23:59 UTC] No.45070743{3}[source]▶

>>45065796 #

Do you realize how silly you sound ?

"Linux was a Finish conspiracy to cripple hard-working US operating systems makers. It's not really open because I don't understand it."

Leave nationalism to those who want to use you as mere pawns in their imaginary Chess game.

replies(1): >>45070791 #

54. numpad0 ◴[30 Aug 25 00:08 UTC] No.45070791{4}[source]▶

>>45070743 #

I don't care as long as it doesn't sound sillier than to call them open source. They're literally binaries. Often lossy compressed, even.

55. fooker ◴[30 Aug 25 01:14 UTC] No.45071087{3}[source]▶

>>45068175 #

The software stack for doing what you suggest would cost about a hundred million to develop over five-ten years.

replies(1): >>45072220 #

56. fooker ◴[30 Aug 25 01:16 UTC] No.45071097{3}[source]▶

>>45065925 #

H100 was 32k three years ago.

Significantly cheaper now that most cloud providers are buying Blackwell.

57. mountainriver ◴[30 Aug 25 01:23 UTC] No.45071121{3}[source]▶

>>45065925 #

Runpods network is the worst I’ve ever seen, their infra in general is terrible. It was started by comcast execs, go figure.

Their GPU availability is amazing though

replies(1): >>45071901 #

58. Spooky23 ◴[30 Aug 25 02:54 UTC] No.45071538{3}[source]▶

>>45067796 #

> And what stinks is that you can't even build a Dell/HPE server like this online. You have to 'request a quote' for an 'AI Server'

The hot parts are/were on allocation to both vendors. They try to sus out your use case and redirect you to less constrained parts.

59. adam_arthur ◴[30 Aug 25 03:39 UTC] No.45071720[source]▶

>>45064819 #

I'm curious as well.

Depreciation and GPU failure rate over time must be considered, which I don't see mentioned in the article.

60. thundergolfer ◴[30 Aug 25 04:20 UTC] No.45071901{4}[source]▶

>>45071121 #

Is the network just slow, or just it have outages?

61. koliber ◴[30 Aug 25 05:52 UTC] No.45072198[source]▶

>>45065876 #

These are great points.

However, I don’t think these companies provision capacity for peak usage and let it idle during off peak. I think they provision it at something a bit above average, and aim at 100% utilization for the max number of hours in the day. When there is not enough capacity to meet demand they utilize various service degradation methods and/or load shedding.

replies(1): >>45072278 #

62. appreciatorBus ◴[30 Aug 25 05:56 UTC] No.45072220{4}[source]▶

>>45071087 #

But I was assured that this sort of stack could simply be vibed into existence?

63. adastra22 ◴[30 Aug 25 06:03 UTC] No.45072256{3}[source]▶

>>45065796 #

Open weights is the equivalent of open source here. DeepSeek is open weight.

If you have some reason to believe a different definition should be used, please provide it. Because there is no source code here.

64. adastra22 ◴[30 Aug 25 06:03 UTC] No.45072258{4}[source]▶

>>45065413 #

What does “open source” even mean when there is no source code?

replies(1): >>45075024 #

65. mcny ◴[30 Aug 25 06:08 UTC] No.45072278{3}[source]▶

>>45072198 #

Is this why I get anthropic/Claude emails every single day since I signed up for their status updates? I just assumed they were working hard with production bugs but in light of this comment, if you don't hit capacity constraints every day, you are wasting money?

replies(2): >>45072607 #>>45073590 #

66. chii ◴[30 Aug 25 07:21 UTC] No.45072607{4}[source]▶

>>45072278 #

This is true for all capital equipment - whether it's a GPU, a bore drill, or an earth mover.

You want to make use of it at as close to 100% as possible.

replies(1): >>45072945 #

67. hvb2 ◴[30 Aug 25 08:27 UTC] No.45072945{5}[source]▶

>>45072607 #

With the caveat that GPUs depreciate a bit faster obviously. A drill is still a drill next year or a decade from now.

replies(1): >>45073934 #

68. senko ◴[30 Aug 25 09:18 UTC] No.45073200[source]▶

>>45065876 #

You're not wrong.

However, this all assumes realtime requirements. For batching, you can smooth over the demand curve, and you don't care about latency.

69. saagarjha ◴[30 Aug 25 09:19 UTC] No.45073205{4}[source]▶

>>45069433 #

You sound like you are talking about something completely different.

replies(2): >>45073477 #>>45075958 #

70. guerrilla ◴[30 Aug 25 09:42 UTC] No.45073311[source]▶

>>45064329 (OP) #

Now if only it would stop prefacing all its output with "Of course!" ;)

This is why I use DS though. I think its the only ethical option due to its efficiency. I think that outweighs all other considerations at this point.

replies(1): >>45075832 #

71. koliber ◴[30 Aug 25 10:46 UTC] No.45073590{4}[source]▶

>>45072278 #

Just like at an all-you-can eat buffet.

72. apetrov ◴[30 Aug 25 12:09 UTC] No.45073934{6}[source]▶

>>45072945 #

yes, but the capital is still tied to it. you want it to Have a meaningful ROI, not sitting in a warehouse.

73. reachableceo ◴[30 Aug 25 12:39 UTC] No.45074128{3}[source]▶

>>45068175 #

That is what I’m doing with my excess compute , excess fabrication , CNC, laser , 3d printing , reflow oven etc capacity in between hardware revs for my main product. I also bill out my trusted sub contractors.

I validate the compute renters because ITAR. Lots of hostile foreign powers trying to access compute .

My main business is ITAR related , so I have incredibly high security in place already.

We are multi tenant from day zero and have slurm etc in place for accounting reasons for federal contracts etc. we actually are spinning up federal contracting as a service and will do a ShowHN when that launches.

Riches in the niches and the business of business :)

74. aurareturn ◴[30 Aug 25 13:39 UTC] No.45074618[source]▶

>>45066036 #

Isn't Blackwell optimized for FP4? This blog post runs Deepseek at fp8, which is probably the sweet spot but new models with fp4 native training and inference would be drastically faster than fp8 on blackwell.

75. SV_BubbleTime ◴[30 Aug 25 14:35 UTC] No.45075024{5}[source]▶

>>45072258 #

There is a source, it would be the training data. There is also kind of the training code.

Almost absolutely no one releases their training data.

76. 7thpower ◴[30 Aug 25 16:18 UTC] No.45075832[source]▶

>>45073311 #

The only ethical option? Please help me understand the argument here.

replies(1): >>45076220 #

77. sdesol ◴[30 Aug 25 16:37 UTC] No.45075958{5}[source]▶

>>45073205 #

No what I am saying is there are more applications for batch processing that will help with utilization. I can see developers and companies using off hour processing to prep their data for agentic coding.

78. guerrilla ◴[30 Aug 25 17:06 UTC] No.45076220{3}[source]▶

>>45075832 #

Everything else uses more energy for both training and inference. Reducing the energy footprint is our highest priority in this domain. It outweighs the other considerations like it being Chinese, run by a hedge fund, etc. None of that matters if we destroy our ability to live on this planet. DeepSeek is not good enough, but we need to choose it in order to encourage competition on this front specifically. It's more important that companies focus on that than spend time improving other metrics.

↑