Most active commenters

latchkey(4)
AnthonyMouse(4)

The AMD Radeon Instinct MI300A's Giant Memory Subsystem

(chipsandcheese.com)

Show context

mk_stjames ◴[18 Jan 25 16:06 UTC] No.42749299[source]▶

So the 300A is an accelerator coupled with a full 24-core EPYC and 128GB of HBM all on a single chip (or, packaged chiplets, whatever).

Why is it I can't buy a single one of these, on a motherboard, in a workstation format case, to use as an insane workstation? Assuming you could program for the accelerator part, there is an entire world of x86-fixed CAD, engineering, and entertainment industry (rendering, etc) where people want a single, desktop machine with 128GB + of fast ram to number crunch.

There are Blender artists out there that build dual and quad RTX4090 machines with Threadrippers for $20k+ in components all day, because their render jobs pay for it.

There are engineering companies that would not bat an eye at dropping $30k on a workstation if it mean they could spin around 80 gigabyte CATIA models of cars or aircraft loaded in RAM quicker. I know this at least because I sure as hell did with with several HP Z-series machines costing whole-Toyota-Corolla prices over the years...

But these combined APU chips are relegated to these server units. In the end is this a driver problem? Just a software problem? A chicken and egg problem where no one is developing the support because there isn't the hardware on the market, and there isn't the hardware on the market because AMD thinks there is no use case?

Edit: and note my use cases mentioned don't rely on latency, really, like videogamers need to hit framerates. The cache miss latency mentioned in the article doesn't matter as much for these type of compute applications where the main problems are just loading and unloading the massive amount of data. Things like offline renders and post-processing CFD simulations. Not necessarily a video output framerate.

replies(4): >>42749843 #>>42752447 #>>42757529 #>>42762774 #

1. latchkey ◴[18 Jan 25 17:28 UTC] No.42749843[source]▶

>>42749299 #

(I run a company that buys MI300x.)

> Why is it I can't buy a single one of these, on a motherboard, in a workstation format case, to use as an insane workstation?

AMD doesn't have the resources to support end users for something like this. They are a public company, look at their spend. They are pouring everything they've got into trying to keep up with the Nvidia release cycle for AI chips.

These chips are cutting edge, they are not perfect. They are still working through the hardware and software issues. It is hard enough to deal with all the public opinion on things as it is. Why would they add another layer of potential abuse?

replies(1): >>42752446 #

2. AnthonyMouse ◴[18 Jan 25 23:59 UTC] No.42752446[source]▶

>>42749843 (TP) #

The people who buy stuff like that are professionals. They often know something about the tools they're using and if there are any problems, provide bug reports that actually describe what's happening instead of some non-descriptive mush like "I have your GPU and Windows crashes sometimes". That is extremely helpful if you're trying to get rid of those bugs.

This is the same reason software shops have found it useful to support Linux, even if not many people use it. The people who do will make your product suck less, which in turn makes it easier to sell to the mass market, which will get upset and think unfavorably of you if they have the same problem but not be as good at telling you about it.

replies(2): >>42752455 #>>42752538 #

3. Aurornis ◴[19 Jan 25 00:02 UTC] No.42752455[source]▶

>>42752446 #

> provide bug reports that actually describe what's happening

Doesn’t matter if the bug reports are good or bad. Supporting low volume applications is a bad business move when the alternative is 9-figure data center contracts.

The data center business is orders of magnitude larger. Trying to support individual developers would be a huge business mistake when they already can’t keep up with data center.

replies(1): >>42752502 #

4. AnthonyMouse ◴[19 Jan 25 00:10 UTC] No.42752502{3}[source]▶

>>42752455 #

It's the same hardware running the same software. You want the bug reports so you can fix them and then your data center customers don't encounter them when they're evaluating your product.

What they can keep up with is basically a matter of how much capacity they order from TSMC. If they underestimated demand for some generation, that's the sort of thing you fix with the next contract or you're just throwing money away.

5. latchkey ◴[19 Jan 25 00:15 UTC] No.42752538[source]▶

>>42752446 #

Groq is a good example here:

https://www.eetimes.com/groq-ceo-we-no-longer-sell-hardware/

Our users give them plenty of feedback. They just RMA'd whole bunch of our GPUs over this issue so that they could take them back to the mothership and figure out what's up...

https://github.com/ROCm/ROCm/issues/4021

It takes a lot of coordination, across ourselves (with customers), our DC, AMD and Dell to make that happen.

replies(2): >>42752706 #>>42774377 #

6. AnthonyMouse ◴[19 Jan 25 00:53 UTC] No.42752706{3}[source]▶

>>42752538 #

It's not that you don't get bug reports from data center customers, it's that data center customers have scale in a bad way. They buy thousands of GPUs, they do whatever they're going to do with them, they have a problem, they report the bug. One bug report across thousands of GPUs, because they're all being used for the same thing by that customer so they only have the problems you have when you try to do that. Another data center buys thousands of GPUs and they're doing something else which is extremely common and well supported, so they don't have any issues and you get zero bug reports from them.

Compare this to, you sell a thousand GPUs to a thousand professionals and 10% of them have some problem, but each a different one. You get 100 bug reports, you fix 100 bugs instead of just one, things improve much faster.

replies(1): >>42753127 #

7. latchkey ◴[19 Jan 25 02:35 UTC] No.42753127{4}[source]▶

>>42752706 #

We have 136 of these things. Not thousands. AMD is intentionally keeping their number of providers limited [0](bottom of page).

No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.

These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.

If you're now also suggesting that AMD also release another product that is easier for developers to get their hands on and deploy, then now you've totally lost me. You're exponentially trying to increase the amount of work and money they spend, for what? Some feedback?

[0] https://www.amd.com/en/products/accelerators/instinct.html

replies(2): >>42755030 #>>42760534 #

8. AnthonyMouse ◴[19 Jan 25 08:19 UTC] No.42755030{5}[source]▶

>>42753127 #

> We have 136 of these things. Not thousands.

That's a number within an order of magnitude, and you're presumably not the largest provider.

> No two providers has the same customers, meaning the workloads vary quite a lot, and a lot of the "professional" developers you're talking about all have jobs that rent this compute.

If you own something and you've having problems with it, you're more inclined to try to solve them. If you're renting something and you have problems with it, you're more inclined to rent something else instead.

> These GPUs are enterprise, they only come in one form factor. It is a 350lbs box that takes 10kW of power and some pretty serious cooling. It costs as much as an expensive Ferrari.

Making only 4-socket systems was a choice.

You're also acting like multiple SKUs are something weird. Start offering Ryzen APUs with some on-package GDDR or HBM. Make something that fits in the Threadripper socket and uses PCIe power connectors for extra power. People would buy these things.

The point is to create lots of systems in the hands of lots of people that use the same general hardware architecture so that you're improving its software support.

replies(1): >>42759016 #

9. ◴[19 Jan 25 17:15 UTC] No.42759016{6}[source]▶

>>42755030 #

10. _zoltan_ ◴[19 Jan 25 19:14 UTC] No.42760534{5}[source]▶

>>42753127 #

I think you underestimate the people here when you throw around things like "it costs as much as an expensive Ferrari". a lot of us work with systems like these, so we understand why they cost so much and what they can do. On Reddit this works, here, I feel this is pretty condescending.

"Intentionally limiting" is just koolaid. It's ok to drink it, it's your business, but it's koolaid. You think if AWS wanted to deploy a couple hundred thousand of these systems, AMD would be sad? I bet Lisa would be happy.

I tried renting a system, and putting in a credit card is not enough. That's a red flag for me. I don't want to email, chat with sales, etc, just put in a card number. This works for even GH200 systems over at lambda.

As for number of SKUs, for Blackwell there are a lot, if you believe Jensen, and why wouldn't you? He stated at CES that almost every DC they go into is a bit bespoke with modifications.

AMD seems unable to execute on this, which is reflected in its share price.

replies(1): >>42760676 #

11. latchkey ◴[19 Jan 25 19:24 UTC] No.42760676{6}[source]▶

>>42760534 #

> I feel this is pretty condescending

Apologies, not my intention.

> I bet Lisa would be happy.

I bet too! I was referring to neoclouds, not tier 1.

> I tried renting a system, and putting in a credit card is not enough.

You truly don't need to talk to anyone, CC and go: https://www.shadeform.ai/

> AMD seems unable to execute on this, which is reflected in its share price.

I agree, they haven't been doing the best job [0]. Let's hope they can show action and turn it around.

[0] https://x.com/HotAisle/status/1880679135875362839

replies(1): >>42761247 #

12. _zoltan_ ◴[19 Jan 25 20:09 UTC] No.42761247{7}[source]▶

>>42760676 #

Ok, maybe it works now just by CC. Glad that's sorted.

AMD is tone deaf unfortunately, but I liked your reply on X.

↑