Most active commenters
  • Dylan16807(9)
  • (8)
  • Rinzler89(5)
  • bayindirh(4)
  • dangus(4)
  • numpad0(4)
  • bearjaws(4)
  • nsteel(4)
  • kanbankaren(3)
  • chaostheory(3)

172 points marban | 144 comments | | HN request time: 2.973s | source | bottom
1. InTheArena ◴[] No.40051885[source]
While everyone has focused on Apple's power-efficiency on the M series chips, one thing that has been very interesting is how powerful the unified memory model (by having the memory on-package with CPU) with large bandwidth to the memory actually is. Hence a lot of people in the local LLMA community are really going after high-memory Macs.

It's great to see NPUs here with the new Ryzen cores - but I wonder how effective they will be with off-die memory versus the Apple approach.

That said, it's nothing but great to see these capabilities in something other then a expensive NVIDIA card. Local NPUs may really help with edge deploying more conferencing capabilities.

Edited - sorry, ,meant on-package.

replies(8): >>40051950 #>>40052032 #>>40052167 #>>40052857 #>>40053126 #>>40054064 #>>40054570 #>>40054743 #
2. vbezhenar ◴[] No.40051950[source]
Apple does not make on-die RAM.
3. chaostheory ◴[] No.40052032[source]
What Apple has is theoretically great on paper, but it fails to live up to expectations. Whats the point of having the RAM for running an LLM locally when the performance is abysmal compared to running it on even a consumer Nvidia GPU. It’s a missed opportunity that I hope either the M4 or M5 addresses
replies(8): >>40052327 #>>40052344 #>>40052929 #>>40053695 #>>40053835 #>>40054577 #>>40054855 #>>40056153 #
4. Havoc ◴[] No.40052140[source]
Are these NPUs addressable with a standard PyTorch LLM stack?

NPU seems to mean very different things depending on device/vendor

5. bearjaws ◴[] No.40052158[source]
The focus on TOPS seems a bit out of line with reality for LLMs. TOPs doesn't matter for LLMs if your memory bandwidth can't keep up. Since it doesn't have quad channel memory mentioned I guess it's still dual channel?

Even top of the line DDR5 is around 128GB/s vs a M1 at 400GB/s.

At the end of the day, it still seems like AI in consumer chips is chasing a buzzword, what is the killer feature?

On mobile there are image processing benefits and voice to text, translation... but on desktop those are no where near common use cases.

replies(3): >>40052204 #>>40052260 #>>40052353 #
6. atty ◴[] No.40052167[source]
Apples memory is on package, not on die.
7. postalrat ◴[] No.40052204[source]
https://www.neatvideo.com/blog/post/m3

That says M1 is 68.25 GB/s

replies(3): >>40052258 #>>40052307 #>>40052631 #
8. givinguflac ◴[] No.40052258{3}[source]
The op were obviously talking about M1 Max.
replies(1): >>40052494 #
9. VHRanger ◴[] No.40052260[source]
The killer feature is presumably inference at the edge, but I don't see that being used on desktop much at all right now.

Especially since most desktop applications people use are web apps. Of the native apps people use that leverage this sort of stuff, almost all are GPU accelerated already (eg. image and video editing AI tools)

replies(1): >>40052360 #
10. bearjaws ◴[] No.40052307{3}[source]
M1 Max sorry, I don't mean to compare a 4 year old tablet processor to the latest generation of laptop CPUs.
11. InTheArena ◴[] No.40052327{3}[source]
The performance of oolama on my M1 MAX is pretty solid - and does things that my 2070 GPU can't do because of memory.
replies(1): >>40052675 #
12. bearjaws ◴[] No.40052344{3}[source]
It's a 25w processor. How will it ever live up to a 400w GPU? Also you can't even run large models on a single 4090, but you can on M series laptops with enough RAM.

The fact a laptop can run 70B+ parameter models is a miracle, it's not what the chip was built to do at all.

replies(3): >>40052910 #>>40056761 #>>40059613 #
13. futureshock ◴[] No.40052353[source]
Upscaling for gaming or video.

Local context aware search

Offline Speech to text and TTS

Offline generation of clip art or stock images for document editing

Offline LLM that can work with your documents as context and access application and OS APIs

Improved enemy AI in gaming

Webcam effects like background removal or filters.

Audio upscaling and interpolation like for bad video call connections.

replies(2): >>40052491 #>>40052497 #
14. jzig ◴[] No.40052360{3}[source]
What does “at the edge” mean here?
replies(4): >>40052515 #>>40052529 #>>40052531 #>>40052991 #
15. phkahler ◴[] No.40052383[source]
Am I missing something? These look just like the APUs with the addition of "management" and "security" features and without the iGPU. Is that right?
replies(1): >>40052780 #
16. bearjaws ◴[] No.40052491{3}[source]
> Upscaling for gaming or video.

Already exists on all three major GPU manufacturers, and it definitely makes sense as a GPU workload.

> Local context aware search

You don't need an AI processor to do this, Windows search used to work better and had even less compute resources to work with.

> Offline Speech to text and TTS

See my point about not a very common use case for desktops & laptops vs cell phones.

> Offline LLM that can work with your documents as context and access application and OS APIs

Maybe for some sort of background task or only using really small models <13B parameters. Anything real time is going to run at 1-2t/s with a large model.

Small models are pretty terrible though, I doubt people want even more incorrect information and hallucinations.

> Improved enemy AI in gaming

See Ageia PhysX

> Webcam effects like background removal or filters.

We already have this without NPUs.

> Audio upscaling and interpolation like for bad video call connections.

I could see this, or noise cancellation.

replies(3): >>40052653 #>>40052700 #>>40052949 #
17. postalrat ◴[] No.40052494{4}[source]
How is it obvious? Anyone reading that could assume that any M1 gets that bandwidth.
18. kanbankaren ◴[] No.40052497{3}[source]
All of this(except upscaling) is possible with iGPU/CPU without breaking a sweat?
replies(1): >>40052666 #
19. georgeecollins ◴[] No.40052515{4}[source]
Not using AI on the cloud. So if your connection is uncertain or you want use your bandwidth for something else-- like video conferencing or gaming. Probably the killer app is something that wants to use AI but doesn't involve paying a cloud provider. I was talking to a vendor about their chat bot built to put into MMOs or mobile games. It woudl be killer to have a character have life like conversation in those kinds of experiences. But the last thing you want to do is increase your server costs the way this AI would. Edge computing could solve that.
20. PeterSmit ◴[] No.40052529{4}[source]
Not in the cloud.
21. Zach_the_Lizard ◴[] No.40052531{4}[source]
I'm guessing "the edge" is doing inference work in the browser, etc. as opposed to somewhere in the backend of the web app.

Maybe your local machine can run, I don't know, a model to make suggestions as you're editing a Google Doc, which frees up the Big Machine in the Sky to do other things.

As this becomes more technically feasible, it reduces the effective cost of inference for a new service provider, since you, the client, are now running their code.

The Jevons paradox might kick in, causing more and more uses of LLMs for use cases that were too expensive before.

22. kokonoko ◴[] No.40052543[source]
I hope (but I doubt) that this will be more than a marketing stand to include something "AI" in their product line. Every vendor has their own hardware that is badly supported by tools, and even if it is supported, only a fraction of the available software uses it. In the meantime it takes precious die area and resources.
23. ◴[] No.40052631{3}[source]
24. bayindirh ◴[] No.40052653{4}[source]
It's about power management, and doing more things with less power. These specialized IP blocks on CPUs allow these things to be done with less power and less latency.

Intel's bottom of the barrel N95 & N100 CPUs have Gaussian & Neural accelerators for simple image processing and object detection tasks, plus a voice processor for low power voice based activation and command capture and process.

You can always add more power hungry, general purpose components to add capabilities. Heck, video post processing entered hardware era with ATI Radeon 8500. But doing these things with negligible power costs is the new front.

Apple is not adding coprocessors to their iPhones because it looks nice. All of these coprocessors reduce CPU wake-up cycles tremendously and allows the device to monitor tons of things out of bands with negligible power costs.

25. bayindirh ◴[] No.40052666{4}[source]
The things which doesn't make GPU to break a sweat has its own specialized (or semi-specialized) processing blocks on the GPU, too.
replies(1): >>40052781 #
26. dangus ◴[] No.40052675{4}[source]
Not that I don’t believe you but the 2070 is two generations and 5 years old. Maybe a comparison to a 4000 series would be more appropriate?
replies(2): >>40052731 #>>40052773 #
27. pdpi ◴[] No.40052700{4}[source]
>> Upscaling for gaming or video.

> Already exists on all three major GPU manufacturers, and it definitely makes sense as a GPU workload.

"makes sense as a GPU workload" is underselling it a bit. Doing it on the CPU is basically insane. Games typically upscale only the world view (the expensive part to render) while rendering the UI at full res. So to do CPU-side upscaling we're talking about a game rendering a surface on the GPU, sending it to the CPU, upscaling it there, sending it back to the GPU, then compositing with the UI. It's just needlessly complicated.

28. Kirby64 ◴[] No.40052731{5}[source]
The M1 Max is also 2 generations old, and ~3 years old at this point. Seems like a fair comparison to me.
replies(2): >>40052845 #>>40052863 #
29. Aissen ◴[] No.40052746[source]
A quick search into it shows that this Ryzen AI NPU's support isn't integrated into upstream inference frameworks yet — so right now it's just useless silicon surface you pay for :-/
replies(3): >>40052844 #>>40053100 #>>40060474 #
30. Teever ◴[] No.40052773{5}[source]
Well, you know that it would still be able to do more than a 4000 series GPU from Nvidia because you can have more system memory in a mac than you can have video ram in a 4000 series GPU.
replies(1): >>40052862 #
31. c0l0 ◴[] No.40052780[source]
They also support ECC UDIMMs with ECC enabled, which has been the "PRO" series APU killer feature on AM4 for me. The non-"PRO" APUs will run fine with ECC UDIMMs, but cannot make use of the extra parity information (maybe for reasons of market segmentation - I don't know if anyone outside of AMD knows). This is probably less of a concern with DDR5 platforms and their "on-DIE ECC" (which you cannot monitor for Correctable Errors at least, afaik), but it's still gonna matter for me.
replies(2): >>40052887 #>>40055219 #
32. kanbankaren ◴[] No.40052781{5}[source]
I meant the current generation of GPUs that don't have any AI acceleration blocks.
replies(1): >>40052810 #
33. bayindirh ◴[] No.40052810{6}[source]
They are MATMUL machines by design already. They do not need to "accelerate" AI to begin with.

Their cores/shaders can be programmed to do that.

Also, name a current gen GPU which doesn't have video encoding/decoding capabilities/facilities in silicon, even ones which do not allow shaders to be used in this process for post-processing. It's impossible (to not to have these capabilities) at this point in time.

replies(1): >>40053026 #
34. Rinzler89 ◴[] No.40052844[source]
Some AMD laptops haven't even yet enabled the NPU in firmware even on the 7000 series wich are about a year old. Meaning it's still useless.

I was kinda bummed out they released the 8000 series after I just bought a laptop with 7000 series, but I think I actually dodged a bullet here since it doesn't look like much of an upgrade and the AI silicone screams of very early first gen product to me, as if they rushed it out the door because everyone else was doing "AI" and they needed to also cash in on the hype, kinda like the first gen RTX cards.

I think by the time I'll actually upgrade, the AI/NPU tech would have matured considerably and actually be useful.

replies(2): >>40055719 #>>40055799 #
35. dangus ◴[] No.40052845{6}[source]
The 4000 series still has a bigger gap in how much of a generational leap that product was.

The M3 Max has something like 33% faster overall graphics performance than the M1 Max (average benchmark) while the 4090 is something like 138% faster than the 2080Ti.

Depending on which 2070 and 4070 models you compare the difference is similar, close to or exceeding 100% uplift.

replies(1): >>40055156 #
36. thsksbd ◴[] No.40052857[source]
Old becomes new, the SGI O2 had (off chip) a unified memory model for performance reasons.

Not a CS guy, but it seems to me that NUMA like architecture has to come back. Large RAM on chip (balancing a thermal budget between #ofcores vs ram), a much larger RAM off chip and even more RAM through a fast interconnect on a single kernel image. Like the Origin 300 had.

replies(3): >>40053224 #>>40054054 #>>40055112 #
37. dangus ◴[] No.40052862{6}[source]
Yes, obviously I’m aware that you can throw more RAM at an M-series GPU.

But of course that’s only helpful for specific workflows.

38. talldayo ◴[] No.40052863{6}[source]
Maybe it's controversial, but I don't think comparing 5nm mobile hardware from 2021 is a fair fight against 12nm desktop hardware from 2018.

And still, performance-wise, the 2070 still wins out by a ~33% margin: https://browser.geekbench.com/opencl-benchmarks

replies(2): >>40053373 #>>40055651 #
39. nwah1 ◴[] No.40052887{3}[source]
Yes. And it actually does have an iGPU, depending on which model.
replies(1): >>40053492 #
40. wongarsu ◴[] No.40052910{4}[source]
It's a valid comparison in the very limited sense of "I have $2000 to spend on a way to run LLMs, should I get an RTX4090 for the computer I have or should I get a 24GB MacBook", or "I have $5000, should I get an RTX A6000 48GB or a 96GB MacBook".

Those comparisons are unreasonable in a sense, but they are implied by statements like GPs "Hence a lot of people in the local LLMA community are really going after high-memory Macs".

replies(3): >>40053367 #>>40054359 #>>40054554 #
41. zitterbewegung ◴[] No.40052929{3}[source]
Buying a m3 max with 128gb of RAM while will underperform any consumer NVIDIA GPU it will be able to load larger models in practice but slowly.

I think a way for the m series chips to aggressively target GPU inference or training would need a strategy that increases the speed of the RAM to start to match GDDR6 or HBM3 or use it directly.

replies(2): >>40053002 #>>40056817 #
42. futureshock ◴[] No.40052949{4}[source]
> Upscaling for gaming or video. > Already exists on all three major GPU manufacturers, and it definitely makes sense as a GPU workload. These AMD chips are APUs that are often the only GPU, not every user will have a dedicated GPU.

> Local context aware search > You don't need an AI processor to do this, Windows search used to work better and had even less compute resources to work with. You could still improve it with increased natural language understanding instead of simple keyword. “Give me all documents about dogs” instead of searching for each breed as a keyword.

> Offline Speech to text and TTS > See my point about not a very common use case for desktops & laptops vs cell phones. Maybe not for you but accessibility is a key feature for many users. You think blind users should suffer through bad TTS?

> Offline LLM that can work with your documents as context and access application and OS APIs > Maybe for some sort of background task or only using really small models <13B parameters. Anything real time is going to run at 1-2t/s with a large model. > Small models are pretty terrible though, I doubt people want even more incorrect information and hallucinations. Small model have been improving and better capabilities in consumer chips will allow larger models to run faster.

> Improved enemy AI in gaming > See Ageia PhysX Surely you’re not suggesting that enemy AI is solved problem in gaming?

> Webcam effects like background removal or filters. > We already have this without NPUs. Sure but it could go from obvious and distracting to seamless and convincing.

> Audio upscaling and interpolation like for bad video call connections. I could see this, or noise cancellation.

43. VHRanger ◴[] No.40052991{4}[source]
Edge is doing computing on the client (eg. browser, phone, laptop, etc.) instead of the server
replies(1): >>40055563 #
44. ◴[] No.40053002{4}[source]
45. myself248 ◴[] No.40053020[source]
What does "commercial market" mean here? The article says these features were already available in the "consumer" versions of these chips -- were those given away for free or something? What about them was not "commercial"?

I'm sure there's some market segmentation thing at work here, but this just sounds like a Hallmark-holiday excuse to rehash an old press-release and pretend it's another revolution all over again.

replies(2): >>40053048 #>>40053077 #
46. kanbankaren ◴[] No.40053026{7}[source]
I was talking about AI blocks and you moved the goal post to video codec blocks.
replies(1): >>40053078 #
47. dhruvdh ◴[] No.40053048[source]
You can't buy these pro variants from Microcenter for example, but you can buy them from pre-built OEM desktops. Mostly meant for enterprise customers who buy in bulk, I think.
48. transpute ◴[] No.40053077[source]
> the article says these features were already available

Actually, the article says they were not available:

  the Pro series is based on AMD's existing consumer-oriented processor models but comes with additional features
Pro (enterprise) CPUs include remote management, memory encryption and other security features: https://www.amd.com/en/ryzen-pro
49. bayindirh ◴[] No.40053078{8}[source]
No. I didn't move anything.

I said that the core (3D rendering hardware) of a GPU with shaders is the AI block already, and said that other tasks like video encoders have their own blocks, but still pull capabilities from the "core" to improve things.

50. dhruvdh ◴[] No.40053100[source]
There is a VitisAI execution provider for ONNX, and you can use ONNX backends for inference frameworks that support it. More info here - https://ryzenai.docs.amd.com/en/latest/

But regardless, 16 TOPs is no good for LLMs. Though there is a Ryzen AI demo that shows Llama 7B running on these at 8 tokens/sec. A sub-par experience for a sub-par LLM.

replies(3): >>40054182 #>>40054664 #>>40142456 #
51. v1sea ◴[] No.40053126[source]
edit: I was wrong.
replies(3): >>40053352 #>>40053365 #>>40053618 #
52. Rinzler89 ◴[] No.40053224{3}[source]
UMA in the SGI machines (and gaming consoles) made sense because all the memory chips at that time were equally slow, or fast, depending how you wanna look at it.

PC HW split the video memory from system memory once GDDRAM become so much faster than system RAM, but GDDRAM has too high latency for CPUs and DDR has too low bandwidth for GPUs, so the separation made sense for each's strengths and still does to this day. Unifying it again, like with AMD's APUs, means either compromises for the CPU or for the GPU. There's no free lunch.

Currently AMD APUs on the PC use unified DDRAM so CPU performance is top but GPU/NPU perforce is bottlenecked. If they were to use unified GDDRAM like in the PS5/Xbox then GPU/NPU performance would be top and CPU performance would be bottlenecked.

replies(3): >>40053947 #>>40054567 #>>40056075 #
53. ◴[] No.40053352{3}[source]
54. oflordal ◴[] No.40053365{3}[source]
You can do that on both HIP and cuda through e.g. hipHostMalloc and the cuda equivalent (Not officially supported on the AMD APUs but works in practice). With a discrete GPU the GPU will access memory across PCIe but on an APU it will go full speed to RAM as far as I can tell.
55. fckgw ◴[] No.40053367{5}[source]
No, it is not a valid comparison to make between an entire laptop and a single PC part. "The computer I have" is doing a ton of heavy lifting here.
replies(2): >>40053642 #>>40053845 #
56. chessgecko ◴[] No.40053373{7}[source]
For this comparison the generation of chip doesn’t really matter because the llm decode (which is the costly step) barely uses any of the perf and just needs the model weights to fit in memory
57. c0l0 ◴[] No.40053492{4}[source]
In case you don't need an integrated GPU (that's somewhat powerful/potent), you can go with any other Ryzen AM5 CPU to receive proper ECC-enabled ECC UDIMM support, afaik :)
58. smallmancontrov ◴[] No.40053618{3}[source]
> low latency results each frame

What does that do to your utilization?

I've been out of this space for a while, but in game dev any backwards information flow (GPU->CPU) completely murdered performance. "Whatever you do, don't stall the pipeline." Instant 50%-90% performance hit. Even if you had to awkwardly duplicate calculations on the CPU, it was almost always worth it, and not by a small amount. The caveat to "almost" was that if you were willing to wait 2-4 frames to get data back, you could do that without stalling the pipeline.

I didn't think this was a memory architecture thing, I thought it was a data dependency thing. If you have to finish all calculations before readback and if you have to readback before starting new calculations, the time for all cores to empty out and fill back up is guaranteed to be dead time, regardless of whether the job was rendering polygons or finite element calculations or neural nets.

Does shared memory actually change this somehow? Or does it just make it more convenient to shoot yourself in the foot?

EDIT: or is the difference that HPC operates in a regime where long "frame time" dwarfs the pipeline empty/refill "dead time"?

replies(1): >>40053853 #
59. wongarsu ◴[] No.40053642{6}[source]
"Should I upgrade what I have or buy something new" is a completely normal everyday decision. Of course it doesn't apply to everyone since it presumes you have something compatible to upgrade, but it is a real decision lots of people are making
replies(1): >>40053750 #
60. evilduck ◴[] No.40053695{3}[source]
That completely depends on your expectations and uses.

I have a gaming rig with a 4080 with 16GB of RAM and it can't even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized. Yeah it's fast when something fits on it, but I don't see much point in very fast generation of bad output. A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM and for ~$300 less, and it's a whole computer instead of a single part that still needs a computer around it. Compared to my 4080, that GPU and the surrounding computer get you half the VRAM capacity for greater cost than the Mac.

If you're building a rig with multiple GPUs to support many users or for internal private services and are willing to drop more than $3k then I think the equation swings back in favor of Nvidia, but not until then.

replies(4): >>40053808 #>>40053912 #>>40053916 #>>40059721 #
61. fckgw ◴[] No.40053750{7}[source]
But it's also assuming everyone has a desktop PC capable of this stuff that can be upgraded.
62. soupbowl ◴[] No.40053808{4}[source]
Just buy another 16gb of ram for 80$....
replies(2): >>40053856 #>>40054038 #
63. ◴[] No.40053835{3}[source]
64. michaelt ◴[] No.40053845{6}[source]
Sorta yes, sorta no.

You're certainly right that with a macbook you get a whole computer, so you're getting more for your money. And it's a luxury high-end computer too!

But personally, I've never seen anyone step directly from not-even-having-a-PC to buying a 4090 for $1800. Folks that aren't technically inclined by and large stick with hosted models like ChatGPT.

More common in my experience is for technical folks with, say, an 8GB GPU to experiment with local ML, decide they're interested in it, then step up to a 4090 or something.

65. v1sea ◴[] No.40053853{4}[source]
It was probably from my workloads being relatively small that I could get away with 90Hz read on the cpu side. I'll need to dig deeper into it. The metrics I was seeing were showing 200-300 microseconds of GPU time for physics calculations and within the same frame the cpu reading from that buffer. Maybe I'm wrong, need to test more.
replies(1): >>40054488 #
66. elzbardico ◴[] No.40053856{5}[source]
GPU ram?
67. cjk2 ◴[] No.40053912{4}[source]
Yeah this.

Also as an anecdote, my daily driver machine is a bottom end M2 Mac mini because I am a cheap ass. I paid less than it for the 4070 card in my desktop PC. The M2 Mac does a dehaze from RAW in lightroom in 18 seconds. My 4070 takes 9 seconds. So the GPU is twice as fast but the mac has a whole free computer stuck to it.

68. ◴[] No.40053916{4}[source]
69. Dalewyn ◴[] No.40053947{4}[source]
>Unifying it again, like with AMD's APUs, means either compromises for the CPU or for the GPU. There's no free lunch.

I think the lunch here (it still ain't free) is that RAM speed means nothing if you don't have enough RAM in the first place, and this is a compromise solution to that practical problem.

replies(2): >>40054000 #>>40055288 #
70. Rinzler89 ◴[] No.40054000{5}[source]
>if you don't have enough RAM in the first place

Enough RAM for what task exactly? System RAM is plentiful and cheap nowadays(unless you buy Apple). I got new a laptop with 32GB RAM for about 750 Euros. But the speeds are too low for high-end gamming or LLM training for the poor APU.

replies(1): >>40054101 #
71. evilduck ◴[] No.40054038{5}[source]
Running your larger-than-your-GPU-VRAM LLM model on regular DDR ram will completely slaughter your token/s speed to the point that the Mac comes out ahead again.
replies(1): >>40056550 #
72. aidenn0 ◴[] No.40054054{3}[source]
Several PC graphics standards attempted to offer high-speed access to system memory (though I think only VLB offered direct access to the memory controller at the same speed as the CPU). Not having to have dedicated GPU memory has obvious advantages, but it's hard to get right.
73. numpad0 ◴[] No.40054064[source]
Note that while UMA is great in the sense that they allow LLM models to be run at all, M-series chips aren't faster[1] when the model fits in VRAM.

  1: screenshot from[2]: https://www.igorslab.de/wp-content/uploads/2023/06/Apple-M2-ULtra-SoC-Geekbench-5-OpenCL-Compute.jpg
  2: https://wccftech.com/apple-m2-ultra-soc-isnt-faster-than-amd-intel-last-year-desktop-cpus-50-slower-than-nvidia-rtx-4080/
replies(2): >>40054237 #>>40055676 #
74. numpad0 ◴[] No.40054101{6}[source]
Enough RAM for LLM. There are GPUs faster than M2 Ultra but can't run LLMs normally, which make that speed a moot point for LLM use-cases.
75. Aissen ◴[] No.40054182{3}[source]
Thanks, I was looking for information on this, it seems to be lower speed than pure-CPU inference on M2, and probably much worse than a ROCm GPU-based solution?
replies(1): >>40055622 #
76. cstejerean ◴[] No.40054237{3}[source]
The problem is you're limited to 24 GB of VRAM unless you pay through the nose for datacenter GPUs, whereas you can get an M-series chip with 128 GB or 192 GB of unified memory.
replies(1): >>40054469 #
77. oceanplexian ◴[] No.40054359{5}[source]
The answer depends on what you plan to do with it.

Do you need to do fine tuning on a smaller model and need the highest inference performance with smaller models? Are you planning to use it as a lab to learn how to work with tools that are used in Big Tech (i.e. CUDA)? Or do you just want to do slow inference on super huge models (e.g Grok)?

Personally, I chose the Nvidia route because as a backend engineer, Macs aren’t seriously used in datacenters. The same frameworks I use to develop on a 3090 are transferable to massive, infiniband-connected clusters with TB of VRAM.

78. numpad0 ◴[] No.40054469{4}[source]
Surely! The point is that they're not million times faster magic chips that makes NVIDIA bankrupt tomorrow. That's all. A laptop with up to 128GB "VRAM" is a great option, absolutely no doubt about that.
replies(1): >>40054716 #
79. smallmancontrov ◴[] No.40054488{5}[source]
> edit: I was wrong.

If the only reason you were "wrong" was because you intuitively understood that it wasn't worth a large amount of valuable human time to save a small amount of cheap machine time, you were right in the way that matters (time allocation) and should keep it up :)

80. 0x457 ◴[] No.40054554{5}[source]
I think there are two communities:

- the "hobbyists" with $5k GPUs

- People that work in the industry that never used "not mac" or even if they did - explaining to IT that you need a PC with RTX A6000 48GB instead of a mac like literally everyone else in the company is a loosing battle.

replies(1): >>40054806 #
81. numpad0 ◴[] No.40054567{4}[source]
I suspect there are difficulties with DRAM latency and/or signal integrity with APUs and RAM-expandable GPUs. Wasted ALUs are wasted if you'd be stalling deep SIMD pipelines all the time.
82. spamizbad ◴[] No.40054570[source]
My understanding is the unified RAM on the M-series die does not contribute significantly to their performance. You get a little bit better latency but not much. The real value to Apple is likely it greatly simplifies your design since you don't have to route out tons of DRAM signaling and power management on your logic board. Might make DRAM training easier too but that's way beyond my expertise.
replies(2): >>40054690 #>>40058366 #
83. instagib ◴[] No.40054577{3}[source]
One thing I would consider is usage throttling on a MacBook Pro. Would repeated LLM usage run into throttling?

No idea what specifically everyone is pulling their performance data from or what task(s).

Here is a video to help visualize the differences with a maxed out m3 max vs 16gbm1 pro vs 4090 on llm 7B/13b/70b llama 2. https://youtu.be/jaM02mb6JFM

Here’s a Reddit comparison of 4090 vs M2 Ultra 96gb with tokens/s

https://old.reddit.com/r/LocalLLaMA/comments/14319ra/rtx_409...

M3 pro memory BW 150 gb/s M3 max 10/30 300 gb/s M3 max 12/40 400 gb/s

“Llama models are mostly limited by memory bandwidth. rtx 3090 has 935.8 gb/s rtx 4090 has 1008 gb/s m2 ultra has 800 gb/s m2 max has 400 gb/s so 4090 is 10% faster for llama inference than 3090 and more than 2x faster than apple m2 max https://github.com/turboderp/exllama using exllama you can get 160 tokens/s in 7b model and 97 tokens/s in 13b model while m2 max has only 40 tokens/s in 7b model and 24 tokens/s in 13b apple 40/s Memory bandwidth cap is also the reason why llamas work so well on cpu (…)

buying second gpu will increase memory capacity to 48gb but has no effect on bandwidth so 2x 4090 will have 48gb vram and 1008 gb/s bandwidth and 50% utilization”

84. markdog12 ◴[] No.40054664{3}[source]
Wow, that's simply embarrassing.
85. john_alan ◴[] No.40054690{3}[source]
also provides the GPU with serious RAM allocation, a 64GB M3 chip comes with more ram for the GPU than a 4090
86. john_alan ◴[] No.40054716{5}[source]
They are powerful, but I agree with you, it's nice to be able to run Goliath locally, but it's a lot slower than my 4070.
87. AceJohnny2 ◴[] No.40054743[source]
> unified memory model (by having the memory on-package with CPU)

That's not what "unified memory model" means.

It means that the CPU and GPU (and ANE!) have access to the same banks of memory, unlike PC GPUs that have their own memory, separated from the CPU's by the PCIe bottleneck (as fast as that is, it's still smaller than direct shared DRAM access).

It allows the hardware more flexibility in how the single pool of memory is allocated across devices, and faster sharing of data across devices. (throughput/latency depends on the internal system bus ports and how many each device have access to)

The Apple M-Series chips also has the memory on-package with the CPU (technically SoC, "System-on-Chip"), but that provides different benefits.

replies(2): >>40055425 #>>40056022 #
88. wongarsu ◴[] No.40054806{6}[source]
There is also an important third group:

- people that work outside Silicon Valley, where the entire company uses Windows centrally managed through Active Directory, and explaining IT that you need an Mac is an uphill battle. So you just submit your request for an RTX A6000 48GB to be added to your existing workstation

Those people are the intended target customer of the A6000, and there are a lot of them.

replies(1): >>40067455 #
89. john_alan ◴[] No.40054855{3}[source]
what are you talking about, it's literally the fastest single core retail CPU globally, and multicore is close too - https://browser.geekbench.com/processor-benchmarks
replies(1): >>40063026 #
90. Dylan16807 ◴[] No.40055112{3}[source]
> Old becomes new

I disagree. This is like pointing at a 2024 model hatchback and saying "old becomes new" because you can cite a hatchback from 50 years ago.

There's a bevy of ways to isolate or combine memory pools, and mainstream hardware has consistently used many of these methods the entire time.

91. whizzter ◴[] No.40055156{7}[source]
Googling power draw the 4090 goes up to 450w whilst the 2080ti was at 250w, adjusting for power consumption the increase is somewhere around 32%. Some architectural gains and probably optimizations in chipset workings but we're not seeing as many amazing generational leaps anymore regardless of manufacturer/designer.
replies(1): >>40077726 #
92. pedrocr ◴[] No.40055219{3}[source]
Since in AM5 all CPUs have a basic iGPU, for a home server all the normal CPUs already work fine. The advantage of the APUs is they're on a monolithic die so should have quite a bit lower idle power usage which is important if you have a NAS or other homelab server running 24/7.
93. Dylan16807 ◴[] No.40055288{5}[source]
The real problem is a lack of competition in GPU production.

GDDR is not very expensive. You should be able to get a GPU with a mid-level chip and tons of memory, but it's just not offered. Instead, please pay triple or quadruple the price of a high end gaming GPU to get a model with double the memory and basically the same core.

The level of markup is astonishing. I can go from 8GB to 16GB on AMD for $60, but going from 24GB to 48GB costs $3000. And nvidia isn't better.

replies(4): >>40055629 #>>40055989 #>>40056434 #>>40069867 #
94. baarsh ◴[] No.40055305[source]
How come new series comes with options only up to 8 cores, while 5900X and 5950X were already 12 and 16 cores few years ago?
replies(2): >>40055539 #>>40055945 #
95. cmovq ◴[] No.40055425{3}[source]
Having separated GPU memory also has its benefits. Once the data makes it through the PCIe bus, graphics memory typically has much higher bandwidth which also doesn’t need to split with the CPU.
replies(2): >>40056425 #>>40062642 #
96. Arrath ◴[] No.40055539[source]
Better sustained performance with the thermal envelope afforded to only 8 cores vs 12+?
97. Dylan16807 ◴[] No.40055563{5}[source]
Half the definitions I see of edge include client devices, and half of them don't include client devices.

I like the latter. Why even use a new word if it's just going to be the same as "client"?

98. p_l ◴[] No.40055622{4}[source]
Because the NPU isn't for high-end inferencing. It's a relatively small coprocessor that is supposed to do bunch of tasks with high TOPS/watt without engaging the way more power hungry GPU.

At release time, the windows driver for example included few video processing offloads used by Windows Frameworks used for example by MS Teams for background removal - so that such tasks use less battery on laptops and free up CPU/GPU for other tasks on desktop.

For higher end processing you can use the same AIE-ML coprocessors various chips available previously from Xilinx and now under AMD brand.

replies(1): >>40055788 #
99. zozbot234 ◴[] No.40055629{6}[source]
> You should be able to get a GPU with a mid-level chip and tons of memory, but it's just not offered.

Apple unified memory is the easiest way to get exactly that. There is a markup on memory upgrades but it's quite reasonable, not at all extreme.

replies(1): >>40055688 #
100. JudasGoat ◴[] No.40055651{7}[source]
I found it interesting that the Apple M3 scored nearly identical to the Radeon 780M. I know the memory bandwidth is slower but you can add 2 32gb sodimms to the AMD APU for short money.
101. paulmd ◴[] No.40055676{3}[source]
that's openCL compute, LLM models ideally should be hitting the neural accelerator, not running on generalized gpu compute shaders.
102. Dylan16807 ◴[] No.40055688{7}[source]
But I can't put that GPU into my existing machine, so I'm still paying $3000 extra if I don't want that Apple machine to be my computer.
103. robocat ◴[] No.40055719{3}[source]
Does anyone have any mental heuristics for judging how "useless" a feature is?

Over decades I have a growing antipathy towards products with too many features. Especially new versions/models where the vaunted features of the previous version/model seem to never have been used by anyone.

replies(1): >>40056186 #
104. fpgamlirfanboy ◴[] No.40055788{5}[source]
> the same AIE-ML coprocessors

they're not the same - versal acaps (whatever you want to call them) have AIE1 arch while phoenix has AIE2 arch. there are significant differences between the two arches (local memory, bfloat16, etc.)

replies(1): >>40056288 #
105. sva_ ◴[] No.40055799{3}[source]
> Some AMD laptops haven't even yet enabled the NPU in firmware

This is entirely the fault of the OEMs though, not AMD. It is activated on mine for example. But pretty much unusable under Linux at the moment (unless you're willing to run a custom kernel for it[0].)

0. https://github.com/amd/xdna-driver

replies(1): >>40056110 #
106. wmf ◴[] No.40055945[source]
Desktop Ryzen = up to 16 cores

Laptop Ryzen = up to 8 cores (the 8000G are laptop CPUs in a desktop socket)

replies(1): >>40057103 #
107. ◴[] No.40055989{6}[source]
108. fulafel ◴[] No.40056022{3}[source]
Most x86 machines have integrated GPUs and hardware-wise are UMA.
replies(1): >>40056137 #
109. fulafel ◴[] No.40056075{4}[source]
The fancier and more expensive SGIs had higher bandwisth non UMA memory systems on the GPUs.

The thing about bandwidth is that you can just make wider buses with more of the same memory chips in parallel.

110. Rinzler89 ◴[] No.40056110{4}[source]
>This is entirely the fault of the OEMs though, not AMD.

Not true. AMD can demand how OEMs integrate and use their chips in their products as part of the sales agreement, same how Nvidia does.

AMD could have said to every system integrator buying 7000 series chips and up, that the NPU must be active in the final product.

So if the end products suck, AMD bares most of the blame for not ensuring a minimum level of QA with its integrators who release half-assed stuff since it all reflects poorly on them in the end. It's one of the reason why Nvidia keeps such a tight grip over its integrators on how their chips are to used.

111. alacritas0 ◴[] No.40056137{4}[source]
integrated GPUs are not powerful in comparison to dedicated GPUs
replies(1): >>40060861 #
112. jwr ◴[] No.40056153{3}[source]
Hmm. I'm running decent LLMs locally (deepseek-coder:33b-instruct-q8_0, mistral:7b-instruct-v0.2-q8_0, mixtral:8x7b-instruct-v0.1-q4_0) on my MacBook Pro and they respond pretty quickly. At least for interactive use they are fine and comparable to Anthropic Opus in speed.

That MacBook has an M3 Max and 64GB RAM.

I'd say it does live up to my expectations, perhaps even slightly exceeds them.

113. Rinzler89 ◴[] No.40056186{4}[source]
>Does anyone have any mental heuristics for judging how "useless" a feature is?

My favorite example is the story I got to live through of the first generations of consumer 64 bit CPUs.

When the first AMD Athlon 64 came out, everyone I knew was buying them because they though they were getting something totally future proof by jumping early on the 64 bit bandwagon, in 2003, when nobody yet had 4GB+ of RAM and neither Windows nor any software would see 64bit releases till several years later when Vista came out which everyone avoided and staid on Windows XP 32bit waiting for Windows 7.

And by the time RAM sizes over 4GB and 64 bit software became even remotely mainstream, we already had dual- and quad-core CPUs miles ahead of those original 64 bit CPUs which were now obsolete (tech progress back then was wild).

So just like how 64bit silicone was a useless feature on consumer CPUs, and like the first GPUs with raytracing, I feel like now we're in the same boat with AI silicone in PCs, no much SW support for them and when it does come, these early chips will be obsolete. It's the price of being an early adopter.

replies(1): >>40061793 #
114. p_l ◴[] No.40056288{6}[source]
Phoenix has AIE-ML (what you call AIE2), Versal has choice of AIE (AIE1) and AIE-ML (AIE2) depending on chip you buy.

Essentially, AMD is making two tile designs optimized for slightly different computations and claims that they are going to offer both in Versal, but NPUs use exclusively the ML-optimized ones.

115. JonChesterfield ◴[] No.40056382[source]
Zen4 is very pretty, see https://news.ycombinator.com/item?id=32983406.

I like their APUs a lot. Using a 4800U in the cable tray under the desk to drive screens on which I'm writing this. One in a laptop for whenever I'm away from the desk.

If you're sufficiently determined the compute units on these things are totally usable for running arbitrary code. As in a program that spawns a bunch of threads to work stuff out could have some of those "threads" running on the GPU cores.

I wouldn't say the software stack is totally there for out of the box convenience. As in you'll be writing in freestanding ~C and maybe a bit of assembly. I got partway through implementing that and got sidetracked. The GPU libc in LLVM is roughly the production version of some of that hacking. Between these machines coming out and the MI300A landing I really should put something up on github which looks like a pthread_create that executes on the GPU instead.

116. crawshaw ◴[] No.40056425{4}[source]
An M2 Ultra has 800GB/s of memory bandwidth, an Nvidia 4090 has 1008GB/s. Apple have chosen to use relatively little system memory at unusually high bandwidth.
117. nsteel ◴[] No.40056434{6}[source]
This isn't my area but won't it be quite expensive to use that GDDR? The PHY and the controller are complicated and you've got max 2GB devices, so if you want more memory you need a wider bus. That requires more beachfront and therefore a bigger die. That must make it expensive once you go beyond what you can fit on your small, cheap chip. Do their 24GB+ cards really use GDDR?*

And you need to ensure you don't shoot yourself in the foot by making anything (relatively) cheap that could be useful for AI...

*Edit: wow yeh, they do! A 384-bit interface on some! Sounds hot.

replies(1): >>40059573 #
118. programd ◴[] No.40056550{6}[source]
Depends on what you're doing. Just chatting with the AI?

I'm getting about 7 tokens per sec for Mistral with the Q6_K on a bog standard Intel i5-11400 desktop with 32G of memory and no discrete GPU (the CPU has Intel UHD Graphics 730 built in). 2 year old low end CPU that goes for, what $150? these days. As far as I'm concerned that's conversational speed. Pop in some 8 core modern CPU and I'm betting you can double that, without even involving any GPU.

People way overestimate what they need in order to play around with models these days. Use llama.cpp and buy that extra $80 worth of RAM and pay about half the price of a comparable Mac all in. Bigger models? Buy more RAM, which is very cheap these days.

There's a $487 special on Newegg today with an i7-12700KF, motherboard and 32G of ram. Add another $300 worth of case, power supply, SSD and more RAM and you're under the price of a Macbook Air. There's your LLM inference machine (not for training obviously) which can run even the 70B models at home at acceptable conversational speed.

119. chaostheory ◴[] No.40056761{4}[source]
The problem is that it extends to both Mac Studio and Mac Pro.
120. chaostheory ◴[] No.40056817{4}[source]
You summed up my point better than I did
121. protastus ◴[] No.40057103{3}[source]
Dragon Range (e.g., 7945HX) is a laptop/mobile workstation Ryzen with up to 16 cores, but power inefficient compared to the 8000-series due to the chiplet design. Already in market, mostly in gaming laptops.
122. rowanG077 ◴[] No.40058366{3}[source]
So why are you proposing Apple did that if not for performance as you claim? They waste a lot of silicon for those extra memory controllers. Basically a comparable amount to all the CPU cores in an M1 Max.
123. Dylan16807 ◴[] No.40059573{7}[source]
The upgrade is actually remarkably similar.

RX7600 and RX7600XT have 8GB or 16GB attached to a 128-bit bus.

RX7900XT and W7900 have 24GB or 48GB attached to a 384-bit bus.

Neither upgrade changes the bus width, only the memory chips.

The two upgraded models even use the same memory chips, as far as I can tell! GDDR6, 18000MT/s, 2GB per 16 pins. I couldn't confirm chip count on the W7900 but it's the same density.

replies(1): >>40064467 #
124. dragonwriter ◴[] No.40059613{4}[source]
> The fact a laptop can run 70B+ parameter models is a miracle

Most laptops with 64+GB of RAM can run a 70B model at 4-bit quantization. It’s not a miracle, it’s just math. M2 can do it faster than systems with slower memory bandwidth.

replies(1): >>40064678 #
125. dragonwriter ◴[] No.40059721{4}[source]
> I have a gaming rig with a 4080 with 16GB of RAM and it can’t even run Mixtral (kind of the minimum bar of a useful generic LLM in my opinion) without being heavily quantized.

Which Mixtral? What does “heavily quantized” mean?

> A refurbished M1 Max with 32GB of RAM will enable you to generate better quality LLM output than even a 4090 with 24GB of VRAM

Not assuming the 4090 is in a computer with non-trivial system RAM (which it kind of needs to be), since models can be split between GPU and CPU. The M1 Max might have better performance for models that take >24GB and <= 32GB than the 4090 machine with 24GB of VRAM, but assuming the 4090 is in a machine with 16GB+ of system RAM, it will be able to run bigger models than the M2, as well as outperforming it for models requiring up to 24GB of RAM.

The system I actually use is a gaming laptop with a 16GB 3080Ti and 64GB of system RAM, and Mixtral 8x7B @ 4-bit (~30GB RAM) works.

126. ◴[] No.40060474[source]
127. fulafel ◴[] No.40060861{5}[source]
Generally so, but the gap is closing (like Apple is showing).
128. nercury ◴[] No.40061793{5}[source]
If AMD did not come up with 64-bit extension, we would be saying goodbye to x86 architecture.
129. mmaniac ◴[] No.40062642{4}[source]
The benefit when having an ecosystem of discrete GPUs is that CPUs can get away with having low bandwidth memory. This is great if you want motherboards with socketed CPU and socketed RAM which are compatible with the whole range of product segments.

CPUs don't really care about memory bandwidth until you get to extreme core counts (Threadripper/Xeon territory). Mainstream desktop and laptop CPUs are fine with just two channels of reasonably fast memory.

This would bottleneck an iGP, but those are always weak anyway. The PC market has told users who need more to get a discrete GPU and to pay the extra costs involved with high bandwidth soldered memory only if they need it.

The calculation Apple has made is different. You'll get exactly what you need as a complete package. You get CPU, GPU, and the bandwidth you need to feed both as a single integrated SoC all the way to the high end. Modularity is something PC users love but doing away with it does have advantages for integration.

130. edward28 ◴[] No.40063026{4}[source]
I would take those benchmarks with a grain of salt, given that it shows a 96 core epyc loosing in multi thread to a 64 core epyc and a 32 core xeon.
131. nsteel ◴[] No.40064467{8}[source]
Perhaps I should have been clearer and said "if you want more memory than the 16GB model you need a wider bus" but this confirms what I described above.

  - cheap gfx die with a 128-bit memory interface.
  - vastly more expensive gfx die with a 384-bit memory interface.
Essentially there's a cheap upgrade option available for each die, swapping the 1GB memory chips for 2GB memory chips (GDDR doesn't support a mix). If you have the 16GB model and you want more memory, there are no bigger memory chips available, and so you need a wider bus and that's going to cost significantly more to produce and hence they charge more.

As a side note, I would expect the the GDDR chips to be x32, rather than 16.

replies(1): >>40068306 #
132. ◴[] No.40064678{5}[source]
133. 0x457 ◴[] No.40067455{7}[source]
While there are many people "that work outside Silicon Valley, where the entire company uses Windows centrally managed through Active Directory, and explaining IT that you need a Mac is an uphill battle" I think these companies either don't care about AI or get into it by acquiring a startup (that runs on macs).

Companies that run windows and AD are too busy to make sure you move your mouse every 5 minutes while you're on the clock more. At least that is my experience.

replies(1): >>40076476 #
134. Dylan16807 ◴[] No.40068306{9}[source]
> Essentially there's a cheap upgrade option available for each die

Hmm, I think you missed what my actual point was.

You can buy a card with the cheap die for $270.

You can buy a card with the expensive die for $1000.

So far, so good.

The cost of just the memory upgrade for the cheap die is $60.

The cost of just the memory upgrade for the expensive die is $3000.

None of that $3000 is going toward upgrading the bus width. Maybe one percent of it goes toward upgrading the circuit board. It's a huge market segmentation fee.

> As a side note, I would expect the the GDDR chips to be x32, rather than 16.

The pictures I've found show the 7600 using 4 ram chips and the 7600 XT using 8 ram chips.

replies(1): >>40069347 #
135. nsteel ◴[] No.40069347{10}[source]
I did miss your point, I misunderstood the hefty price tag was just for the memory. Thank you for not biting my head off! That is pretty mad. I'm struggling to think of any explanation other than yours. Even a few supporting upgrades maybe required for the extra memory (power supplies etc) wouldn't be anywhere near that cost.

I couldn't find much to actually see how many chips used but on https://www.techpowerup.com/review/sapphire-radeon-rx-7600-x... it says H56G42AS8DX-014 which is a x32 (https://product.skhynix.com/products/dram/gddr/gddr6.go?appT...). But either way it can't explain that pricing!

replies(1): >>40071488 #
136. paulmd ◴[] No.40069867{6}[source]
what exactly do you expect them to do about the routing? 384b gpus are as big as memory buses get in the modern era and that gives you 24gb capacity or 48gb when clamshelled. Higher-density 3GB modules that would allow 36GB/72GB have been repeatedly pushed back, which has kinda screwed over consoles as well - they are in the same bind with "insufficient" VRAM on MS and while sony is less awful they didn't give a VRAM increase with PS5 Pro either.

GB202 might be going to 512b bus which is unprecedented in the modern era (nobody has done it since the Hawaii/GCN2 days) but that's really as big as anybody cares to route right now. What do you propose for going past that?

Ironically AMD actually does have the capability to experiment and put bigger memory buses on smaller dies. And the MCM packaging actually does give you some physical fanout that makes the routing easier. But again, ultimately there is just no appetite for going to 512b or 768b buses at a design level, for very good technical reasons.

like it just is what it is - GDDR is just not very dense, this is what you can get out of it. Production of HBM-based gpus is constrained by stacking capacity, which is the same reason people can't get enough datacenter cards in the first place.

Higher-density LPDDR does exist, but you still have to route it, and bandwidth goes down quite a bit. And that's the Apple Silicon approach, which solves your problem but unfortunately a lot of people just flatly reject any offering from the Fruit Company for interpersonal reasons.

replies(1): >>40072886 #
137. Dylan16807 ◴[] No.40071488{11}[source]
The first two pictures on that page show 4 RAM chips on each side of the board.

https://media-www.micron.com/-/media/client/global/documents...

This document directly talks about splitting a 32 data bit connection across two GDDR6 chips, on page 7.

"Allows for a doubling of density. Two 8Gb devices appear to the controller as a single, logical 16Gb device with two 16-bite wide channels."

Do that with 16Gb chips and you match the GPUs we're talking about.

replies(1): >>40074037 #
138. Dylan16807 ◴[] No.40072886{7}[source]
> what exactly do you expect them to do about the routing? 384b gpus are as big as memory buses get in the modern era and that gives you 24gb capacity or 48gb when clamshelled.

I'm not asking for more than that. I'm asking for that to be available on a mainstream consumer model.

Most of the mid-tier GPUS from AMD and nVidia have 256 bit memory busses. I want 32GB on that bus at a reasonable price. Let's say $150 or $200 more than the 16GB version of the same model.

I appreciate the information about higher densities, though.

139. nsteel ◴[] No.40074037{12}[source]
Of course - clamshell mode! They don't connect half the data bus from each chip. Also explains how they fit it on the card so easily without (and cheeply).
replies(1): >>40078610 #
140. MichaelZuo ◴[] No.40076476{8}[source]
There are some genuine, non-Dilbert-esque uses for AD. Where there really isn't a viable, similarly performant, alternative.
141. dangus ◴[] No.40077726{8}[source]
I’m still seeing over a 100% uplift comparing mobile to mobile on Nvidia products: https://gpu.userbenchmark.com/Compare/Nvidia-RTX-4090-Laptop...

As far as desktop products, power consumption is irrelevant.

142. Dylan16807 ◴[] No.40078610{13}[source]
Yeah, though a way to connect fewer data pins to each chip doesn't particularly have to be clamshell, it just requires a little bit of flexibility in the design.
143. imtringued ◴[] No.40142456{3}[source]
In the benchmark you have linked, you clearly see that the performance of the CPU only implementation and the NPU implementation are identical.

https://github.com/amd/RyzenAI-SW/blob/main/example/transfor...

What this should tell you is that "15 TOPs" is an irrelevant number in this benchmark. There are exactly two FLOPs per parameter. Loading the parameters takes more time than processing them.

There are people with less than 8GB of VRAM and they can't load these models into their GPU and end up with the exact same performance as on CPU. The 12tflops of the 3060 Ti 8GB are "no good" for LLMs, because the bottleneck for token generation is memory bandwidth.

My Ryzen 2700 gets 7 tokens per second at 50 GFLOPs. What does this tell you? The NPU can saturate the memory bandwidth of the system.

Now here is the gotcha: Have you tried inputting very large prompts? Because that is where the speedup is going to be extremely noticeable. Instead of waiting minutes on a 2000 token prompt, it will be just as fast as on GPUs, because the initial prompt processing is compute bound.

Also, before calling something subpar, you're going to have to tell me how you are going to put larger models like Goliath 70b or 120b models on your GPU.