Evaluating the Infinity Cache in AMD Strix Halo

(chipsandcheese.com)

141 points zdw | 1 comments | 22 Oct 25 04:20 UTC | HN request time: 0.001s | source

Show context

andrewstuart ◴[22 Oct 25 05:17 UTC] No.45665124[source]▶

Despite this APU being deeply interesting to people who want to do local AI, anecdotally I hear that it’s hard to get models to run on it.

Why would AMD not have focused everything it possibly has on demonstrating and documenting and fixing and showing and smoothing the path for AI on their systems?

Why does AMD come across as so generally clueless when it comes to giving developers what they want, compared to Nvidia?

AMD should do whatever it takes to avoid these sort of situations:

https://youtu.be/cF4fx4T3Voc?si=wVmYmWVIya4DQ8Ut

replies(10): >>45665138 #>>45665148 #>>45665186 #>>45665215 #>>45665736 #>>45665755 #>>45665858 #>>45665962 #>>45667229 #>>45671834 #

pella ◴[22 Oct 25 05:21 UTC] No.45665148[source]▶

>>45665124 #

"The AMD Ryzen™ AI Max+ processor is the first (and only) Windows AI PC processor capable of running large language models up to 235 Billion parameters in size. This includes support for popular models such as: Open AI's GPT-OSS 120B and Z.ai Org's GLM 4.5 Air. The large unified memory pool also allows models (up to 128 Billion parameters) to run at their maximum context length (which is a memory intensive feature) - enabling and empowering use cases involving tool-calling, MCP and agentic workflows - all available today. "

  GPT-OSS 120B MXFP4              : up to 44 tk/s
  GPT-OSS 20B MXFP4               : up to 62 tk/s
  Qwen3 235B A22B Thinking Q3 K L : up to 14 tk/s
  Qwen3 Coder 30B A3B Q4 K M      : up to 66 tk/s
  GLM 4.5 Air Q4 K M              : up to 16 tk/s

(performance tk/s ) : https://www.amd.com/en/blogs/2025/amd-ryzen-ai-max-personal-...

replies(2): >>45665211 #>>45668203 #

storus ◴[22 Oct 25 12:43 UTC] No.45668203[source]▶

>>45665148 #

Strix Halo can only allocate 96GB RAM to the GPU. So GPT-OSS 120B can be ran only at Q6 at best (but activations would need to be partially stored in the CPU mem then).

replies(3): >>45668309 #>>45669073 #>>45670029 #

1. ondra ◴[22 Oct 25 12:51 UTC] No.45668309[source]▶

>>45668203 #

GPT-OSS 120B uses native 4 bit representation, so it fits fine.

↑