Most active commenters

    ←back to thread

    468 points speckx | 18 comments | | HN request time: 0.527s | source | bottom
    Show context
    Aurornis ◴[] No.45302320[source]
    I thought the conclusion should have been obvious: A cluster of Raspberry Pi units is an expensive nerd indulgence for fun, not an actual pathway to high performance compute. I don’t know if anyone building a Pi cluster actually goes into it thinking it’s going to be a cost effective endeavor, do they? Maybe this is just YouTube-style headline writing spilling over to the blog for the clicks.

    If your goal is to play with or learn on a cluster of Linux machines, the cost effective way to do it is to buy a desktop consumer CPU, install a hypervisor, and create a lot of VMs. It’s not as satisfying as plugging cables into different Raspberry Pi units and connecting them all together if that’s your thing, but once you’re in the terminal the desktop CPU, RAM, and flexibility of the system will be appreciated.

    replies(11): >>45302356 #>>45302424 #>>45302433 #>>45302531 #>>45302676 #>>45302770 #>>45303057 #>>45303061 #>>45303424 #>>45304502 #>>45304568 #
    1. glitchc ◴[] No.45302424[source]
    I did some calculations on this. Procuring a Mac Studio with the latest Mx Ultra processor and maxing out the memory seems to be the most cost effective way to break into 100b+ parameter model space.
    replies(8): >>45302483 #>>45302490 #>>45302620 #>>45302698 #>>45302777 #>>45302916 #>>45302937 #>>45304489 #
    2. Palomides ◴[] No.45302483[source]
    even a single new mac mini will beat this cluster on any metric, including cost
    3. randomgermanguy ◴[] No.45302490[source]
    Depends on how heavy one wants to go with the quants (for Q6-Q4 the AMD Ryzen AI MAX chips seem better/cheaper way to get started).

    Also the Mac Studio is a bit hampered by its low compute-power, meaning you really can't use a 100b+ dense model, only MoE feasibly without getting multi minute prompt-processing times (assuming 500+ tokens etc.)

    replies(2): >>45303116 #>>45303242 #
    4. the8472 ◴[] No.45302620[source]
    You could try getting a DGX Thor devkit with 128GB unified memory. Cheaper than the 96GB mac studio and more FLOPs.
    replies(1): >>45305690 #
    5. eesmith ◴[] No.45302698[source]
    Geerling links to last month's essay on a Frameboard cluster, at https://www.jeffgeerling.com/blog/2025/i-clustered-four-fram... . In it he writes 'An M3 Ultra Mac Studio with 512 gigs of RAM will set you back just under $10,000, and it's way faster, at 16 tokens per second.' for 671B parameters, that is, that M3 is at least 3x the performance of the other three systems.
    6. GeekyBear ◴[] No.45302777[source]
    Now that we know that Apple has added tensor units to the GPU cores the M5 series of chips will be using, I might be asking myself if I couldn't wait a bit.
    replies(1): >>45304691 #
    7. teleforce ◴[] No.45302916[source]
    Not quite, as it stands now the most cost effective way is most likely framework desktop or similar system for example HP G1a laptop/PC [1],[2].

    [1] The Framework Desktop is a beast:

    https://news.ycombinator.com/item?id=44841262

    [2] HP ZBook Ultra:

    https://www.hp.com/us-en/workstations/zbook-ultra.html

    8. llm_nerd ◴[] No.45302937[source]
    The next generation M5 should bring the matmul functionality seen on the A19 Pro to the desktop SoC's GPU -- "tensor" cores, in essence -- and will dramatically improve the running of most AI models on those machine.

    Right now the Macs are viable purely because you can get massive amounts of unified memory. Be pretty great when they have the massive matrix FMA performance to complement it.

    9. GeekyBear ◴[] No.45303116[source]
    Given the RAM limitations of the first gen Ryzen AI MAX, you have no choice but to go heavy on the quantization of the larger LLMs on that hardware.
    10. mercutio2 ◴[] No.45303242[source]
    Huh? My maxed out Mac Studio gets 60-100 tokens per second on 120B models, with latency on the order of 2 seconds.

    It was expensive, but slow it is not for small queries.

    Now, if I want to bump the context window to something huge, it does take 10-20 seconds to respond for agent tasks, but it’s only 2-3x slower than paid cloud models, in my experience.

    Still a little annoying, and the models aren’t as good, but the gap isn’t nearly as big as you imply, at least for me.

    replies(3): >>45303597 #>>45304594 #>>45304642 #
    11. zargon ◴[] No.45303597{3}[source]
    GPT OSS 120B only has 5B active parameters. GP specifically said dense models, not MoE.
    12. encom ◴[] No.45304489[source]
    >Mac

    >cost effective

    lmao

    13. ◴[] No.45304594{3}[source]
    14. EnPissant ◴[] No.45304642{3}[source]
    I think the Mac Studio is a poor fit for gpt-oss-120b.

    On my 96 GB DDR5-6000 + RTX 5090 box, I see ~20s prefill latency for a 65k prompt and ~40 tok/s decode, even with most experts on the CPU.

    A Mac Studio will decode faster than that, but prefill will be 10s of times slower due to much lower raw compute vs a high-end GPU. For long prompts that can make it effectively unusable. That’s what the parent was getting at. You will hit this long before 65k context.

    If you have time, could you share numbers for something like:

    llama-bench -m <path-to-gpt-oss-120b.gguf> -ngl 999 -fa 1 --mmap 0 -p 65536 -b 4096 -ub 4096

    Edit: The only Mac Studio pp65536 datapoint I’ve found is this Reddit thread:

    https://old.reddit.com/r/LocalLLaMA/comments/1jq13ik/mac_stu ...

    They report ~43.2 minutes prefill latency for a 65k prompt on a 2-bit DeepSeek quant. Gpt-oss-120b should be faster than that, but still very slow.

    replies(1): >>45311625 #
    15. t1amat ◴[] No.45304691[source]
    This is the right take. You might be able to get decent (2-3x less than a GPU rig) token generation, which is adequate, but your prompt processing speeds are more like 50-100x slower. A hardware solution is needed to make long context actually usable on a Mac.
    16. glitchc ◴[] No.45305690[source]
    Yeah but slower memory compared to the M3 Ultra. There's a big difference in memory bandwidth, which seems to be a driving factor for inferencing. Training on the other hand, it's probably a lot faster.
    17. int_19h ◴[] No.45311625{4}[source]
    This is Mac Studio M1 Ultra with 128Gb of RAM.

      > llama-bench -m ./gpt-oss-120b-MXFP4-00001-of-00002.gguf -ngl 999 -fa 1 --mmap 0 -p 65536 -b 4096 -ub 4096       
                                                                                                 
      | model                          |       size |     params | backend    | threads | n_batch | n_ubatch | fa | mmap |            test |                  t/s |
      | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | -------: | -: | ---: | --------------: | -------------------: |
      | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Metal,BLAS |      16 |    4096 |     4096 |  1 |    0 |         pp65536 |       392.37 ± 43.91 |
      | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | Metal,BLAS |      16 |    4096 |     4096 |  1 |    0 |           tg128 |         65.47 ± 0.08 |
      
      build: a0e13dcb (6470)
    replies(1): >>45314457 #
    18. EnPissant ◴[] No.45314457{5}[source]
    Thanks. That’s better than I expected. It's only 8.3x worse than a 5090 + CPU: 167s latency.