Tenstorrent Launches Blackhole Developer Products at Tenstorrent Dev Day

(tenstorrent.com)

1. mika6996 ◴[03 Apr 25 19:30 UTC] No.43574293[source]▶

Are the tenstorrent blackhole cards anyhow competitive?

replies(3): >>43574938 #>>43574986 #>>43579794 #

2. wincy ◴[03 Apr 25 19:58 UTC] No.43574612[source]▶

I thought this was somehow BitTorrent related. I read this article and am not really sure if this is something to be excited about? Can I run Deepseek R1 with whatever this is?

replies(1): >>43574905 #

3. lostmsu ◴[03 Apr 25 20:06 UTC] No.43574710[source]▶

>>43573310 (OP) #

It's funny, that they sell a tensor accelerator, but do not mention its TOps anywhere.

replies(1): >>43579888 #

4. rasengan0 ◴[03 Apr 25 20:07 UTC] No.43574720[source]▶

>>43573310 (OP) #

Nice filter of models supported: https://tenstorrent.com/developers but needs that another filter for Blackhole RISCV p100a p150; my guess going by memory https://tenstorrent.com/hardware/blackhole it is up there with n300 https://docs.tenstorrent.com/aibs/wormhole/specifications.ht...

5. krasin ◴[03 Apr 25 20:23 UTC] No.43574905[source]▶

>>43574612 #

DeepSeek R1 has 671B parameters. If you want to run at 8 bits/parameter (the native format for DeepSeek R1, in which it was trained, so the top performance possible for the model), one would need about 32 Blackhole p150a ([1]), which is around $42k and 10kW power consumption.

So, yes, you can run DeepSeek R1 on it, but there are cheaper options (if we only talk about inference).

1. https://tenstorrent.com/hardware/blackhole

6. krasin ◴[03 Apr 25 20:26 UTC] No.43574938[source]▶

>>43574293 #

They seem to be showing very decent performance results for diffusion transformers. Not so much for the autoregressive transformers (the "regular" ones).

7. sprash ◴[03 Apr 25 20:30 UTC] No.43574986[source]▶

>>43574293 #

A test [1] by a random dude with the older Wormhole N150 delivers half the performance (as in tokens/s) of a RTX 4090 in generic Llama tests. The new p150 should have double the performance according to specs, but who knows. I'd call it somewhat competitive.

1.: https://youtu.be/WibEx3jfKu0?t=957

8. littlestymaar ◴[03 Apr 25 20:50 UTC] No.43575210[source]▶

>>43573310 (OP) #

Re-using a comment a wrote some time ago:

Tenstorrent really needs to put more VRAM on their cards.

If chinese companies can hack Nvidia GPUs with 48 or 96GB vram at a competitive price, surely Tensorrent can too.

Variants of n300d at $2500 for 48GB and $3900 for 96GB would be instant hits.

~~24GB for $1500 simply isn't gonna do it.~~ (old part of the comment related to the old n300 which can be update with: 32B for $1400 still isn't enough for success. There's some progress, but that's still too low considering it's exotic hardware that will lead to tons of compatibility issues).

replies(4): >>43575246 #>>43575326 #>>43575727 #>>43579793 #

9. krasin ◴[03 Apr 25 20:52 UTC] No.43575246[source]▶

>>43575210 #

It's 32GB for $1300 for Blackhole p150a([1]). The rest of your point is very true.

1. https://tenstorrent.com/hardware/blackhole

replies(1): >>43575403 #

10. aseipp ◴[03 Apr 25 20:59 UTC] No.43575326[source]▶

>>43575210 #

The new p150 cards linked in the OP have 32GB GDDR6 @ 512GB/s for $1,300. Which isn't bad on paper, I guess. They're meant to be networked (quad 800GB QSFP-DD) like Nvidia GPUs, so two of them would get you 64GB of VRAM at $2600 for ~600W which is basically what you're asking for? The power usage isn't good enough yet at scale I think, but for a workstation it's quite manageable.

Real workloads remain to be seen, but if they can actually get a working build of vLLM and their cards remain actually buyable, well, they're doing better than some of the competition...

replies(1): >>43575428 #

11. littlestymaar ◴[03 Apr 25 21:05 UTC] No.43575403{3}[source]▶

>>43575246 #

I updated my comment accordingly, I had just copy-pasted a comment of mine of Reddit from a few days ago but this part needed an update.

12. littlestymaar ◴[03 Apr 25 21:08 UTC] No.43575428{3}[source]▶

>>43575326 #

> so two of them would get you 64GB of VRAM at $2600 for ~600W which is basically what you're asking for?

Almost, except with respect to space in the box and power usage, which are critical IMHO.

> but if they can actually get a working build of vLLM and their cards remain actually buyable, well, they're doing better than some of the competition...

That's a big if though, poor software support is to be expected and you'll need to factor that in IMHO, and that's why they need to beef up the memory. Of course if software support is stellar then it may be good enough of a deal.

13. bigyabai ◴[03 Apr 25 21:33 UTC] No.43575727[source]▶

>>43575210 #

Dedicated memory isn't the issue. Increase DRAM on your card and your bandwidth goes down; increase the bandwidth and your price increases reciprocally. The solution isn't to just solder more memory anywhere it fits, these are well-paid engineers that are working to optimize a complex problem space. The Chinese board fluxers are working with a different class of hardware that regularly ships with dark silicon, binned hardware and die-chopped configurations.

You'll note that Apple didn't just immediately resume shipping systems with 1.5TB of RAM when they revised their own system architecture. It's taken them half a decade to recoup a third of that capacity at the VRAM-level speeds they require to unify the GPU and CPU's memory.

replies(1): >>43580765 #

14. mixmastamyk ◴[03 Apr 25 21:48 UTC] No.43575882[source]▶

>>43573310 (OP) #

Not clear how RISC-V is an AI accelerator? Is this a special build of a CPU, to look like a GPU?

Found answer: https://tenstorrent.com/faq

15. ◴[03 Apr 25 22:00 UTC] No.43575991[source]▶

>>43573310 (OP) #

16. jauntywundrkind ◴[03 Apr 25 22:13 UTC] No.43576107[source]▶

>>43573310 (OP) #

Data goes in, AI emits hawking radiation out, in fancy patterns. Apt enough metaphor I guess.

17. PinkiesBrain ◴[04 Apr 25 08:56 UTC] No.43579793[source]▶

>>43575210 #

It's not meant as a workstation/tinkering system, the card without networking is not the main aim. If you're willing to pay 4k for 96GB, just get 3 with networking.

That said, it missed the boat on MoE. The future is two tiered memory systems, NVIDIA has already announced they are doing that. Ideally these cards should have 4-8 DIMM slots for a couple channels of DDR5.

That would also make them far more useful for workstations/tinkering.

18. imtringued ◴[04 Apr 25 08:57 UTC] No.43579794[source]▶

>>43574293 #

Tenstorrent's Wormhole was garbage for anything other than development.

Blackhole is actually usable. They cost slightly more than a used 3090 Ti with 24 GB VRAM, but they come with 32GB GDDR6 and 4x 800G networking (apparently only blackholes to other blackholes).

Nvidia's datacenter GPUs are at least 4x faster, but they also cost 20 times as much. There's also the fact that they have as much SRAM as Groq LPUs.

There is also another present. You get 16 Ascalon cores per Blackhole card. Yes, you've heard that right. You are getting 16 of the fastest RISC-V cores ever developed for a measily $999 to $1,299.

My only complaint is that they have these insane 300W TDPs.

replies(1): >>43580900 #

19. imtringued ◴[04 Apr 25 09:16 UTC] No.43579888[source]▶

>>43574710 #

72 Tensix Cores -> Tensix Cores 140. They doubled their TFLOPS and VRAM+memory bandwidth. Thereby maintaining the same TFLOPS to memory ratio.

replies(1): >>43580452 #

20. lostmsu ◴[04 Apr 25 10:54 UTC] No.43580452{3}[source]▶

>>43579888 #

So what's the number before and after?

21. littlestymaar ◴[04 Apr 25 11:35 UTC] No.43580765{3}[source]▶

>>43575727 #

> Dedicated memory isn't the issue.

To run large MoE models it is.

> Increase DRAM on your card and your bandwidth goes down

Why would it?

> You'll note that Apple didn't just immediately resume shipping systems with 1.5TB of RAM when they revised their own system architecture. It's taken them half a decade to recoup a third of that capacity at the VRAM-level speeds they require to unify the GPU and CPU's memory

I fail to see how a unified architecture on a general purpose CPU is a good illustration when we're discussing PCIe accelerator cards. The problems they face have little in common.

22. camel-cdr ◴[04 Apr 25 11:49 UTC] No.43580900{3}[source]▶

>>43579794 #

> You get 16 Ascalon cores per Blackhole card.

No you don't, you get licensed SiFive X280 cores, which are slow in order cores with 512-bit vector registers (dual issue 256-bit ALUs).

See: https://docs.tenstorrent.com/aibs/blackhole/specifications.h...

and the SiFive X280 page: https://www.sifive.com/cores/intelligence-x280

23. tucnak ◴[04 Apr 25 14:47 UTC] No.43583219[source]▶

>>43573310 (OP) #

Since n300's came out, they have publicly shared their roadmap, so I've been waiting for next-generation hardware ever since. They have also announced p300 yesterday (that would put two Blackhole chips on one card, akin to what n300 did before)

So that would put a p300 unit at 64 GB GDDR6 and 1 TB/s bandwidth. Very competitive, considering that Tenstorrent is now the only vendor to offer scale-out at reasonable price point. Whoever figures out how to make it work with Corundum[1] for unlimited K/V cache offloading is going to make a lot of money: as agents spend more time executing tool-code, and de-coupled from chats, the individual jobs will take more and more time, so scheduling will become more important. How do you manage TB's of K/V cache concurrently?

People complaining about bandwidth are not seeing the bigger picture. Probably because they're unaware NVMe-oF exists, and never kept up with modern network topologies, because hyperscaler Kool-Aid doesn't include it.

[1] https://github.com/corundum/corundum