(cerebras.ai)

427 points benchmarkist | 1 comments | 19 Nov 24 00:15 UTC | HN request time: 0.229s | source

Show context

zackangelo ◴[19 Nov 24 02:11 UTC] No.42179476[source]▶

This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.

I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism?

replies(9): >>42179489 #>>42179493 #>>42179501 #>>42179503 #>>42179754 #>>42179794 #>>42180035 #>>42180144 #>>42180569 #

modeless ◴[19 Nov 24 02:16 UTC] No.42179493[source]▶

>>42179476 #

Cerebras is a chip company. They are not using GPUs. Their chip uses wafer scale integration which means it's the physical size of a whole wafer, dozens of GPUs in one.

They have limited memory on chip (all SRAM) and it's not clear how much HBM bandwidth they have per wafer. It's a completely different optimization problem than running on GPU clusters.

replies(2): >>42180735 #>>42190988 #

why_only_15 ◴[19 Nov 24 07:05 UTC] No.42180735[source]▶

>>42179493 #

they have about 125GB/s of off-chip bandwidth

replies(1): >>42180792 #

saagarjha ◴[19 Nov 24 07:19 UTC] No.42180792[source]▶

>>42180735 #

Do they just not do HBM at all or

replies(1): >>42182712 #

1. why_only_15 ◴[19 Nov 24 12:30 UTC] No.42182712[source]▶

>>42180792 #

I'm not too up to date but as I recall there are a lot of weirdnesses because of how big their chip is (e.g. thermal expansion being a problem). I believe they have a single giant line in the middle of the chip for this reason. maybe this makes HBM etc. hard? certainly their chip would be more appealing if they cut down the # of cores by 10x, added matrix units and added HBM but looks like they're not going to go this way.

↑

Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference