←back to thread

604 points andy99 | 2 comments | | HN request time: 0.002s | source
Show context
isusmelj ◴[] No.44536509[source]
I hope they do well. AFAIK they’re training or finetuning an older LLaMA model, so performance might lag behind SOTA. But what really matters is that ETH and EPFL get hands-on experience training at scale. From what I’ve heard, the new AI cluster still has teething problems. A lot of people underestimate how tough it is to train models at this scale, especially on your own infra.

Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.

replies(5): >>44536696 #>>44536809 #>>44537201 #>>44539869 #>>44541746 #
asjir ◴[] No.44541746[source]
I'd be more concerned about the size used being 70b (deepseek r1 has 671b) which makes catching up with SOTA kinda more difficult to begin with.
replies(1): >>44541843 #
1. zettabomb ◴[] No.44541843{3}[source]
SOTA performance is relative to model size. If it performs better than other models in the 70B range (e.g. Llama 3.3) then it could be quite useful. Not everyone has the VRAM to run the full fat Deepseek R1.
replies(1): >>44543323 #
2. tough ◴[] No.44543323[source]
also isn't DeepSeek's Mixture of Experts? meaning not all params get ever activated on one forward pass?

70B feels like the best balance between usable locally and decent for regular use.

maybe not SOTA, but a great first step.