(arxiv.org)

111 points galeos | 1 comments | 17 Apr 25 07:27 UTC | HN request time: 0.229s | source

Show context

Havoc ◴[17 Apr 25 11:46 UTC] No.43715393[source]▶

>>43714004 (OP) #

Is there a reason why the 1.58 ones are always aimed at quite small ones? Think I’ve seen an 8B but that’s about it.

Is there a technical reason for it or just research convenience ?

replies(2): >>43715453 #>>43717231 #

londons_explore ◴[17 Apr 25 11:52 UTC] No.43715453[source]▶

>>43715393 #

I suspect because current GPU hardware can't efficiently train such low bit depth models. You end up needing activations to use 8 or 16 bits in all the data paths, and don't get any more throughput per cycle on the multiplications than you would have done with FP32.

Custom silicon would solve that, but nobody wants to build custom silicon for a data format that will go out of fashion before the production run is done.

replies(2): >>43715606 #>>43715705 #

1. Havoc ◴[17 Apr 25 12:08 UTC] No.43715606[source]▶

>>43715453 #

Makes sense. Might be good for mem throughput constrained devices though so hoping it’ll pick up

↑

BitNet b1.58 2B4T Technical Report