Most active commenters

nopelynopington(4)

Popular/hot comments

>>43714642 #

BitNet b1.58 2B4T Technical Report

(arxiv.org)

1. balazstorok ◴[17 Apr 25 09:35 UTC] No.43714642[source]▶

Does someone have a good understanding how 2B models can be useful in production? What tasks are you using them for? I wonder what tasks you can fine-tune them on to produce 95-99% results (if anything).

replies(7): >>43714663 #>>43714744 #>>43714864 #>>43714922 #>>43714969 #>>43715153 #>>43715192 #

2. throwaway314155 ◴[17 Apr 25 09:39 UTC] No.43714663[source]▶

>>43714642 #

Summarization on mobile/embedded might be a good usecase?

replies(1): >>43716601 #

3. Lapel2742 ◴[17 Apr 25 09:51 UTC] No.43714744[source]▶

>>43714642 #

I'm just playing / experimenting around with local LLM's. Just to see what I can do with them. One thing that comes to mind is gaming: E.g. text/dialog generation in procedural worlds / adventures.

4. logicchains ◴[17 Apr 25 10:11 UTC] No.43714864[source]▶

>>43714642 #

2B models by themselves aren't so useful, but it's very interesting as a proof of concept, because the same technique used to train a 200B model could produce one that's much more efficient (cheaper and more environmentally friendly) than existing 200B models, especially with specialised hardware support.

5. akoboldfrying ◴[17 Apr 25 10:19 UTC] No.43714906[source]▶

>>43714004 (OP) #

They give some description of how their weights are stored: they pack 4 weights into an int8, indicating that their storage format isn't optimal (2 bits per weight instead of the optimal ~1.58 bits). But I don't know enough about LLM internals to know how material this is.

Could anyone break down the steps further?

replies(1): >>43718627 #

6. nialse ◴[17 Apr 25 10:22 UTC] No.43714922[source]▶

>>43714642 #

The use case for small models include sentiment and intent analysis, spam and abuse detection, and classifications of various sorts. Generally LLM are thought of as chat models but the output need not be a conversation per se.

replies(1): >>43715445 #

7. future10se ◴[17 Apr 25 10:31 UTC] No.43714969[source]▶

>>43714642 #

The on-device models used for Apple Intelligence (writing tools, notification and email/message summaries, etc.) are around ~3B parameters.

I mean, they could be better (to put it nicely), but there is a legitimate use-case for them and I'd love to see more work in this space.

https://machinelearning.apple.com/research/introducing-apple...

https://arxiv.org/abs/2407.21075

8. snovv_crash ◴[17 Apr 25 11:06 UTC] No.43715153[source]▶

>>43714642 #

Anything you'd normally train a smaller custom model for, but with an LLM you can use a prompt instead of training.

9. galeos ◴[17 Apr 25 11:10 UTC] No.43715180[source]▶

>>43714004 (OP) #

You can try out the model in a demo they have setup: https://bitnet-demo.azurewebsites.net/

10. meltyness ◴[17 Apr 25 11:14 UTC] No.43715192[source]▶

>>43714642 #

I'm more interested in how users are taking 95-99% to 99.99% for generation-assisted tasks. I haven't seen a review or study of techniques, even though on the ground it's pretty trivial to think of some candidates.

replies(1): >>43716572 #

11. Havoc ◴[17 Apr 25 11:46 UTC] No.43715393[source]▶

>>43714004 (OP) #

Is there a reason why the 1.58 ones are always aimed at quite small ones? Think I’ve seen an 8B but that’s about it.

Is there a technical reason for it or just research convenience ?

replies(2): >>43715453 #>>43717231 #

12. mhitza ◴[17 Apr 25 11:52 UTC] No.43715445{3}[source]▶

>>43714922 #

My impression was that text embeddings are better suited for classification. Of course the big caveat is that the embeddings must have "internalized" the semantic concept you're trying to map.

From some article I have in my draft, experimenting with open source text embeddings:

    ./match venture capital
    purchase           0.74005488647684
    sale               0.80926752301733
    place              0.81188663814236
    positive sentiment 0.90793311875207
    negative sentiment 0.91083707598925
    time               0.9108697315425
 
    ./store sillicon valley
    ./match venture capital
    sillicon valley    0.7245139487301
    purchase           0.74005488647684
    sale               0.80926752301733
    place              0.81188663814236
    positive sentiment 0.90793311875207
    negative sentiment 0.91083707598925
    time               0.9108697315425

Of course you need to figure out what these black boxes understand. For example for sentiment analysis, instead of having it match against "positive" "negative" you would have the matching terms be "kawai" and "student debt". Depending how the text embedding internalized negatives and positives based on their training data.

13. londons_explore ◴[17 Apr 25 11:52 UTC] No.43715453[source]▶

>>43715393 #

I suspect because current GPU hardware can't efficiently train such low bit depth models. You end up needing activations to use 8 or 16 bits in all the data paths, and don't get any more throughput per cycle on the multiplications than you would have done with FP32.

Custom silicon would solve that, but nobody wants to build custom silicon for a data format that will go out of fashion before the production run is done.

replies(2): >>43715606 #>>43715705 #

14. Havoc ◴[17 Apr 25 12:08 UTC] No.43715606{3}[source]▶

>>43715453 #

Makes sense. Might be good for mem throughput constrained devices though so hoping it’ll pick up

15. zamadatix ◴[17 Apr 25 12:17 UTC] No.43715705{3}[source]▶

>>43715453 #

The custom CUDA kernel for 4-in-8 seems to have come out better than a naive approach (such as just treating each as an fp8/int8) + it lowers memory bandwidth. Custom hardware would certainly make that improvement even better but I don't think that's what's limiting training to 2-8 billion parameters as much as something like research convenience while the groundwork for this type of model is still being figured out.

16. nopelynopington ◴[17 Apr 25 12:57 UTC] No.43716144[source]▶

>>43714004 (OP) #

I built it at home this morning and tried it, perhaps my expectations were high but I wasn't terribly impressed. I asked it for a list of ten types of data I might show on a home info display panel. It gave me three. I clarified that I wanted ten, it gave me six. Every request after that just returned the same six things.

I know it's not chatGPT4 but I've tried other very small models that run on CPU only and had better results

replies(2): >>43716331 #>>43720674 #

17. ashirviskas ◴[17 Apr 25 13:13 UTC] No.43716331[source]▶

>>43716144 #

> I've tried other very small models that run on CPU only and had better results

Maybe you can you share some comparative examples?

replies(1): >>43716768 #

18. oezi ◴[17 Apr 25 13:32 UTC] No.43716572{3}[source]▶

>>43715192 #

Three strategies seem to be:

- Use LLM to evaluate result and retry if it doesn't match.

- let users trigger a retry

- let users edit

19. ◴[17 Apr 25 13:35 UTC] No.43716601{3}[source]▶

>>43714663 #

20. ◴[17 Apr 25 13:41 UTC] No.43716699[source]▶

>>43714004 (OP) #

21. nopelynopington ◴[17 Apr 25 13:45 UTC] No.43716768{3}[source]▶

>>43716331 #

sure, here's my conversation with BitNet b1.58 2B4T

https://pastebin.com/ZZ1tADvp

here's the same prompt given to smollm2:135m

https://pastebin.com/SZCL5WkC

The quality of the second results are not fantastic. The data isn't public, and it repeats itself mentioning income a few times. I don't think I would use either of these models for accurate data but I was surprised at the truncated results from bitnet

Smollm2:360M returned better quality results, no repetition, but it did suggest things which didn't fit the brief exactly (public data given location only)

https://pastebin.com/PRFqnqVF

Edit:

I tried the same query on the live demo site and got much better results. Maybe something went wrong on my end?

replies(1): >>43717694 #

22. yieldcrv ◴[17 Apr 25 14:18 UTC] No.43717231[source]▶

>>43715393 #

They aren’t, there is a 1.58 version of deepseek that’s like 200gb instead of 700

replies(1): >>43719355 #

23. sroussey ◴[17 Apr 25 14:47 UTC] No.43717694{4}[source]▶

>>43716768 #

You were using bitnet.cpp?

replies(1): >>43717900 #

24. nopelynopington ◴[17 Apr 25 15:00 UTC] No.43717900{5}[source]▶

>>43717694 #

Yes

25. rcMgD2BwE72F ◴[17 Apr 25 15:10 UTC] No.43718061[source]▶

>>43714004 (OP) #

I ask about the last French election and the #1 sentence is:

>Marine Le Pen, a prominent figure in France, won the 2017 presidential election despite not championing neoliberalism. Several factors contributed to her success: (…)

What data did they train their model on?

26. Fubwubs ◴[17 Apr 25 15:51 UTC] No.43718627[source]▶

>>43714906 #

This model maps weights to ternary values {-1, 0, 1} (aka trits). One trit holds log(3)/log(2) ≈ 1.58 bits of information. To represent a single trit by itself would require 2 bits, but it is possible to pack 5 trits into 8 bits. This article explains it well: https://compilade.net/blog/ternary-packing

By using 4 ternary weights per 8 bits, the model is not quite as space-efficient as it could be in terms of information density. (4*1.58)/8 = 0.79 vs (5*1.58)/8 = 0.988 There is currently no hardware acceleration for doing operations on 5 trits packed into 8 bits, so the weights have to be packed and unpacked in software. Packing 5 weights into 8 bits requires slower, more complex packing/unpacking algorithms.

replies(1): >>43723333 #

27. rbanffy ◴[17 Apr 25 16:45 UTC] No.43719319[source]▶

>>43714004 (OP) #

Not to be confused with BITNET

https://en.m.wikipedia.org/wiki/BITNET

28. logicchains ◴[17 Apr 25 16:48 UTC] No.43719355{3}[source]▶

>>43717231 #

That's not a real BitNet, it's just a post-training quantisation, and its performance suffers compared to if it was trained from scratch at 1.58 bits.

29. Thoreandan ◴[17 Apr 25 17:14 UTC] No.43719687[source]▶

>>43714004 (OP) #

I guess B1FF@BITNET posts are gonna come from an LLM now.

Context: https://web.archive.org/web/20030830105202/http://www.catb.o...

30. Me1000 ◴[17 Apr 25 18:46 UTC] No.43720674[source]▶

>>43716144 #

This is a technology demo, not a model you'd want to use. Because Bitnet models are only average 1.58 bits per weight you'd expect to need the model to be much larger than your fp8/fp16 counterparts in terms of parameter count. Plus this is only a 2 billion parameter model in the first place, even fp16 2B parameter models generally perform pretty poorly.

replies(1): >>43720945 #

31. nopelynopington ◴[17 Apr 25 19:11 UTC] No.43720945{3}[source]▶

>>43720674 #

Ok that's fair. I still think something was up with my build though, the online demo worked far better than my local build

32. akoboldfrying ◴[17 Apr 25 23:27 UTC] No.43723333{3}[source]▶

>>43718627 #

That link gives a great description of how to pack trits more efficiently, thanks. Encoding in "base 3" was obvious to me, but I didn't realise that 5 trits fit quite tightly into a byte, or that it's possible to "space the values apart" so that they can be extracted using just multiplications and bitwise ops (no division or remainder).

↑