Most active commenters

diggan(3)

Popular/hot comments

>>41879780 #
>>41880073 #
>>41880304 #

Microsoft BitNet: inference framework for 1-bit LLMs

(github.com)

1. zamadatix ◴[18 Oct 24 13:57 UTC] No.41879524[source]▶

For anyone that hasn't read the previous papers before the "1.58-bit" part comes from using 3 values (-1, 0, 1) and log2[3]=1.58...

2. newfocogi ◴[18 Oct 24 14:28 UTC] No.41879780[source]▶

>>41877609 (OP) #

I'm enthusiastic about BitNet and the potential of low-bit LLMs - the papers show impressive perplexity scores matching full-precision models while drastically reducing compute and memory requirements. What's puzzling is we're not seeing any major providers announce plans to leverage this for their flagship models, despite the clear efficiency gains that could theoretically enable much larger architectures. I suspect there might be some hidden engineering challenges around specialized hardware requirements or training stability that aren't fully captured in the academic results, but would love insights from anyone closer to production deployment of these techniques.

replies(6): >>41879903 #>>41880200 #>>41880375 #>>41881054 #>>41881230 #>>41882202 #

3. strangescript ◴[18 Oct 24 14:44 UTC] No.41879903[source]▶

>>41879780 #

I find it a little confusing as well. I wonder if its because so many of these companies have went all in on the "traditional" approach that deviating now seems like a big shift?

4. lostmsu ◴[18 Oct 24 14:45 UTC] No.41879915[source]▶

>>41877609 (OP) #

No GPU inference support?

replies(1): >>41879986 #

5. diggan ◴[18 Oct 24 14:52 UTC] No.41879986[source]▶

>>41879915 #

> that support fast and lossless inference of 1.58-bit models on CPU (with NPU and GPU support coming next).

6. wwwtyro ◴[18 Oct 24 15:03 UTC] No.41880073[source]▶

>>41877609 (OP) #

Can anyone help me understand how this works without special bitnet precision-specific hardware? Is special hardware unnecessary? Maybe it just doesn't reach the full bitnet potential without it? Or maybe it does, with some fancy tricks? Thanks!

replies(3): >>41880204 #>>41880283 #>>41881707 #

7. swfsql ◴[18 Oct 24 15:16 UTC] No.41880200[source]▶

>>41879780 #

I think that since training must happen on a non-bitnet architecture, tuning towards bitnet is always a downgrade on it's capabilities, so they're not really interested in it. But maybe they could be if they'd offer cheaper plans, since it's efficiency is relatively good.

I think the real market for this is for local inference.

8. hansvm ◴[18 Oct 24 15:17 UTC] No.41880204[source]▶

>>41880073 #

I haven't checked this one out yet, but a common trick is using combinations of instructions and data invariants allowing you to work in "lanes".

The easiest example is xor, which can trivially be interpreted as either xoring one large integer or xoring a vector of smaller integers.

Take a look at the SWAR example here [0] as a pretty common/easy example of that technique being good for something in the real world.

Dedicated hardware is almost always better, but you can still get major improvements with a little elbow grease.

[0] https://nimrod.blog/posts/algorithms-behind-popcount/

replies(1): >>41880274 #

9. 15155 ◴[18 Oct 24 15:23 UTC] No.41880274{3}[source]▶

>>41880204 #

This is extremely easy to implement in-FPGA.

10. eightysixfour ◴[18 Oct 24 15:24 UTC] No.41880283[source]▶

>>41880073 #

While fancy hardware would make it faster, what you are comparing it to is a bunch of floating point and large number multiplication. I believe in this case they just use a look up table:

If one value is 0, it is 0.

If the signs are different, it is -1.

If the signs are the same, it is 1.

I’m sure those can be done with relatively few instructions using far less power hungry hardware.

11. faragon ◴[18 Oct 24 15:26 UTC] No.41880304[source]▶

>>41877609 (OP) #

I'm glad Microsoft uses Bash in the example, instead of their own Windows shells. As a user I would like having something like "Git Bash" for Windows built in the system, as default shell.

replies(3): >>41880730 #>>41883028 #>>41886203 #

12. waynenilsen ◴[18 Oct 24 15:35 UTC] No.41880375[source]▶

>>41879780 #

I suppose hardware support would be very helpful, new instructions for bitpacked operations?

13. alkh ◴[18 Oct 24 15:56 UTC] No.41880605[source]▶

>>41877609 (OP) #

Sorry for a stupid question but to clarify, even though it is a 1-bit model, it is supposed to be working with any types of embeddings, even taken from larger LLMs(in their example, they use HF1BitLLM/Llama3-8B-1.58-100B-tokens). I.e. it doesn't have an embedding layer built-in and relies on embedding provided separately?

replies(1): >>41881253 #

14. not_a_bot_4sho ◴[18 Oct 24 16:10 UTC] No.41880730[source]▶

>>41880304 #

WSL is where it's at today. It's not quite what you're asking for, as it is a separate virtual OS, but the integration is so tight that it feels like you're using your favorite shell natively in Windows.

replies(1): >>41880840 #

15. diggan ◴[18 Oct 24 16:18 UTC] No.41880840{3}[source]▶

>>41880730 #

> integration is so tight that it feels like you're using your favorite shell natively in Windows

WSL1 certainly felt that way, WSL2 just feels like any other virtualization manager and basically works the same. Not sure why people sings the praise of WSL2, I gave it a serious try for months but there is a seemingly endless list of compatibility issues which I never had with VMWare or VirtualBox, so I just went back to those instead and the experience is the same more or less.

replies(1): >>41880994 #

16. Scene_Cast2 ◴[18 Oct 24 16:26 UTC] No.41880929[source]▶

>>41877609 (OP) #

Neat. Would anyone know where the SDPA kernel equivalent is? I poked around the repo, but only saw some form of quantization code with vectorized intrinsics.

17. throwaway314155 ◴[18 Oct 24 16:33 UTC] No.41880994{4}[source]▶

>>41880840 #

Probably because it has relatively painless GPU sharing with pass through. As far as I know that sort of feature requires a hypervisor-level VM, which is not something you get with VirtualBox.

replies(1): >>41881111 #

18. diggan ◴[18 Oct 24 16:43 UTC] No.41881111{5}[source]▶

>>41880994 #

Someone correct me if I'm wrong, but I think you can use a KVM or QEMU backend for VirtualBox and that way get GPU pass-through. Probably not out of the box though.

replies(2): >>41883686 #>>41884904 #

19. danielmarkbruce ◴[18 Oct 24 16:55 UTC] No.41881230[source]▶

>>41879780 #

People are almost certainly working on it. The people who are actually serious and think about things like this are less likely to just spout out "WE ARE BUILDING A CHIP OPTIMIZED FOR 1-BIT" or "WE ARE TRAINING A MODEL USING 1-BIT" etc, before actually being quite sure they can make it work at the required scale. It's still pretty researchy.

20. danielmarkbruce ◴[18 Oct 24 16:57 UTC] No.41881253[source]▶

>>41880605 #

No. You can't put any type of embedding in.

21. summerlight ◴[18 Oct 24 17:40 UTC] No.41881707[source]▶

>>41880073 #

The major benefit would be its significant decrease in memory consumption, rather than the compute itself. The major bottleneck of the current LLM infra is typically memory bandwidth and that's the reason why those chip industries are going crazy on HBM. Surely compute optimization helps but this is useful even without any hardware changes.

replies(1): >>41882331 #

22. ◴[18 Oct 24 18:36 UTC] No.41882202[source]▶

>>41879780 #

23. az226 ◴[18 Oct 24 18:49 UTC] No.41882331{3}[source]▶

>>41881707 #

Inference speeds go brrrr as well.

24. layer8 ◴[18 Oct 24 20:12 UTC] No.41883028[source]▶

>>41880304 #

Just install Cygwin.

Not sure what you mean by “default shell”. The default shell on Windows is this: https://en.wikipedia.org/wiki/Windows_shell. I don’t suppose you mean booting into Bash. Windows doesn’t have any other notion of a default shell.

replies(1): >>41883960 #

25. delegate ◴[18 Oct 24 20:46 UTC] No.41883348[source]▶

>>41877609 (OP) #

I assume it is not as powerful at some tasks than full sized model, so what would one use this model for ?

26. trebligdivad ◴[18 Oct 24 21:05 UTC] No.41883477[source]▶

>>41877609 (OP) #

Has some one made an FPGA or ASIC implementation yet? It feels like it should be easy (and people would snap up for inference).

27. Datagenerator ◴[18 Oct 24 21:31 UTC] No.41883686{6}[source]▶

>>41881111 #

Close Windows, many doors open

28. faragon ◴[18 Oct 24 22:10 UTC] No.41883960{3}[source]▶

>>41883028 #

I used Cygwin for more than a decade. I prefer Git Bash (msys-based).

replies(1): >>41884494 #

29. hhejrkrn ◴[18 Oct 24 23:46 UTC] No.41884494{4}[source]▶

>>41883960 #

I thought msysgit was also cygein based

30. EgoIncarnate ◴[19 Oct 24 01:23 UTC] No.41884904{6}[source]▶

>>41881111 #

The WSL2 GPU passthrough is more like a virtual GPU than KVM style device passthrough. I believe it's effectively a device specific linux userland driver to device specific windows kernel driver with a linux kernel shim bridging the too. If I recall correctly, the linux userland drivers are actually provided by the windows driver.

31. ein0p ◴[19 Oct 24 01:28 UTC] No.41884930[source]▶

>>41877609 (OP) #

1.58bpw is not “1 bit”.

32. aithrowawaycomm ◴[19 Oct 24 07:13 UTC] No.41886203[source]▶

>>41880304 #

They are using Windows shells, just not on Macs. This is what the caption for the video says:

> A demo of bitnet.cpp running a BitNet b1.58 3B model on Apple M2:

And this is in their Windows build instructions:

> Important! If you are using Windows, please remember to always use a Developer Command Prompt / PowerShell for VS2022 for the following commands

33. sheerun ◴[21 Oct 24 13:16 UTC] No.41903851[source]▶

>>41877609 (OP) #

When will AIs learn to no bla bla bla by default

↑