1-Bit AI Infrastructure

(arxiv.org)

1. ttyprintk ◴[15 Nov 24 14:32 UTC] No.42147278[source]▶

>>42147252 (OP) #

Later a4.8 quantization by some of the same team:

https://news.ycombinator.com/item?id=42092724

https://arxiv.org/abs/2411.04965

replies(1): >>42191454 #

2. ◴[20 Nov 24 06:09 UTC] No.42191204[source]▶

>>42147252 (OP) #

3. ◴[20 Nov 24 06:10 UTC] No.42191212[source]▶

>>42147252 (OP) #

4. skavi ◴[20 Nov 24 07:12 UTC] No.42191454[source]▶

>>42147278 #

and the repo for this project: https://github.com/microsoft/BitNet

replies(1): >>42192368 #

5. dailykoder ◴[20 Nov 24 08:12 UTC] No.42191742[source]▶

>>42147252 (OP) #

I have read about it quite a few weeks ago the first time and I found it very interesting.

Now that I have done more than enough CPU design inside FPGAs, I wanted to try something new, some computation heavy things that could benefit from an FPGA. Does anyone here know how feasable it'd be to implement something like that on an FPGA? I only have rather small chips (artix-7 35T and polarfire SoC with 95k logic slices). So I know I won't be able to press a full LLM into that, but something should be possible.

Maybe I should refresh the fundamentals though and start with MNIST. But the question is rather: What is a realistic goal that I could possibly reach with these small FPGAs? Performance might be secondary, I am rather interested in what's possible regarding complexity/features on a small device.

Also has anyone here compiled openCL (or GL?) kernels for FPGAs and can give me a starting point? I was wondering if it's possible to have a working backend for something like tinygrad[1]. I think this would be a good way to learn all the different layers on how such frameworks actually work

- [1] https://github.com/tinygrad/tinygrad

replies(4): >>42192593 #>>42192595 #>>42194578 #>>42197158 #

6. WiSaGaN ◴[20 Nov 24 08:44 UTC] No.42191930[source]▶

>>42147252 (OP) #

I would expect research along this way to pick up quite a bit if we confirm the pretrain stage is not scaling as previous expected, thus the scale and architecture would be more stable in the near future, especially if the focus shifts to inference time scaling.

7. sinuhe69 ◴[20 Nov 24 10:02 UTC] No.42192368{3}[source]▶

>>42191454 #

The demo they showed was full of repeated sentences. The 3B model looks quite dense, TBH. Did they just want to show the speed?

replies(1): >>42193481 #

8. svantana ◴[20 Nov 24 10:37 UTC] No.42192593[source]▶

>>42191742 #

Couldn't you implement a bitnet kernel, and use that as a co-processor to a PC? Or is the I/O bandwidth so low that it won't be worth it?

replies(1): >>42193162 #

9. verytrivial ◴[20 Nov 24 10:37 UTC] No.42192595[source]▶

>>42191742 #

You gain in potential parallelism with FPGA, so with very small "at the edge" models they could speed things up, right? But the models are always going to be large, so memory bandwidth is going to be a bottle neck unless some v fancy FPGA memory "fabric" is possible. Perhaps for extremely low latency classification tasks? I'm having trouble picturing that application though.

The code itself is surprisingly small/tight. I'm been playing with llama.cpp for the last few days. The CPU only archive is like 8Mb on gitlab, and there is no memory allocation during run time. My ancient laptop (as in 2014!) is sweating but producing spookily good output with quantized 7B models.

(I'm mainly commenting to have someone correct me, by the way, since I'm interested in this question too!)

replies(2): >>42193146 #>>42197224 #

10. sva_ ◴[20 Nov 24 11:02 UTC] No.42192729[source]▶

>>42147252 (OP) #

It seems like arxiv replaced 'bitnet.cpp' with a link 'this http url', even though '.cpp' is clearly not a tld. Poor regex?

replies(2): >>42192782 #>>42195206 #

11. bc569a80a344f9c ◴[20 Nov 24 11:12 UTC] No.42192782[source]▶

>>42192729 #

Sort of. And not on the author’s side.

https://academia.stackexchange.com/questions/132315/how-to-a...

12. dailykoder ◴[20 Nov 24 12:07 UTC] No.42193146{3}[source]▶

>>42192595 #

> Perhaps for extremely low latency classification tasks? I'm having trouble picturing that application though.

Possibly, yes. I have no concrete plans yet. Maybe language models are the wrong area though. Some general either image classification or object detection would be neat (say lane detection with a camera or something like that)

replies(1): >>42196445 #

13. dailykoder ◴[20 Nov 24 12:10 UTC] No.42193162{3}[source]▶

>>42192593 #

Since I don't have a board with PCIe port the fastest I could get is 100MBit ethernet, i think. Or rather use the Microchip board which has a hard RISC-V quad core processor on it connected via an AXI-Bus with the FPGA fabric. The CPU itself run at only 625MHz, so there is huge potential to speed up some fancy computation

replies(1): >>42194838 #

14. js8 ◴[20 Nov 24 12:19 UTC] No.42193249[source]▶

>>42147252 (OP) #

It's technically not 1-bit, but 2-bit.

Anyway, I wonder if there is some HW support in modern CPUs/GPUs for linear algebra (like matrix multiplication) over Z_2^n ? I think it would be useful for SAT solving.

replies(4): >>42193552 #>>42193565 #>>42195167 #>>42197753 #

15. newswasboring ◴[20 Nov 24 12:50 UTC] No.42193481{4}[source]▶

>>42192368 #

3B models, especially in quantized state, almost always behave like this.

16. almostgotcaught ◴[20 Nov 24 13:00 UTC] No.42193552[source]▶

>>42193249 #

not CPU/GPU but on FPGA finite field arithmetic is a thing; plenty of stuff like this around https://ieeexplore.ieee.org/document/4392002

17. scarmig ◴[20 Nov 24 13:01 UTC] No.42193565[source]▶

>>42193249 #

There's carry-less multiplication (https://en.m.wikipedia.org/wiki/CLMUL_instruction_set), introduced by Intel in 2010.

18. nickpsecurity ◴[20 Nov 24 15:14 UTC] No.42194578[source]▶

>>42191742 #

This submission should help you:

https://news.ycombinator.com/item?id=41470074

replies(1): >>42195697 #

19. mysteria ◴[20 Nov 24 15:38 UTC] No.42194838{4}[source]▶

>>42193162 #

Even with a PCIe FPGA card you're still going to be memory bound during inference. When running LLama.cpp on straight CPU memory bandwidth, not CPU power, is always the bottleneck.

Now if the FPGA card had a large amount of GPU tier memory then that would help.

20. JKCalhoun ◴[20 Nov 24 16:02 UTC] No.42195167[source]▶

>>42193249 #

Or, technically, 1.58 bit. ;-)

21. Joker_vD ◴[20 Nov 24 16:04 UTC] No.42195206[source]▶

>>42192729 #

> '.cpp' is clearly not a tld.

Is it that clear? Because e.g. .app and .cpa are TLDs. So are .py and .so.

replies(1): >>42198579 #

22. dailykoder ◴[20 Nov 24 16:42 UTC] No.42195697{3}[source]▶

>>42194578 #

Thanks!

23. tgv ◴[20 Nov 24 17:50 UTC] No.42196445{4}[source]▶

>>42193146 #

Real-time translation or speech transcription for the hearing-impaired onto AR-glasses? Now you've got a good reason to make it look like a Star Trek device.

Or glasses that can detect threats/opportunities in the environment and call them out via ear plugs, for the vision-impaired.

24. yalok ◴[20 Nov 24 17:54 UTC] No.42196490[source]▶

>>42147252 (OP) #

So basically the idea is to pack 3 ternary weights (-1,0,1) into 5 bits instead of 6, but they compare the results with fp16 model which would use 48 bits for those 3 weights…

And speed up comes from the memory io, compensated a bit by the need to unpack these weights before using them…

Did I get this right?

replies(1): >>42197301 #

25. hidelooktropic ◴[20 Nov 24 17:59 UTC] No.42196523[source]▶

>>42147252 (OP) #

Does anyone have the actual "this http url"?

replies(1): >>42196660 #

26. dkrajews ◴[20 Nov 24 18:14 UTC] No.42196660[source]▶

>>42196523 #

https://github.com/microsoft/BitNet

27. UncleOxidant ◴[20 Nov 24 19:15 UTC] No.42197158[source]▶

>>42191742 #

I've had the same idea. One way to go about it would be to modify an existing RISC-V cpu to include the ternary math ops to accelerate bitnet operations. And vector/matrix extensions based on those. Then your LLM is implemented in RISC-V assembly using those extensions. (It would be possible to do some work on the LLVM backend so you could use a C implementation of the LLM, but that starts to be a lot of work. Also, we'd need 2 bit signed int types in C.)

A completely different approach is differentiable logic networks. You end up with a logic-gate network after training. This logic gate network would be very easy to translate into Verilog or VHDL. https://github.com/Felix-Petersen/difflogic

28. UncleOxidant ◴[20 Nov 24 19:23 UTC] No.42197224{3}[source]▶

>>42192595 #

Lower latency, but also much lower power. This sort of thing would be of great interest to companies running AI datacenters (which is why Microsoft is doing this research, I'd think). Low latency is also quite useful for real-time tasks.

> The code itself is surprisingly small/tight. I'm been playing with llama.cpp for the last few days.

Is there a bitnet model that runs on llama.cpp? (looks like it: https://www.reddit.com/r/LocalLLaMA/comments/1dmt4v7/llamacp...) which bitnet model did you use?

29. UncleOxidant ◴[20 Nov 24 19:32 UTC] No.42197301[source]▶

>>42196490 #

Yeah, that seems to be the case. Though, I suspect Microsoft is interested in implementing something like a custom RISC-V CPU that has an ALU that's tuned for doing this ternary math and added custom vector/matrix instructions. Something like that could save them a lot of power in their data centers.

If it were to catch on then perhaps we'd see Intel, AMD, ARM adding math ops optimized for doing ternary math?

replies(1): >>42200664 #

30. meindnoch ◴[20 Nov 24 20:23 UTC] No.42197753[source]▶

>>42193249 #

https://en.m.wikipedia.org/wiki/CLMUL_instruction_set

31. Natfan ◴[20 Nov 24 22:04 UTC] No.42198579{3}[source]▶

>>42195206 #

and .com is a TLD[0] and also a file type[1], to further complicate matters.

---

[0]: https://en.wikipedia.org/wiki/.com

[1]: https://en.wikipedia.org/wiki/COM_file

32. yalok ◴[21 Nov 24 03:00 UTC] No.42200664{3}[source]▶

>>42197301 #

my dream is to see ternary support at the HW wire level - that'd be even more power efficient, and transistor count may be less...