Now that I have done more than enough CPU design inside FPGAs, I wanted to try something new, some computation heavy things that could benefit from an FPGA. Does anyone here know how feasable it'd be to implement something like that on an FPGA? I only have rather small chips (artix-7 35T and polarfire SoC with 95k logic slices). So I know I won't be able to press a full LLM into that, but something should be possible.
Maybe I should refresh the fundamentals though and start with MNIST. But the question is rather: What is a realistic goal that I could possibly reach with these small FPGAs? Performance might be secondary, I am rather interested in what's possible regarding complexity/features on a small device.
Also has anyone here compiled openCL (or GL?) kernels for FPGAs and can give me a starting point? I was wondering if it's possible to have a working backend for something like tinygrad[1]. I think this would be a good way to learn all the different layers on how such frameworks actually work
The code itself is surprisingly small/tight. I'm been playing with llama.cpp for the last few days. The CPU only archive is like 8Mb on gitlab, and there is no memory allocation during run time. My ancient laptop (as in 2014!) is sweating but producing spookily good output with quantized 7B models.
(I'm mainly commenting to have someone correct me, by the way, since I'm interested in this question too!)
https://academia.stackexchange.com/questions/132315/how-to-a...
Possibly, yes. I have no concrete plans yet. Maybe language models are the wrong area though. Some general either image classification or object detection would be neat (say lane detection with a camera or something like that)
Now if the FPGA card had a large amount of GPU tier memory then that would help.
Or glasses that can detect threats/opportunities in the environment and call them out via ear plugs, for the vision-impaired.
And speed up comes from the memory io, compensated a bit by the need to unpack these weights before using them…
Did I get this right?
A completely different approach is differentiable logic networks. You end up with a logic-gate network after training. This logic gate network would be very easy to translate into Verilog or VHDL. https://github.com/Felix-Petersen/difflogic
> The code itself is surprisingly small/tight. I'm been playing with llama.cpp for the last few days.
Is there a bitnet model that runs on llama.cpp? (looks like it: https://www.reddit.com/r/LocalLLaMA/comments/1dmt4v7/llamacp...) which bitnet model did you use?
If it were to catch on then perhaps we'd see Intel, AMD, ARM adding math ops optimized for doing ternary math?
---