←back to thread

157 points galeos | 2 comments | | HN request time: 0.403s | source
Show context
dailykoder ◴[] No.42191742[source]
I have read about it quite a few weeks ago the first time and I found it very interesting.

Now that I have done more than enough CPU design inside FPGAs, I wanted to try something new, some computation heavy things that could benefit from an FPGA. Does anyone here know how feasable it'd be to implement something like that on an FPGA? I only have rather small chips (artix-7 35T and polarfire SoC with 95k logic slices). So I know I won't be able to press a full LLM into that, but something should be possible.

Maybe I should refresh the fundamentals though and start with MNIST. But the question is rather: What is a realistic goal that I could possibly reach with these small FPGAs? Performance might be secondary, I am rather interested in what's possible regarding complexity/features on a small device.

Also has anyone here compiled openCL (or GL?) kernels for FPGAs and can give me a starting point? I was wondering if it's possible to have a working backend for something like tinygrad[1]. I think this would be a good way to learn all the different layers on how such frameworks actually work

- [1] https://github.com/tinygrad/tinygrad

replies(4): >>42192593 #>>42192595 #>>42194578 #>>42197158 #
svantana ◴[] No.42192593[source]
Couldn't you implement a bitnet kernel, and use that as a co-processor to a PC? Or is the I/O bandwidth so low that it won't be worth it?
replies(1): >>42193162 #
1. dailykoder ◴[] No.42193162[source]
Since I don't have a board with PCIe port the fastest I could get is 100MBit ethernet, i think. Or rather use the Microchip board which has a hard RISC-V quad core processor on it connected via an AXI-Bus with the FPGA fabric. The CPU itself run at only 625MHz, so there is huge potential to speed up some fancy computation
replies(1): >>42194838 #
2. mysteria ◴[] No.42194838[source]
Even with a PCIe FPGA card you're still going to be memory bound during inference. When running LLama.cpp on straight CPU memory bandwidth, not CPU power, is always the bottleneck.

Now if the FPGA card had a large amount of GPU tier memory then that would help.