Show HN: GPULlama3.java Llama Compilied to PTX/OpenCL Now Integrated in Quarkus

1. lostmsu ◴[12 Dec 25 01:56 UTC] No.46240000[source]▶

Does it support flash attention? Use tensor cores? Can I write custom kernels?

UPD. found no evidence that it supports tensor cores, so it's going to be many times slower than implementations that do.

replies(1): >>46242072 #

2. mikepapadim ◴[12 Dec 25 08:32 UTC] No.46242072[source]▶

Yes, when you use the PTX backend it supports Tensor Cores.It has also implementation for flash attention. You can also write your own kernels, have a look here: https://github.com/beehive-lab/GPULlama3.java/blob/main/src/... https://github.com/beehive-lab/GPULlama3.java/blob/main/src/...

replies(1): >>46242644 #

3. lostmsu ◴[12 Dec 25 10:12 UTC] No.46242644[source]▶

TornadoVM GitHub has no mentions of tensor cores or WMMA instructions. The only mention of tensor cores is in 2024 and states they are not used: https://github.com/beehive-lab/TornadoVM/discussions/393

replies(1): >>46243918 #

4. mikepapadim ◴[12 Dec 25 13:27 UTC] No.46243918{3}[source]▶

replies(1): >>46261711 #

5. lostmsu ◴[14 Dec 25 08:48 UTC] No.46261711{4}[source]▶

I believe these are SIMD. Tensor cores require MMA family of instructions. Ask me how I know. :)