Does it support flash attention? Use tensor cores? Can I write custom kernels?
UPD. found no evidence that it supports tensor cores, so it's going to be many times slower than implementations that do.
replies(1):
tornado --devices tornado --version
# Navigate to the project directory cd GPULlama3.java
# Source the project-specific environment paths -> this will ensure the source set_paths
# Build the project using Maven (skip tests for faster build) # mvn clean package -DskipTests or just make make
# Run the model (make sure you have downloaded the model file first - see below) ./llama-tornado --gpu --verbose-init --opencl --model beehive-llama-3.2-1b-instruct-fp16.gguf --prompt "tell me a joke"
https://github.com/m4rs-mt/ILGPU/compare/master...lostmsu:IL...
Good article: https://alexarmbr.github.io/2024/08/10/How-To-Write-A-Fast-M...