Ollama enables deploying LLMs models locally on laptops and edge severs, Cactus enables deploying on phones. Deploying directly on phones facilitates building AI apps and agents capable of phone use without breaking privacy, supports real-time inference with no latency, we have seen personalised RAG pipelines for users and more.
Apple and Google actively went into local AI models recently with the launch of Apple Foundation Frameworks and Google AI Edge respectively. However, both are platform-specific and only support specific models from the company. To this end, Cactus:
- Is available in Flutter, React-Native & Kotlin Multi-platform for cross-platform developers, since most apps are built with these today.
- Supports any GGUF model you can find on Huggingface; Qwen, Gemma, Llama, DeepSeek, Phi, Mistral, SmolLM, SmolVLM, InternVLM, Jan Nano etc.
- Accommodates from FP32 to as low as 2-bit quantized models, for better efficiency and less device strain.
- Have MCP tool-calls to make them performant, truly helpful (set reminder, gallery search, reply messages) and more.
- Fallback to big cloud models for complex, constrained or large-context tasks, ensuring robustness and high availability.
It's completely open source. Would love to have more people try it out and tell us how to make it great!
The core distinction is in the ecosystem: Google AI Edge runs tflite models, whereas Cactus is built for GGUF. This is a critical difference for developers who want to use the latest open-source models.
One major outcome of this is model availability. New open source models are released in GGUF format almost immediately. Finding or reliably converting them to tflite is often a pain. With Cactus, you can run new GGUF models on the day they drop on Huggingface.
Quantization level also plays a role. GGUF has mature support for quantization far below 8-bit. This is effectively essential for mobile. Sub-8-bit support in TFLite is still highly experimental and not broadly applicable.
Last, Cactus excels at CPU inference. While tflite is great, its peak performance often relies on specific hardware accelerators (GPUs, DSPs). GGUF is designed for exceptional performance on standard CPUs, offering a more consistent baseline across the wide variety of devices that app developers have to support.
I have not looked at OP's work yet, but if it makes the task easier, I would opt for that instead of Google's "MediaPipe" API.
GGUF is more suitable for the latest open-source models, i agree there. Quant2/Q4 will probably be critical as well, if we don't see a jump in ram. But then again I wonder when/If mediapipe will support GGUF as well.
PS, I see you are in the latest YC batch? (below you mentioned BF). Good luck and have fun!
> Why lie?
Whoa—that's way too aggressive for this forum and definitely against the site guidelines. Could you please review them (https://news.ycombinator.com/newsguidelines.html) and take the spirit of this site more to heart? We'd appreciate it. You can always make your substantive points while doing that.
Note this one: "Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith."