←back to thread

217 points HenryNdubuaku | 2 comments | | HN request time: 0s | source

Hey HN, Henry and Roman here - we've been building a cross-platform framework for deploying LLMs, VLMs, Embedding Models and TTS models locally on smartphones.

Ollama enables deploying LLMs models locally on laptops and edge severs, Cactus enables deploying on phones. Deploying directly on phones facilitates building AI apps and agents capable of phone use without breaking privacy, supports real-time inference with no latency, we have seen personalised RAG pipelines for users and more.

Apple and Google actively went into local AI models recently with the launch of Apple Foundation Frameworks and Google AI Edge respectively. However, both are platform-specific and only support specific models from the company. To this end, Cactus:

- Is available in Flutter, React-Native & Kotlin Multi-platform for cross-platform developers, since most apps are built with these today.

- Supports any GGUF model you can find on Huggingface; Qwen, Gemma, Llama, DeepSeek, Phi, Mistral, SmolLM, SmolVLM, InternVLM, Jan Nano etc.

- Accommodates from FP32 to as low as 2-bit quantized models, for better efficiency and less device strain.

- Have MCP tool-calls to make them performant, truly helpful (set reminder, gallery search, reply messages) and more.

- Fallback to big cloud models for complex, constrained or large-context tasks, ensuring robustness and high availability.

It's completely open source. Would love to have more people try it out and tell us how to make it great!

Repo: https://github.com/cactus-compute/cactus

Show context
throw777373 ◴[] No.44525807[source]
Ollama runs on Android just fine via Termux. I use it with 5GB models. They even recently added ollama package, there is no longer need to compile it from source code.
replies(2): >>44525911 #>>44530291 #
rshemet ◴[] No.44525911[source]
True - but Cactus is not just an app.

We are a dev toolkit to run LLMs cross-platform locally in any app you like.

replies(2): >>44526025 #>>44528646 #
1. jadbox ◴[] No.44526025[source]
How does it work? How does one model on the device get shared to many apps? Does each app have it's own inference sdk running or is there one inference engine shared to many apps (like ollama does). If it's the later, what's the communication protocol to the inference engine?
replies(1): >>44526336 #
2. rshemet ◴[] No.44526336[source]
Great question. Currently, each app is sandboxed - so each model file is downloaded inside each app's sandbox. We're working on enabling file sharing across multiple apps so you don't have to redownload the model.

With respect to the inference SDK, yes you'll need to install the (react native/flutter) framework inside each app you're building.

The SDK is very lightweight (our own iOS app is <30MB which includes the inference SDK and a ton of other stuff)