The Tomoul CLI

A single static binary. Run open-weights models on your laptop with an OpenAI-compatible API — chat, embeddings, and speech in one process. Same model catalog as `api.tomoul.ai`. Same engine.

Install

brew install tomoul                              # macOS
scoop install tomoul                             # Windows
curl -fsSL https://tomoul.ai/install.sh | sh     # Linux

Full install matrix: Installation.

The 60-second flow

$ tomoul serve inkubalm-0.4b
⬇  Downloading inkubalm-0.4b (240 MB Q8_K)...
✓  Loaded. Listening on http://127.0.0.1:8080 (OpenAI-compatible)
 
# In another shell:
$ curl http://127.0.0.1:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"inkubalm-0.4b","messages":[{"role":"user","content":"Habari"}]}'

That's the entire UX target. Every other subcommand exists in service of this one flow.

Subcommands

Command	What it does
`tomoul serve <model>`	Resolve → download → load → start an OpenAI-compat server.
`tomoul run <model> -i "..."`	One-shot generation, no server. Pipe-friendly.
`tomoul pull <model>`	Download + cache. Don't serve.
`tomoul models`	List cached and available catalog models.
`tomoul rm <model>`	Free disk.
`tomoul doctor`	Diagnose GPU, drivers, RAM, network.
`tomoul auth login`	Log in to bridge to the cloud.
`tomoul serve <model> --cloud`	Same UX, but inference runs on `api.tomoul.ai`.
`tomoul version`	Print the version.

How it compares to Ollama

Ollama is excellent at one thing — a curated LLM catalog with a great install experience. Tomoul CLI matches Ollama on install and wins on three dimensions Ollama doesn't serve:

Dimension	Ollama	Tomoul
Static binary, no runtime	Go runtime	Zig static
OpenAI-compat by default	Partial	Full
`/v1/audio/transcriptions` (Whisper)	—	✓
`/v1/embeddings`	Preview	✓
Browser + iOS/Android (via release artifacts)	—	✓
LLM catalog breadth	200+ models	~10 launch, growing
Underserved / African models	—	✓

We don't match Ollama on LLM catalog breadth. We win on multimodal-in-one binary, non-server platforms, and underserved models.

Bridging to the cloud

Add --cloud to any serve or run invocation and the same command routes to api.tomoul.ai instead of your local GPU. Your app keeps calling localhost:8080; the CLI proxies. See --cloud bridge.

tomoul run

Last updated 13 May 2026Edit this page on GitHub