tomoul serve
Run any model from Tomoul's catalog locally. One binary, one command. Chat, embeddings, and audio transcription served from the same OpenAI-compatible endpoint.
Usage
$ tomoul serve inkubalm-0.4b
⬇ Downloading inkubalm-0.4b (240 MB Q8_K)...
✓ Loaded. Listening on http://127.0.0.1:8080 (OpenAI-compatible)
Then point any OpenAI client at http://127.0.0.1:8080/v1. If the model
isn't cached, serve pulls it first.
Multimodal in one process
Unlike Ollama (chat-only), one tomoul serve exposes every modality the
model class supports:
| Endpoint | When it's live |
|---|---|
/v1/chat/completions | LLMs (Llama / Phi / Qwen / InkubaLM / gpt-oss). |
/v1/completions | Same models, classic completion shape. |
/v1/embeddings | Embedding models (bge-m3, mxbai-embed, sentence-transformer). |
/v1/audio/transcriptions | Whisper variants. |
/v1/models | Lists what this process is serving. |
/health | Liveness probe. |
You can run multiple serve processes on different ports (one per modality)
or, in Phase 2, a single process holds multiple models and routes by the
model field in the request.
Model aliases
Short names resolve to full slugs:
tomoul serve llama3 # → meta-llama/Llama-3.2-3B
tomoul serve bge-m3 # → baai/bge-m3
tomoul serve whisper # → openai/whisper-large-v3
Aliases are curated and baked into the binary. Full slugs (provider/model)
always work.
Defaults
- Port
8080. - One model per
serveinvocation (multi-model serving is Phase 2). - Q8_K on CPU and Apple Silicon, Q4_0 on NVIDIA/AMD GPU (override with
--quant). - Logs to stdout; telemetry off by default (opt in with
--telemetry).
Flags
| Flag | Default | Notes |
|---|---|---|
--port | 8080 | Bind port. |
--host | 127.0.0.1 | Bind host. Use 0.0.0.0 for LAN access. |
--quant | auto | fp16, int8, int4, q8_k, q4_0. |
--device | auto | cpu, cuda, metal, vulkan. |
--cache | ~/.cache/tomoul | Model weight cache dir. |
--cloud | off | Route through api.tomoul.ai instead of local GPU. |
--telemetry | off | Anonymous usage pings — off by default. |
GPU & quantization
On NVIDIA / AMD GPUs we default to Q4_0 via Vulkan compute shaders. On Apple
Silicon we use Metal + Q8_K. On CPU we use SIMD (AVX2 / NEON / WASM-SIMD)
with the bundled pure-Zig zblas fallback. Run tomoul doctor
to see what your machine supports.