Tomoul Engine

A minimalist Zig inference engine. No Python runtime, no ONNX, no containers — just weights on disk and a binary that runs. Powers `api.tomoul.ai` and `tomoul serve` internally; importable into any Zig project. MIT-licensed.

What we actually focus on

The engine's focus is not LLMs in the first instance — it's the models that sit around LLMs and do the unglamorous work in production pipelines: detecting when someone is actually talking, fixing punctuation on raw transcripts, generating embeddings, transcribing speech. These are the models that always turn out to be the hardest part to deploy, and they don't get a Zig-native runtime anywhere else.

LLMs are in scope — InkubaLM and Qwen3.5 ship today, more on the roadmap — but the differentiator is the surrounding stack.

Three layers

Layer 3 — Stack       src/models/         VAD, punctuation, Whisper, embeddings, LLMs
Layer 2 — Bridge      tools/              Python export scripts: PyTorch → .tl
Layer 1 — Core        src/core/, src/gpu/ Tensor, ops, GPU HAL, loader, quantization

The .tl format is a tiny binary container — magic bytes, dimensions, raw floats. Convert once with the Python bridge in tools/; from that point forward everything is pure Zig.

What the engine provides

Core (src/core/)

  • Tensor + ops with SIMD (AVX2 / NEON / WASM-SIMD).
  • BLAS via OpenBLAS / Apple Accelerate, or the bundled pure-Zig zblas fallback. WASM builds use zblas automatically.
  • Quantization: F32, F16, Q8_0, Q4_0, Q8_K with per-block scales.

GPU HAL (src/gpu/)

  • Vulkan, Metal, WebGPU, and CPU fallback — one interface, four backends. Builds for desktop, mobile, and browser.

Architecture families (src/arch/)

Six families, each parameterised by config: llama, deltanet, transformer_encoder, transformer_decoder, encoder_decoder, lstm. Full details: Architectures.

Format readers (src/format/)

  • safetensors, HF config.json, and the in-house .tl format. GGUF is on the roadmap.

Model registry (src/models/)

ModelFamilyNotes
Silero VADlstm~2.2 MB WASM. Voice activity detection.
XLM-RoBERTa Punctuationtransformer_encoderRestores punctuation on raw transcripts.
sentence-transformer / bge-m3transformer_encoderMultilingual embeddings.
Whisper (tiny / distil-small / large-v3-turbo)encoder_decoderSpeech-to-text.
InkubaLMllamaAfrican-language LLM. Tomoul exclusive.
Qwen3.5llamaGeneral-purpose LLM.

When the engine adds a new architecture family or model, both tomoul-cloud and tomoul-cli inherit it. That's the compounding bet.

Who it's for

  • Zig developers building inference into a non-AI app. A Zig CLI that wants embeddings, VAD, or transcription — take the engine + a catalog model.
  • Researchers prototyping new architectures. Tensor / ops / GPU primitives are free; you write the forward pass and a validation harness.
  • Other inference-engine builders. The modules are sharp enough that a fork is cheaper than greenfield.
  • Educators. The source is small (~30k LOC of Zig) and modular enough to be a teaching artifact.
Calling from a non-Zig language?

If you're calling the Tomoul cloud API from Python, Node, Go, or any other language, you do not need the engine. Use any OpenAI SDK pointed at https://api.tomoul.ai/v1. The engine is for Zig consumers and contributors.

What it's not

  • Not a per-language SDK matrix. No npm / PyPI / crates.io packages. Non-Zig consumers integrate via HuggingFace release artifacts (.wasm, .a, .so, .dylib, .h, .tl) directly.
  • Not a training framework. Inference only.
  • Not a generic ML toolkit. Curated architectures, not every PyTorch op.
  • Not stable yet. Pre-1.0. Pin a commit. See Stability & versioning.

Source & license

Last updated 13 May 2026Edit this page on GitHub