GPU & quantization
One GPU HAL over Vulkan, Metal, WebGPU, and a CPU fallback. Quantization is per-tensor, per-block, and chosen at load time.
Backend selection
var model = try tomoul.models.bge_m3.load(a, .{ .device = .auto });
// .auto, .cpu, .vulkan, .metal, .webgpu
Backends live in
src/gpu/:
vulkan.zig, metal.zig, webgpu.zig, plus the shared hal.zig interface
every backend implements.
| Platform | Default backend |
|---|---|
| Linux NVIDIA / AMD | Vulkan |
| macOS (Apple Silicon) | Metal |
| macOS (Intel) | Metal (limited) or CPU |
| Windows | Vulkan |
| Browser / WASM | WebGPU (falls back to CPU) |
NVIDIA hardware is targeted through Vulkan compute shaders — one codepath, no NVIDIA-only tax. ROCm isn't required for AMD either.
CPU path
On the CPU backend, matrix math goes through BLAS. Two options at build time:
- OpenBLAS or Apple Accelerate — link the system library. Pass
-Dblas=true. zblas— the bundled pure-Zig fallback. No system dependency. WASM builds use it automatically.
The fallback isn't a stub — zblas is fast enough that small models
(Silero VAD, sentence-transformer, bge-m3 at smaller dims) run well on CPU
without any system BLAS.
Quantization formats
| Format | Bits | Use case |
|---|---|---|
f32 | 32 | Reference / validation only. |
f16 | 16 | Default for embeddings + small models. |
q8_0 | 8 | Solid quality / size tradeoff. |
q4_0 | 4 | Compact, mid-large LLMs. |
q8_k | 8 (per-block) | Higher fidelity than q8_0 at near-equal size. |
Pick at load time:
var model = try tomoul.arch.llama.LlamaModel.load(a, .{
.weights = "phi-4.tl",
.quant = .q4_0,
});
Browser / WASM
The engine cross-compiles to wasm32 with the WebGPU backend. Smaller
embedding models (bge-m3, mxbai-embed) run in the browser. LLMs are
size-prohibitive in WASM today — use the cloud API instead.