Audio transcription

`POST /v1/audio/transcriptions` — speech to text. OpenAI-Whisper-compatible. Multilingual, with explicit support for Swahili, Hausa, Yoruba, Amharic.

Endpoint

POST https://api.tomoul.ai/v1/audio/transcriptions

Multipart form upload.

Request body (multipart)

Field	Type	Notes
`file`	binary	mp3, wav, flac, m4a, ogg, webm. Max 25 MB.
`model`	string	e.g. `openai/whisper-large-v3`
`language`	string (opt.)	ISO 639-1 code, e.g. `sw`, `ha`, `yo`
`response_format`	enum (opt.)	`json` (default), `text`, `srt`, `vtt`, `verbose_json`
`temperature`	float (opt.)	0–1
`prompt`	string (opt.)	Vocabulary hints

curl https://api.tomoul.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $TOMOUL_KEY" \
  -F file=@meeting.mp3 \
  -F model=openai/whisper-large-v3 \
  -F language=sw

Response

Default json:

{ "text": "Habari za asubuhi, karibu Tomoul." }

Models

Slug	Notes
`openai/whisper-large-v3`	Default. Strong multilingual.
`openai/whisper-large-v3-turbo`	4× faster, slightly lower quality.
`nvidia/parakeet-tdt-1.1b`	English-only, fastest.

Language hints

Whisper auto-detects, but the language hint cuts cost and improves accuracy on short clips. Day-1 supported codes include en, fr, ar, pt, sw, ha, yo, am, zu, xh, so.

Output formats

json — default. Just { "text": "..." }.
text — raw text, no JSON wrapping.
srt / vtt — subtitle files with timestamps.
verbose_json — word-level timestamps and confidence scores.

← Previous

Files

Usage

Last updated 13 May 2026Edit this page on GitHub