Audio transcription

`POST /v1/audio/transcriptions` — speech to text. OpenAI-Whisper-compatible. Multilingual, with explicit support for Swahili, Hausa, Yoruba, Amharic.

Endpoint

POST https://api.tomoul.ai/v1/audio/transcriptions

Multipart form upload.

Request body (multipart)

FieldTypeNotes
filebinarymp3, wav, flac, m4a, ogg, webm. Max 25 MB.
modelstringe.g. openai/whisper-large-v3
languagestring (opt.)ISO 639-1 code, e.g. sw, ha, yo
response_formatenum (opt.)json (default), text, srt, vtt, verbose_json
temperaturefloat (opt.)0–1
promptstring (opt.)Vocabulary hints
curl https://api.tomoul.ai/v1/audio/transcriptions \
  -H "Authorization: Bearer $TOMOUL_KEY" \
  -F file=@meeting.mp3 \
  -F model=openai/whisper-large-v3 \
  -F language=sw

Response

Default json:

{ "text": "Habari za asubuhi, karibu Tomoul." }

Models

SlugNotes
openai/whisper-large-v3Default. Strong multilingual.
openai/whisper-large-v3-turbo4× faster, slightly lower quality.
nvidia/parakeet-tdt-1.1bEnglish-only, fastest.

Language hints

Whisper auto-detects, but the language hint cuts cost and improves accuracy on short clips. Day-1 supported codes include en, fr, ar, pt, sw, ha, yo, am, zu, xh, so.

Output formats

  • json — default. Just { "text": "..." }.
  • text — raw text, no JSON wrapping.
  • srt / vtt — subtitle files with timestamps.
  • verbose_json — word-level timestamps and confidence scores.
Last updated 13 May 2026Edit this page on GitHub