Chat completions
`POST /v1/chat/completions` — OpenAI-compatible chat. Streaming, function calling, and JSON mode all supported.
Endpoint
POST https://api.tomoul.ai/v1/chat/completions
Request body
Same shape as OpenAI's /v1/chat/completions. Required: model, messages.
Common optional: temperature, top_p, max_tokens, stream, tools,
response_format, seed.
{
"model": "microsoft/phi-4",
"messages": [
{ "role": "system", "content": "You answer in 1 sentence." },
{ "role": "user", "content": "What is bge-m3?" }
],
"temperature": 0.2,
"max_tokens": 256
}
For exhaustive parameter docs, OpenAI's reference applies in full: platform.openai.com/docs/api-reference/chat/create.
Response
{
"id": "chatcmpl_01HV...",
"object": "chat.completion",
"model": "microsoft/phi-4",
"choices": [{
"index": 0,
"message": { "role": "assistant", "content": "A multilingual embedding model from BAAI." },
"finish_reason": "stop"
}],
"usage": { "prompt_tokens": 32, "completion_tokens": 12, "total_tokens": 44 }
}
Streaming
Set "stream": true and read SSE chunks. Each chunk is
chat.completion.chunk; the stream ends with data: [DONE].
stream = client.chat.completions.create( model="microsoft/phi-4", messages=[{"role": "user", "content": "Count to 10."}], stream=True, ) for chunk in stream: delta = chunk.choices[0].delta.content if delta: print(delta, end="", flush=True)
Full pattern, including cancellation: Streaming completions guide.
Tools / function calling
Send tools: [...] in the request and parse choices[0].message.tool_calls
in the response. The schema matches OpenAI's tools spec. Models that support
tools advertise capabilities.tools: true in
GET /v1/models. Full pattern:
Function calling guide.
JSON mode
Set response_format: { "type": "json_object" } to force valid JSON. For
structured output against a schema, pass
{ "response_format": { "type": "json_schema", "json_schema": { ... } } }
on models that advertise capabilities.structured_output: true.
Tomoul-specific notes
seedis honoured deterministically on Tomoul-exclusive models. Third-party models route on best-effort.X-Tomoul-Region: <code>pins the call to a specific region. See Regions & residency.X-Tomoul-Cache: no-storeopts out of prompt caching for the call. See Pricing & metering.