Chat completions

`POST /v1/chat/completions` — OpenAI-compatible chat. Streaming, function calling, and JSON mode all supported.

Endpoint

POST https://api.tomoul.ai/v1/chat/completions

Request body

Same shape as OpenAI's /v1/chat/completions. Required: model, messages. Common optional: temperature, top_p, max_tokens, stream, tools, response_format, seed.

{
  "model": "microsoft/phi-4",
  "messages": [
    { "role": "system", "content": "You answer in 1 sentence." },
    { "role": "user",   "content": "What is bge-m3?" }
  ],
  "temperature": 0.2,
  "max_tokens": 256
}

For exhaustive parameter docs, OpenAI's reference applies in full: platform.openai.com/docs/api-reference/chat/create.

Response

{
  "id":      "chatcmpl_01HV...",
  "object":  "chat.completion",
  "model":   "microsoft/phi-4",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "A multilingual embedding model from BAAI." },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 32, "completion_tokens": 12, "total_tokens": 44 }
}

Streaming

Set "stream": true and read SSE chunks. Each chunk is chat.completion.chunk; the stream ends with data: [DONE].

stream = client.chat.completions.create(
  model="microsoft/phi-4",
  messages=[{"role": "user", "content": "Count to 10."}],
  stream=True,
)
for chunk in stream:
  delta = chunk.choices[0].delta.content
  if delta:
      print(delta, end="", flush=True)

Full pattern, including cancellation: Streaming completions guide.

Tools / function calling

Send tools: [...] in the request and parse choices[0].message.tool_calls in the response. The schema matches OpenAI's tools spec. Models that support tools advertise capabilities.tools: true in GET /v1/models. Full pattern: Function calling guide.

JSON mode

Set response_format: { "type": "json_object" } to force valid JSON. For structured output against a schema, pass

{ "response_format": { "type": "json_schema", "json_schema": { ... } } }

on models that advertise capabilities.structured_output: true.

Tomoul-specific notes

seed is honoured deterministically on Tomoul-exclusive models. Third-party models route on best-effort.
X-Tomoul-Region: <code> pins the call to a specific region. See Regions & residency.
X-Tomoul-Cache: no-store opts out of prompt caching for the call. See Pricing & metering.

Embeddings

Last updated 13 May 2026Edit this page on GitHub