RAG with bge-m3
End-to-end retrieval pipeline using Tomoul's embeddings, reranker, and a small chat model. ~50 lines, ~$0 in credits.
Setup
pip install openai numpy
export TOMOUL_KEY=tomoul_sk_...
Indexing
from openai import OpenAI
client = OpenAI(api_key="$TOMOUL_KEY", base_url="https://api.tomoul.ai/v1")
DOCS = [
"bge-m3 is BAAI's multilingual embedding model.",
"Tomoul hosts bge-m3 in EU regions today.",
"Helsinki is the capital of Finland.",
]
vecs = client.embeddings.create(model="baai/bge-m3", input=DOCS).data
vectors = [v.embedding for v in vecs]
Store vectors alongside DOCS in your vector DB of choice — pgvector,
Qdrant, LanceDB, all work.
Querying
import numpy as np
def search(query, k=5):
qv = client.embeddings.create(
model="baai/bge-m3",
input=query,
).data[0].embedding
scores = np.dot(vectors, qv)
idx = np.argsort(-scores)[:k]
return [DOCS[i] for i in idx]
Rerank step
import requests
def rerank(query, candidates, top_n=3):
r = requests.post(
"https://api.tomoul.ai/v1/rerank",
headers={"Authorization": f"Bearer $TOMOUL_KEY"},
json={
"model": "baai/bge-reranker-v2-m3",
"query": query,
"documents": candidates,
"top_n": top_n,
},
).json()
return [candidates[hit["index"]] for hit in r["results"]]
The rerank step routinely lifts answer quality by 15–30 points on retrieval benchmarks — disproportionately the cheapest quality win in the stack.
Generate the answer
def answer(query):
hits = rerank(query, search(query, k=10), top_n=3)
prompt = f"Context:\n{chr(10).join(hits)}\n\nQuestion: {query}"
resp = client.chat.completions.create(
model="microsoft/phi-4",
messages=[{"role": "user", "content": prompt}],
)
return resp.choices[0].message.content
Production tips
- Batch embed calls. Up to 2048 inputs per request — see Embeddings batching.
- Pin a region with
X-Tomoul-Regionto keep latency predictable. - Cache user-query embeddings in your app — same prompt twice = same vector, no need to re-embed.
- For multilingual corpora (Swahili + English mixed),
bge-m3beats most English-only models at retrieval recall — don't fragment your index across models. See Multilingual retrieval.
Last updated 13 May 2026Edit this page on GitHub