Multilingual retrieval
A practical recipe for retrieval across languages — including Swahili, Yoruba, Amharic, and other African languages that monolingual embedders miss.
The problem
English-only embedders (text-embedding-3-large, mxbai-embed-large) cluster
Swahili or Yoruba text into near-identical regions of the vector space.
Recall on non-English queries collapses below 30% on real corpora.
The fix isn't to translate every document first — that's a tax on every query. The fix is to embed everything with a model that was trained multilingual.
Pick the right embedder
baai/bge-m3— the default for >95% of multilingual retrieval. 100+ languages, 1024 dims, 8192-token context.jinaai/jina-embeddings-v3— when you also need image + text.- Tomoul partner embedders (Lelapa, Masakhane) for specialist
African-language retrieval — when they list, they appear in
GET /v1/modelswith thetomoul/prefix.
Indexing a mixed corpus
Embed every doc with the same multilingual model. Don't embed English docs with one model and Swahili docs with another — your vectors won't share a space and similarity is meaningless across them.
texts = ["Good morning.", "Habari za asubuhi.", "Bonjour."]
resp = client.embeddings.create(model="baai/bge-m3", input=texts)
The three vectors above will land within ~0.15 cosine distance of each other.
Cross-lingual queries
A query in English will retrieve relevant Swahili docs and vice versa — no translation step needed.
search("morning greetings") # finds "Habari za asubuhi."
search("salutations du matin") # finds the same
African-language specifics
| Language | Code | bge-m3 quality |
|---|---|---|
| Swahili | sw | Strong out of the box. |
| Yoruba | yo | Acceptable; specialist embedders recover 5–15 MRR points when listed. |
| Hausa | ha | Acceptable; same caveat. |
| Igbo | ig | Acceptable; same caveat. |
| Amharic | am | Ge'ez script. Test recall before shipping. |
| Tigrinya | ti | Ge'ez script. Test recall before shipping. |
| Code-switched | — | Sheng / Naija Pidgin work well in practice. |
On multilingual corpora, the relevance gap between embedding-top-K and cross-encoder rerank is wider than on English-only corpora. See RAG with bge-m3 for the full pipeline.