BIPI
BIPI

Choosing an Embedding Model: A Practitioner's Comparison

Agentic AI

OpenAI text-embedding-3, Cohere embed-v3, Voyage, and the open-source contenders. We benchmarked all four on multilingual retrieval, domain documents, and cost-per-million tokens. The right answer depends on what you are retrieving.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 28, 2024 · 7 min read

#embeddings#rag#vector-search

Embedding model selection is one of those decisions where the wrong choice is invisible at first and expensive to reverse. Switch embedding models and you have to re-embed your entire corpus. Some of our clients have hundreds of millions of vectors. Get the choice right the first time and you save a re-indexing cycle that takes weeks.

The four mainstream options

Our shortlist for new projects in 2024: OpenAI text-embedding-3-large, Cohere embed-multilingual-v3, Voyage voyage-3-large, and open-source BGE-M3 or Nomic embed-v1.5. Each has a different sweet spot.

Benchmark setup

We tested on three corpora from real client projects: 2 million English support tickets, 800k mixed-language product reviews across 11 languages, and 60k specialised technical documents in chemistry. Retrieval task: MRR at 10 against a held-out query set of 1,000 labelled queries per corpus.

0.71
Voyage MRR@10 on technical chemistry corpus
0.68
Cohere multilingual MRR@10 on mixed-language reviews
0.64
OpenAI text-embedding-3-large MRR@10 on English support tickets

Domain specialisation matters more than benchmark score

Voyage's domain-tuned models won on our technical chemistry corpus by a wide margin. The general-purpose embeddings struggled with chemical nomenclature like 2,3-dihydroxybutanoic acid where punctuation and notation carry meaning. Voyage-3-large handled it natively. The lesson: published benchmark scores tell you less than testing on your own corpus.

Multilingual reality check

Cohere embed-multilingual-v3 is the strongest multilingual option we have shipped. It handled cross-lingual retrieval cleanly: a query in French returned relevant English documents and vice versa. OpenAI multilingual support is competent but the cross-lingual retrieval was weaker. BGE-M3 is the best open-source multilingual option and it is genuinely competitive but you are running your own inference.

Dimension tradeoffs

Dimensions drive storage cost. OpenAI text-embedding-3-large is 3072 dimensions natively but supports Matryoshka truncation down to 256. Cohere is 1024. Voyage-3-large is 1024. BGE-M3 is 1024. For 100 million vectors at 4 bytes per float, the storage difference between 3072 and 1024 is 800 GB. At pgvector pricing on Aurora, that is real money.

  • Truncating Matryoshka embeddings to 1024 dimensions costs around 2 to 3 points of recall on most corpora
  • Truncating to 512 dimensions costs 5 to 8 points and is rarely worth it
  • Quantisation to int8 saves 4x storage at roughly 1 point of recall, often a better deal than dimension truncation

Cost per million tokens

As of mid 2024: OpenAI text-embedding-3-large is 0.13 dollars per million input tokens. Cohere embed-v3 is 0.10. Voyage-3-large is 0.18. Open-source on your own GPU depends on hardware but we have measured Nomic embed at roughly 0.04 per million tokens including amortised GPU cost on an A100.

For a corpus of 500 million tokens, the difference between providers is 20 to 90 dollars. For a corpus of 50 billion tokens, the difference becomes meaningful. We have one client where the open-source self-hosted setup pays back in three months despite the operational overhead.

Embedding cost is usually rounding error compared to inference cost. Optimise for retrieval quality first, cost second.

When to fine-tune your own embeddings

Fine-tuning embeddings used to be exotic. With sentence-transformers and a few thousand labelled triples, it has become tractable. We have fine-tuned embeddings for two clients with highly specialised domains: legal contracts and medical device documentation. In both cases the fine-tuned BGE-base model beat every commercial offering by 5 to 9 MRR points.

The catch: you need labelled triples. Anchor, positive, hard negative. For most teams the labelling cost outweighs the quality gain. We recommend it only when your domain has terminology that is rare in general training data and you have at least 5,000 labelled examples.

Reranking is non-negotiable

Whichever embedding model you pick, layer a reranker on top. Cohere rerank-3 is our default. It adds 200 ms of latency and 2 to 4 points of MRR. On the technical chemistry corpus it lifted Voyage's 0.71 to 0.79. Reranker cost is per query, not per document, so it does not scale with corpus size.

Our default recommendation

For new projects without domain specialisation: Cohere embed-multilingual-v3 plus rerank-3. It is competitive on English, strongest on multilingual, well-priced, and operationally simple. For domain-specialised projects: test Voyage first, then consider fine-tuning a BGE base if Voyage does not have a model for your domain. Open-source self-hosted: only if you have GPU operations expertise already or you are at a scale where it pays back.

Whichever you pick, write down why. Embedding model choices outlive the engineers who made them, and the next engineer is going to ask why they cannot just swap to whatever is on top of the leaderboard this month.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.