BIPI
BIPI

Fine-Tune or RAG: A Decision Framework Based on Real Project Cost

Agentic AI

The fine-tune versus RAG debate is usually framed badly. The right answer depends on whether your problem is behavioural or knowledge-bound, and most production systems need both. We share the framework we use with clients.

By Arjun Raghavan, Security & Systems Lead, BIPI · April 19, 2024 · 7 min read

#fine-tuning#rag#llm

A health-tech client asked us to fine-tune GPT-4o on their clinical documentation. They had 12,000 example notes, a six-figure budget, and a deadline. We pushed back. The actual problem was not that the model needed to learn clinical terminology. The problem was that the model needed to cite specific guideline documents that were updated quarterly. Fine-tuning would have baked in stale content. We shipped a RAG system with a custom embedding model in six weeks and the client kept the fine-tune budget.

The behavioural versus knowledge axis

The first question we ask: is the problem behavioural or knowledge-bound? Behavioural means the model needs to learn a format, a tone, a style of reasoning. Knowledge-bound means the model needs to retrieve specific facts that exist in a corpus. Fine-tuning is for behaviour. RAG is for knowledge. Most teams confuse the two.

  • Behavioural: emit JSON in a specific schema, write in a specific tone, follow a multi-step reasoning pattern, classify into a fixed taxonomy.
  • Knowledge-bound: answer questions about your product, cite policy documents, surface customer records, ground responses in fresh data.

Cost comparison from real projects

We ran a side-by-side on a structured extraction task last spring. Fine-tuning GPT-4o-mini on 4,000 labelled examples cost 1,200 dollars in training, took two weeks of data preparation, and produced a model that hit 94 percent task accuracy. RAG with a 50-shot dynamic prompt and a Cohere reranker hit 91 percent accuracy, took four days to build, and cost 80 dollars in setup plus higher per-request inference. At the client's volume the RAG version paid back in 11 months. Below that volume fine-tuning never paid back.

94%
fine-tune accuracy on extraction task
91%
RAG accuracy on same task
11 months
break-even point for fine-tuning at client volume

Latency tradeoffs

RAG adds retrieval latency. For our typical setup with pgvector and an HNSW index on 2 million chunks, retrieval takes 40 to 90 ms. Reranking with Cohere rerank-3 adds another 200 ms. Total overhead: under 300 ms. Fine-tuned models save context tokens so they often have lower TTFT. If you are building a voice agent where every 100 ms matters, the fine-tune latency advantage is real. For chat and document workflows it is invisible.

Data freshness

This is where fine-tuning gets dangerous. Fine-tuned models freeze your knowledge at training time. If your domain changes weekly, you are committing to a retraining cadence that most teams do not actually maintain. We have seen fine-tuned support bots cite product features that were deprecated months earlier. RAG over a freshly indexed corpus does not have this problem because the index is the source of truth.

Fine-tune things that do not change. Retrieve things that do.

The hybrid approach

Most of our production systems are hybrid. We fine-tune for format, structure, and tone. We RAG for facts, freshness, and citations. The fine-tuned model learns to expect a retrieval block in the prompt and to format the output correctly. The retrieval system handles the actual content. This separates the two concerns cleanly and lets each one be improved independently.

Example: a legal-tech client. We fine-tuned a small model to emit a specific JSON schema with citation markers in the right place. We then RAG over their case law corpus to fill the citations. Fine-tuning the format gave us 99 percent schema compliance. RAG kept the citations current. Neither approach alone hit either bar.

When to skip fine-tuning entirely

Skip fine-tuning if any of these apply: your corpus is under 1,000 high-quality examples, your task changes more than once a quarter, your accuracy gap from few-shot prompting is under 5 points, your team does not have an eval suite that you trust. Especially the last one. Fine-tuning without a held-out eval is gambling with cash. We have seen teams ship fine-tunes that scored worse than the base model and only noticed when production complaints rolled in.

The decision in five questions

We ask clients these five questions before recommending an approach: how often does the underlying data change, how big is the corpus, is the gap in behaviour or in knowledge, what is the latency budget, what is the per-request cost ceiling. The answers usually pick the approach automatically. The places we end up with fine-tuning are narrower than people expect: format compliance, classification at scale, and tone control for high-volume customer touchpoints.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.