Database Replication: Picking the Right Strategy for the Failure Mode You Care About.
Digital Engineering
Sync, async, semi-sync, multi-region, multi-master. The replication choice you make is really a choice about which failures you want to tolerate. A practitioner's guide without the marketing.
By Arjun Raghavan, Security & Systems Lead, BIPI · February 15, 2026 · 8 min read
Database replication is one of those topics where the marketing and the reality diverge sharply. Vendor docs say 'high availability and disaster recovery.' What you actually have to choose between is which kinds of failures you can survive without losing data, and at what cost in latency and operational complexity.
We have helped enough teams untangle replication choices that go wrong in production that the patterns are clear. The right choice depends on the specific failures you are protecting against, not on a generic 'best practice.'
Asynchronous replication: the default
How it works: primary commits transactions locally and acknowledges to the client. Replicas pull or receive the changes shortly after. Replication lag is typically 10ms to a few seconds in the same region, longer across regions.
What you get: read scaling, hot standby for failover, ability to query historical state if you have time-travel. Operational simplicity is the killer feature. Most managed services (RDS, Cloud SQL, Aurora) default to this.
What you lose on failure: any committed-but-not-yet-replicated transactions when the primary dies. Typical exposure: 100ms-2s of writes. For most applications, this is acceptable. For payments, ledgers, anything where exactly-once matters: it is not.
Synchronous replication: when you cannot lose writes
How it works: primary writes the transaction locally, sends to replica, waits for replica acknowledgement, then acknowledges to the client. The transaction is durable on at least two nodes before the client sees success.
What you get: zero-RPO failover, true durability against single-node loss. Required for some compliance scenarios (PCI under certain interpretations, financial data under SOX).
What it costs: every transaction now includes a network round trip to the replica. Same-AZ: a few hundred microseconds. Same-region different-AZ: 1-2ms. Cross-region: 50-200ms. The latter is rarely worth it for sync replication. The former is usually fine.
Cross-region synchronous replication sounds like a great idea until your write latency triples and your throughput drops by 40 percent.
Semi-sync: the middle ground
How it works: primary commits locally and waits for at least one replica to acknowledge receipt (not full apply). The transaction is durable on at least two nodes' transaction logs but might not yet be queryable on the replica.
What you get: most of the durability benefit of sync without the apply-latency cost. The replica eventually catches up.
Where it falls short: if the replica's apply queue is backed up during a primary failure, recovery takes longer. Also, semi-sync degrades to async if the replica is slow, which means the durability guarantee is conditional. Read the documentation carefully.
Multi-region active-passive
Replicas in a second region with async replication. Failover requires DNS or routing changes, accepts loss of in-flight writes (the cross-region replication lag, often 100ms-1s), and recovery time is minutes to hours depending on automation.
Best for: regional disaster recovery on workloads that can tolerate the RPO and the RTO. The standard pattern for most SaaS that has to claim 'multi-region.'
Multi-region active-active (multi-master)
Writes accepted in multiple regions, replicated bidirectionally. Aurora Global Database with write-forwarding, CockroachDB, Spanner. Different products, very different tradeoffs.
What you have to deal with: write conflicts. Two regions update the same row at the same time. The system has to pick a winner. Strategies range from 'last writer wins' (silent data loss) to consensus protocols (write latency includes cross-region round trips). There is no free lunch.
Best for: applications where regional partitioning is natural (each user is mostly served from one region) and conflict rates are low. Bad for: anything with global state where any user can update any row.
What we recommend by use case
- Standard SaaS, most workloads: async replication, single-region primary, multi-AZ replicas. Multi-region async for DR. RPO measured in seconds, RTO in tens of minutes.
- Financial transactions, ledgers, regulated data: semi-sync or sync replication within the primary region, async cross-region for DR. RPO of zero on regional failure, seconds on cross-region failure.
- Globally distributed user base with regional partitioning: multi-region active-active with careful schema design. Most users only touch rows that belong to their region. Conflicts are rare.
- Read-heavy analytics, ML feature stores: aggressive read replicas, possibly across regions. Writes go through a single primary because consistency is more important than write availability for these workloads.
What goes wrong
- Replication lag spikes during high-write periods. Read replicas serve stale data. Reports look weird. Need monitoring and alerting on lag, plus query routing that falls back to primary above a threshold.
- Failover happens and the replica was 12 seconds behind. Lost transactions. Need monitoring on lag plus a runbook for partial-loss recovery.
- Async replication looked fine in dev (sub-millisecond) and falls apart in prod (sustained 5-second lag during peaks). Need load testing at production write rate.
- Multi-master conflict rate looked low in synthetic tests and is high in production because real users hit the same hot keys. Need conflict detection and a clear policy for resolution.
Closing
There is no 'best' replication strategy. There is only the strategy that aligns with which failures you can absorb and which ones would kill you. Most teams over-engineer for cross-region durability they do not need and under-engineer for regional disaster recovery they do. The fix is to write down your actual recovery requirements first, then pick the replication topology that meets them. Doing it the other way (pick a topology, hope it covers the requirements) is how outages get expensive.
Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.