BIPI
BIPI

Event-Driven Architecture: The Mistakes That Bite You at Year Three

Digital Engineering

Event-driven systems work on day one and fail on day 800. The mistakes are predictable: events as commands, no schema versioning, no idempotency, no replay. Here are the patterns that survive five years.

By Arjun Raghavan, Security & Systems Lead, BIPI · July 13, 2024 · 7 min read

#event-driven#architecture#distributed

Every event-driven system we audit at year three has the same five problems. The team that built it has either left or moved on. The events have grown organically. Nobody can answer 'what happens when we replay the last 24 hours of events' with confidence. The Kafka cluster has 400 topics, half of which are dead, and rebalancing takes an hour.

The architecture is not wrong. The execution accumulated debt that nobody amortized. Here are the mistakes we keep finding and the patterns that survive contact with five years of organic growth.

Mistake 1: events as commands

An event describes a fact that happened: OrderPlaced, PaymentCaptured, InventoryReserved. A command tells someone to do something: PlaceOrder, CapturePayment, ReserveInventory. Mixing these is the most common architectural failure we see, and it is fatal because it couples the producer to the consumer's behavior.

When OrderPlaced is an event, twelve services can subscribe and react in their own way. When OrderPlaced is actually 'PleaseChargeTheCard' in disguise, you have built RPC with extra steps. The producer now cares whether the consumer succeeded, retries become weird, and replaying the topic does interesting things to your customers' credit cards.

The rule we enforce: event names are past tense. Always. If a name does not naturally end in -ed, it is probably a command. Make it a command, send it through a different channel, and own that it is RPC.

Mistake 2: no schema versioning

Year one: the team agrees on a JSON shape and writes it in Confluence. Year two: someone adds a field. Year three: there are seven slight variants in production, three of which crash a downstream consumer if encountered. The team is now afraid to change the schema because they do not know who is reading it.

What works:

  • Avro or Protobuf with a schema registry, not raw JSON
  • Backward and forward compatibility checked in CI on every schema change
  • Schema versions in the event envelope, not negotiated implicitly
  • A 'schema owner' field so consumers know who to ask when something breaks
  • Deprecation windows of at least 90 days before removing a field

Mistake 3: no idempotency

Kafka delivers at-least-once. Your consumer will see the same event twice. Sometimes ten times. If processing the event is not idempotent, you have built a system that occasionally double-charges customers and you will not find out until accounting reconciles three weeks later.

The pattern: every event has an event_id. Every consumer maintains a deduplication store keyed by (consumer_name, event_id) with a TTL longer than your maximum reasonable replay window. Postgres works fine for this at moderate scale. Redis works at higher scale. The TTL conversation is real: too short and you lose dedup on replay, too long and you have a giant table.

Mistake 4: no replay strategy

Eventually you will need to replay events. A bug got out, a downstream service was down, a compliance audit needs a re-derivation. The teams that have not designed for this end up doing it manually with bash scripts, getting it 80 percent right, and explaining to the regulator why one customer's history looks weird.

Design for replay from day one. That means:

  1. Event retention measured in months, not days, on critical topics
  2. Consumer offsets resettable to a timestamp without redeploying
  3. Idempotency that holds across the full retention window
  4. A documented replay runbook with named owners
  5. A tier-1 'replay this topic from this timestamp to that one' tool that someone has actually run in production

Mistake 5: every team owns every topic

When ownership is fuzzy, the topic dies of neglect. Schema rot, stale consumers, unclear retention, no monitoring. We have audited Kafka clusters with 400 topics where the producer team genuinely could not name three of the consumers.

Topic ownership is a first-class artifact. Each topic has one owning team. The team is on the dashboard. The team is paged when consumer lag breaks SLO. The team owns schema changes. If a topic does not have a clear owner, it should not exist.

Event-driven architecture is genuinely powerful. The systems that survive five years are not the ones with the cleverest pipelines. They are the ones where someone owned the operational discipline: schemas, idempotency, replay, ownership. That is the work most teams skip and pay for later.

Read more field notes, explore our services, or get in touch at info@bipi.in. Privacy Policy · Terms.