Event-Driven Architecture

Event-Driven Architecture (EDA) is a design paradigm where services communicate by producing and consuming events — immutable facts about things that have happened. Instead of Service A calling Service B directly (synchronous RPC), Service A publishes an event and moves on. Any number of services can react independently.

This decoupling is the foundation of scalable microservice systems: producers don’t know or care who consumes their events, and new consumers can be added without modifying the producer.

Events vs Commands

This distinction is fundamental and frequently tested in interviews.

EventCommand
SemanticsA fact — something that happenedA request — something that should happen
TensePast tense: OrderPlaced, PaymentCharged, UserRegisteredImperative: PlaceOrder, ChargePayment, RegisterUser
OwnershipOwned by the producer (the service that knows the fact)Owned by the consumer (the service that will execute it)
CouplingLoose — producer doesn’t know who listensTight — sender knows the target service
FailureProducer succeeds regardless of consumer failuresSender must handle target failure (retry, timeout)
MutabilityImmutable — a fact cannot be changed after it happenedMutable — can be retried, modified, cancelled
Fan-outNatural — any number of consumersTypically one target
sequenceDiagram
    participant OS as Order Service
    participant K as Kafka / Event Bus
    participant PS as Payment Service
    participant IS as Inventory Service
    participant NS as Notification Service

    Note over OS,NS: Command-driven (tight coupling)
    OS->>PS: ChargePayment (command)
    PS->>OS: PaymentResult
    OS->>IS: ReserveInventory (command)
    IS->>OS: ReservationResult
    OS->>NS: SendConfirmation (command)

    Note over OS,NS: Event-driven (loose coupling)
    OS->>K: OrderPlaced (event)
    K->>PS: OrderPlaced → charges payment independently
    K->>IS: OrderPlaced → reserves inventory independently
    K->>NS: OrderPlaced → sends confirmation independently

In the command-driven model, the Order Service orchestrates every downstream step and must handle each service’s failures. In the event-driven model, the Order Service publishes one fact and downstream services react autonomously.

⚠️

Events and commands are not mutually exclusive. Most real systems use both. Within a saga orchestrator, commands are appropriate (the orchestrator explicitly tells services what to do). Between bounded contexts, events are preferred (loose coupling). The question in an interview is always “what coupling do you want here?”

Pub/Sub Pattern

Publish/Subscribe is the messaging pattern that implements event-driven communication. Producers publish to a topic (or channel); consumers subscribe to topics they care about.

flowchart LR
    subgraph Producers
        A[Order Service]
        B[Payment Service]
        C[User Service]
    end

    subgraph Event Bus / Kafka
        T1[Topic: orders]
        T2[Topic: payments]
        T3[Topic: users]
    end

    subgraph Consumers
        D[Analytics]
        E[Search Indexer]
        F[Notification Service]
        G[Fraud Detection]
    end

    A --> T1
    B --> T2
    C --> T3

    T1 --> D
    T1 --> E
    T1 --> F
    T1 --> G
    T2 --> D
    T2 --> G
    T3 --> E
    T3 --> F

Key Properties

Producer ignorance: The Order Service publishes OrderPlaced without knowing that Analytics, Search, Notifications, and Fraud Detection all consume it. When the Fraud Detection team adds their consumer next quarter, zero changes are needed on the Order Service.

Consumer independence: Each consumer processes events at its own pace. If the Search Indexer falls behind (Elasticsearch is slow), it doesn’t affect Notifications or Analytics. Each consumer group tracks its own offset in Kafka.

Temporal decoupling: The producer and consumer don’t need to be running at the same time. Events are durably stored in Kafka (or SQS, SNS). A consumer that was down for maintenance catches up when it restarts.

Pub/Sub Implementations

SystemModelRetentionBest For
KafkaPersistent log, consumer groupsConfigurable (days/forever)High-throughput event backbone, replay
AWS SNS → SQSSNS fans out to SQS queuesSQS: 14 days maxAWS-native, simple fan-out
Google Pub/SubPersistent, per-subscription delivery7 days defaultGCP-native, auto-scaling
Redis Pub/SubFire-and-forget (no persistence)None — message lost if no subscriber is connectedReal-time notifications where loss is acceptable
RabbitMQ (fanout exchange)Exchange copies to all bound queuesUntil consumedLow-latency, small-scale fan-out

Event Sourcing

Event sourcing takes the event-driven idea to its extreme: instead of storing the current state of an entity, you store the complete sequence of events that led to that state. Current state is derived by replaying events.

Traditional CRUD vs Event Sourcing

Traditional (state-based):
  Account table: { id: 42, balance: 750, name: "Alice" }
  → You know the current balance but not HOW it got there

Event Sourcing:
  Event store for account:42:
    [1] AccountOpened    { balance: 0 }
    [2] MoneyDeposited   { amount: 1000 }
    [3] MoneyWithdrawn   { amount: 200 }
    [4] MoneyWithdrawn   { amount: 50 }
    → Current balance: replay → 0 + 1000 - 200 - 50 = 750
    → You know every transition that led to this state
sequenceDiagram
    participant C as Client
    participant S as Account Service
    participant ES as Event Store
    participant P as Projection (Read Model)

    C->>S: Withdraw $50 from account:42

    S->>ES: Load events for account:42
    ES->>S: [AccountOpened, Deposited(1000), Withdrawn(200)]

    S->>S: Replay events → current balance = 800
    S->>S: Validate: 800 ≥ 50 → OK

    S->>ES: Append: MoneyWithdrawn { amount: 50, accountId: 42 }

    ES->>P: New event → update read model
    P->>P: UPDATE balance = 750 WHERE id = 42

    C->>P: GET /accounts/42/balance
    P->>C: { balance: 750 }

Event Store

The event store is an append-only log where events are immutable. Once written, an event is never modified or deleted.

Event Store (per aggregate):

account:42 stream:
┌────────┬──────────────────┬───────────────────────────┬──────────────┐
│ seq    │ event_type       │ payload                   │ timestamp    │
├────────┼──────────────────┼───────────────────────────┼──────────────┤
│ 1      │ AccountOpened    │ { owner: "Alice" }        │ 2025-01-15   │
│ 2      │ MoneyDeposited   │ { amount: 1000 }          │ 2025-01-15   │
│ 3      │ MoneyWithdrawn   │ { amount: 200 }           │ 2025-02-01   │
│ 4      │ MoneyWithdrawn   │ { amount: 50 }            │ 2025-03-10   │
└────────┴──────────────────┴───────────────────────────┴──────────────┘

Concurrency control: optimistic locking on sequence number.
  Append event at seq=5 succeeds only if current seq is still 4.
  If another write happened first (seq=5 already exists) → conflict → retry.

Dedicated event stores: EventStoreDB, Axon Server. General-purpose alternatives: PostgreSQL with an append-only events table + sequence constraint, Kafka with compacted topics.

CQRS (Command Query Responsibility Segregation)

Event sourcing naturally splits reads and writes into separate models:

flowchart LR
    subgraph Write Side
        CMD[Command] --> AGG[Aggregate
validates + emits events] AGG --> ES[(Event Store
append-only)] end ES --> PROJ[Projector
builds read models] subgraph Read Side PROJ --> RM[(Read Model DB
denormalized, query-optimized)] Q[Query] --> RM end
SideOptimized ForStorage
Write (Command)Consistency, validation, event generationEvent store (append-only, sequential)
Read (Query)Fast queries, complex aggregationsDenormalized tables, materialized views, Elasticsearch

The read model is a projection — a materialized view derived from events. You can build multiple projections from the same event stream: one for the user-facing API, one for admin dashboards, one for analytics. Each is optimized for its specific query patterns.

Snapshots

For aggregates with many events (e.g., an account with 100,000 transactions), replaying all events on every command is expensive. Snapshots store the aggregate’s state at a point in time:

Events:    [1] [2] [3] ... [9999] [10000] [10001] [10002]
                                    ↑
                            Snapshot at seq 10000:
                            { balance: 42350, ... }

To load current state:
  1. Load snapshot (seq 10000, balance 42350)
  2. Replay only events 10001, 10002
  Instead of replaying all 10002 events

Benefits of Event-Driven Architecture

Auditability

Every state change is recorded as an event. You have a complete, immutable audit trail without any additional logging infrastructure.

"Why does user:42 have balance $750?"
→ Replay event stream:
  AccountOpened(0) → Deposited(1000) → Withdrawn(200) → Withdrawn(50)
  Each event has timestamp, actor, metadata.
  Full provenance of every state transition.

Temporal Queries

With event sourcing, you can answer “what was the state at time T?” by replaying events up to T. Traditional databases can only tell you the current state.

"What was user:42's balance on February 15?"
→ Replay events with timestamp ≤ Feb 15:
  AccountOpened(0) → Deposited(1000) → Withdrawn(200)
  Balance was $800 on February 15.

Easy Fan-Out

Adding a new consumer requires zero changes to existing services. The new team subscribes to the relevant topics and starts consuming.

Month 1: Order Service → Kafka → [Analytics, Search]
Month 3: Fraud team adds a consumer → [Analytics, Search, Fraud Detection]
Month 5: ML team adds a consumer → [Analytics, Search, Fraud, ML Training Pipeline]

Order Service code: unchanged through all of this.

Temporal Decoupling

Producers and consumers don’t need to be available simultaneously. Events are durably stored and consumed when the consumer is ready.

Challenges

Eventual Consistency

In EDA, the write (event published) and the read (projection updated) are not synchronous. There is always a window where the read model is stale.

T=0:    User places order → OrderPlaced event published
T=5ms:  Event reaches Kafka
T=50ms: Analytics consumer processes event → dashboard updated
T=200ms: Search consumer processes event → Elasticsearch updated

Between T=0 and T=200ms, searching for the order returns nothing.

Mitigation:

  • Accept the delay (most systems can tolerate 100ms–1s of staleness)
  • Return the write result directly to the user (read-your-writes on the write path)
  • Use causal consistency tokens for critical read-after-write scenarios

Schema Evolution

Events are immutable — you can’t change old events. But your schema will evolve. A consumer written today must handle events written a year ago.

Version 1 (Jan 2025):  OrderPlaced { orderId, userId, total }
Version 2 (Jun 2025):  OrderPlaced { orderId, userId, total, currency, discountCode }

Consumers must handle both versions.

Strategies:

StrategyHowTrade-off
Additive-only changesOnly add new optional fields; never remove or renameSafest; limits schema flexibility
Schema registry (Confluent)Enforce compatibility rules (backward, forward, full) on Avro/Protobuf schemasRequires infrastructure; enforces discipline
UpcastingConsumer transforms old events to current schema on readLogic lives in consumers; each consumer handles migration independently
Event versioningOrderPlaced.v1, OrderPlaced.v2 — different event typesExplicit but proliferates event types

Debugging Complexity

In a synchronous system, a stack trace shows the full call chain. In EDA, there is no call stack — events flow asynchronously through topics and consumer groups.

Synchronous: Order → Payment → Inventory → Shipping
  One stack trace shows the entire flow. One log correlation ID.

Event-driven: OrderPlaced → (Kafka) → PaymentCharged → (Kafka) → InventoryReserved → (Kafka) → ...
  Each step is a separate process. No shared stack trace.
  Must correlate via traceId/correlationId in event metadata.

Mitigation:

  • Correlation ID: Every event carries a correlationId (set by the initial request) and a causationId (the event that caused this one). Distributed tracing tools (Jaeger, Zipkin) visualize the full flow.
  • Event catalog: A central registry documenting every event type, its schema, producers, and consumers. Without this, large EDA systems become opaque.
  • Dead Letter Queues: Events that fail processing N times go to a DLQ for inspection rather than blocking the consumer.

Ordering Across Topics

Events within a single Kafka partition are ordered, but events across different topics are not. If OrderPlaced and PaymentCharged are on separate topics, a consumer may see PaymentCharged before OrderPlaced.

Mitigation:

  • Put causally related events on the same topic + partition key
  • Use timestamps or sequence numbers to reorder at the consumer
  • Design consumers to handle out-of-order events (buffer and wait, or process idempotently)

EDA Patterns Summary

PatternWhenHow
Simple pub/subLoose fan-out notificationProducer → Topic → N consumers
Event sourcing + CQRSFull audit trail, temporal queries, separate read/write optimizationEvent store → projections → read models
Saga (choreography)Cross-service coordination without orchestratorServices react to each other’s events
Saga (orchestration)Complex multi-step workflowsOrchestrator sends commands, receives events
Event notification“Something happened, look it up yourself”Thin event (just ID + type), consumer fetches details via API
Event-carried state transferConsumer needs the data, not just a notificationFat event (full entity payload), consumer doesn’t need to call back
ℹ️

Interview framing: “I’d design this as an event-driven system. The Order Service publishes OrderPlaced to Kafka — that’s an immutable fact, not a command. Payment, Inventory, and Notifications each subscribe independently as separate consumer groups. This gives us loose coupling: adding Fraud Detection next quarter requires zero changes to the Order Service. The trade-off is eventual consistency — the search index may be a few hundred milliseconds behind — which is acceptable for this use case. Events carry a correlationId for distributed tracing across the async flow.”