Batch vs Streaming

Every night at 2 AM, a Spark job reads the day’s 500 million click events from S3, joins them with user profiles, computes per-user engagement scores, and writes the results to a data warehouse. The marketing team queries these scores at 9 AM to plan the day’s campaigns. This works — until the fraud team needs to block a stolen credit card within 5 seconds of the first suspicious transaction. They can’t wait until 2 AM. They need the same data, processed the same way, but in real-time. This tension — high-throughput scheduled processing vs. low-latency continuous processing — is the fundamental trade-off between batch and stream processing.

Bounded vs. Unbounded Data

The distinction isn’t really about “batch” and “stream” as technologies — it’s about the nature of the data:

flowchart LR
    subgraph "Bounded Data (Batch)"
        BD[Known start and end
e.g., yesterday's logs
a CSV export
last month's orders]
        BD --> BP[Process entire dataset
at once, on a schedule]
        BP --> BR[Complete result
available after job finishes]
    end

    subgraph "Unbounded Data (Stream)"
        UD[No end — data arrives
continuously forever
e.g., click events, GPS,
sensor readings]
        UD --> SP[Process each event
as it arrives]
        SP --> SR[Results available
continuously, incrementally]
    end

Property	Bounded (Batch)	Unbounded (Stream)
Dataset size	Known and finite	Infinite and growing
When processing starts	After all data is collected (or on a schedule)	As soon as data arrives
Processing model	Read → Transform → Write (entire dataset)	Event arrives → Process → Emit (one at a time or micro-batch)
Result availability	After the job completes (minutes to hours)	Continuously (milliseconds to seconds)
Example	“Compute yesterday’s revenue”	“Alert if revenue drops 20% in the last 5 minutes”

Batch Processing

The Problem It Solves

You have a large volume of historical data and need to compute aggregations, joins, transformations, or ML features over the entire dataset. Correctness and throughput matter more than latency. The results are consumed by downstream systems (dashboards, APIs, ML models) that don’t need real-time freshness.

How It Works: MapReduce to Spark

MapReduce (Hadoop) was the original distributed batch framework. It reads data from HDFS, processes it in two phases (Map and Reduce), and writes results back to HDFS.

flowchart LR
    subgraph "MapReduce Job: Count Page Views per URL"
        Input[Input Files
HDFS / S3] --> Map1[Mapper 1
emit url, 1]
        Input --> Map2[Mapper 2
emit url, 1]
        Input --> Map3[Mapper 3
emit url, 1]
        Map1 & Map2 & Map3 --> Shuffle[Shuffle & Sort
group by url]
        Shuffle --> Red1[Reducer 1
sum counts for urls A-M]
        Shuffle --> Red2[Reducer 2
sum counts for urls N-Z]
        Red1 & Red2 --> Output[Output Files
HDFS / S3]
    end

# Conceptual MapReduce for counting page views per URL

def mapper(line):
    """Map phase: parse log line, emit (url, 1)."""
    timestamp, user_id, url, status = line.split("\t")
    yield (url, 1)

def reducer(url, counts):
    """Reduce phase: sum all counts for this URL."""
    yield (url, sum(counts))

# MapReduce framework handles:
# 1. Distributing input splits across mapper tasks
# 2. Shuffling mapper output by key to the correct reducer
# 3. Calling reducer for each unique key
# 4. Writing results to output files

MapReduce’s problem: Every job writes intermediate results to disk between Map and Reduce phases. A multi-step pipeline (filter → join → aggregate → sort) means 4 jobs × 2 disk writes each = 8 disk round-trips. This is extremely slow for iterative algorithms (ML training, graph processing).

Apache Spark: In-Memory Batch

Spark solved MapReduce’s disk I/O problem by keeping intermediate data in memory across pipeline stages.

flowchart LR
    subgraph "Spark Pipeline (in-memory)"
        S3In[S3 / HDFS] --> RDD1["Stage 1: Filter
(in memory)"]
        RDD1 --> RDD2["Stage 2: Join
(in memory)"]
        RDD2 --> RDD3["Stage 3: Aggregate
(in memory)"]
        RDD3 --> S3Out[S3 / Data Warehouse]
    end

    subgraph "MapReduce Pipeline (disk at every step)"
        H1[HDFS] --> MR1["Job 1: Filter
→ write to HDFS"]
        MR1 --> MR2["Job 2: Join
→ write to HDFS"]
        MR2 --> MR3["Job 3: Aggregate
→ write to HDFS"]
        MR3 --> H2[HDFS]
    end

# Spark equivalent: entire pipeline in one program
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("DailyEngagement").getOrCreate()

# Read yesterday's click events (bounded dataset)
clicks = (
    spark.read.parquet("s3://data-lake/clicks/date=2026-04-24/")
    .filter(F.col("status") == 200)
)

# Join with user profiles
users = spark.read.parquet("s3://data-lake/users/")

engagement = (
    clicks
    .groupBy("user_id")
    .agg(
        F.count("*").alias("click_count"),
        F.countDistinct("url").alias("unique_pages"),
        F.avg("dwell_time_ms").alias("avg_dwell_ms"),
    )
    .join(users, "user_id", "left")
    .withColumn("engagement_score",
        F.col("click_count") * 0.3
        + F.col("unique_pages") * 0.5
        + F.col("avg_dwell_ms") / 1000 * 0.2
    )
)

# Write to data warehouse
engagement.write.mode("overwrite").parquet(
    "s3://data-warehouse/engagement/date=2026-04-24/"
)

Batch Processing Characteristics

Property	Value
Latency	Minutes to hours (job startup + full dataset scan)
Throughput	Very high — optimized for scanning TB/PB of data
Fault tolerance	Simple — if a task fails, rerun it from the input partition (input is immutable on disk)
Exactly-once	Trivial — rerun the failed task, overwrite the output partition
State management	Easy — state is the entire dataset; shuffle handles grouping
Resource usage	Bursty — cluster idle between runs, full utilization during runs

Stream Processing

The Problem It Solves

You need to react to events as they happen: detect fraud in real-time, update a live dashboard, trigger a notification within seconds of an event, or maintain a continuously-updated materialized view. Waiting for a batch job is unacceptable.

How It Works

A stream processor consumes events from a source (Kafka topic, Kinesis stream), applies transformations event by event (or in small windows), and writes results to a sink (another Kafka topic, database, dashboard).

flowchart LR
    Source[Kafka Topic
click-events] --> SP[Stream Processor
Flink / Kafka Streams]
    SP --> Sink1[Redis
real-time counters]
    SP --> Sink2[Kafka Topic
enriched-events]
    SP --> Sink3[Alert Service
fraud alerts]

    style SP fill:#f96,stroke:#333,stroke-width:2px

# Conceptual stream processing: real-time fraud detection
# (pseudocode representing Flink/Kafka Streams logic)

class FraudDetector:
    """Process each transaction event as it arrives."""

    def __init__(self):
        # Per-user state: recent transaction history
        self.user_state = {}  # user_id → list of recent transactions

    def process_event(self, event: dict):
        user_id = event["user_id"]
        amount = event["amount"]
        location = event["location"]
        timestamp = event["timestamp"]

        history = self.user_state.get(user_id, [])

        # Rule 1: velocity check — more than 5 transactions in 1 minute
        recent = [t for t in history if timestamp - t["ts"] < 60]
        if len(recent) >= 5:
            emit_alert(user_id, "high_velocity", event)

        # Rule 2: impossible travel — transaction in NYC, then London
        #          within 1 hour
        if history:
            last = history[-1]
            distance = geodistance(last["location"], location)
            time_diff = timestamp - last["ts"]
            if distance > 5000 and time_diff < 3600:  # 5000km in <1hr
                emit_alert(user_id, "impossible_travel", event)

        # Rule 3: amount anomaly — 10x above user's average
        avg_amount = sum(t["amount"] for t in history) / max(len(history), 1)
        if amount > avg_amount * 10 and avg_amount > 0:
            emit_alert(user_id, "amount_anomaly", event)

        # Update state
        history.append({"ts": timestamp, "amount": amount,
                        "location": location})
        # Keep only last 1 hour of history
        self.user_state[user_id] = [
            t for t in history if timestamp - t["ts"] < 3600
        ]

Windowing: Aggregating Over Time

Batch processing aggregates over a complete dataset — easy because boundaries are known. Stream processing aggregates over time windows — harder because data arrives continuously and out of order.

flowchart TB
    subgraph "Tumbling Window (fixed, non-overlapping)"
        TW1["00:00–00:05
count=42"] --> TW2["00:05–00:10
count=38"]
        TW2 --> TW3["00:10–00:15
count=51"]
    end

    subgraph "Sliding Window (overlapping)"
        SW1["00:00–00:05
count=42"]
        SW2["00:01–00:06
count=45"]
        SW3["00:02–00:07
count=39"]
    end

    subgraph "Session Window (gap-based)"
        SS1["User A: 3 clicks
within 30s gap"] --> Gap1["60s inactivity"]
        Gap1 --> SS2["User A: 5 clicks
within 30s gap"]
    end

Window type	How it works	Use case
Tumbling	Fixed-size, non-overlapping (e.g., every 5 min)	“Requests per 5-minute bucket” — simple counts
Sliding	Fixed-size, overlapping (e.g., 5 min window, slides every 1 min)	“Average latency over the last 5 minutes, updated every minute”
Session	Variable-size, defined by inactivity gap	“Group a user’s clicks into browsing sessions”

The Late Data Problem

Events don’t always arrive in order. A mobile app sends an event at 10:00:00, but network delays mean it reaches Kafka at 10:00:07. If the 10:00:00–10:00:05 window has already been computed and emitted, this late event is missed.

Event time:      10:00:00  10:00:01  10:00:03  10:00:04  [window closes at 10:00:05]
Processing time: 10:00:00  10:00:01  10:00:03  10:00:04  10:00:05 → emit result

Late event arrives: event_time=10:00:02, processing_time=10:00:07
→ This event belongs to the already-closed window

Without watermarks: event is dropped → inaccurate count
With watermarks:    allow up to 10s of lateness → wait until 10:00:15
                    before finalizing the window → accurate but higher latency

Watermarks are the mechanism stream processors use to track progress in event time. A watermark of t=10:00:05 means “I believe all events with event_time ≤ 10:00:05 have arrived.” The processor can close windows up to that point. Late events beyond the watermark are either dropped or handled by a side output.

Exactly-Once in Stream Processing

The Problem

A stream processor reads an event from Kafka, updates an internal counter, and writes the result to a sink. If the processor crashes after updating the counter but before committing the Kafka offset, it will re-process the event on restart — the counter is incremented twice.

Normal:    read event → update state → commit offset → ✓
Crash:     read event → update state → CRASH → restart
Restart:   read event (again, old offset) → update state (again) → double count!

The Solution: Checkpointing + Transactional Sinks

Stream processors like Flink solve this with coordinated checkpoints: periodically snapshot the processor’s internal state and the current Kafka offsets atomically. On recovery, restore from the last checkpoint — both state and offset are consistent.

sequenceDiagram
    participant K as Kafka
    participant F as Flink Processor
    participant S as State Store
    participant Sink as Output Sink

    F->>K: Read event (offset=100)
    F->>S: Update state (count=42)
    F->>Sink: Write result

    Note over F: Checkpoint triggered
    F->>S: Snapshot state (count=42)
    F->>K: Record offset=100 in checkpoint

    F->>K: Read event (offset=101)
    F->>S: Update state (count=43)
    Note over F: CRASH before next checkpoint

    Note over F: Recovery: restore checkpoint
    F->>S: Restore state (count=42)
    F->>K: Resume from offset=100
    F->>K: Re-read event (offset=100) → re-process → count=42+1=43 ← correct

Stream Processing Characteristics

Property	Value
Latency	Milliseconds to seconds
Throughput	High — but less than batch for same data volume (per-event overhead)
Fault tolerance	Complex — checkpointing, exactly-once requires coordination
Exactly-once	Hard — needs checkpointed state + transactional sinks
State management	Complex — must handle per-key state, expiry, and recovery
Resource usage	Continuous — cluster runs 24/7, must handle traffic peaks

Lambda Architecture

The Problem It Solves

You want both: accurate historical analytics from batch processing and low-latency results from stream processing. The batch layer gives you correctness (it can reprocess the entire history), and the speed layer gives you freshness (results within seconds).

How It Works

flowchart TB
    Source[Event Source
Kafka / Kinesis] --> Batch[Batch Layer
Spark / Hadoop]
    Source --> Speed[Speed Layer
Flink / Storm]

    Batch -->|"accurate, delayed
(runs hourly/nightly)"| Serving[Serving Layer
merge results]
    Speed -->|"approximate, real-time
(continuous)"| Serving

    Serving --> Query[Query API
Dashboard / Application]

Example: real-time view counts + accurate daily totals

Speed layer (stream):
  - Counts views in real-time as events arrive
  - Result: "Video X has ~1,042,387 views" (approximate, up-to-date)
  - May have double-counts from retries or missed events

Batch layer (nightly):
  - Reprocesses all events from the raw log
  - Result: "Video X had exactly 1,042,351 views yesterday" (precise)
  - Overwrites the speed layer's approximation for completed time ranges

Serving layer merges:
  - For yesterday: use batch result (1,042,351) ← authoritative
  - For today (in progress): use speed layer result ← approximate but fresh

Why Lambda Is Problematic

The operational cost of Lambda:

  1. You write EVERY piece of business logic TWICE:
     - Once in Spark (Python/Scala, batch semantics)
     - Once in Flink (Java, streaming semantics)

  2. You maintain TWO codebases that must produce the SAME results

  3. When a bug is found:
     - Fix it in the batch code
     - Fix it in the streaming code (different language, different API)
     - Verify both produce the same output
     - Deploy both independently

  4. Two sets of infrastructure to operate, monitor, and scale

  Result: 2× engineering cost for every feature

Advantage	Disadvantage
Batch layer serves as a “source of truth” that corrects stream approximations	Two codebases to maintain with identical logic
Stream layer provides real-time freshness	Different languages/frameworks for batch and stream
Batch can reprocess history when logic changes	Serving layer must merge two result sets — complex
Simpler to reason about batch correctness	Operational overhead of two separate systems

Kappa Architecture

The Problem It Solves

Lambda’s core pain is maintaining two codebases. Kappa eliminates the batch layer entirely: one streaming pipeline handles both real-time processing and historical reprocessing. If the processing logic changes, you replay the event log from the beginning through the new version of the streaming pipeline.

How It Works

flowchart TB
    Source[Event Log
Kafka with long retention
30 days / infinite] --> V2[Stream Processor v2
current logic]

    V2 -->|"real-time results"| Serving[Serving Layer
single source of truth]
    Serving --> Query[Query API]

    subgraph "Reprocessing (when logic changes)"
        Source -->|"replay from offset 0"| V3[Stream Processor v3
new logic]
        V3 -->|"rebuild output"| NewServing[New Serving Table]
    end

    NewServing -.->|"swap when caught up"| Serving

Kappa reprocessing workflow:

  1. Current pipeline (v2) processes events in real-time → writes to output_v2
  2. Business logic changes → deploy new pipeline (v3)
  3. v3 starts consuming from the BEGINNING of the Kafka log
  4. v3 processes historical + real-time events → writes to output_v3
  5. When v3 catches up to real-time: swap serving table from output_v2 → output_v3
  6. Shut down v2

  Result: one codebase, one framework, reprocessing via replay

Prerequisites for Kappa

Kappa only works if:

The event log is retained long enough — Kafka must retain events for the full reprocessing window (30 days, 1 year, or infinite with tiered storage)
Replay throughput is sufficient — the stream processor must be able to catch up from history within a reasonable time. Processing 1 year of events might take hours even at 10× speed
Processing logic is deterministic — replaying the same events must produce the same results (no external API calls during processing that might return different results)

Lambda vs Kappa

Property	Lambda	Kappa
Codebases	Two (batch + stream)	One (stream only)
Reprocessing	Rerun batch job on historical data	Replay event log through new stream processor
Correctness guarantee	Batch layer is the “source of truth”	Stream processor must be correct — no batch safety net
Operational complexity	High — two systems, merging layer	Lower — one system, but replay infrastructure needed
Event log requirement	Short retention OK (batch reads from data lake)	Long retention required (Kafka stores all history)
Reprocessing speed	Fast (Spark reads Parquet from S3 — optimized for bulk)	Slower (stream processor replays sequentially)
Best for	When batch and stream have fundamentally different requirements	When one pipeline can serve both real-time and historical

⚠️

Kappa doesn’t mean “no batch at all.” Some workloads are inherently batch: ML model training on historical data, one-time data migrations, complex graph algorithms. Kappa means the primary serving pipeline is stream-based with replay capability — not that batch processing is banned from the entire organization.

When to Choose Each

flowchart TB
    Start([Data Processing
Decision]) --> Q1{Latency
requirement?}

    Q1 -->|"Minutes to hours
is acceptable"| Batch[Batch Processing
Spark / Hadoop / dbt]

    Q1 -->|"Seconds to
milliseconds needed"| Q2{Need historical
reprocessing?}

    Q2 -->|"Yes — logic changes
frequently"| Q3{Team can maintain
two codebases?}
    Q2 -->|"No — real-time
only"| Stream[Stream Processing
Flink / Kafka Streams]

    Q3 -->|"Yes — separate
batch + stream teams"| Lambda[Lambda Architecture]
    Q3 -->|"No — one team,
one codebase"| Kappa[Kappa Architecture]

    style Batch fill:#bfb
    style Stream fill:#bfb
    style Lambda fill:#ffb
    style Kappa fill:#bfb

Decision Matrix

Use case	Latency	Best approach	Why
Nightly revenue report	Hours	Batch (Spark)	Full-table aggregation, correctness matters, no urgency
ML model training	Hours	Batch (Spark)	Reads terabytes of historical features, iterative algorithms
ETL to data warehouse	Hours	Batch (Spark / dbt)	Transform and load bounded datasets on a schedule
Fraud detection	Milliseconds	Stream (Flink)	Must block transaction before it completes
Real-time dashboard	Seconds	Stream (Flink / KStreams)	Live metrics updated every few seconds
Recommendation feed refresh	Minutes	Micro-batch (Spark Streaming)	Near-real-time but can tolerate ~1 min delay
Event-driven microservices	Milliseconds	Stream (Kafka Streams)	React to domain events as they occur
Log aggregation + alerting	Seconds	Stream → batch	Stream for alerting, batch for historical analysis
Search index updates	Seconds	Stream (CDC → Flink → ES)	Keep Elasticsearch in sync with DB changes
Ad click attribution	Mixed	Lambda or Kappa	Real-time counts for billing + batch for accurate reconciliation

Real-World Architecture: Combining Both

Most production systems don’t choose one — they use both batch and stream processing for different parts of the pipeline:

flowchart TB
    subgraph "Event Sources"
        App[App Events] --> Kafka[Kafka]
        DB[(Database)] -->|CDC| Kafka
    end

    subgraph "Stream Processing (real-time)"
        Kafka --> Flink[Apache Flink]
        Flink --> Redis[Redis
real-time counters]
        Flink --> ES[Elasticsearch
live search index]
        Flink --> Alerts[Alert Service
fraud, anomalies]
    end

    subgraph "Batch Processing (scheduled)"
        Kafka -->|"sink to S3"| S3[(S3 Data Lake
Parquet files)]
        S3 --> Spark[Apache Spark
nightly jobs]
        Spark --> DW[(Snowflake / BigQuery
data warehouse)]
        Spark --> ML[ML Training
feature pipelines]
    end

    subgraph "Serving"
        Redis --> API[API / Dashboard]
        DW --> API
        ES --> API
    end

The event log (Kafka) is the shared backbone. Stream processors consume in real-time. Batch processors consume the same events archived to S3 in Parquet format. Both produce outputs that feed into the serving layer. This is not Lambda — there’s no merging of batch and stream results for the same query. Instead, batch and stream serve different use cases from the same data source.

Head-to-Head Comparison

Property	Batch	Stream	Micro-Batch
Latency	Minutes–hours	Milliseconds–seconds	Seconds–minutes
Throughput	Highest (bulk I/O)	High (per-event overhead)	High (small bulk I/O)
Exactly-once	Trivial (rerun failed task)	Hard (checkpointing + transactional sinks)	Moderate
State management	Shuffle-based (no long-lived state)	Per-key state stores, watermarks, expiry	Per-batch state
Fault tolerance	Rerun from input partition	Checkpoint + replay from offset	Rerun micro-batch
Late data handling	N/A (all data present)	Watermarks, allowed lateness, side outputs	Per-batch cutoff
Reprocessing	Rerun job on historical data	Replay event log (if retained)	Rerun on historical micro-batches
Cost model	Pay for compute during job runs	Pay for 24/7 running cluster	Pay for frequent small jobs
Tools	Spark, Hadoop, dbt, Presto	Flink, Kafka Streams, Storm	Spark Structured Streaming

ℹ️

Interview tip: When data processing comes up in a system design interview, say: “I’d use batch for scheduled analytics and ML training — a Spark job reads from the data lake nightly, computes aggregates, and writes to the warehouse. For real-time requirements like fraud detection or live dashboards, I’d use stream processing with Flink consuming from Kafka — each event is processed with sub-second latency. Both read from the same Kafka event log: streams in real-time, batch from Parquet archives on S3. I wouldn’t use Lambda architecture because maintaining the same logic in two frameworks doubles engineering cost. If I need historical reprocessing, I’d take the Kappa approach — replay the Kafka log through a new version of the stream processor and swap the serving table when it catches up.”

Test Your Understanding

Your team runs a Lambda architecture: Spark for batch, Flink for streaming, both producing user engagement scores. A bug is discovered in the engagement formula. You fix it in the Spark codebase (Python) and deploy. A week later, you realize the same bug is still in the Flink codebase (Java). Scores have been inconsistent for 7 days. How do you prevent this?

This is the fundamental argument for Kappa architecture. With Lambda, every logic change must be applied to two codebases in different languages with different semantics. The fix:

Short-term: Share the formula as a configuration (e.g., a JSON rule definition) consumed by both Spark and Flink, so changes are applied once.
Long-term: Migrate to Kappa. Use Flink for both real-time and historical reprocessing. When the formula changes, replay the Kafka log through the updated Flink job. One codebase, one language, one deployment.

The 2× engineering cost of Lambda is not just initial development — it’s every bug fix, every formula change, and every feature addition forever.

Flink claims ’exactly-once’ processing. But your Flink job reads from Kafka, updates a counter, and writes to PostgreSQL. After a crash and recovery, you notice the PostgreSQL row was written twice. How is this ’exactly-once'?

Flink’s exactly-once guarantee applies to its internal state, not to arbitrary external sinks. The checkpoint mechanism (Chandy-Lamport) ensures that Flink’s internal counters, windows, and state stores are consistent after recovery. But writing to PostgreSQL is a side effect outside Flink’s control.

For true end-to-end exactly-once, the sink must support transactions. Flink’s two-phase commit sink (e.g., Kafka sink with transactions) pre-commits output during checkpoint and commits only after checkpoint completes. For PostgreSQL, you need an idempotent upsert (INSERT ... ON CONFLICT DO UPDATE) using the event’s unique key, so replayed events overwrite rather than duplicate.

Lesson: “Exactly-once” in stream processing means exactly-once state update, not exactly-once side effect. Sinks must be either transactional or idempotent.

Your analytics dashboard uses a nightly Spark batch job. The product team wants ‘real-time analytics.’ You propose replacing Spark with Flink. Your senior engineer pushes back: ‘Batch is actually better for this use case.’ When is batch genuinely better than streaming?

Batch is better when:

The consumers don’t need freshness. If the dashboard is viewed once daily at 9 AM, processing at 2 AM is perfectly fine. Streaming adds 24/7 infrastructure cost for no user-visible benefit.
The computation requires global context. ML training, percentile calculations (P99 latency), and graph algorithms need the full dataset. Streaming approximations (HyperLogLog, T-Digest) sacrifice accuracy.
Cost optimization matters. A batch job runs for 2 hours on spot instances at $0.03/core-hour. A Flink cluster runs 24/7 at $0.10/core-hour. For the same daily compute, batch costs 1/12th.
Correctness is non-negotiable. Financial reporting, regulatory compliance, and billing must be exact. Batch reprocessing is trivially reproducible; stream reprocessing requires careful Kafka offset management.

Rule of thumb: If the data consumer can tolerate hours of latency, batch is simpler, cheaper, and more correct. Streaming earns its complexity only when seconds matter.

Stream Processing Engines