Back-Pressure & Load Shedding

Back-Pressure & Load Shedding

A producer generates events faster than the consumer can process them. Without intervention, the in-between buffer grows without bound — first exhausting memory, then triggering OOM kills, cascading failures, or silent data loss when the buffer is eventually dropped. Back-pressure and load shedding are two complementary strategies for handling overload: back-pressure slows down the producer; load shedding drops excess work to protect the system.

Back-Pressure

Back-pressure is a flow control mechanism where a downstream component signals the upstream component to slow down. The signal propagates backward through the pipeline, reducing the input rate to match the processing capacity.

sequenceDiagram
    participant P as Producer (1000 msg/s)
    participant Q as Buffer (capacity: 500)
    participant C as Consumer (200 msg/s)

    Note over P,C: Without back-pressure
    P->>Q: 1000 msg/s
    Q->>C: 200 msg/s
    Note over Q: Buffer grows: 500 → 1300 → 2100 → OOM

    Note over P,C: With back-pressure
    P->>Q: 1000 msg/s
    Q-->>P: SLOW DOWN (buffer 80% full)
    P->>Q: 200 msg/s (matches consumer rate)
    Q->>C: 200 msg/s
    Note over Q: Buffer stable at ~400

TCP Flow Control

TCP has built-in back-pressure via the receive window. The receiver advertises how many bytes it can accept; the sender cannot exceed this. When the receiver’s buffer fills, the window shrinks to zero — the sender blocks.

Sender sends 64KB → Receiver buffer: 64KB / 128KB
Sender sends 64KB → Receiver buffer: 128KB / 128KB (full)
Receiver advertises window = 0 → Sender pauses
Receiver processes 32KB → Receiver advertises window = 32KB
Sender sends 32KB → flow continues

This is why a slow HTTP client causes the server’s write() call to block — TCP back-pressure propagates from the client through the kernel to the application.

Reactive Streams

The Reactive Streams specification standardizes back-pressure for async pipelines. The consumer requests N items from the producer — the producer sends at most N, then waits for the next request.

Subscriber → Publisher: request(10)     // I can handle 10 items
Publisher → Subscriber: onNext(item) × 10
Subscriber → Publisher: request(5)      // processed some, ready for 5 more
Publisher → Subscriber: onNext(item) × 5

Implementations: Java Flow (JDK 9+), Project Reactor (Flux/Mono), RxJava (Flowable), Akka Streams.

// Reactor example: consumer controls the pace
Flux.range(1, 1_000_000)
    .onBackpressureBuffer(1000, dropped -> log.warn("Dropped: {}", dropped))
    .publishOn(Schedulers.boundedElastic())
    .subscribe(item -> {
        process(item);  // slow consumer
    });

Message Queue Back-Pressure

Kafka: consumers control their own pace — they call poll() when ready. If a consumer falls behind, its lag grows but the broker is not affected. Back-pressure is implicit: the consumer simply doesn’t poll faster than it can process.

RabbitMQ: prefetch_count limits how many unacknowledged messages a consumer holds. The broker stops sending when the prefetch limit is reached — back-pressure from consumer to broker.

# RabbitMQ: consumer-level back-pressure
channel.basic_qos(prefetch_count=10)  # max 10 unacked messages

def callback(ch, method, properties, body):
    process(body)          # slow processing
    ch.basic_ack(method.delivery_tag)  # ack → broker sends next

channel.basic_consume(queue='tasks', on_message_callback=callback)

Load Shedding

When back-pressure alone cannot reduce the incoming rate (e.g., user-facing HTTP traffic — you can’t tell browsers to slow down), the system must shed load: intentionally reject or drop requests to protect core functionality.

flowchart TD
    R[Incoming Requests
5000 rps] --> AC{Admission Control} AC -->|Within capacity
3000 rps| Process[Process Request] AC -->|Over capacity
2000 rps| Shed[Reject: 503 + Retry-After] Process --> Response[200 OK]

Priority-Based Shedding

Not all requests are equal. Shed lowest-priority traffic first:

PriorityTraffic TypeShed First?
CriticalCheckout, payment, authenticationNever shed (last resort)
HighProduct page, search, cartShed under extreme overload
MediumRecommendations, reviews, analyticsShed early
LowHealth checks, background sync, telemetryShed first
def admission_control(request, current_load, capacity):
    priority = classify_priority(request)

    if current_load < capacity * 0.7:
        return ALLOW  # under 70% — allow everything

    if current_load < capacity * 0.85:
        if priority in ("low",):
            return REJECT  # shed low-priority
        return ALLOW

    if current_load < capacity * 0.95:
        if priority in ("low", "medium"):
            return REJECT  # shed low + medium
        return ALLOW

    # Over 95% — only critical traffic
    if priority != "critical":
        return REJECT
    return ALLOW

Where to Shed

Shed as early as possible in the request path — before the request consumes resources:

User → CDN → Load Balancer → API Gateway → Service → Database
                                  ↑
                        Shed here (cheapest rejection point)

API Gateway / reverse proxy: NGINX limit_req with burst and nodelay, Envoy’s adaptive concurrency limiter, AWS ALB connection limits.

Application-level: track in-flight request count; reject with 503 when count exceeds threshold. This is more precise than infrastructure-level shedding because it can account for request cost (a simple GET vs an expensive aggregation query).

CoDel (Controlled Delay)

A sophisticated load shedding algorithm used by Google’s Stubby/gRPC framework. Instead of shedding based on queue depth, CoDel tracks how long requests spend in the queue. If the minimum queue delay exceeds a threshold for a sustained period, CoDel starts dropping requests from the head of the queue.

Request enters queue at t=1000
Request dequeued at t=1015 → queue delay = 15ms

If min queue delay > 5ms for the last 100ms interval:
  → start dropping requests
  → drop rate increases over time until queue delay drops below 5ms

Why head-of-queue drops? The oldest request in the queue is the most likely to have already timed out at the client — returning a response for it is wasted work.

Back-Pressure vs Load Shedding

PropertyBack-PressureLoad Shedding
MechanismSlow down the producerDrop excess requests
Data lossNone — all messages eventually processedYes — dropped requests are lost
Works whenProducer is controllable (internal pipeline, message queue)Producer is uncontrollable (user HTTP traffic, external API)
Latency impactIncreases producer-side latency (waiting)Keeps latency low for admitted requests
Failure modeCascading slowdown if back-pressure propagates too farPartial unavailability (some users get 503)

In practice, both are used together:

  • Back-pressure within internal async pipelines (Kafka consumers, stream processing, database write queues)
  • Load shedding at the system boundary (API gateway, load balancer) for external traffic
ℹ️

Interview tip: When discussing system overload, say: “At the API gateway, I’d use load shedding with priority classification — health checks and analytics are shed first, checkout traffic is protected. Inside the system, the Kafka consumer uses back-pressure implicitly — it polls at its own pace, and consumer lag is monitored. If lag exceeds 5 minutes, we scale the consumer group horizontally. The key principle is: reject early at the boundary, propagate back-pressure through internal pipelines.” This shows you understand the distinction between controllable and uncontrollable traffic sources.

Test Your Understanding

Kafka consumer lag grows to 10 million messages. The consumer is processing but can’t keep up. What are your options?

Kafka has built-in back-pressure — the consumer polls at its own pace. But lag means it’s falling behind.

Options: (1) Scale the consumer group (add instances, up to partition count). (2) Optimize processing (batch DB writes, async I/O). (3) Increase poll batch size (max.poll.records). (4) For non-critical consumers, skip old messages and jump to latest offset. Don’t increase partition count mid-stream — re-hashes keys and breaks ordering.

API gateway receives 100K RPS. Backend handles 50K. Without load shedding, latency climbs to 30 seconds. How do you decide which 50K to accept?
Priority-based shedding. Classify by criticality: checkout/payment (P0, never shed) → product pages (P1, shed under severe overload) → recommendations/analytics (P2, shed first). Reject P2 with 503 + Retry-After when QPS exceeds capacity. A 503 in 1ms is better than a timeout in 30s.
A reactive pipeline uses back-pressure to slow the producer. But the producer is an external API receiving user requests — it can’t slow down. What do you do?
You can’t back-pressure uncontrollable sources. Use bounded buffering + load shedding: accept into a bounded queue, shed excess when the queue is full (503, drop, or sample). The buffer size = your latency tolerance. For user-facing APIs, a small buffer (or none — shed immediately at capacity) is usually better.