Presence & Online Status
Your chat app shows a green dot next to each contact who is currently online. User A opens the app, sees 47 friends online, and starts a conversation. Meanwhile, user B’s phone loses signal in an elevator — 30 seconds later, their dot turns gray. This seems simple until you realize: 500 million users, each with hundreds of contacts, and the system must detect presence changes within seconds while not drowning the infrastructure in unnecessary updates.
Heartbeat-Based Presence Detection
The most common approach: each connected client sends a lightweight heartbeat message to the server at a fixed interval. If the server stops receiving heartbeats, it marks the user as offline after a grace period.
Client sends heartbeat every 5 seconds.
Server expects heartbeat within 15 seconds (3 missed heartbeats = offline).
Timeline:
t=0 heartbeat ✓ → online
t=5 heartbeat ✓ → online
t=10 heartbeat ✓ → online
t=15 (missed) → still online (grace period)
t=20 (missed) → still online (grace period)
t=25 (missed) → OFFLINE (3 missed = 15s since last heartbeat)Why Not Just Use WebSocket Connection State?
A WebSocket disconnect event seems like a natural signal. The moment the TCP connection drops, the server knows the user is gone. But in practice:
| Signal | Problem |
|---|---|
| TCP FIN received | Works for clean disconnects (user closes app). Does not work for abrupt network loss — TCP keepalive timeout is minutes, not seconds. |
| TCP RST received | Only if the OS sends a reset. Many mobile disconnects leave the TCP state dangling on the server for 30–120 seconds. |
| WebSocket close frame | Only sent on graceful shutdown. Crash, kill, or network loss = no close frame. |
Heartbeats solve all of these: if the client stops heartbeating for any reason — app crash, network loss, battery death — the server detects absence within the grace period.
Heartbeat Interval Trade-offs
| Interval | Grace period (3×) | Detection speed | Overhead per user |
|---|---|---|---|
| 3s | 9s | Fast | High (0.33 msg/s) |
| 5s | 15s | Moderate | Moderate (0.2 msg/s) |
| 10s | 30s | Slow | Low (0.1 msg/s) |
| 30s | 90s | Very slow | Minimal (0.03 msg/s) |
At 500M concurrent users with a 5s heartbeat: 100 million heartbeats per second. This is a significant load. The heartbeat must be as lightweight as possible — a single byte or a WebSocket ping frame, not a full JSON payload.
import time
import asyncio
class PresenceTracker:
"""Server-side presence tracking using heartbeat timestamps."""
def __init__(self, redis, heartbeat_timeout=15):
self.redis = redis
self.heartbeat_timeout = heartbeat_timeout
async def heartbeat(self, user_id: str):
"""Record heartbeat — set key with TTL so it auto-expires."""
key = f"presence:{user_id}"
now = int(time.time())
pipe = self.redis.pipeline()
pipe.set(key, now)
pipe.expire(key, self.heartbeat_timeout)
await pipe.execute()
async def is_online(self, user_id: str) -> bool:
"""Check if user has a non-expired presence key."""
return await self.redis.exists(f"presence:{user_id}") == 1
async def get_last_seen(self, user_id: str) -> int | None:
"""Return last heartbeat timestamp, or None if expired."""
val = await self.redis.get(f"presence:{user_id}")
return int(val) if val else NoneThe key insight: Redis TTL is the heartbeat timeout. If the user heartbeats within the TTL window, the key gets refreshed. If not, Redis automatically deletes the key — no background cleanup job needed. Checking if a user is online is a single EXISTS call — O(1).
Last-Seen Storage
Not every application needs real-time “green dot” presence. Many show “last seen 5 minutes ago” — a simpler problem that requires only storing the most recent activity timestamp.
# Redis hash: single key, one field per user
# More memory-efficient than individual keys for large user sets
await redis.hset("last_seen", user_id, int(time.time()))
# Read last seen
ts = await redis.hget("last_seen", user_id)
last_seen = datetime.fromtimestamp(int(ts)) if ts else None| Storage | Pros | Cons |
|---|---|---|
Redis key per user (presence:{user_id}) | TTL-based auto-expiry, O(1) lookup | Memory overhead per key (50–80 bytes overhead each) |
Redis hash (last_seen → user_id → timestamp) | Compact for millions of users, single key | No per-field TTL — need cleanup job for stale entries |
Database column (users.last_seen_at) | Persistent, queryable, no extra infra | Too slow for real-time heartbeat writes (disk I/O per heartbeat) |
Production pattern: Use Redis key-per-user with TTL for real-time presence detection, and asynchronously update a last_seen_at column in the database when the user goes offline (on TTL expiry via Redis keyspace notifications or a periodic sync job).
The Fan-Out Problem
When user A comes online, who needs to know? All of A’s contacts who are currently online and have the chat app open. For a user with 500 contacts, that’s potentially 500 notifications. For a celebrity with 10 million followers, it’s a storm.
Naive Approach: Broadcast to All Contacts
User A comes online
→ Fetch A's contact list: [B, C, D, E, ..., N] (500 contacts)
→ For each contact:
→ Check if contact is online
→ If online, find their WebSocket connection server
→ Push "A is now online" notificationProblem: This is O(contacts) work per status change. At 500M users each changing status several times per day, the volume of fan-out messages is enormous. Worse: users who rapidly toggle between online/offline (flapping) cause amplified fan-out.
Practical Solution: Subscribe on View, Not Globally
Instead of pushing presence to all contacts at all times, only push presence updates to users who are actively viewing a screen that shows the contact’s status.
sequenceDiagram
participant A as User A
participant S as Presence Service
participant PS as Pub/Sub Layer
participant B as User B
Note over B: B opens chat list
showing contacts including A
B->>S: Subscribe to presence of [A, C, D, ...]
S->>PS: Add B to channel "presence:A"
Note over A: A opens the app → heartbeat starts
A->>S: Heartbeat (online)
S->>PS: Publish "A: online" to channel "presence:A"
PS->>B: "A is now online"
Note over B: B navigates away from chat list
B->>S: Unsubscribe from presence of [A, C, D, ...]
S->>PS: Remove B from channel "presence:A"This limits fan-out to only the users who care right now — typically a small subset of any user’s contact list. Users who have the app in the background or are on a different screen don’t receive unnecessary updates.
Cross-Server Routing
With millions of WebSocket connections spread across hundreds of connection servers, user A’s status change on server 1 must reach user B on server 47. The presence service cannot maintain a direct mapping of every user to every server.
flowchart TB
subgraph Connection Layer
WS1[WebSocket Server 1
Users: A, X, Y]
WS2[WebSocket Server 2
Users: B, Z]
WS3[WebSocket Server 3
Users: C, D]
end
subgraph Pub/Sub Layer
R1[Redis Pub/Sub
or Kafka]
end
subgraph Presence Service
PS[Presence Tracker
Redis TTL keys]
end
WS1 -->|"A heartbeat"| PS
PS -->|"publish: A online"| R1
R1 -->|"A online"| WS2
R1 -->|"A online"| WS3
WS2 -->|"deliver to B
(subscribed to A)"| WS2
WS3 -->|"deliver to C
(subscribed to A)"| WS3How it works:
- Each WebSocket server subscribes to the pub/sub channels for the users whose presence its connected clients care about
- When the presence service detects a status change (online → offline or vice versa), it publishes to the user’s presence channel
- Only WebSocket servers with subscribers on that channel receive the message
- The receiving WebSocket server pushes the update to the specific connected clients
Redis Pub/Sub vs Kafka for Presence
| Criteria | Redis Pub/Sub | Kafka |
|---|---|---|
| Latency | Sub-millisecond | 5–50ms (batching) |
| Delivery | Fire-and-forget (no persistence) | Persistent log, replay possible |
| Missed messages | Lost if subscriber is down | Consumer reads from offset after recovery |
| Scale | Single Redis node: ~1M messages/s | Partitioned: millions of messages/s |
| Use for presence | Good — presence is ephemeral, loss is acceptable | Overkill unless you need audit log of status changes |
Redis Pub/Sub is the typical choice for presence: the data is ephemeral (a missed “online” notification is harmless — the client can poll on next screen load), and the latency is lower.
Flapping Prevention
A user on a flaky mobile connection might alternate between online and offline every few seconds. Without mitigation, this generates a storm of status change notifications.
Without debounce:
t=0 online → notify 200 subscribers
t=3 offline → notify 200 subscribers
t=5 online → notify 200 subscribers
t=8 offline → notify 200 subscribers
...
= 800 notifications in 10 seconds for one userSolution: Debounce offline transitions. When heartbeats stop, wait an extra grace period before declaring the user offline and notifying subscribers.
class DebouncedPresence:
"""Delay offline notification to prevent flapping."""
def __init__(self, redis, offline_delay=30, heartbeat_timeout=15):
self.redis = redis
self.offline_delay = offline_delay
self.heartbeat_timeout = heartbeat_timeout
async def heartbeat(self, user_id: str):
was_pending_offline = await self.redis.exists(
f"pending_offline:{user_id}"
)
pipe = self.redis.pipeline()
pipe.set(f"presence:{user_id}", int(time.time()))
pipe.expire(f"presence:{user_id}", self.heartbeat_timeout)
# Cancel any pending offline notification
pipe.delete(f"pending_offline:{user_id}")
await pipe.execute()
if was_pending_offline:
# User came back before offline was broadcast — no notification
pass
async def mark_pending_offline(self, user_id: str):
"""Called when heartbeat TTL expires (via keyspace notification).
Schedule offline notification after delay."""
await self.redis.set(
f"pending_offline:{user_id}", 1, ex=self.offline_delay
)
# When pending_offline TTL expires → truly offline → notify subscribers
async def confirm_offline(self, user_id: str):
"""Called when pending_offline TTL expires.
User did not heartbeat during the delay — genuinely offline."""
await self.publish_status_change(user_id, "offline")The pattern: heartbeat TTL expires → set pending_offline with a 30-second TTL → if heartbeat resumes, delete pending_offline (no notification sent) → if pending_offline expires, publish offline status.
Scaling Presence to Hundreds of Millions
Architecture Overview
flowchart TB
subgraph Clients
C1([Mobile App])
C2([Web App])
C3([Desktop App])
end
subgraph Edge
LB[Load Balancer
Sticky Sessions]
end
subgraph Connection Tier
WS1[WS Server 1]
WS2[WS Server 2]
WSN[WS Server N]
end
subgraph Presence Tier
PT[Presence Tracker
Redis Cluster]
end
subgraph Pub/Sub Tier
PS[Redis Pub/Sub
Sharded by user_id]
end
C1 & C2 & C3 --> LB
LB --> WS1 & WS2 & WSN
WS1 & WS2 & WSN -->|heartbeat| PT
PT -->|status change| PS
PS -->|fan-out| WS1 & WS2 & WSNScaling Dimensions
| Component | Scaling strategy |
|---|---|
| WebSocket servers | Horizontal — add more servers behind sticky load balancer; each holds N connections in memory |
| Presence store (Redis) | Redis Cluster with hash slots; shard by user_id; each shard handles ~1M users |
| Pub/Sub | Shard pub/sub channels across Redis instances by user_id hash; avoids single-node bottleneck |
| Heartbeat ingestion | WebSocket servers batch heartbeats locally (e.g., every 1s flush) before writing to Redis — reduces Redis ops |
Batching Heartbeats
At 500M users × 0.2 heartbeats/s = 100M Redis writes/s. A single Redis cluster cannot absorb this.
Local batching: each WebSocket server collects heartbeats from its local connections and flushes them in a single pipeline command every 1 second:
class HeartbeatBatcher:
def __init__(self, redis, flush_interval=1.0):
self.redis = redis
self.flush_interval = flush_interval
self.pending = set()
def record(self, user_id: str):
self.pending.add(user_id)
async def flush(self):
"""Called every flush_interval seconds."""
if not self.pending:
return
pipe = self.redis.pipeline()
now = int(time.time())
for user_id in self.pending:
key = f"presence:{user_id}"
pipe.set(key, now)
pipe.expire(key, 15)
await pipe.execute()
self.pending.clear()If a WebSocket server holds 100K connections, this reduces 20K individual Redis commands/s (100K × 0.2) down to a single pipeline of 100K commands every 1s — orders of magnitude fewer round-trips to Redis.
Consistency Trade-offs
Presence is one of the rare cases where eventual consistency is genuinely acceptable. A user appearing “online” for 15 extra seconds after closing the app is a negligible UX issue. A user appearing “offline” for 5 seconds after opening the app is similarly harmless.
| Consistency level | Behavior | Cost |
|---|---|---|
| Strong | All observers see the exact same status at the same instant | Requires distributed consensus per status change — unacceptable latency and throughput |
| Eventual (with bounded staleness) | Observers converge within heartbeat interval + propagation delay (~5–15s) | Pub/sub + Redis TTL — scalable, simple |
| Best-effort | Observers may miss transient status changes entirely | Fire-and-forget pub/sub — cheapest, acceptable for “last seen” |
Production systems (WhatsApp, Telegram, Discord) use eventual consistency with bounded staleness. Users tolerate a few seconds of stale presence. The system optimizes for throughput and simplicity over perfect accuracy.
Do not use strong consistency for presence. Requiring linearizable reads for “is user X online?” would force every status check through a consensus protocol — destroying throughput for a feature where staleness is perfectly acceptable. This is a textbook case of matching the consistency model to the business requirement.
Multi-Device Presence
Modern users are logged in on multiple devices simultaneously — phone, tablet, laptop. The presence system must handle this correctly.
Rule: A user is online if any of their devices is online. A user is offline only when all devices are offline.
async def heartbeat_multi_device(redis, user_id: str, device_id: str):
"""Track presence per device. User is online if any device is active."""
key = f"presence:{user_id}:devices"
await redis.hset(key, device_id, int(time.time()))
await redis.expire(key, 30) # overall key TTL as safety net
async def is_online_multi_device(redis, user_id: str, timeout=15) -> bool:
"""User is online if any device heartbeated within timeout."""
devices = await redis.hgetall(f"presence:{user_id}:devices")
now = int(time.time())
for device_id, last_seen in devices.items():
if now - int(last_seen) < timeout:
return True
return False
async def cleanup_stale_devices(redis, user_id: str, timeout=15):
"""Remove devices that haven't heartbeated recently."""
devices = await redis.hgetall(f"presence:{user_id}:devices")
now = int(time.time())
stale = [d for d, ts in devices.items() if now - int(ts) >= timeout]
if stale:
await redis.hdel(f"presence:{user_id}:devices", *stale)The per-device hash also enables richer status like “online on mobile” vs “online on desktop” — useful for apps like Slack that show device-specific indicators.
Comparison of Approaches
| Approach | Detection speed | Scalability | Complexity | Best for |
|---|---|---|---|---|
| WebSocket disconnect event | Unreliable (seconds to minutes) | High (no extra traffic) | Low | Supplement to heartbeats, not a replacement |
| Client heartbeat + Redis TTL | Configurable (5–30s) | High with batching + sharding | Moderate | Real-time presence (chat apps, collaboration tools) |
| Periodic poll (“pull” model) | Depends on poll interval (30–60s) | High (amortized) | Low | “Last seen” display, non-real-time presence |
| Pub/Sub fan-out | Near-instant propagation | Moderate (fan-out cost) | High | Live status updates to active viewers |
Interview tip: When designing presence in a system design interview, say: “Each client heartbeats every 5 seconds. I store the heartbeat as a Redis key with a 15-second TTL — when the key expires, the user is offline. For notifying contacts, I use pub/sub scoped to active viewers: when user B opens a chat screen showing A, B subscribes to A’s presence channel. This limits fan-out to only the users who care right now. For flapping prevention, I debounce offline transitions with a 30-second delay before broadcasting.” This covers detection, storage, fan-out optimization, and edge case handling — the four things interviewers evaluate on this topic.
Test Your Understanding
Your presence system uses a 5-second heartbeat with 15-second TTL. A user is in an elevator with intermittent connectivity — heartbeats succeed at t=0, fail at t=5 and t=10, succeed at t=15. The TTL expires at t=15 (last heartbeat at t=0 + 15s TTL). But the heartbeat at t=15 refreshes the key. Does the user ever appear offline?
No — it’s a race condition. The heartbeat at t=15 and the TTL expiry at t=15 race each other. If the heartbeat arrives before Redis expires the key, the user stays online. If Redis expires the key first (and you use keyspace notifications to trigger offline), the user briefly appears offline before the heartbeat re-creates the key.
In practice: Redis TTL expiry is lazy (checked on access) and active (sampled periodically). The heartbeat at t=15 will almost always win because it’s an explicit SET with a new TTL. But the 30-second offline debounce solves this entirely — even if the key expires, the pending_offline delay gives the next heartbeat time to cancel it.
User A has 10,000 contacts. A comes online and you use pub/sub to notify all contacts who are currently viewing their chat list. Worst case, 5,000 contacts have the app open. That’s 5,000 pub/sub deliveries for one status change. How does WhatsApp handle this at 2 billion users?
WhatsApp doesn’t push presence to 5,000 contacts. The “subscribe on view” pattern limits fan-out:
- When user B opens a chat with A, B subscribes to A’s presence channel. B’s contact list might show 200 contacts, but only the visible ones on screen (~15-20) have active subscriptions.
- Presence updates are not broadcast to all contacts — only to users with an active subscription. This reduces fan-out from O(contacts) to O(active_viewers), which is typically 5-50, not 5,000.
- For the contact list screen, presence is polled in batch (“are any of these 200 users online?”) on screen load, not streamed. Only when B opens a 1:1 chat does real-time presence subscription activate.
This distinction between polling for lists and subscribing for active chats is what makes presence scalable.
A server holding 100K WebSocket connections crashes. All 100K users’ heartbeats stop simultaneously. The presence system marks all of them offline and publishes 100K status-change events through pub/sub. This creates a thundering herd of notifications. How do you mitigate this?
Three layers of defense:
- Offline debounce (30 seconds). The crashed users’ TTLs expire, but
pending_offlinekeys delay the offline broadcast by 30 seconds. If the users reconnect to other servers within 30 seconds (via load balancer), the pending-offline is canceled — no notification sent. - Staggered TTL expiry. Not all 100K users heartbeated at the exact same moment — heartbeats are spread across the 5-second interval. TTLs expire over a 5-second window, not all at once.
- Batch offline processing. Instead of publishing 100K individual status changes, detect the server crash (via health check) and process the users in batches with rate limiting on the pub/sub layer. Alternatively, mark the server’s users as “status unknown” and let the debounce timer resolve each one.
The combination of debounce + reconnection window means most users never appear offline during a server crash — they reconnect to another server before the debounce expires.