Uber-Style Location Indexing

A rider opens the Uber app and taps “Request Ride.” Within 2 seconds, the app shows 8 nearby drivers on the map. Behind the scenes, the system must answer: “Which available drivers are within 3km of latitude 40.7128, longitude -74.0060?” — and answer it in under 100ms, while simultaneously ingesting millions of GPS updates per second from drivers moving in real-time. This is a fundamentally different problem from static nearby search (Yelp restaurants don’t move). The write rate is extreme, the data is constantly stale, and reads must still be fast.

Access Patterns

PatternCharacteristics
WritesEvery active driver sends GPS coordinates every 4 seconds. 5 million concurrent drivers = ~1.25 million writes/second. Each write updates a single driver’s position.
ReadsRider requests nearby drivers. Need all available drivers within radius (e.g., 3km). Must return in <100ms. Read rate depends on active riders — typically 10-100K reads/second.
StalenessA driver’s stored position is already 0-4 seconds old when used. Acceptable — at city speeds (30 km/h), 4 seconds = 33 meters of error.

The write rate dominates. Any storage system for this problem must handle millions of point updates per second where each update invalidates the previous position.

Design: GeoHash + Redis Sorted Sets

Core Idea

  1. Hash each driver’s location to a GeoHash cell at a fixed precision
  2. Store drivers in Redis sorted sets, one per GeoHash cell
  3. To find nearby drivers: compute the 9 surrounding GeoHash cells, query all 9 sorted sets
flowchart TB
    subgraph Driver Updates
        D1([Driver 1
lat: 40.7128, lng: -74.006]) -->|every 4s| GW[Location Gateway] D2([Driver 2
lat: 40.7135, lng: -74.005]) -->|every 4s| GW D3([Driver 3
lat: 40.7580, lng: -73.985]) -->|every 4s| GW end GW -->|compute GeoHash| Enc[GeoHash Encode
precision=6] Enc -->|"dr5ru7 → driver_1"| R1[Redis: cell:dr5ru7
ZADD] Enc -->|"dr5ru7 → driver_2"| R1 Enc -->|"dr5rue → driver_3"| R2[Redis: cell:dr5rue
ZADD]

Write Path: Driver Location Update

When a driver sends a GPS update:

import geohash2 as geohash
import time

def update_driver_location(driver_id, lat, lng):
    new_cell = geohash.encode(lat, lng, precision=6)
    
    # Get driver's previous cell (stored in a separate hash)
    old_cell = redis.hget("driver:cell", driver_id)
    
    if old_cell and old_cell != new_cell:
        # Driver moved to a different cell — remove from old
        redis.zrem(f"cell:{old_cell}", driver_id)
    
    # Add/update driver in new cell's sorted set
    # Score = current timestamp (for TTL-based cleanup)
    redis.zadd(f"cell:{new_cell}", {driver_id: time.time()})
    
    # Remember which cell this driver is in
    redis.hset("driver:cell", driver_id, new_cell)
    
    # Store exact coordinates for distance calculation
    redis.hset(f"driver:loc:{driver_id}", mapping={
        "lat": lat, "lng": lng, "ts": time.time()
    })

Why sorted set with timestamp as score? It enables stale location cleanup (see below). The score is not used for ranking in queries — it’s used for TTL-like expiry.

Read Path: Find Nearby Drivers

When a rider requests nearby drivers:

def find_nearby_drivers(rider_lat, rider_lng, radius_km=3):
    # 1. Compute rider's GeoHash cell + 8 neighbors
    center = geohash.encode(rider_lat, rider_lng, precision=6)
    neighbors = geohash.neighbors(center)
    cells = [center] + neighbors  # 9 cells
    
    # 2. Fetch all drivers from these 9 cells
    candidates = []
    now = time.time()
    for cell in cells:
        # Get all drivers in this cell with timestamps
        drivers = redis.zrangebyscore(
            f"cell:{cell}",
            min=now - 30,  # only drivers updated in last 30 seconds
            max=now,
            withscores=True
        )
        candidates.extend(drivers)
    
    # 3. Fetch exact coordinates and compute distance
    nearby = []
    for driver_id, last_seen in candidates:
        loc = redis.hgetall(f"driver:loc:{driver_id}")
        dist = haversine(rider_lat, rider_lng, float(loc["lat"]), float(loc["lng"]))
        if dist <= radius_km:
            nearby.append({
                "driver_id": driver_id,
                "distance_km": round(dist, 2),
                "last_seen": last_seen
            })
    
    # 4. Sort by distance
    return sorted(nearby, key=lambda d: d["distance_km"])

GeoHash Precision Choice

The GeoHash precision must match the search radius. Too fine = too many cells to query. Too coarse = too many false positives to filter.

Search RadiusGeoHash PrecisionCell SizeCells QueriedResult
3 km5 (~5km cells)4.9 × 4.9 km9 cells = ~220 km²Too many false positives
3 km6 (~1km cells)1.2 × 0.6 km9 cells = ~6.5 km²Good balance
3 km7 (~150m cells)153 × 153 m9 cells = ~0.2 km²Misses drivers 200m-3km away

Precision 6 is the sweet spot for km-scale radius queries. For finer searches (within a building), use precision 7 or 8.

Stale Location Cleanup

When a driver closes the app or loses connectivity, their location becomes stale. Without cleanup, ghost drivers appear in search results.

Approach: use the sorted set scores (timestamps) for implicit TTL.

def cleanup_stale_drivers():
    """Run periodically or triggered per cell on read."""
    cutoff = time.time() - 30  # 30 seconds = stale
    
    # For each active cell, remove drivers with score < cutoff
    for cell_key in redis.scan_iter("cell:*"):
        removed = redis.zremrangebyscore(cell_key, "-inf", cutoff)

The read path already filters by timestamp (zrangebyscore min=now-30), so stale entries don’t appear in results even before cleanup runs. The cleanup job is for memory reclamation.

Alternative: driver heartbeat resets TTL. Each GPS update refreshes the score. If a driver’s score falls more than 30 seconds behind, they’re treated as offline.

H3: Uber’s Production System

Uber’s actual production system uses H3 — a hexagonal hierarchical spatial index — instead of GeoHash. The core design pattern (cell-based indexing + sorted sets) remains the same, but the cell geometry is better.

Why Hexagons?

GeoHash (rectangles):         H3 (hexagons):
┌────┬────┬────┐              ⬡ ⬡ ⬡ ⬡ 
│    │    │    │             ⬡ ⬡ ⬡ ⬡ ⬡
├────┼────┼────┤              ⬡ ⬡ ⬡ ⬡
│    │ ★  │    │             ⬡ ⬡ ★ ⬡ ⬡
├────┼────┼────┤              ⬡ ⬡ ⬡ ⬡
│    │    │    │             ⬡ ⬡ ⬡ ⬡ ⬡
└────┴────┴────┘              ⬡ ⬡ ⬡ ⬡

Rectangle neighbors: 8        Hexagon neighbors: 6
Corner neighbors are farther   All neighbors are equidistant
  than edge neighbors            from center
PropertyGeoHash (Rectangle)H3 (Hexagon)
Neighbor distanceCorner neighbors are √2× farther than edge neighborsAll 6 neighbors are equidistant from center
Cell area uniformityVaries with latitude (see GeoHash)Near-uniform globally (icosahedron projection)
Cells to search for radius9 (including corners that barely overlap)7 (center + 6 neighbors, all meaningful)
Resolution levelsDetermined by string length16 predefined resolutions (0-15)
ImplementationSimple string encodingLibrary required (h3-py, h3-java)
import h3

# Encode lat/lng to H3 cell at resolution 9 (~0.1 km² per cell)
cell = h3.latlng_to_cell(40.7128, -74.0060, res=9)
# cell = "892a1008003ffff"

# Get the ring of neighboring cells at distance 1
neighbors = h3.grid_disk(cell, k=1)
# Returns 7 cells: center + 6 surrounding hexagons

# For a larger radius, increase k
ring_2 = h3.grid_disk(cell, k=2)
# Returns 19 cells: center + 2 rings of hexagons

The rest of the architecture (Redis sorted sets per cell, write/read paths, stale cleanup) works identically — just swap GeoHash encoding for H3 encoding.

Full Architecture

flowchart TB
    subgraph Drivers
        D1([Driver Apps])
    end

    subgraph Riders
        R1([Rider App])
    end

    D1 -->|GPS every 4s| LG[Location Gateway
Kafka Producer] LG --> K[Kafka Topic:
driver-locations] K --> LW[Location Workers
Consume + GeoHash/H3] LW -->|update cells| Redis[(Redis Cluster
Sorted Sets per Cell)] LW -->|update exact coords| Redis R1 -->|"find nearby"| NS[Nearby Service] NS -->|query 9 cells| Redis NS -->|filter by distance| NS NS -->|ranked drivers| R1 subgraph Cleanup CJ[Cleanup Job] -->|remove stale| Redis end

Why Kafka in the Write Path?

Drivers produce ~1.25M location updates/second. Writing all of them directly to Redis risks overwhelming it during spikes. Kafka acts as a buffer:

  • Absorbs write spikes (surge events, concerts ending)
  • Enables multiple consumers: location index, trip tracking, analytics, surge pricing all consume the same stream
  • Provides replay: if the Redis cluster fails and rebuilds, replay the last 30 seconds of locations to repopulate

Redis Cluster Sharding

With millions of drivers, a single Redis node can’t hold all cells. Shard by cell key:

cell:dr5ru7 → Redis node A  (consistent hashing)
cell:dr5ru8 → Redis node B
cell:dr5rue → Redis node A

The 9-cell read query fans out to at most 9 different Redis nodes (usually fewer, since neighboring cells often hash to the same node). A scatter-gather pattern collects results from all nodes before merging and filtering.

⚠️

Hot cells: a stadium with 50,000 ride requests in 5 minutes means one GeoHash cell gets hammered. Mitigation: for hot cells, replicate the sorted set to multiple Redis nodes and load-balance reads across replicas. Alternatively, use a finer GeoHash precision for the hot area to spread load across more cells.

ℹ️

Interview tip: “Drivers send GPS every 4 seconds — that’s ~1.25M writes/sec for 5M concurrent drivers. I’d buffer through Kafka and have workers encode each update to a GeoHash cell (or H3 hex), storing driver IDs in Redis sorted sets keyed by cell. The score is the timestamp, which gives implicit TTL for stale cleanup. For ‘find nearby drivers,’ I compute the 9 surrounding cells, fetch from Redis, and post-filter by exact Haversine distance. In production, Uber uses H3 hexagons for uniform cell areas. Hot cells near stadiums or airports get replicated across Redis nodes to handle read spikes.”