Bloom Filter

System Design

Hashing

Bloom Filter

Scenario: You’re building a user registration system for a large-scale web service (e.g., Twitter or GitHub).

🤔 The Problem

When a new user enters a desired username:

You must check if it’s already taken.
The username store could contain millions of entries.
A direct DB lookup for every attempted username = unnecessary load.

❌ Without Optimization:

Most usernames may already exist, triggering expensive DB reads.
You want to avoid checking usernames that definitely don’t exist.

✅ With Bloom Filter:

Store all existing usernames in a Bloom filter.
When a new user signs up: Check the Bloom filter.
- If “definitely not present”: proceed with registration.
- If “might be present”: check the DB for confirmation.

This cuts down DB lookups drastically when users try creative or uncommon usernames.

Scenario: Caching with a Backend Database

Let’s say you’re building a large-scale service like YouTube or Facebook. You use a caching layer (like Redis or Memcached) in front of your persistent database to reduce query latency.

🤔 The Problem

Every time a client requests data:

Check Cache → If hit, return.
If Miss → Query database.

But here’s the catch:

Many cache misses go on to query the DB for items that don’t exist at all (e.g., someone asks for a video that’s been deleted).
These misses lead to wasted DB lookups, increasing load and latency.

✅ The Bloom Filter Solution

You use a Bloom filter to pre-check if an item is definitely not in the dataset:

If the Bloom filter says “no” ➜ skip cache and DB lookup entirely.
If it says “maybe” ➜ proceed to cache and DB.

This drastically reduces expensive DB hits for missing keys.

⚙️ How Bloom Filters Work

A Bloom filter is a probabilistic data structure that answers: “Is element x possibly in the set?”

Core Components:

A bit array of size m, initialized with all bits as 0.
k independent hash functions that map an element to k positions in the bit array.

✅ Insertion:

Hash the input using all k hash functions.
Set all the corresponding k bits in the array to 1.

❓ Query (Membership Check):

Hash the input using the same k functions.
Check if all the corresponding bits are 1.
- If any bit is 0 ➜ definitely not in the set.
- If all bits are 1 ➜ element may be in the set.

🔁 No Deletion (in standard Bloom filters)

To remove items, you’d need a Counting Bloom Filter, which uses counters instead of bits.

Example

Bit array size = 10, 3 hash functions: H1, H2, H3.
alice hashes to bits 1, 4, 7
bob hashes to bits 2, 4, 6
carol hashes to bits 0, 4, 7

graph
  subgraph Bit Array
    B0["0"]
    B1["1"]
    B2["0"]
    B3["0"]
    B4["1"]
    B5["0"]
    B6["0"]
    B7["1"]
    B8["0"]
    B9["0"]
  end
   %% Connections for Alice's Hashes (say H1=1, H2=4, H3=7)
  InsertAlice["Insert 'alice'"]
  InsertAlice -->|H1 = 1| B1
  InsertAlice -->|H2 = 4| B4
  InsertAlice -->|H3 = 7| B7

graph
  subgraph Bit Array
    B0["0"]
    B1["1"]
    B2["1"]
    B3["0"]
    B4["1"]
    B5["0"]
    B6["1"]
    B7["1"]
    B8["0"]
    B9["0"]
  end
InsertBob["Insert 'bob'"]
  InsertBob -->|H1 = 2| B2
  InsertBob -->|H2 = 4| B4
  InsertBob -->|H3 = 6| B6

graph
  subgraph Bit Array
    B0["0"]
    B1["✅ 1"]
    B2["1"]
    B3["0"]
    B4["✅ 1"]
    B5["0"]
    B6["1"]
    B7["✅ 1"]
    B8["0"]
    B9["0"]
  end
  InsertAlice["Query 'alice'"]
  InsertAlice -->|H1 = 1| B1
  InsertAlice -->|H2 = 4| B4
  InsertAlice -->|H3 = 7| B7

graph
  subgraph Bit Array
    B0["❌ 0"]
    B1["1"]
    B2["1"]
    B3["0"]
    B4["✅ 1"]
    B5["0"]
    B6["1"]
    B7["✅ 1"]
    B8["0"]
    B9["0"]
  end
  InsertAlice["Query 'Carol'"]
  InsertAlice -->|H1 = 0| B0
  InsertAlice -->|H2 = 4| B4
  InsertAlice -->|H3 = 7| B7

📉 Probabilistic Nature

False negatives: Never happen. If it’s in the filter, it won’t say it’s not.
False positives: Can happen. The filter may wrongly say an element exists due to overlapping bits from different insertions.

Sample Implementation

import math
import hashlib
import bitarray  # Install via: pip install bitarray

class BloomFilter:
    def __init__(self, expected_items, false_positive_rate):
        """
        Initialize a Bloom filter with given capacity and desired error rate.
        """
        self.n = expected_items
        self.p = false_positive_rate

        # Calculate optimal size (m) and number of hashes (k)
        self.m = self.optimal_bit_array_size(self.n, self.p)
        self.k = self.optimal_num_hash_functions(self.m, self.n)

        self.bit_array = bitarray.bitarray(self.m)
        self.bit_array.setall(0)

    def _hashes(self, item):
        """
        Generate k hash values using double hashing.
        """
        item = item.encode('utf-8')
        h1 = int(hashlib.sha256(item).hexdigest(), 16)
        h2 = int(hashlib.md5(item).hexdigest(), 16)

        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def add(self, item):
        """
        Insert an item into the Bloom filter.
        """
        for idx in self._hashes(item):
            self.bit_array[idx] = 1

    def check(self, item):
        """
        Check if an item might be in the Bloom filter.
        Returns:
            True — Possibly present (may be false positive)
            False — Definitely not present
        """
        return all(self.bit_array[idx] for idx in self._hashes(item))

    @staticmethod
    def optimal_bit_array_size(n, p):
        """
        m = -(n * ln(p)) / (ln(2)^2)
        """
        return int(-n * math.log(p) / (math.log(2) ** 2))

    @staticmethod
    def optimal_num_hash_functions(m, n):
        """
        k = (m / n) * ln(2)
        """
        return int((m / n) * math.log(2))

✨ Best Practices Recap

Precompute capacity: (n) to avoid overfilling the filter, which increases false positives.
Choose optimal k: Use $$k = \frac{m}{n} \ln 2$$ to balance accuracy vs performance.
Use fast, uniform hash functions: e.g., MurmurHash, xxHash.
Watch load factor: Too many insertions = higher false positives.
Store Bloom filters in memory (e.g. Redis bloom filter module) for fast access in distributed systems.
Don’t use for critical exactness (e.g., login auth, password checks).
Combine with other structures: Use Bloom filters for a quick “definitely not here” and then fall back to DB/cache.

🔁 Alternatives

Structure	Supports Deletion?	Memory Efficient	False Positives	Notes
Bloom Filter	No	✅ Yes	✅ Yes	Fast, scalable, simple
Counting Bloom	✅ Yes	⚠️ Slightly More	✅ Yes	Use counters instead of bits
Cuckoo Filter	✅ Yes	✅ Yes	✅ Lower rate	Better for frequent insert/delete
Quotient Filter	✅ Yes	✅ Cache-friendly	✅ Yes	Better for cache locality, newer structure
HashSet/Trie	✅ Yes	❌ No	❌ No	Precise but space-heavy

Invertible Bloom Filter