Bloom Filter

Scenario: You’re building a user registration system for a large-scale web service (e.g., Twitter or GitHub).

πŸ€” The Problem

When a new user enters a desired username:

  • You must check if it’s already taken.
  • The username store could contain millions of entries.
  • A direct DB lookup for every attempted username = unnecessary load.

❌ Without Optimization:

  • Most usernames may already exist, triggering expensive DB reads.
  • You want to avoid checking usernames that definitely don’t exist.

βœ… With Bloom Filter:

  • Store all existing usernames in a Bloom filter.
  • When a new user signs up: Check the Bloom filter.
    • If β€œdefinitely not present”: proceed with registration.
    • If β€œmight be present”: check the DB for confirmation.

This cuts down DB lookups drastically when users try creative or uncommon usernames.

Scenario: Caching with a Backend Database

Let’s say you’re building a large-scale service like YouTube or Facebook. You use a caching layer (like Redis or Memcached) in front of your persistent database to reduce query latency.

πŸ€” The Problem

Every time a client requests data:

  1. Check Cache β†’ If hit, return.
  2. If Miss β†’ Query database.

But here’s the catch:

  • Many cache misses go on to query the DB for items that don’t exist at all (e.g., someone asks for a video that’s been deleted).
  • These misses lead to wasted DB lookups, increasing load and latency.

βœ… The Bloom Filter Solution

You use a Bloom filter to pre-check if an item is definitely not in the dataset:

  • If the Bloom filter says “no” ➜ skip cache and DB lookup entirely.
  • If it says “maybe” ➜ proceed to cache and DB.

This drastically reduces expensive DB hits for missing keys.

βš™οΈ How Bloom Filters Work

A Bloom filter is a probabilistic data structure that answers: β€œIs element x possibly in the set?”

Core Components:

  • A bit array of size m, initialized with all bits as 0.
  • k independent hash functions that map an element to k positions in the bit array.

βœ… Insertion:

  1. Hash the input using all k hash functions.
  2. Set all the corresponding k bits in the array to 1.

❓ Query (Membership Check):

  1. Hash the input using the same k functions.
  2. Check if all the corresponding bits are 1.
    • If any bit is 0 ➜ definitely not in the set.
    • If all bits are 1 ➜ element may be in the set.

πŸ” No Deletion (in standard Bloom filters)

To remove items, you’d need a Counting Bloom Filter, which uses counters instead of bits.

Example

  • Bit array size = 10, 3 hash functions: H1, H2, H3.
  • alice hashes to bits 1, 4, 7
  • bob hashes to bits 2, 4, 6
  • carol hashes to bits 0, 4, 7
graph
  subgraph Bit Array
    B0["0"]
    B1["1"]
    B2["0"]
    B3["0"]
    B4["1"]
    B5["0"]
    B6["0"]
    B7["1"]
    B8["0"]
    B9["0"]
  end
   %% Connections for Alice's Hashes (say H1=1, H2=4, H3=7)
  InsertAlice["Insert 'alice'"]
  InsertAlice -->|H1 = 1| B1
  InsertAlice -->|H2 = 4| B4
  InsertAlice -->|H3 = 7| B7
graph
  subgraph Bit Array
    B0["0"]
    B1["1"]
    B2["1"]
    B3["0"]
    B4["1"]
    B5["0"]
    B6["1"]
    B7["1"]
    B8["0"]
    B9["0"]
  end
InsertBob["Insert 'bob'"]
  InsertBob -->|H1 = 2| B2
  InsertBob -->|H2 = 4| B4
  InsertBob -->|H3 = 6| B6
graph
  subgraph Bit Array
    B0["0"]
    B1["βœ… 1"]
    B2["1"]
    B3["0"]
    B4["βœ… 1"]
    B5["0"]
    B6["1"]
    B7["βœ… 1"]
    B8["0"]
    B9["0"]
  end
  InsertAlice["Query 'alice'"]
  InsertAlice -->|H1 = 1| B1
  InsertAlice -->|H2 = 4| B4
  InsertAlice -->|H3 = 7| B7
graph
  subgraph Bit Array
    B0["❌ 0"]
    B1["1"]
    B2["1"]
    B3["0"]
    B4["βœ… 1"]
    B5["0"]
    B6["1"]
    B7["βœ… 1"]
    B8["0"]
    B9["0"]
  end
  InsertAlice["Query 'Carol'"]
  InsertAlice -->|H1 = 0| B0
  InsertAlice -->|H2 = 4| B4
  InsertAlice -->|H3 = 7| B7

πŸ“‰ Probabilistic Nature

  • False negatives: Never happen. If it’s in the filter, it won’t say it’s not.
  • False positives: Can happen. The filter may wrongly say an element exists due to overlapping bits from different insertions.

Sample Implementation

import math
import hashlib
import bitarray  # Install via: pip install bitarray

class BloomFilter:
    def __init__(self, expected_items, false_positive_rate):
        """
        Initialize a Bloom filter with given capacity and desired error rate.
        """
        self.n = expected_items
        self.p = false_positive_rate

        # Calculate optimal size (m) and number of hashes (k)
        self.m = self.optimal_bit_array_size(self.n, self.p)
        self.k = self.optimal_num_hash_functions(self.m, self.n)

        self.bit_array = bitarray.bitarray(self.m)
        self.bit_array.setall(0)

    def _hashes(self, item):
        """
        Generate k hash values using double hashing.
        """
        item = item.encode('utf-8')
        h1 = int(hashlib.sha256(item).hexdigest(), 16)
        h2 = int(hashlib.md5(item).hexdigest(), 16)

        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def add(self, item):
        """
        Insert an item into the Bloom filter.
        """
        for idx in self._hashes(item):
            self.bit_array[idx] = 1

    def check(self, item):
        """
        Check if an item might be in the Bloom filter.
        Returns:
            True β€” Possibly present (may be false positive)
            False β€” Definitely not present
        """
        return all(self.bit_array[idx] for idx in self._hashes(item))

    @staticmethod
    def optimal_bit_array_size(n, p):
        """
        m = -(n * ln(p)) / (ln(2)^2)
        """
        return int(-n * math.log(p) / (math.log(2) ** 2))

    @staticmethod
    def optimal_num_hash_functions(m, n):
        """
        k = (m / n) * ln(2)
        """
        return int((m / n) * math.log(2))

✨ Best Practices Recap

  • Precompute capacity: (n) to avoid overfilling the filter, which increases false positives.
  • Choose optimal k: Use $$k = \frac{m}{n} \ln 2$$ to balance accuracy vs performance.
  • Use fast, uniform hash functions: e.g., MurmurHash, xxHash.
  • Watch load factor: Too many insertions = higher false positives.
  • Store Bloom filters in memory (e.g. Redis bloom filter module) for fast access in distributed systems.
  • Don’t use for critical exactness (e.g., login auth, password checks).
  • Combine with other structures: Use Bloom filters for a quick “definitely not here” and then fall back to DB/cache.

πŸ” Alternatives

StructureSupports Deletion?Memory EfficientFalse PositivesNotes
Bloom FilterNoβœ… Yesβœ… YesFast, scalable, simple
Counting Bloomβœ… Yes⚠️ Slightly Moreβœ… YesUse counters instead of bits
Cuckoo Filterβœ… Yesβœ… Yesβœ… Lower rateBetter for frequent insert/delete
Quotient Filterβœ… Yesβœ… Cache-friendlyβœ… YesBetter for cache locality, newer structure
HashSet/Trieβœ… Yes❌ No❌ NoPrecise but space-heavy