Counting Bloom Filter

Counting Bloom Filter

❓ The Membership Problem

In large-scale systems, we often need to answer:
“Is this item in the set?”
Examples:

  • Is this URL malicious?
  • Has this user already seen this ad?
  • Is this file already cached?

⚠️ Challenge with Traditional Bloom Filters

Bloom filters are great for space-efficient membership testing, but they don’t support deletions. Once a bit is set, you can’t unset it without risking false negatives.

💡 Enter Counting Bloom Filters

CBFs solve this by replacing bits with counters, allowing:

  • Insertions (increment counters)
  • Deletions (decrement counters)
  • Membership checks (verify counters > 0)

🏗️ Architecture & Internal Working

🔧 Components

  • Counter Array: Instead of a bit array, CBF uses an array of small integers (e.g., 4-bit counters).
  • Hash Functions: k independent hash functions map each item to k positions in the array.

🔁 Operations

✅ Insertion

graph TD
    A[Element x] --> B[Hash x with k functions]
    B --> C[Get k indices]
    C --> D[Increment counters at those indices]

🔍 Lookup

graph TD
    A[Element x] --> B[Hash x with k functions]
    B --> C[Get k indices]
    C --> D[Check if all counters > 0]
    D --> E[Return "Probably Present"]

❌ Deletion

graph TD
    A[Element x] --> B[Hash x with k functions]
    B --> C[Get k indices]
    C --> D[Decrement counters at those indices]

📊 Example

Let’s say we have:

  • Counter array of size m = 10
  • k = 3 hash functions
  • Insert element "cat"

Assume hash functions map "cat" to indices [2, 5, 7].
We increment counters at those positions:

Before: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  
After : [0, 0, 1, 0, 0, 1, 0, 1, 0, 0]

To delete "cat", we decrement the same counters.

Implementation

import mmh3         # pip install mmh3
from collections import defaultdict

class CountingBloomFilter:
    def __init__(self, size: int, hash_count: int):
        self.size = size                    # Size of counter array
        self.hash_count = hash_count        # Number of hash functions
        self.counters = [0] * size          # Counter array

    def _hashes(self, item: str):
        return [mmh3.hash(item, seed) % self.size for seed in range(self.hash_count)]

    def add(self, item: str):
        for i in self._hashes(item):
            self.counters[i] += 1

    def remove(self, item: str):
        for i in self._hashes(item):
            if self.counters[i] > 0:
                self.counters[i] -= 1

    def contains(self, item: str) -> bool:
        return all(self.counters[i] > 0 for i in self._hashes(item))

    def __str__(self):
        return str(self.counters)

🧠 Best Practices

  • Choose optimal k and m: Use formulas to minimize false positives:
    • k ≈ (m/n) * ln(2)
  • Use good hash functions: FNV, MurmurHash are popular.
  • Avoid counter overflow: Use 4–8 bits per counter depending on expected frequency.
  • Monitor false positives: They increase as the filter saturates.
  • Use CBFs for cache eviction, deduplication, or stream filtering.

🔄 Alternatives

AlternativeDescriptionProsCons
Standard Bloom FilterBit array with no deletion supportVery space-efficientNo deletions
Count-Min SketchEstimates frequency of items in a streamTracks countsHigher error rate
Cuckoo FilterStores fingerprints with support for deletionLower false positivesMore complex insert/delete
d-left Counting BloomOptimized CBF using d-left hashingSpace-efficientMore complex implementation

Interview Tip

If asked to design a distributed cache, you can use CBFs to:

  • Track which items are cached
  • Evict items safely
  • Share compact cache summaries across nodes