Counting Bloom Filter
❓ The Membership Problem
In large-scale systems, we often need to answer:
“Is this item in the set?”
Examples:
- Is this URL malicious?
- Has this user already seen this ad?
- Is this file already cached?
⚠️ Challenge with Traditional Bloom Filters
Bloom filters are great for space-efficient membership testing, but they don’t support deletions. Once a bit is set, you can’t unset it without risking false negatives.
💡 Enter Counting Bloom Filters
CBFs solve this by replacing bits with counters, allowing:
- Insertions (increment counters)
- Deletions (decrement counters)
- Membership checks (verify counters > 0)
🏗️ Architecture & Internal Working
🔧 Components
- Counter Array: Instead of a bit array, CBF uses an array of small integers (e.g., 4-bit counters).
- Hash Functions:
kindependent hash functions map each item tokpositions in the array.
🔁 Operations
✅ Insertion
graph TD
A[Element x] --> B[Hash x with k functions]
B --> C[Get k indices]
C --> D[Increment counters at those indices]🔍 Lookup
graph TD
A[Element x] --> B[Hash x with k functions]
B --> C[Get k indices]
C --> D[Check if all counters > 0]
D --> E[Return "Probably Present"]❌ Deletion
graph TD
A[Element x] --> B[Hash x with k functions]
B --> C[Get k indices]
C --> D[Decrement counters at those indices]📊 Example
Let’s say we have:
- Counter array of size
m = 10 k = 3hash functions- Insert element
"cat"
Assume hash functions map "cat" to indices [2, 5, 7].
We increment counters at those positions:
Before: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
After : [0, 0, 1, 0, 0, 1, 0, 1, 0, 0]To delete "cat", we decrement the same counters.
Implementation
import mmh3 # pip install mmh3
from collections import defaultdict
class CountingBloomFilter:
def __init__(self, size: int, hash_count: int):
self.size = size # Size of counter array
self.hash_count = hash_count # Number of hash functions
self.counters = [0] * size # Counter array
def _hashes(self, item: str):
return [mmh3.hash(item, seed) % self.size for seed in range(self.hash_count)]
def add(self, item: str):
for i in self._hashes(item):
self.counters[i] += 1
def remove(self, item: str):
for i in self._hashes(item):
if self.counters[i] > 0:
self.counters[i] -= 1
def contains(self, item: str) -> bool:
return all(self.counters[i] > 0 for i in self._hashes(item))
def __str__(self):
return str(self.counters)🧠 Best Practices
- Choose optimal
kandm: Use formulas to minimize false positives:k ≈ (m/n) * ln(2)
- Use good hash functions: FNV, MurmurHash are popular.
- Avoid counter overflow: Use 4–8 bits per counter depending on expected frequency.
- Monitor false positives: They increase as the filter saturates.
- Use CBFs for cache eviction, deduplication, or stream filtering.
🔄 Alternatives
| Alternative | Description | Pros | Cons |
|---|---|---|---|
| Standard Bloom Filter | Bit array with no deletion support | Very space-efficient | No deletions |
| Count-Min Sketch | Estimates frequency of items in a stream | Tracks counts | Higher error rate |
| Cuckoo Filter | Stores fingerprints with support for deletion | Lower false positives | More complex insert/delete |
| d-left Counting Bloom | Optimized CBF using d-left hashing | Space-efficient | More complex implementation |
Interview Tip
If asked to design a distributed cache, you can use CBFs to:
- Track which items are cached
- Evict items safely
- Share compact cache summaries across nodes