Invertible Bloom Filter
✅ Problem Statement
Traditional Bloom Filters are great for membership testing (“Is element X in this set?”), but:
- They don’t allow deletion (removal of items from the filter).
- They don’t support listing elements of the set.
- They can’t compute set difference or recover exact contents.
These limitations are critical in distributed systems, data synchronization, database reconciliation, etc.
✅ Use Case
Invertible Bloom Filters (IBFs) are designed to:
- Reconcile sets across systems (e.g. detecting differences between two replicas).
- Track inserted and deleted items.
- Support decoding the set (retrieve actual items from filter).
Example: Reconciling missing or extra records in two databases efficiently over a network.
🧠 How IBFs Solve the Problem
IBFs extend standard Bloom Filters with invertibility—they not only test membership but can reconstruct the elements that caused differences between sets.
👇 Key Features
- Supports deletion
- Supports set subtraction
- Supports decoding of element list under some constraints
- Efficient in space and computation
🏗️ Architecture & Working (In Deep Detail)
An Invertible Bloom Filter is essentially an array of cells. Each cell holds:
- A count (
count
) - An XOR of element IDs (
idSum
) - An XOR of hash values of those IDs (
hashSum
)
You insert and delete items by updating the cells at multiple locations (using multiple hash functions). Crucially, each element “touches” multiple cells (like Bloom filters).
🧬 Data Structure of a Cell
Cell:
count → Integer
idSum → XOR of all inserted (or deleted) IDs
hashSum → XOR of hashes of IDs (to verify uniqueness)
💡 Operations
1. Insert Operation
- Hash item with k hash functions.
- For each cell:
count += 1
idSum ^= itemID
hashSum ^= hash(itemID)
2. Delete Operation
- Reverse the insert:
count -= 1
idSum ^= itemID
hashSum ^= hash(itemID)
3. Decode Operation (Invert)
- Scan for cells with
count = ±1
. - For each:
- Confirm
idSum
matcheshashSum
- If match → recover element
- Remove element by reversing insert/delete
- Confirm
- Repeat until no more recoverable cells
Example
- Cell array size: 10
- Hash functions: H1, H2, H3
- Each cell stores:
count
(int)idSum
(bitwise XOR of item IDs)hashSum
(bitwise XOR of item hashes)
Let’s assign dummy values:
Item IDs:
- alice = 0b0011
- bob = 0b0101
- carol = 0b0110
Hash(itemID):
- alice → 0b1110
- bob → 0b1011
- carol → 0b0111
Hash outputs:
Item | H1 | H2 | H3 |
---|---|---|---|
alice | 1 | 4 | 7 |
bob | 2 | 4 | 6 |
carol | 0 | 4 | 7 |
🧩 Insert Alice
graph TD subgraph Insert 'alice' A0["Cell 1"] -->|count++| A1["count = 1"] A0 -->|idSum ^= 0011| A2["idSum = 0011"] A0 -->|hashSum ^= 1110| A3["hashSum = 1110"] B0["Cell 4"] -->|count++| B1["count = 1"] B0 -->|idSum ^= 0011| B2["idSum = 0011"] B0 -->|hashSum ^= 1110| B3["hashSum = 1110"] C0["Cell 7"] -->|count++| C1["count = 1"] C0 -->|idSum ^= 0011| C2["idSum = 0011"] C0 -->|hashSum ^= 1110| C3["hashSum = 1110"] end
🧩 Insert Bob
graph TD subgraph Insert 'bob' A0["Cell 2"] -->|count++| A1["count = 1"] A0 -->|idSum ^= 0101| A2["idSum = 0101"] A0 -->|hashSum ^= 1011| A3["hashSum = 1011"] B0["Cell 4"] -->|count++| B1["count = 2"] B0 -->|idSum ^= 0101| B2["idSum = 0011 ^ 0101 = 0110"] B0 -->|hashSum ^= 1011| B3["hashSum = 1110 ^ 1011 = 0101"] C0["Cell 6"] -->|count++| C1["count = 1"] C0 -->|idSum ^= 0101| C2["idSum = 0101"] C0 -->|hashSum ^= 1011| C3["hashSum = 1011"] end
🧩 Insert Carol
graph TD subgraph Insert 'carol' A0["Cell 0"] -->|count++| A1["count = 1"] A0 -->|idSum ^= 0110| A2["idSum = 0110"] A0 -->|hashSum ^= 0111| A3["hashSum = 0111"] B0["Cell 4"] -->|count++| B1["count = 3"] B0 -->|idSum ^= 0110| B2["idSum = 0110 ^ 0110 = 0000"] B0 -->|hashSum ^= 0111| B3["hashSum = 0101 ^ 0111 = 0010"] C0["Cell 7"] -->|count++| C1["count = 2"] C0 -->|idSum ^= 0110| C2["idSum = 0011 ^ 0110 = 0101"] C0 -->|hashSum ^= 0111| C3["hashSum = 1110 ^ 0111 = 1001"] end
🔍 Decode Phase
We now try to find cells with count = ±1
and where idSum
== hash(itemID)
. These are “pure” cells.
Step-by-Step Decoding:
Cell 0:
count = 1
,idSum = 0110
,hashSum = 0111
- hash(0110) = 0111 → ✅ Match → Recover carol
- Delete carol’s entries from cells 0, 4, 7
Repeating: now Cell 2 has
count = 1
,idSum = 0101
,hashSum = 1011
- ✅ Match → Recover bob
- Delete bob’s entries from cells 2, 4, 6
Again: Cell 1,
count = 1
,idSum = 0011
,hashSum = 1110
- ✅ Match → Recover alice
- Delete alice from cells 1, 4, 7
Eventually all cells reach neutral state — decoding complete. ✅
🚀 Summary Table (For Quick Review)
Action | Cells Affected | Change |
---|---|---|
Insert Alice | 1, 4, 7 | count += 1, XOR idSum , XOR hashSum |
Insert Bob | 2, 4, 6 | count += 1, XOR idSum , XOR hashSum |
Insert Carol | 0, 4, 7 | count += 1, XOR idSum , XOR hashSum |
Decode Phase | Any cell with count=1 and idSum==hashSum | Recovered → Deleted → Cascading decode |
Implementaion
import hashlib
import random
class Cell:
def __init__(self):
self.count = 0
self.idSum = 0
self.hashSum = 0
def is_pure(self):
return abs(self.count) == 1 and self.hashSum == self.hash_item(self.idSum)
@staticmethod
def hash_item(item_id):
h = hashlib.sha256(str(item_id).encode()).hexdigest()
return int(h, 16)
class IBF:
def __init__(self, size=10, num_hashes=3):
self.size = size
self.num_hashes = num_hashes
self.cells = [Cell() for _ in range(size)]
self.seeds = [random.randint(0, 10000) for _ in range(num_hashes)]
def _hashes(self, item_id):
return [
hash((seed, item_id)) % self.size
for seed in self.seeds
]
def insert(self, item_id):
h = Cell.hash_item(item_id)
for idx in self._hashes(item_id):
cell = self.cells[idx]
cell.count += 1
cell.idSum ^= item_id
cell.hashSum ^= h
def delete(self, item_id):
h = Cell.hash_item(item_id)
for idx in self._hashes(item_id):
cell = self.cells[idx]
cell.count -= 1
cell.idSum ^= item_id
cell.hashSum ^= h
def decode(self):
decoded_items = []
changed = True
while changed:
changed = False
for cell in self.cells:
if cell.is_pure():
item_id = cell.idSum
decoded_items.append(item_id)
self.delete(item_id) # simulate removal
changed = True
break # restart scan after mutation
return decoded_items
# 👀 Example usage
if __name__ == "__main__":
ibf = IBF(size=10, num_hashes=3)
# Demo items
alice = 13
bob = 21
carol = 34
# Insert items
ibf.insert(alice)
ibf.insert(bob)
ibf.insert(carol)
# Decode
recovered = ibf.decode()
print("Decoded Items:", recovered)
🧪 Best Practices
- ✅ Choose good hash functions — low collision probability.
- ✅ Ensure enough cells to reduce chance of undecodable filters.
- ✅ Keep
count
,idSum
, andhashSum
as compact as possible — space-efficient yet reliable. - ✅ Use IBFs when exact reconciliation matters and bandwidth is limited.
🔄 Alternatives
Technique | Pros | Cons |
---|---|---|
Standard Bloom Filter | Simple, fast membership testing | No deletion, no recovery |
Counting Bloom Filter | Allows deletion | No set listing or decoding |
Cuckoo Filter | Supports deletion, compact | Limited set operations |
Merkle Trees | Exact reconciliation | More overhead, less flexible |
IBFs strike a balance between efficiency and invertibility, ideal for set reconciliation problems in distributed systems.