Invertible Bloom Filter

Invertible Bloom Filter

✅ Problem Statement

Traditional Bloom Filters are great for membership testing (“Is element X in this set?”), but:

  • They don’t allow deletion (removal of items from the filter).
  • They don’t support listing elements of the set.
  • They can’t compute set difference or recover exact contents.

These limitations are critical in distributed systems, data synchronization, database reconciliation, etc.

✅ Use Case

Invertible Bloom Filters (IBFs) are designed to:

  • Reconcile sets across systems (e.g. detecting differences between two replicas).
  • Track inserted and deleted items.
  • Support decoding the set (retrieve actual items from filter).

Example: Reconciling missing or extra records in two databases efficiently over a network.

🧠 How IBFs Solve the Problem

IBFs extend standard Bloom Filters with invertibility—they not only test membership but can reconstruct the elements that caused differences between sets.

👇 Key Features

  • Supports deletion
  • Supports set subtraction
  • Supports decoding of element list under some constraints
  • Efficient in space and computation

🏗️ Architecture & Working (In Deep Detail)

An Invertible Bloom Filter is essentially an array of cells. Each cell holds:

  • A count (count)
  • An XOR of element IDs (idSum)
  • An XOR of hash values of those IDs (hashSum)

You insert and delete items by updating the cells at multiple locations (using multiple hash functions). Crucially, each element “touches” multiple cells (like Bloom filters).

🧬 Data Structure of a Cell

Cell:
  count     → Integer
  idSum     → XOR of all inserted (or deleted) IDs
  hashSum   → XOR of hashes of IDs (to verify uniqueness)

💡 Operations

1. Insert Operation

  • Hash item with k hash functions.
  • For each cell:
    • count += 1
    • idSum ^= itemID
    • hashSum ^= hash(itemID)

2. Delete Operation

  • Reverse the insert:
    • count -= 1
    • idSum ^= itemID
    • hashSum ^= hash(itemID)

3. Decode Operation (Invert)

  • Scan for cells with count = ±1.
  • For each:
    • Confirm idSum matches hashSum
    • If match → recover element
    • Remove element by reversing insert/delete
  • Repeat until no more recoverable cells

Example

  • Cell array size: 10
  • Hash functions: H1, H2, H3
  • Each cell stores:
    • count (int)
    • idSum (bitwise XOR of item IDs)
    • hashSum (bitwise XOR of item hashes)

Let’s assign dummy values:

Item IDs:
- alice = 0b0011
- bob   = 0b0101
- carol = 0b0110

Hash(itemID):
- alice → 0b1110
- bob   → 0b1011
- carol → 0b0111

Hash outputs:

ItemH1H2H3
alice147
bob246
carol047

🧩 Insert Alice

graph TD
  subgraph Insert 'alice'
    A0["Cell 1"] -->|count++| A1["count = 1"]
    A0 -->|idSum ^= 0011| A2["idSum = 0011"]
    A0 -->|hashSum ^= 1110| A3["hashSum = 1110"]

    B0["Cell 4"] -->|count++| B1["count = 1"]
    B0 -->|idSum ^= 0011| B2["idSum = 0011"]
    B0 -->|hashSum ^= 1110| B3["hashSum = 1110"]

    C0["Cell 7"] -->|count++| C1["count = 1"]
    C0 -->|idSum ^= 0011| C2["idSum = 0011"]
    C0 -->|hashSum ^= 1110| C3["hashSum = 1110"]
  end

🧩 Insert Bob

graph TD
  subgraph Insert 'bob'
    A0["Cell 2"] -->|count++| A1["count = 1"]
    A0 -->|idSum ^= 0101| A2["idSum = 0101"]
    A0 -->|hashSum ^= 1011| A3["hashSum = 1011"]

    B0["Cell 4"] -->|count++| B1["count = 2"]
    B0 -->|idSum ^= 0101| B2["idSum = 0011 ^ 0101 = 0110"]
    B0 -->|hashSum ^= 1011| B3["hashSum = 1110 ^ 1011 = 0101"]

    C0["Cell 6"] -->|count++| C1["count = 1"]
    C0 -->|idSum ^= 0101| C2["idSum = 0101"]
    C0 -->|hashSum ^= 1011| C3["hashSum = 1011"]
  end

🧩 Insert Carol

graph TD
  subgraph Insert 'carol'
    A0["Cell 0"] -->|count++| A1["count = 1"]
    A0 -->|idSum ^= 0110| A2["idSum = 0110"]
    A0 -->|hashSum ^= 0111| A3["hashSum = 0111"]

    B0["Cell 4"] -->|count++| B1["count = 3"]
    B0 -->|idSum ^= 0110| B2["idSum = 0110 ^ 0110 = 0000"]
    B0 -->|hashSum ^= 0111| B3["hashSum = 0101 ^ 0111 = 0010"]

    C0["Cell 7"] -->|count++| C1["count = 2"]
    C0 -->|idSum ^= 0110| C2["idSum = 0011 ^ 0110 = 0101"]
    C0 -->|hashSum ^= 0111| C3["hashSum = 1110 ^ 0111 = 1001"]
  end

🔍 Decode Phase

We now try to find cells with count = ±1 and where idSum == hash(itemID). These are “pure” cells.

Step-by-Step Decoding:

  1. Cell 0: count = 1, idSum = 0110, hashSum = 0111

    • hash(0110) = 0111 → ✅ Match → Recover carol
    • Delete carol’s entries from cells 0, 4, 7
  2. Repeating: now Cell 2 has count = 1, idSum = 0101, hashSum = 1011

    • ✅ Match → Recover bob
    • Delete bob’s entries from cells 2, 4, 6
  3. Again: Cell 1, count = 1, idSum = 0011, hashSum = 1110

    • ✅ Match → Recover alice
    • Delete alice from cells 1, 4, 7

Eventually all cells reach neutral state — decoding complete. ✅

🚀 Summary Table (For Quick Review)

ActionCells AffectedChange
Insert Alice1, 4, 7count += 1, XOR idSum, XOR hashSum
Insert Bob2, 4, 6count += 1, XOR idSum, XOR hashSum
Insert Carol0, 4, 7count += 1, XOR idSum, XOR hashSum
Decode PhaseAny cell with count=1 and idSum==hashSumRecovered → Deleted → Cascading decode

Implementaion

import hashlib
import random

class Cell:
    def __init__(self):
        self.count = 0
        self.idSum = 0
        self.hashSum = 0

    def is_pure(self):
        return abs(self.count) == 1 and self.hashSum == self.hash_item(self.idSum)

    @staticmethod
    def hash_item(item_id):
        h = hashlib.sha256(str(item_id).encode()).hexdigest()
        return int(h, 16)

class IBF:
    def __init__(self, size=10, num_hashes=3):
        self.size = size
        self.num_hashes = num_hashes
        self.cells = [Cell() for _ in range(size)]
        self.seeds = [random.randint(0, 10000) for _ in range(num_hashes)]

    def _hashes(self, item_id):
        return [
            hash((seed, item_id)) % self.size
            for seed in self.seeds
        ]

    def insert(self, item_id):
        h = Cell.hash_item(item_id)
        for idx in self._hashes(item_id):
            cell = self.cells[idx]
            cell.count += 1
            cell.idSum ^= item_id
            cell.hashSum ^= h

    def delete(self, item_id):
        h = Cell.hash_item(item_id)
        for idx in self._hashes(item_id):
            cell = self.cells[idx]
            cell.count -= 1
            cell.idSum ^= item_id
            cell.hashSum ^= h

    def decode(self):
        decoded_items = []
        changed = True

        while changed:
            changed = False
            for cell in self.cells:
                if cell.is_pure():
                    item_id = cell.idSum
                    decoded_items.append(item_id)
                    self.delete(item_id)  # simulate removal
                    changed = True
                    break  # restart scan after mutation

        return decoded_items

# 👀 Example usage
if __name__ == "__main__":
    ibf = IBF(size=10, num_hashes=3)

    # Demo items
    alice = 13
    bob = 21
    carol = 34

    # Insert items
    ibf.insert(alice)
    ibf.insert(bob)
    ibf.insert(carol)

    # Decode
    recovered = ibf.decode()
    print("Decoded Items:", recovered)

🧪 Best Practices

  • ✅ Choose good hash functions — low collision probability.
  • ✅ Ensure enough cells to reduce chance of undecodable filters.
  • ✅ Keep count, idSum, and hashSum as compact as possible — space-efficient yet reliable.
  • ✅ Use IBFs when exact reconciliation matters and bandwidth is limited.

🔄 Alternatives

TechniqueProsCons
Standard Bloom FilterSimple, fast membership testingNo deletion, no recovery
Counting Bloom FilterAllows deletionNo set listing or decoding
Cuckoo FilterSupports deletion, compactLimited set operations
Merkle TreesExact reconciliationMore overhead, less flexible

IBFs strike a balance between efficiency and invertibility, ideal for set reconciliation problems in distributed systems.