Bloom Filters: Fast Membership Testing With Zero False Negatives

TL;DR

A bloom filter uses multiple hash functions and a bit array to test set membership. False positives are possible; false negatives are not. Use them when you need fast 'definitely not in set' checks at massive scale.

We had a crawler that needed to check whether a URL had already been visited. At 500 million URLs, storing them in a hash set meant 40GB of RAM. We switched to a bloom filter and got the same functionality in 600MB.

Here's what a bloom filter is and why it works.

The Problem

You have a huge set of things. You need to answer the question: "Have I seen this before?"

A hash set does this in O(1), but it stores every element. At scale, that's expensive.

A bloom filter answers "definitely not seen" or "probably seen" using a fixed-size bit array, regardless of how many elements you add.

How It Works

A bloom filter is a bit array of m bits, initialized to all zeros. You choose k hash functions.

Adding an element:

element = "example.com"

hash1("example.com") = 3  → set bit[3] = 1
hash2("example.com") = 7  → set bit[7] = 1
hash3("example.com") = 14 → set bit[14] = 1

Checking membership:

check("example.com")
  hash1 → bit[3]  = 1 ✓
  hash2 → bit[7]  = 1 ✓
  hash3 → bit[14] = 1 ✓
  → "probably in set"

check("other.com")
  hash1 → bit[2]  = 0 ✗
  → "definitely NOT in set" (stop here)

If any bit is 0, the element is definitely not in the set. If all bits are 1, it's probably in the set (but could be a false positive from other elements setting those bits).

False Positives, Never False Negatives

False positive: filter says "probably seen" but we haven't
False negative: filter says "not seen" but we have
               → IMPOSSIBLE with bloom filters

Bits are only set, never cleared. Once you add an element, its bits stay set forever. So if something is in the filter, all its bits are 1, and the check always returns "probably in set."

False positive rate depends on how full the filter is. An empty filter has 0% false positives. A completely full one returns "probably in set" for everything.

Implementation From Scratch

import hashlib
import math

class BloomFilter:
    def __init__(self, capacity: int, error_rate: float = 0.01):
        # Optimal bit array size
        self.m = math.ceil(
            -(capacity * math.log(error_rate)) / (math.log(2) ** 2)
        )
        # Optimal number of hash functions
        self.k = math.ceil((self.m / capacity) * math.log(2))

        self.bits = bytearray(math.ceil(self.m / 8))
        self.count = 0

        print(f"Bloom filter: {self.m} bits ({self.m/8/1024:.1f} KB), "
              f"{self.k} hash functions, "
              f"target error rate {error_rate}")

    def _hashes(self, item: str):
        """Generate k hash positions for item."""
        encoded = item.encode('utf-8')
        h1 = int(hashlib.md5(encoded).hexdigest(), 16)
        h2 = int(hashlib.sha1(encoded).hexdigest(), 16)
        for i in range(self.k):
            yield (h1 + i * h2) % self.m

    def _get_bit(self, pos: int) -> bool:
        byte_idx, bit_idx = divmod(pos, 8)
        return bool(self.bits[byte_idx] & (1 << bit_idx))

    def _set_bit(self, pos: int):
        byte_idx, bit_idx = divmod(pos, 8)
        self.bits[byte_idx] |= (1 << bit_idx)

    def add(self, item: str):
        for pos in self._hashes(item):
            self._set_bit(pos)
        self.count += 1

    def __contains__(self, item: str) -> bool:
        return all(self._get_bit(pos) for pos in self._hashes(item))


# Usage
bf = BloomFilter(capacity=1_000_000, error_rate=0.01)
# Bloom filter: 9,585,058 bits (1172.5 KB), 7 hash functions

bf.add("example.com")
bf.add("github.com")
bf.add("google.com")

"example.com" in bf   # True (probably in set)
"missing.com" in bf   # False (definitely NOT in set)

The Math Behind It

For n elements, target false positive rate p, with bit array of size m:

Optimal bit array size:
  m = -(n * ln(p)) / (ln(2)^2)

Optimal number of hash functions:
  k = (m / n) * ln(2)

Actual false positive rate after inserting n elements:
  p = (1 - e^(-kn/m))^k

For 1 million elements at 1% false positive rate:

m = 9,585,059 bits ≈ 1.1 MB
k = 7 hash functions

Compare to storing 1M strings: each URL averages ~50 bytes = 50MB minimum, plus hash table overhead. The bloom filter uses 2% of the memory.

When to Use Them

URL deduplication in crawlers:

crawler_seen = BloomFilter(capacity=100_000_000, error_rate=0.001)

def should_crawl(url: str) -> bool:
    if url in crawler_seen:
        return False  # Probably already crawled — skip
    crawler_seen.add(url)
    return True

# 0.1% of URLs will be crawled twice (false positive)
# 0% of new URLs will be missed (no false negatives)

Spam/bad actor filtering:

blocked_ips = BloomFilter(capacity=10_000_000, error_rate=0.0001)

# Load from blocklist
for ip in load_blocklist():
    blocked_ips.add(ip)

def check_request(ip: str):
    if ip in blocked_ips:
        # Check real database (bloom filter might have false positive)
        if db.is_blocked(ip):
            raise PermissionError("Blocked IP")
    # Fast path: ip definitely not in bloom filter = definitely not blocked

Cache stampede prevention:

recently_cached = BloomFilter(capacity=1_000_000, error_rate=0.01)

def get_value(key: str):
    if key in recently_cached:
        return cache.get(key)  # Probably in cache, try it

    # Definitely not recently cached
    value = db.get(key)
    cache.set(key, value, ttl=300)
    recently_cached.add(key)
    return value

Username/password breach detection:

HaveIBeenPwned uses a k-anonymity model with bloom filters to let you check if a password appears in breach data without sending the full password hash.

Limitations

No deletions. Once a bit is set, you can't safely clear it — it might be shared by another element. Counting bloom filters (increment/decrement counts per bit) allow deletion but use more memory.

Fixed size. You set capacity upfront. Overfill it and false positive rate climbs. Scalable bloom filters link multiple filters together, but add complexity.

No retrieval. It's membership testing only — you can't get elements back out.

Redis Has One Built In

import redis

r = redis.Redis()

# Redis bloom filter (RedisBloom module)
r.bf().create('seen_urls', error_rate=0.01, capacity=10_000_000)

# Add URLs
r.bf().add('seen_urls', 'https://example.com')
r.bf().madd('seen_urls', ['https://a.com', 'https://b.com'])

# Check membership
r.bf().exists('seen_urls', 'https://example.com')  # True
r.bf().exists('seen_urls', 'https://new.com')       # False

Distributed bloom filter, shared across services, with no serialization overhead.

The Bottom Line

Bloom filters trade certainty for memory. You can never get a false negative, so "definitely not in the set" is reliable. "Probably in the set" means do a cheap fast check, then verify in the real store if it matters.

The rules:

  • Use when membership testing at scale is too memory-expensive for a hash set
  • False negative rate is 0% — things in the filter are always detected
  • False positive rate is tunable — set it based on the cost of a false positive
  • Can't delete elements from a standard bloom filter
  • Redis, Cassandra, and HBase all have bloom filter implementations ready to use

If you need "have I seen this before?" at millions of queries per second, bloom filters are the answer.