What is Bloom Filter?

Hello! In this article I will try to describe what the Bloom filter is, explain its purpose and show the scenarios in which it can be used. I am also implementing the Bloom filter in Python from scratch to make it easier to understand its internals.

Purpose of the Bloom filter

Bloom filter is a data structure whose purpose is to quickly check that an element is NOT included in a set (for those who are familiar with O large notation, the complexity of inserting and checking whether an element belongs to a set using Bloom’s filter is O(1)). It can be very useful to prevent unnecessary execution of computationally intensive tasks by simply checking that the item definitely not is included in the set. It is important to understand that Bloom’s filter is a probabilistic data structure: it can tell you with 100% probability that an item is absent from the dataset, but it cannot tell you with 100% probability that an item is in the set (false positive results are possible) … Let’s talk about the scenarios in which you can use the Bloom filter with a detailed explanation of its internals and implementation in Python, and later you will understand how the Bloom filter has such indicators!

Bloom filter is usually used before searching in a set with rather slow access to elements. Due to its use, both the number of search operations over the set and the search time in general can be reduced.

Usage scenarios

Let’s think about scenarios where such a data structure could be very useful to speed up the computation of some tasks. For example, we might start with a backbone router (not the one you might find in your home). These routers may require uplink speeds of more than 100 Gbps. The administrator may need to create a blacklist of IP addresses to block their access to the network. This means that every time a router receives a packet at more than 100 Gbps, it must access its memory and perform at best a logarithmic lookup (O(log(n))) to check if an IP is blocked, given that most IPs are not blocked and that the search will not return results for most packets. In this case, a Bloom filter can be implemented just before memory accesses to ensure that most packets do not have to wait for an IP address to be sent to the network.

Another scenario is databases. When the database is accessed millions of times per second, and most of the accesses are searches for a key that is not in the database, it can be important to minimize the load of such calls to the database for two reasons: if you reduce the number of searches, the database kernel data will respond faster to other requests; if the client might not wait for a database search and get a result (for example, out of memory) without having to access the database, the speedup achieved can be significant.

Finally, to speed up the process of finding a file in a folder with a lot of files, you can use the Bloom filter to first check if the file is missing in that folder.

More typical usage scenarios for the Bloom filter can be found here

What is the Bloom filter?

To illustrate the structure of the Bloom filter, we will use the first scenario. Imagine blacklisting 100 IP addresses. The easiest way to mark if an IP is blacklisted or not is to create a 100-bit list, where each bit is one IP. If the IP address is blacklisted, we mark its position as “1”, otherwise – “0”.

In this Bloom filter, the 4th IP address is blacklisted and all others are not.

How many IP addresses are there in total?

This implementation works if only 100 IPs are used. In reality, each IPv4 address has 32 bits, which means that there are 4,294,967,296 (2 ^ 32) possible addresses (some of them are reserved for private, broadcast, multicast and other special networks, but there are still a huge number of remaining addresses )! And the number of blacklisted IP addresses will probably not exceed a few hundred in the most extreme case. We cannot afford to compile such a large list to use for such a relatively small number of entries. We need to find a way to match the IP address and the entries in the list. And this is where hash functions come to the rescue.

Hash function

A hash function is a function that converts an arbitrary length input to a fixed size value. This way, we can create a fixed size array and compute the hash function for a specific IP address, and it will always generate a number less than or equal to the size of the array. The hash function is not a random function, which means the output will always be the same for the same input.

The hash function takes input, which can be any string (in this case, IP), and calculates a numeric representation. In this case, the numeric representation will be the position in the Bloom filter that matches the input.

But wait … Something is wrong. Let’s go back to our scenario. Imagine we blacklisted 100 IP addresses. How does the hash function accurately map our 100 out of a possible 2 ^ 32 IP addresses to 100 different values ​​without storing any information about them? The truth is, you can’t. There will be collisions. The hash function ensures that each IP address has its own number mapping, but since there may be 4,294,967,296 (2 ^ 32) possible IP addresses, it is impossible to map them all to just a hundred different values. All the hash function can guarantee is that it will scramble the bits of the input so that the output conforms to a uniform distribution. This means that if you change the input of the hash function from 192.168.1.1 to 192.168.1.2, the output is likely to be completely different, seemingly random (but not really random, since each input will always match the same output).

An example of a collision. Two different IP addresses have the same hash, which means that their index in the Bloom filter will be the same.

Ok, now from the beginning: we blacklist 100 IP addresses. Each IP address will go through a hash function, and the result of the hash function will return a number less than or equal to the size of the array. This number will be an array index that marks whether the IP address was blacklisted or not. But there will be collisions – how can we deal with this?

Let’s assume the IP addresses 178.23.12.63 and 112.64.90.12 have the same hash. The first IP was blacklisted, the second was not. When we check if the hash of the second IP address is in the Bloom filter, we will find it there, even if that IP address has never been blacklisted. Does this mean we have a bug?

Remember that I warned in the beginning that the purpose of the Bloom filter is to check that an item is NOT part of the dataset. If the position of an element in the Bloom filter is 0, this element is definitely NOT part of the set. However, if the position of an element in the Bloom filter is 1, then either this element can still be in the set, or it is just a collision. All we can do is reduce the chances of collisions in order to reduce the number of memory accesses required to check if an IP is indeed blacklisted.

Reducing the likelihood of collisions

There are two main ways to reduce the chance of collisions, and both are nuanced. One possibility is to increase the size of the array. If we increase the size of the array (and therefore force the hash function to return a number less than or the same size as the new size of the array), the chance of collisions decreases. In particular, the probability of a false positive (the Bloom filter returns 1 when an item is not in the set) is (1-e^(m / n)), where m is the number of elements to be added to the filter, and n is the filter size.

Another way to reduce the chance of collisions is to increase the number of hash functions. This means that our scenario will use several different hash functions for one IP address, i.e. several different places in the array will be marked as 1. If we use k hash functions, the probability of a false positive is now (1-e^(mk/n))^k, which means that the optimal number of hash functions is (n/m)*ln(2) (you can read more about these equations here).

An example of a Bloom filter with two hash functions. In one of the hashes of these IP addresses, we observe a collision, but you can still check that IP 112.64.90.12 is not included in the set, because one of its Bloom filter positions is not equal to 1.

Well, now let’s implement the Bloom filter in Python and see the result! It only takes us about 50 lines of code.

Let’s start by creating a class BloomFilter (in the next code snippet). The constructor gets the size of the Bloom filter and (optional) the number of expected items that this Bloom filter will store. To create an array of bits, we will use the library bitarrayand we will set them all to zero. Finally, we set the number of hash functions into an equation that returns the optimal number of hash functions given the size of the Bloom filter and the number of elements expected.

import math
from bitarray import bitarray

class BloomFilter(object):

    def __init__(self, size, number_expected_elements=100000):
        self.size = size
        self.number_expected_elements = number_expected_elements

        self.bloom_filter = bitarray(self.size)
        self.bloom_filter.setall(0)

        self.number_hash_functions = round((self.size / self.number_expected_elements) * math.log(2))

Now let’s define a hash function for the Bloom filter. The implementation used (taken from here) implements the DJB2 algorithm. We will use it as a black box for now, since explaining this algorithm is beyond the scope of this article.

def _hash_djb2(self, s):
        hash = 5381
        for x in s:
            hash = ((hash << 5) + hash) + ord(x)
        return hash % self.size

We now have a hash function, but how do we create K hash functions? We can do one simple trick. Instead of creating different hash functions, we’ll just add a number to each input to the hash function. The number will represent the number of the called hash function. Since any small difference in the input of the hash function will result in a completely different hash, the result can be thought of as a different hash function. Cool, isn’t it?

def _hash(self, item, K):
        return self._hash_djb2(str(K) + item)

Now let’s create a function to add an element to the Bloom filter. To do this, let’s iterate through all the hash functions, calculating the hash for the item and finally putting 1 (or True) in the hash index.

def add_to_filter(self, item):
        for i in range(self.number_hash_functions):
            self.bloom_filter[self._hash(item, i)] = 1

It remains only to create a function that checks that the element NOT is in the Bloom filter. To do this, let’s iterate over all the hash functions again. If any of the Bloom filter positions are 0, we can say that the item is definitely missing from the set. Otherwise, if all positions are 1, we cannot say that the item is not in the set.

 def check_is_not_in_filter(self, item):
        for i in range(self.number_hash_functions):
            if self.bloom_filter[self._hash(item, i)] == 0:
                return True
        return False

And that’s it! We have implemented our Bloom filter. Let’s test it!

We will create a simple test to see if it works. Let’s create a Bloom filter with 1 million entries, and then set the expected cardinality to 100,000. We’re going to add the “192.168.1.1” element to our Bloom filter as a blocked IP address.

bloom_filter = BloomFilter(1000000, 100000)
base_ip = "192.168.1."
bloom_filter.add_to_filter(base_ip + str(1))

To test this, we will iterate i from 1 to 100,000 and check if IP 192.168.1.i is in the Bloom filter (there are no IP addresses where i> 254, like 192.168.289, but in this case we just test). We will output elements that we do not know about if they are part of the set; all other elements that will not be printed are definitely not included in the set.

for i in range(1, 100000):
    if not bloom_filter.check_is_not_in_filter(base_ip + str(i)):
        print(base_ip+str(i))

192.168.1.1

Yes! Our Bloom filter says that out of 100,000 IP addresses, the only item that can be blocked is indeed our blocked IP address. No false positives!

Here is the complete code for our Bloom filter:

import math
from bitarray import bitarray


class BloomFilter(object):

    def __init__(self, size, number_expected_elements=100000):
        self.size = size
        self.number_expected_elements = number_expected_elements

        self.bloom_filter = bitarray(self.size)
        self.bloom_filter.setall(0)

        self.number_hash_functions = round((self.size / self.number_expected_elements) * math.log(2))


    def _hash_djb2(self, s):
        hash = 5381
        for x in s:
            hash = ((hash << 5) + hash) + ord(x)
        return hash % self.size


    def _hash(self, item, K):
        return self._hash_djb2(str(K) + item)


    def add_to_filter(self, item):
        for i in range(self.number_hash_functions):
            self.bloom_filter[self._hash(item, i)] = 1


    def check_is_not_in_filter(self, item):
        for i in range(self.number_hash_functions):
            if self.bloom_filter[self._hash(item, i)] == 0:
                return True
        return False


bloom_filter = BloomFilter(1000000, 100000)
base_ip = "192.168.1."
bloom_filter.add_to_filter(base_ip + str(1))

for i in range(1, 100000):
    if not bloom_filter.check_is_not_in_filter(base_ip + str(i)):
        print(base_ip+str(i))

That’s it for Bloom filters. I hope you were curious about what a Bloom filter is and how to implement it. Thank you for attention!


The translation of the article was prepared in advance of the start Data Engineer course

We also invite everyone to free demo lesson on ML in Spark. In the lesson, participants, together with an expert, will learn about the features of ML in Spark, review the model development process and learn how to translate trained models into production.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *