**I just announced the new *** Learn Spring * course, focused on the fundamentals of Spring 5 and Spring Boot 2:

*Learn Spring*course, focused on the fundamentals of Spring 5 and Spring Boot 2:

Last modified: October 26, 2018

The* HyperLogLog (HLL) *data structure is a probabilistic data structure used to **estimate the cardinality of a data set**.

Suppose that we have millions of users and we want to calculate the number of distinct visits to our web page. A naive implementation would be to store each unique user id in a set, and then the size of the set would be our cardinality.

When we are dealing with very large volumes of data, counting cardinality this way will be very inefficient because the data set will take up a lot of memory.

But if we are fine with an estimation within a few percent and don’t need the exact number of unique visits, then we can use the *HLL*, as it was designed for exactly such a use case – **estimating the count of millions or even billions of distinct values**.

To get started we’ll need to add the Maven dependency for the *hll* library:

<dependency> <groupId>net.agkn</groupId> <artifactId>hll</artifactId> <version>1.6.0</version> </dependency>

Jumping right in – the *HLL* constructor has two arguments that we can tweak according to our needs:

*log2m (log base 2) –*this is the number of registers used internally by*HLL*(note: we are specifying the*m*)*regwidth –*this is the number of bits used per register

If we want a higher accuracy, we need to set these to higher values. Such a configuration will have additional overhead because our *HLL *will occupy more memory. If we’re fine with lower accuracy, we can lower those parameters, and our *HLL *will occupy less memory.

Let’s create an *HLL *to count distinct values for a data set with 100 million entries. We will set the *log2m* parameter equal to *14* and *regwidth *equal to *5* – reasonable values for a data set of this size.

**When each new element is inserted to the HLL, it needs to be hashed beforehand.** We will be using

HashFunction hashFunction = Hashing.murmur3_128(); long numberOfElements = 100_000_000; long toleratedDifference = 1_000_000; HLL hll = new HLL(14, 5);

Choosing those parameters should give us an **error rate below one percent** (1,000,000 elements). We will be testing this in a moment.

Next, let’s insert the 100 million elements:

LongStream.range(0, numberOfElements).forEach(element -> { long hashedValue = hashFunction.newHasher().putLong(element).hash().asLong(); hll.addRaw(hashedValue); } );

Finally, we can test that **the cardinality returned by the HLL is within our desired error threshold**:

long cardinality = hll.cardinality(); assertThat(cardinality) .isCloseTo(numberOfElements, Offset.offset(toleratedDifference));

We can calculate how much memory our *HLL* from the previous section will take by using the following formula: *numberOfBits = 2 ^ log2m * regwidth*.

In our example that will be *2 ^ 14 * 5 *bits (roughly* 81000 *bits or* 8100 *bytes). So estimating the cardinality of a 100-million member set using *HLL *occupied only 8100 bytes of memory.

Let’s compare this with a naive set implementation. In such an implementation, we need to have a *Set* of 100 million *Long* values, which would occupy *100,000,000 * 8 *bytes* = 800,000,000 *bytes*.*

We can see the difference is astonishingly high. Using *HLL*, we need only 8100 bytes, whereas using the naive *Set* implementation we would need roughly 800 megabytes.

When we consider bigger data sets, the difference between *HLL* and the naive *Set* implementation becomes even higher.

*HLL* has one beneficial property when performing unions*.* When we take the union of two *HLLs* created from distinct data sets and measure its cardinality, **we will get the same error threshold for the union that we would get if we had used a single HLL and calculated the hash values for all elements of both data sets from the beginning**.

Note that when we union two HLLs, both should have the same *log2m *and *regwidth *parameters to yield proper results.

Let’s test that property by creating two *HLLs –* one is populated with values from 0 to 100 million, and the second is populated with values from 100 million to 200 million:

HashFunction hashFunction = Hashing.murmur3_128(); long numberOfElements = 100_000_000; long toleratedDifference = 1_000_000; HLL firstHll = new HLL(15, 5); HLL secondHLL = new HLL(15, 5); LongStream.range(0, numberOfElements).forEach(element -> { long hashedValue = hashFunction.newHasher() .putLong(element) .hash() .asLong(); firstHll.addRaw(hashedValue); } ); LongStream.range(numberOfElements, numberOfElements * 2).forEach(element -> { long hashedValue = hashFunction.newHasher() .putLong(element) .hash() .asLong(); secondHLL.addRaw(hashedValue); } );

Please note that we tuned the configuration parameters of the *HLLs*, increasing the *log2m* parameter from 14, as seen in the previous section, to 15 for this example, since the resulting *HLL* union will contain twice as many elements.

Next, let’s union the *firstHll *and *secondHll *using the *union() *method. As you can see, the estimated cardinality is within an error threshold as if we had taken the cardinality from one *HLL *with 200 million elements:

firstHll.union(secondHLL); long cardinality = firstHll.cardinality(); assertThat(cardinality) .isCloseTo(numberOfElements * 2, Offset.offset(toleratedDifference * 2));

In this tutorial, we had a look at the *HyperLogLog* algorithm.

We saw how to use the *HLL *to estimate the cardinality of a set. We also saw that *HLL* is very space-efficient compared to the naive solution. And we performed the union operation on two *HLLs* and verified that the union behaves in the same way as a single *HLL*.

The implementation of all these examples and code snippets can be found in the GitHub project ; this is a Maven project, so it should be easy to import and run as it is.