Guide to the HyperLogLog Algorithm in Java

Azure Spring Apps is a fully managed service from Microsoft (built in collaboration with VMware), focused on building and deploying Spring Boot applications on Azure Cloud without worrying about Kubernetes.

And, the Enterprise plan comes with some interesting features, such as commercial Spring runtime support, a 99.95% SLA and some deep discounts (up to 47%) when you are ready for production.

>> Learn more and deploy your first Spring Boot app to Azure.

You can also ask questions and leave feedback on the Azure Spring Apps GitHub page.

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

Accelerate Your Jakarta EE Development with Payara Server!

With best-in-class guides and documentation, Payara essentially simplifies deployment to diverse infrastructures.

Beyond that, it provides intelligent insights and actions to optimize Jakarta EE applications.

The goal is to apply an opinionated approach to get to what's essential for mission-critical applications - really solid scalability, availability, security, and long-term support:

>> Download and Explore the Guide (to learn more)

The AI Assistant to boost Boost your productivity writing unit tests - Machinet AI.

AI is all the rage these days, but for very good reason. The highly practical coding companion, you'll get the power of AI-assisted coding and automated unit test generation.
Machinet's Unit Test AI Agent utilizes your own project context to create meaningful unit tests that intelligently aligns with the behavior of the code.
And, the AI Chat crafts code and fixes errors with ease, like a helpful sidekick.

Simplify Your Coding Journey with Machinet AI:

>> Install Machinet AI in your IntelliJ

Looking for the ideal Linux distro for running modern Spring apps in the cloud?

Meet Alpaquita Linux: lightweight, secure, and powerful enough to handle heavy workloads.

This distro is specifically designed for running Java apps. It builds upon Alpine and features significant enhancements to excel in high-density container environments while meeting enterprise-grade security standards.

Specifically, the container image size is ~30% smaller than standard options, and it consumes up to 30% less RAM:

>> Try Alpaquita Containers now.

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

>> Take a look at DBSchema

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

1. Overview

The HyperLogLog (HLL) data structure is a probabilistic data structure used to estimate the cardinality of a data set.

Suppose that we have millions of users and we want to calculate the number of distinct visits to our web page. A naive implementation would be to store each unique user id in a set, and then the size of the set would be our cardinality.

When we are dealing with very large volumes of data, counting cardinality this way will be very inefficient because the data set will take up a lot of memory.

But if we are fine with an estimation within a few percent and don’t need the exact number of unique visits, then we can use the HLL, as it was designed for exactly such a use case – estimating the count of millions or even billions of distinct values.

2. Maven Dependency

To get started we’ll need to add the Maven dependency for the hll library:

<dependency>
    <groupId>net.agkn</groupId>
    <artifactId>hll</artifactId>
    <version>1.6.0</version>
</dependency>

3. Estimating Cardinality Using HLL

Jumping right in – the HLL constructor has two arguments that we can tweak according to our needs:

log2m (log base 2) – this is the number of registers used internally by HLL (note: we are specifying the m)
regwidth – this is the number of bits used per register

If we want a higher accuracy, we need to set these to higher values. Such a configuration will have additional overhead because our HLL will occupy more memory. If we’re fine with lower accuracy, we can lower those parameters, and our HLL will occupy less memory.

Let’s create an HLL to count distinct values for a data set with 100 million entries. We will set the log2m parameter equal to 14 and regwidth equal to 5 – reasonable values for a data set of this size.

When each new element is inserted to the HLL, it needs to be hashed beforehand. We will be using Hashing.murmur3_128() from the Guava library (included with the hll dependency) because it is both accurate and fast.

HashFunction hashFunction = Hashing.murmur3_128();
long numberOfElements = 100_000_000;
long toleratedDifference = 1_000_000;
HLL hll = new HLL(14, 5);

Choosing those parameters should give us an error rate below one percent (1,000,000 elements). We will be testing this in a moment.

Next, let’s insert the 100 million elements:

LongStream.range(0, numberOfElements).forEach(element -> {
    long hashedValue = hashFunction.newHasher().putLong(element).hash().asLong();
    hll.addRaw(hashedValue);
  }
);

Finally, we can test that the cardinality returned by the HLL is within our desired error threshold:

long cardinality = hll.cardinality();
assertThat(cardinality)
  .isCloseTo(numberOfElements, Offset.offset(toleratedDifference));

4. Memory Size of HLL

We can calculate how much memory our HLL from the previous section will take by using the following formula: numberOfBits = 2 ^ log2m * regwidth.

In our example that will be 2 ^ 14 * 5 bits (roughly 81000 bits or 8100 bytes). So estimating the cardinality of a 100-million member set using HLL occupied only 8100 bytes of memory.

Let’s compare this with a naive set implementation. In such an implementation, we need to have a Set of 100 million Long values, which would occupy 100,000,000 * 8 bytes = 800,000,000 bytes.

We can see the difference is astonishingly high. Using HLL, we need only 8100 bytes, whereas using the naive Set implementation we would need roughly 800 megabytes.

When we consider bigger data sets, the difference between HLL and the naive Set implementation becomes even higher.

5. Union of Two HLLs

HLL has one beneficial property when performing unions. When we take the union of two HLLs created from distinct data sets and measure its cardinality, we will get the same error threshold for the union that we would get if we had used a single HLL and calculated the hash values for all elements of both data sets from the beginning.

Note that when we union two HLLs, both should have the same log2m and regwidth parameters to yield proper results.

Let’s test that property by creating two HLLs – one is populated with values from 0 to 100 million, and the second is populated with values from 100 million to 200 million:

HashFunction hashFunction = Hashing.murmur3_128();
long numberOfElements = 100_000_000;
long toleratedDifference = 1_000_000;
HLL firstHll = new HLL(15, 5);
HLL secondHLL = new HLL(15, 5);

LongStream.range(0, numberOfElements).forEach(element -> {
    long hashedValue = hashFunction.newHasher()
      .putLong(element)
      .hash()
      .asLong();
    firstHll.addRaw(hashedValue);
    }
);

LongStream.range(numberOfElements, numberOfElements * 2).forEach(element -> {
    long hashedValue = hashFunction.newHasher()
      .putLong(element)
      .hash()
      .asLong();
    secondHLL.addRaw(hashedValue);
    }
);

Please note that we tuned the configuration parameters of the HLLs, increasing the log2m parameter from 14, as seen in the previous section, to 15 for this example, since the resulting HLL union will contain twice as many elements.

Next, let’s union the firstHll and secondHll using the union() method. As you can see, the estimated cardinality is within an error threshold as if we had taken the cardinality from one HLL with 200 million elements:

firstHll.union(secondHLL);
long cardinality = firstHll.cardinality();
assertThat(cardinality)
  .isCloseTo(numberOfElements * 2, Offset.offset(toleratedDifference * 2));

6. Conclusion

In this tutorial, we had a look at the HyperLogLog algorithm.

We saw how to use the HLL to estimate the cardinality of a set. We also saw that HLL is very space-efficient compared to the naive solution. And we performed the union operation on two HLLs and verified that the union behaves in the same way as a single HLL.

The implementation of all these examples and code snippets can be found in the GitHub project ; this is a Maven project, so it should be easy to import and run as it is.

Guide to the HyperLogLog Algorithm in Java

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Overview

2. Maven Dependency

3. Estimating Cardinality Using HLL

4. Memory Size of HLL

5. Union of Two HLLs

6. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course:

REST with Spring

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Persistence

REST

Security

Full Archive

Baeldung Ebooks

About Baeldung

Write for Baeldung

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Overview

2. Maven Dependency

3. Estimating Cardinality Using HLL

4. Memory Size of HLL

5. Union of Two HLLs

6. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course: