Generic Top

Get started with Spring 5 and Spring Boot 2, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Introduction

HashMap is a powerful data structure that has a broad application, especially when fast lookup time is needed. Yet, if we don't pay attention to details, it can get suboptimal.

In this tutorial, we'll take a look at how to make HashMap as fast as possible.

2. HashMap‘s Bottleneck

HashMap‘s optimistic constant time of element retrieval (O(1)) comes from the power of hashing. For each element, HashMap computes the hash code and puts the element in the bucket associated with that hash code. Because non-equal objects can have the same hash codes (a phenomenon called hash code collision), buckets can grow in size.

The bucket is actually a simple linked list. Finding elements in the linked list isn't very fast (O(n)) but that's not a problem if the list is very small. Problems start when we have a lot of hash code collisions, so instead of a big number of small buckets, we have a small number of big buckets.

In the worst-case scenario, in which we put everything inside one bucket, our HashMap is downgraded to a linked list. Consequently, instead of O(1) lookup time, we get a very unsatisfactory O(n).

3. Tree Instead of LinkedList

Starting from Java 8, one optimization is built-in in HashMapWhen buckets are getting too large, they're transformed into trees, instead of linked lists. That brings the pessimistic time of O(n) to O(log(n)), which is much better. For that to work, the keys of HashMap need to implement the Comparable interface.

That's a nice and automatic solution, but it's not perfect. O(log(n)) is still worse than desired constant time, and transforming and storing trees takes additional power and memory.

4. Best hashCode Implementation

There are two factors we need to take into consideration when choosing a hashing function: quality of produced hash codes and speed.

4.1. Measuring hashCode Quality

Hash codes are stored inside int variables, so the number of possible hashes is limited to the capacity of the int type. It must be so because hashes are used to compute indexes of an array with buckets. That means there's also a limited number of keys that we can store in a HashMap without hash collision.

To avoid collisions as long as we can, we want to spread hashes as evenly as possible. In other words, we want to achieve uniform distribution. That means that each hash code value has the same chance of occurring as any other.

Similarly, a bad hashCode method would have a very unbalanced distribution. In the very worst-case scenario, it would always return the same number.

4.2. Default Object‘s hashCode

In general, we shouldn't use the default Object's hashCode method because we don't want to use object identity in the equals method. However, in that very unlikely scenario in which we really want to use object identity for keys in a HashMap, the default hashCode function will work fine. Otherwise, we'll want a custom implementation.

4.3. Custom hashCode

Usually, we want to override the equals method, and then we also need to override hashCode. Sometimes, we can take advantage of the specific identity of the class and easily make a very fast hashCode method.

Let's say our object's identity is purely based on its integer id. Then, we can just use this id as a hash function:

@Override
public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;

    MemberWithId that = (MemberWithId) o;

    return id.equals(that.id);
}

@Override
public int hashCode() {
    return id;
}

It will be extremely fast and won't produce any collisions. Our HashMap will behave like it has an integer key instead of a complex object.

The situation will get more complicated if we have more fields that we need to take into account. Let's say we want to base equality on both id and name:

@Override
public boolean equals(Object o) {
    if (this == o) return true;
    if (o == null || getClass() != o.getClass()) return false;

    MemberWithIdAndName that = (MemberWithIdAndName) o;

    if (!id.equals(that.id)) return false;
    return name != null ? name.equals(that.name) : that.name == null;
}

Now, we need to somehow combine hashes of id and name.

First, we'll get id‘s hash the same as before. Then, we'll multiple it by some carefully chosen number and add the name‘s hash:

@Override
public int hashCode() {
    int result = id.hashCode();
    result = PRIME * result + (name != null ? name.hashCode() : 0);
    return result;
}

How to choose that number isn't an easy question to answer sufficiently. Historically, the most popular number was 31. It's prime, it results in a good distribution, it's small, and multiplying by it can be optimized using a bit-shift operation:

31 * i == (i << 5) - i

However, now that we don't need to fight for every CPU cycle, some bigger primes can be used. For example, 524287 also can be optimized:

524287 * i == i << 19 - i

And, it may provide a hash of better quality resulting in a lesser chance of collision. Mind that these bit-shift optimizations are done automatically by the JVM, so we don't need to obfuscate our code with them.

4.4. Objects Utility Class

The algorithm we just implemented is well established, and we don't usually need to recreate it by hand every time. Instead, we can use the helper method provided by the Objects class:

@Override
public int hashCode() {
    return Objects.hash(id, name);
}

Under-the-hood, it uses exactly the algorithm described earlier with the number 31 as a multiplier.

4.5. Other Hash Functions

There are many hash functions that provide a lesser collision chance than the one described earlier. The problem is that they're computationally heavier and thus don't provide the speed gain we seek.

If for some reason we really need quality and don't care much for speed, we can take a look at the Hashing class from the Guava library:

@Override
public int hashCode() {
    HashFunction hashFunction = Hashing.murmur3_32();
    return hashFunction.newHasher()
      .putInt(id)
      .putString(name, Charsets.UTF_8)
      .hash().hashCode();
}

It's important to choose a 32-bit function because we can't store longer hashes anyway.

5. Conclusion

Modern Java's HashMap is a powerful and well-optimized data structure. Its performance can, however, be worsened by a badly designed hashCode method. In this tutorial, we looked at possible ways to make hashing fast and effective.

As always, the code examples for this article are available over on GitHub.

Generic bottom

Get started with Spring 5 and Spring Boot 2, through the Learn Spring course:

>> CHECK OUT THE COURSE
Generic footer banner
Comments are closed on this article!