 If you have a few years of experience in Computer Science or research, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

## 1. Introduction

Hashing is widely used in algorithms, data structures, and cryptography.

In this tutorial, we’ll discuss hashing and its application areas in detail.

First, we’ll discuss the core concepts and principles of hashing.

Second, we’ll analyze cryptographic hash functions.

Then, we’ll define a few hashing algorithms and possible attacks on them.

Finally, we’ll look over common hash-based data structures.

## 2. Hashing

### 2.1. Hash Functions

Hash functions take variable-length input data and produce a fixed-length output value. We usually refer to that as hash code, digest, hash value, or just hash. There are a few important properties that characterize hash functions:

• Hashing is a one-directional process. Thus, we can’t retrieve the original data from its hash.
• Hash functions are deterministic. Hence, when we pass the same input to the hash function, it always generates the same output hash code, e.g. SHA-1 hashes are 160 bits long.
• Uniformity – hashes should be distributed evenly across possible values.
• The specific hash function always produces fixed-size output.
• It should be complex enough to minimize the risk of collision.

As an example, let’s analyze a hash function used in Java’s String class:

``````public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}``````

In Effective Java Joshua Bloch explains that values 31 was chosen intentionally because it’s an odd prime. Using even numbers can cause information loss, as multiplication by 2 is analogous to shifting. In simple words, using a value that has multiple divisors can result in more collisions. Thus, using a number with fewer divisors is safer.

Let’ see a hashcode() method in action:

``````public static void main(String[] args) {
String anotherTest = "java";
String test = "test";
String oneMoreTest = "dev";
System.out.println(test.hashCode()); // output: 3556498
System.out.println(test.hashCode()); // output: 3556498
System.out.println(anotherTest.hashCode()); // output: 3254818
System.out.println(oneMoreTest.hashCode()); // output: 99349
}``````

Analyzing the results, we can see some of the mentioned hash function’s properties.

First, invoking hashcode() is a one-directional process. There is no way that we can retrieve the original string from the generated hash (int).

Second, the method is deterministic. We invoked it twice on a test variable and we obtained the same hash.

Finally, it always produces fixed-length hashcodes, because Java’s int type is a fixed size as we can read in Java VM documentation.

### 2.2. Cryptographic Hash Functions

Cryptographic hash functions are a specialized group of hash functions. They provide an increased level of security. Thus, they are used for cryptography purposes like password verification, data integrity validation, blockchain (cryptocurrencies).

Besides the properties of the standard hash functions, they could also fulfill the following criteria, depending on their purpose:

• Collision resistance: The cryptographic hash function must be fully collision-resistant. We already know that standard hash functions should minimize the risk of collisions. However, minimizing doesn’t mean that they can’t occur. Before, we analyzed a hashcode() function. Because it returns an int value the variations of hashes are limited by the int range -2147483648 to 2147483647. Consequently, when we run out of all possible values collisions will occur (they can occur even before that in some cases). Hence, in cryptographic hash functions there can’t be any possible case of a collision occurring.
• Preimage resistance: For any cryptographic hash function , having a digest , there shouldn’t be any quick way to find a message where .
• Second preimage resistance: If a given function is collision-resistant it is always second preimage resistant, but it could not be preimage resistant. A hash function that lacks preimage resistance is considered insecure for cryptographic purposes.
• “Indifferentiability from random oracles”: Finding messages (input values) with two slightly different hashes should be impossible.
• Computing hash for any message should be quick.
• If a message is changed, even marginally, the new hash should be significantly different from the old one. In other words, we shouldn’t be able to find any correlation between old and new hash codes.
• Pseudorandomness.

In brief, cryptographic hash functions should be secure, effective, and reliable.

### 2.3. Hashing Algorithms

Let’s describe a few popular hashing algorithms.

Message-Digest algorithm 5 (MD5) was introduced by Ron Rives in 1991. MD5 produces a 128 bits length digest for any length input. Unfortunately, collisions in MD5 can be found within seconds. Thus, it should no longer be used for cryptographic purposes. It’s usually used as a checksum for data integrity verification.

Secure Hash Algorithm 2 (SHA-2) consists of a few hashing functions, namely SHA-224, SHA-256, SHA-384, SHA-512. National Security Agency (NSA) designed the SHA-2 and then the National Institute of Standards and Technology (NIST) published it in 2001 as a Federal Information Processing Standard (FIPS). Some modern security applications and protocols use SHA-2 including TLS, SSL, SSH, Bitcoin.

BLAKE3 is the most recent one on our list, published on the 9th of January, 2020. This algorithm generates 256 bits long digests, arbitrarily extensible.

BLAKE3 is a single algorithm, build internally using Markle Tree. It’s also general-purpose, fast, and parallel. Those properties make it ideal for checking file integrity, inputs for cryptographic signatures, message authentication. However, the official documentation on GitHub states that it’s not recommended for password hashing.

## 3. Cryptographic Attacks

In this section, we’ll see a few cryptographic attacks that can affect hash functions.

### 3.1. Brute Force

As we already know, hash functions are one-directional, thus there is no way to retrieve the original message from its digest. They are also uniform, hence the given algorithm always produces the same hash for a specific message (e.g. password). On the other hand, uniformity makes it possible to guess a message for a given hash.

This property can be exploited by a brute force attack which is checking all possible messages to find the one that fits the given hash. Theoretically, all hash functions are vulnerable to this type of attack. In practice, the computational complexity of a brute force attack is very high. Therefore, using sufficiently long hashes makes finding the message that fits a specific hash almost impossible.

### 3.2. Birthday Attack

Another method, the birthday attack, relies on a statistic problem called the birthday paradox. Let’s describe it briefly. The first question is: how many people should we gather in one room to obtain at least 50% probability that we’ll find a person born on a specific date e.g. 1st January? The answer is 253.

The second question, which is the core of the problem is: how many people must we gather in one room to obtain at least 50% probability that we’ll find at least two persons born on a specific date e.g. 1st January? The answer is astonishing: 23. A group of 23 persons makes 253 different pairs of people

Thus, for 128 bits long hash we need to check 2^128 messages to find the one that fits the specific hash. However, it takes only 2^64 checks to find a collision.
To illustrate, let’s assume that we have a machine on which finding a message for a specific hash takes 600 000 years. On the same machine, finding a collision (a second message with the same hash) will take only an hour!

Why collisions are so dangerous? As an example, some algorithms authenticate the user by comparing an entered password with its hash stored in the database (i.e. during registration). If there was a simple and quick way to find a collision, a collided phrase could be used as a password instead of the original one.

### 3.3. Denial of Service

Denial of service is a popular cryptographic attack used to overload the server. The hash functions can be also used in data structures.

In brief, a lot of them work efficiently until a collision occurs. Adding a lot of collided data (inputs with the same hash) can slightly impact the time-complexity of operations on such a data structure. Thus, it can make the server unable to perform desired functions.

This is especially dangerous on servers responsible for security features like firewalls or SSH.

## 4. Hash Table

One of the most common data structure that uses hashing is a hash table. This stores data as key-value pairs, and is especially useful when we need fast access to the data.

The time complexity of accessing the element stored in a hash table is constant, so it doesn’t depend on table size or element’s location. A good example of the implementation of a hash table in a specific programming language is Java’s HashMap.

In this section, we’ll discuss the hash table in detail.

### 4.1. Hashing Entries

As we’ve mentioned, the entries in the hash table are key-value pairs. Internally, a hash table stores data in an array of buckets. A key must be unique within a hash table. What happens if we try to add new value with an already existing key? It depends on the implementation, but in most cases, the value will be overridden.

Let’s explain how adding an entry to a hash table works. Given an entry, the hash function implemented in the hash table calculates a digest from a key. The applied hashing algorithm is arbitrary. In previous sections, we defined what properties an effective hash function should have. Generated hash code is an index that indicates the entry’s value location in the array of buckets.

We can access the value by using the corresponding key. During lookup, a hash for the passed key will be calculated and the corresponding value’s location will be found (because a digest is a value’s index). Passing a key that wasn’t used to store any value will result in nothing (null in most programming languages).

It’s important to use immutable data types as keys. As an example, let’s assume that we used a mutable object as a key. For instance, we used an object that consists of two properties: name and surname. After adding value to the hash table, we changed the key state outside of the hash table. For example, we changed the name from Peter to John.

What will happen if we use this key after changing its state to access stored value? The digest for a key with the new state will be different from the original one. Thus, we could obtain nothing or a different value which could have that different hash as an index (because e.g. we added it earlier).

### 4.2. Collisions

We already know that for two different messages a hash function can calculate the same hash values and we refer to that as a collision. Such a situation can also happen in a hash table.  There are various ways of handling collisions in a hash table. Let’s introduce a few common ones.

The first one is called separate chaining. In this technique, collided elements are stored within a separate data structure in the same bucket. Most often, the separate data structure is a list. In such a situation, the total time complexity for hash table operation is the time needed for bucket lookup (constant) + the time for the separate data structure operation.

For instance, the time-complexity for accessing an element in a linked list is . So, in the case of collision, the hash table constant time of accessing the element could be reduced to . Therefore, the more efficient the data structure is used in the bucket the better. Sometimes different kinds of trees are used instead of lists.

The second approach is open addressing. In this strategy, the hash table stores collided elements in separate buckets like normal ones. The difference lies in calculating the digest. When a hash function finds a collision during calculating a hash the result value of increment function is being added to the hash function result, where is a probe number.

During lookup in case of collision, a passed key is compared until the right key is found. Thus, the worst-case time complexity is . There are different variants of open addressing depending on the increment function used:

• linear probing, for • quadratic probing, for • double hashing, for , where is additional increment function for key 