How to Calculate Levenshtein Distance in Java?

Last updated: January 8, 2024

Written by: Deep Jain

Reviewed by: Zenko

Algorithms

Learn in

Azure Container Apps is a fully managed serverless container service that enables you to build and deploy modern, cloud-native Java applications and microservices at scale. It offers a simplified developer experience while providing the flexibility and portability of containers.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, visit the documentation page.

You can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, you can get started over on the documentation page.

And, you can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Traditional keyword-based search methods rely on exact word matches, often leading to irrelevant results depending on the user's phrasing.

By comparison, using a vector store allows us to represent the data as vector embeddings, based on meaningful relationships. We can then compare the meaning of the user’s query to the stored content, and retrieve more relevant, context-aware results.

Explore how to build an intelligent chatbot using MongoDB Atlas, Langchain4j and Spring Boot:

>> Building an AI Chatbot in Java With Langchain4j and MongoDB Atlas

Accessibility testing is a crucial aspect to ensure that your application is usable for everyone and meets accessibility standards that are required in many countries.

By automating these tests, teams can quickly detect issues related to screen reader compatibility, keyboard navigation, color contrast, and other aspects that could pose a barrier to using the software effectively for people with disabilities.

Learn how to automate accessibility testing with Selenium and the LambdaTest cloud-based testing platform that lets developers and testers perform accessibility automation on over 3000+ real environments:

Automated Accessibility Testing With Selenium

1. Introduction

In this article, we describe the Levenshtein distance, alternatively known as the Edit distance. The algorithm explained here was devised by a Russian scientist, Vladimir Levenshtein, in 1965.

We’ll provide an iterative and a recursive Java implementation of this algorithm.

2. What Is the Levenshtein Distance?

The Levenshtein distance is a measure of dissimilarity between two Strings. Mathematically, given two Strings x and y, the distance measures the minimum number of character edits required to transform x into y.

Typically three type of edits are allowed:

Insertion of a character c
Deletion of a character c
Substitution of a character c with c‘

Example: If x = ‘shot’ and y = ‘spot’, the edit distance between the two is 1 because ‘shot’ can be converted to ‘spot’ by substituting ‘h‘ to ‘p‘.

In certain sub-classes of the problem, the cost associated with each type of edit may be different.

For example, less cost for substitution with a character located nearby on the keyboard and more cost otherwise. For simplicity, we’ll consider all costs to be equal in this article.

Some of the applications of edit distance are:

Spell Checkers – detecting spelling errors in text and find the closest correct spelling in dictionary
Plagiarism Detection (refer – IEEE Paper)
DNA Analysis – finding similarity between two sequences
Speech Recognition (refer – Microsoft Research)

3. Algorithm Formulation

Let’s take two Strings x and y of lengths m and n respectively. We can denote each String as x[1:m] and y[1:n].

We know that at the end of the transformation, both Strings will be of equal length and have matching characters at each position. So, if we consider the first character of each String, we’ve got three options:

Substitution:
1. Determine the cost (D1) of substituting x[1] with y[1]. The cost of this step would be zero if both characters are same. If not, then the cost would be one
2. After step 1.1, we know that both Strings start with the same character. Hence the total cost would now be the sum of the cost of step 1.1 and the cost of transforming the rest of the String x[2:m] into y[2:n]
Insertion:
1. Insert a character in x to match the first character in y, the cost of this step would be one
2. After 2.1, we have processed one character from y. Hence the total cost would now be the sum of the cost of step 2.1 (i.e., 1) and the cost of transforming the full x[1:m] to remaining y (y[2:n])
Deletion:
1. Delete the first character from x, the cost of this step would be one
2. After 3.1, we have processed one character from x, but the full y remains to be processed. The total cost would be the sum of the cost of 3.1 (i.e., 1) and the cost of transforming remaining x to the full y

The next part of the solution is to figure out which option to choose out of these three. Since we do not know which option would lead to minimum cost at the end, we must try all options and choose the best one.

4. Naive Recursive Implementation

We can see that the second step of each option in section #3 is mostly the same edit distance problem but on sub-strings of the original Strings. This means after each iteration we end up with the same problem but with smaller Strings.

This observation is the key to formulate a recursive algorithm. The recurrence relation can be defined as:

D(x[1:m], y[1:n]) = min {

D(x[2:m], y[2:n]) + Cost of Replacing x[1] to y[1],

D(x[1:m], y[2:n]) + 1,

D(x[2:m], y[1:n]) + 1

}

We must also define base cases for our recursive algorithm, which in our case is when one or both Strings become empty:

When both Strings are empty, then the distance between them is zero
When one of the Strings is empty, then the edit distance between them is the length of the other String, as we need that many numbers of insertions/deletions to transform one into the other:
- Example: if one String is “dog” and other String is “” (empty), we need either three insertions in empty String to make it “dog”, or we need three deletions in “dog” to make it empty. Hence the edit distance between them is 3

A naive recursive implementation of this algorithm:

public class EditDistanceRecursive {

   static int calculate(String x, String y) {
        if (x.isEmpty()) {
            return y.length();
        }

        if (y.isEmpty()) {
            return x.length();
        } 

        int substitution = calculate(x.substring(1), y.substring(1)) 
         + costOfSubstitution(x.charAt(0), y.charAt(0));
        int insertion = calculate(x, y.substring(1)) + 1;
        int deletion = calculate(x.substring(1), y) + 1;

        return min(substitution, insertion, deletion);
    }

    public static int costOfSubstitution(char a, char b) {
        return a == b ? 0 : 1;
    }

    public static int min(int... numbers) {
        return Arrays.stream(numbers)
          .min().orElse(Integer.MAX_VALUE);
    }
}

This algorithm has the exponential complexity. At each step, we branch-off into three recursive calls, building an O(3^n) complexity.

In the next section, we’ll see how to improve upon this.

5. Dynamic Programming Approach

On analyzing the recursive calls, we observe that the arguments for sub-problems are suffixes of the original Strings. This means there can only be m*n unique recursive calls (where m and n are a number of suffixes of x and y). Hence the complexity of the optimal solution should be quadratic, O(m*n).

Lets look at some of the sub-problems (according to recurrence relation defined in section #4):

Sub-problems of D(x[1:m], y[1:n]) are D(x[2:m], y[2:n]), D(x[1:m], y[2:n]) and D(x[2:m], y[1:n])
Sub-problems of D(x[1:m], y[2:n]) are D(x[2:m], y[3:n]), D(x[1:m], y[3:n]) and D(x[2:m], y[2:n])
Sub-problems of D(x[2:m], y[1:n]) are D(x[3:m], y[2:n]), D(x[2:m], y[2:n]) and D(x[3:m], y[1:n])

In all three cases, one of the sub-problems is D(x[2:m], y[2:n]). Instead of calculating this three times like we do in the naive implementation, we can calculate this once and reuse the result whenever needed again.

This problem has a lot of overlapping sub-problems, but if we know the solution to the sub-problems, we can easily find the answer to the original problem. Therefore, we have both of the properties needed for formulating a dynamic programming solution, i.e., Overlapping Sub-Problems and Optimal Substructure.

We can optimize the naive implementation by introducing memoization, i.e., store the result of the sub-problems in an array and reuse the cached results.

Alternatively, we can also implement this iteratively by using a table based approach:

static int calculate(String x, String y) {
    int[][] dp = new int[x.length() + 1][y.length() + 1];

    for (int i = 0; i <= x.length(); i++) {
        for (int j = 0; j <= y.length(); j++) {
            if (i == 0) {
                dp[i][j] = j;
            }
            else if (j == 0) {
                dp[i][j] = i;
            }
            else {
                dp[i][j] = min(dp[i - 1][j - 1] 
                 + costOfSubstitution(x.charAt(i - 1), y.charAt(j - 1)), 
                  dp[i - 1][j] + 1, 
                  dp[i][j - 1] + 1);
            }
        }
    }

    return dp[x.length()][y.length()];
}

This algorithm performs significantly better than the recursive implementation. However, it involves significant memory consumption.

This can further be optimized by observing that we only need the value of three adjacent cells in the table to find the value of the current cell.

6. Conclusion

In this article, we described what is Levenshtein distance and how it can be calculated using a recursive and a dynamic-programming based approach.

Levenshtein distance is only one of the measures of string similarity, some of the other metrics are Cosine Similarity (which uses a token-based approach and considers the strings as vectors), Dice Coefficient, etc.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, visit the documentation page.

You can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, visit the documentation page.

You can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.