String Similarity Metrics: Sequence Based | Baeldung on Computer Science

1. Introduction

This is a continuation of an overview article on string similarity metrics. This tutorial concentrates on the sequence-based methods mentioned:

Longest Common Substring
Longest Common Subsequence
Gestalt Pattern Matching

The longest common substring and the longest common subsequence differ only in that in the longest common substring algorithm. The ‘sequence’ is restricted to consecutive characters. The longest common subsequence finds groups of consecutive substrings.

In these two methods, the similarity of two strings is based on the common length.

The gestalt pattern matching uses the longest common substring recursively to establish a similarity measure.

In the following sections, we’ll go into more detail.

2. Longest Common Substring

One tool we can use towards string similarity is finding the longest common substring (LCS). The longer the longest common substring is, the more similar the two strings are. We define the longest common substring as:

Given two strings, ‘X’ and ‘Y’, find the length of the longest common substring.

Finding the longest common substring can be used as the basis of a string similarity method by recursively finding the LCS in the first and subsequent rest strings (those characters not in the LCS). This is the basis of the Gestalt Pattern Matching.

To find the LCS of two strings, the conceptually simplest approach is to first find all substrings of one (the smallest) string, $X$ , of length $m$ . This can be done in $O(m^2)$ time. For each of these strings, determine whether it is a substring of the second string, $Y$ , of length $n$ . This can be done in $O(n)$ time. The total time complexity is $O(n m^2)$ .

An implementation of this method, also showing the complexity, can be seen in our article on finding the longest substring without repeating characters. One method to improve the complexity of the algorithm is to use dynamic programming. This method builds a matrix of the character overlaps.

2.1. Dynamic Programming

The complexity of using dynamic programming is much better than the simple approach. Namely, $O(n \times m)$ , where $m$ and $n$ are the lengths of the strings to be compared. The heart of this method is to build a $(n+1)$ by $(m+1)$ matrix, where the rows and columns are the characters in each string, respectively. In this method, we find the longest common suffix, meaning the longest substring in both strings with the same ending.

To do this for a string $X$ of size $m$ and a string $Y$ of size $n$ , we first build a matrix, $LCS$ , of size $m+1$ by $n+1$ where $LCS(i,j)$ have the following properties:

For example, if we have two strings, $X=(ABABC)$ and $Y=(BABCA)$ , the matrix we would build is:

Since the largest number is $4$ , we know that the longest common substring is four characters long. To find the substring itself, start at the largest number, $4$ in this case, and then count down to $1$ :

We can see that the complexity of this algorithm is dominated by the looping through the characters, so $O(n \times m)$ .

3. Longest Common Subsequence

The difference between the longest common sequence and longest common substring is that the ‘sequences’, unlike substrings, are not required to occupy consecutive positions within the original sequences. There can be gaps in characters that don’t match. For example, in the two sentences:

The fox jumped over the high fence

The quick brown fox jumped over the fence.

The longest substring is just:

fox jumped over the

But the longest subsequence is:

The fox jumped over the fence

The dynamic programming methodology to find the longest common sequence is similar to the longest common substring in that in both, a character overlap matrix is built. The matrix is built in the same way. The difference lies in how we traverse this matrix to find a common subsequence. We count down as before, but to find the largest subsequence, traversing to the next lowest count does not have to be adjacent in the matrix.

4. Gestalt Pattern Matching

The gestalt pattern matching, also known as Ratcliff / Obershelp pattern recognition, is a pattern similarity method published originally in the article Pattern Matching: The Gestalt Approach in Dr. Dobb’s Journal. The word Gestalt, meaning the idea that natural systems and their properties should be viewed as wholes, not as loose collections of parts, is used to emphasize that the entire character string is analyzed together. The Gestalt approach is meant to be a method closer to human analysis.

The key formula, given two strings, $S_1$ and $S_2$ , is:

$Similarity = \frac{2 K_m}{|S_1| + |S_2|}$

where $K_m$ is the number of matching characters in the string and $|S_1|$ and $|S_2|$ are the lengths of the two strings. The similarity metric is a value between $0$ , no matches between the string, and $1$ , identical match. The matching characters are determined, first by finding the longest common substring (LCS) and then, recursively, finding the matching characters (using LCS again) in the non-matching regions of both strings.

We illustrate this algorithm with the example used in the original algorithm using the words ‘Pennsylvania’ and ‘Pencilvaneya’. The LCS of the two strings is ‘lvan’: ‘Pennsylvania’, ‘Pencilvaneya’. Thus, the initial $K_m$ is $2*4=8$ .

Now, we continue, with the two remaining rest pairs to the right and left of the matched portion, [‘Pennsy’, ‘Penci’] and [‘ia’,’eya’]. We find the words ‘Pennsy’, ‘Penci’, match with ‘Pen’ in common, and we increase $K_m$ to $8 + 2*3 = 14$ .

The remaining pair [nsy,ci] have no characters in common, so we are done with this side. Continuing with [‘ia‘,’eya‘], we see that there is one character, ‘a’, in common and we increase $K_m$ to $14 + 2*1=16$ . We see that the remaining string, [‘i’,’ey’], has no characters, so we are done. So the score in total is $0.66$ :

$Similarity = \frac{2 K_m}{|S_1| + |ß_2|} = \frac{2*(4+3+1)}{12+12} = \frac{16}{24} = \frac{2}{3} = 0.66$

Implementations of this algorithm can be found in C and assembler in Ratcliff’s article, but there are also implementations in Python based on the original algorithm and even implementation in Scala.

We can see the root of the complexity of the Gestalt algorithm is dependent on the longest common substring algorithm. In effect, this puts the complexity in the worse case, $O(n^2)$ to the best case of $O(n)$ .

5. Summary

In this article, we examined sequence-based methods in the general field of analyzing the similarity of two strings. We saw that the longest common substring method is a similarity method in itself, but also the key algorithm within the Gestalt pattern matching algorithm. We also saw that using dynamic programming methods can greatly improve the complexity of the algorithms.

Full Archive

About Baeldung

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex