1. Introduction

In this tutorial, we elaborate on the Token-Based string similarity metrics mentioned in String Similarity Metrics.

Algorithms falling under this category are, more or less, set similarity algorithms, where the sets are made up of string tokens. In this tutorial, we’ll explore a bit more the following algorithms:

  • Q-Grams
  • Overlap Coefficient
  • Jaccard Similarity
  • Dice Coefficient

The latter three measures are based on set similarity.

2. Token Methods

The set of token methods for string similarity measures has basically these three steps:

  1. Tokens: Examine the text strings to be compared and define a set of tokens, meaning a set of character strings
  2. Count: Count the number of these tokens within each of the strings to be compared
  3. Measure: Using these counts to compute a distance or similarity measure

Some methods, such as q-grams, have a specific way to generate the tokens, and in the other methods, we can define the tokens in any way we see fit. For example, in a normal text, one could choose individual words.

There are several ways to ‘count’ tokens with the texts to be compared. In one way, we count the number of occurrences of that token within the text. The similarity of the texts is expressed in terms of these differences.

We can also use set theory to account for the usage of tokens within the texts. The measures here are in the class of set similarity. The measure computes the similarity between two sets. In our case, the sets are made of tokens.

How the counting of the tokens is combined to form the final similarity measure is the main way to differentiate between the methods. All methods produce a single number between 0 and 1.

3. Approximation of Text

One property that is common to all token methods is based on the fact that we look at the text as a set of tokens and the consequence is that we do not care where these tokens are to be found in the text.

For example, if we have a text:

Hello World

and we use as tokens, “Hello” and “World“. The text:

World Hello

would receive the same score. The same tokens can be found in both texts. It is just where they are found in the text that changes. This is the price we pay for an easily implemented and relatively fast method of text comparison and similarity.

4. Q-Grams

In this method, the creation of the tokens is very specific and well-defined. We count these tokens and look at the difference in occurrence between the texts.

The token we use in this method is a q-gram, a character string of length \mathbf {q}. In this method, we look at two strings and ask the question, how many times does a particular string of length q occur in each? Then we ask, how do these two numbers compare, i.e., we take the absolute distance and sum up these distances. That is our measure of similarity or distance between the two strings. The smaller the number is, the more similar the strings are.

If the two strings are exactly the same, then the token counts would be the same, and the difference would be zero. If they have no tokens that are the same, then the difference is a large number (basically the sum of appearances of the token in each string).

We see this is an approximate method when we see that if the two strings are not the same, but have the set of tokens in a different order, then the sum would still be zero.

Put a bit more mathematically, let P_q(A)[i] be the number of occurrences of the i^{th} q-gram in text string A. The distance measure between two strings A and B is:

dist_{q-gram}(A,B) = \sum_{i_1}^{\sigma^{q}}\abs{P_q(A)[i] - P_q(B)[i]}

5. Set Similarity

One set of measures for similarity are based on examining the token assets. We compare the sets of tokens in each string we’re examining and compute a measure. These measures are set similarity measures.

The set similarity measures do not specify exactly how the tokens are to be defined. How the tokens are created is left for the implementation, and we could even use n-grams as the set of tokens.

5.1. Set Computations

Once we have defined the tokens, we count them in a set-theoretical way to find the difference between the sets. There are several computations based on sets that are used by the measures.

From the entire set of tokens, we can count how many of these tokens are actually found in the text, and this can be mathematically represented by:

\mid A \mid

which represents the size of the set A, the set of tokens found in the text A.

Another important computation is the overlap between the two sets. We ask the question, how many tokens do the two sets have in common. Given two sets, A and B, we represent this mathematically as:

\mid A  \cap  B \mid

Where the symbol \cap represents the intersection between the sets.

A third quantity is to count the set of tokens that were used by both sets. Given two sets, A and B, this is represented as:

\mid A  \cup  B \mid

Where the symbol \cup represents the union between the sets. Tokens not used by either set are not included in this count.

5.2. Set Similarity

The set similarity measures, for the most part, determine how many tokens are in common between the two sets, the intersection between the sets, \mid A  \cap  B \mid. The measures basically only differ in how they are ‘normalized’, in other words, how we bring the count to a number between 0 and 1. The similarity measures between string A and string B are basically how many tokens they have in common divided by a normalization factor.

The set similarity measure of overlap coefficient is:

overlap(X,Y) = \frac{ \mid A  \cap  B \mid }{min(\mid A \mid , \mid B \mid )}

where min(\mid A \mid , \mid B \mid ) is the minimum of the number of tokens in A and the number of tokens in B.

The Jaccard index is similar to the coefficient overlap measure, with the difference being how the terms are normalized. For the Jaccard index, we normalize (divide) the overlap of tokens by the number of unique tokens by both combined:

J(A,B) = \frac{ \mid A \cap B \mid }{\mid A  \cup  B \mid} = \frac{ \mid A \cap B \mid }{\mid A \mid + \mid B \mid + \mid A  \cap  B \mid}

In a sense, we can think of this as the percentage of overlapping tokens.

There is also the Jaccard distance which is defined as:

distance(A,B) = 1-J(A,B) = \frac{ \mid A \cup B \mid - \mid A \cap B \mid}{\mid A \cup B \mid}

The dice coefficient (or sometimes also referred to as the Sørensen–Dice coefficient) is again similar to the Jaccard index:

DSC(A,B)= \frac{ 2 \mid A \cap B \mid }{\mid A \mid + \mid  B \mid}

Once again differing in the normalization of the \mid A \cap B \mid term.

6. Example

Suppose we have two simple texts:

A: “brave new world

B: “hello world

C:hello new world

We’ll use two tokenizing techniques. For the first one, we isolate the words:

brave, \space new, \space world, \space hello

and for the second, we use a 2-gram:

br,\space ra, \space av, \space ve,\space ne, \space ew, \space wo, \space or, \space rl, \space ld, \space he, \space el, \space ll, \space lo

First, we make a table of counts for the words:

text brave new world hello
A 1 1 1 0
B 0 0 1 1
C 0 1 1 1

We also make a table of the n-grams:

text br ra av ve ne ew wo or rl ld he el ll lo
A 1 1 1 1 1 1 1 1 1 1 0 0 0 0
B 0 0 0 0 0 0 1 1 1 1 1 1 1 1
C 0 0 0 0 1 1 1 1 1 1 1 1 1 1

Calculating the quantities needed for the measures, first for the words:

Counts tokens A B C
\mid X \mid words 3 2 3
\mid Y \mid 2-grams 10 8 10
Counts A,B A,C B,C
\mid X \cap Y\mid words 1 2 2
\mid X \cap Y\mid 2-grams 4 6 8
\mid X \cup Y\mid words 1 2 2
\mid X \cup Y\mid 2-grams 14 14 10
dist_q-gram(X,Y) words 3 4 2
dist_q-gram(X,Y) n-grams 10 7 3
Overlap Coefficient words 0.5 0.66 1.0
Overlap Coefficient n-grams 0.5 0.60 1.0
Jaccard Index words 0.25 0.5 0.66
Jaccard Index n-grams 0.28 0.42 0.80
Dice Coefficient words 0.25 0.66 0.80
Dice Coefficient n-grams 0.44 0.60 0.89

We can see that all measures, though they differ by the exact number, give the same trend. All predicted that “brave new world” and “hello new world” were the closest and that “hello world” is the most different.

7. Summary

The token similarity measures (and methods) are a special case of string similarity methods. It is based on dividing the strings to compare into tokens. The calculations involved are relatively simple counting algorithms.

Several string similarity measures using tokens were examined and compared. Three of the methods, overlap coefficient, Jaccard index, and dice coefficient, were based on set similarity measures. They differ in the actual result in that they all use a different normalization factor. These methods did not specify how the tokens should be created from the text.

The n-gram method gave a specific way to create tokens and a specific measure. The tokenization method, however, could be used with the other measures.

In our example, it was seen that all measures correctly produced the expected similarity result.