Tree Edit Distance | Baeldung on Computer Science

1. Introduction

In this tutorial, we’ll talk about the tree edit distance (TED) and how to calculate it. TED is similar to string edit distance but works on trees.

2. What Is a Tree Edit Distance (TED)?

TED is the minimal number of select operations necessary to transform one tree ( $\boldsymbol{T_1}$ ) to another ( $\boldsymbol{T_2}$ ). Usually, those operations are: relabel, delete, and insert a node. However, we can also define edge-based operations such as the removal and creation of an edge between two nodes. In either case, they act only on $T_1$ .

We can also associate different costs to those operations. Then, TED will be the cost of the least expensive sequence of actions transforming $\boldsymbol{T_1}$ to $\boldsymbol{T_2}$ , where the sequence’s cost is the sum of the individual actions’ costs.

2.1. Example

Let’s take a look at these two trees:

We can define the transformation costs as follows:

$cost(relabel) = 2 \quad cost(delete) = cost(insert) = 3$

The least expensive sequence is:

$\begin{aligned} \text{delete } & c, e, g \\ \text{relabel } & f \rightarrow g, d \rightarrow f, b \rightarrow e, a \rightarrow x \\ \end{aligned}$

Its cost is $17$ . So, $TED(T_1, T_2) = 17$ .

2.2. Interpretation of TED

Since it depends on the transforming operations and their costs, TED can be interpreted only concerning those settings. Using different operations and costs can give different distances. Therefore, we should choose them in a way that makes sense in our application.

Anyhow, if TED fulfills the metric axioms, it induces a metric space of trees. That allows us to analyze them rigorously.

So, let’s now define a general TED and the way to calculate it.

3. General Recursive Calculation

Let $\mathcal{A}(T)$ represent the set of operations we can perform on tree $T$ , and let $cost(a)$ be the cost of operation $a \in \mathcal{A}(T)$ . The costs can be negative, but we’ll focus only on the case where all are positive.

Our general recursive definition of TED is:

(1) $\begin{equation*} TED(T_1, T_2) = \begin{cases} 0,& T_1 = T_2 \\ \min_{a \in \mathcal{A}(T)} \{cost(a) + TED(a(T_1), T_2)\}, & \text{otherwise} \end{cases} \end{equation*}$

We can’t say more about its complexity other than it may be exponential, depending on $\mathcal{A}$ and $c$ .

3.1. Memoization

However, if we find an efficient way to hash trees, we may reduce the complexity. The first time we calculate $TED(T, T_2)$ , we hash $T$ and insert the pair $(hash(T), TED(T, T_2))$ into a hash map. That way, if $TED(T, T_2)$ appears again in the recursion, we don’t have to compute it again.

This particular technique is known as memoization. Some authors consider it to be a tool of dynamic programming.

4. A Node-Based TED

The recursion (1) covers a general case of TED. Here, we’ll focus on more specific operations and costs. In particular, on the node-based relabel, insert, and delete.

4.1. Operations

Let’s explain the operations in more detail. While deleting a node, we don’t leave its children hanging in the air. Instead, we add them as the new children of the node’s parent:

Relabeling a node means changing its label. For example:

Unlike deletion and relabeling, insertions have more than one argument. We specify the node to insert and the node that will be its parent:

4.2. Assumptions and Remarks

Let’s first state an important assumption: the nodes in $\boldsymbol{T_1}$ and $\boldsymbol{T_2}$ have a left-to-right ordering. So, for any two nodes in a tree, we can say which one is to the left of the other.

Another assumption is that we can insert, delete, or relabel a node at most once during the transformation. For instance, we can’t delete a node after inserting it. Otherwise, the space of possible transformation sequences to consider would become infinite.

The costs of the operations are constant. So, $delete(T, u)$ has the same cost regardless of the node we’re deleting ( $u$ ) and the state of the tree we’re deleting it from ( $T$ ). The same goes for insertion and relabeling. Let $c_{del}, c_{ins}, c_{rel}$ be their costs.

Since inserting a node to $T_1$ is the same as deleting it from $T_2$ , we need not apply insertion to $T_1$ in our implementation. Instead, we can delete the node from $T_2$ and consider it inserted into $T_1$ . Analogously, when relabeling $u \in T_1$ to match $v \in T_2$ , we don’t need to actually change the label. We can delete both nodes from the trees and consider $u$ relabeled to $v$ . To ease the notation, we’ll denote deletions with a minus sign: $T-u$ instead of $delete(T, u)$ .

Finally, the cost function in our algorithm is restricted to be a distance metric. That makes the TED that it computes a distance metric too. A non-negative constant cost function we opted for fulfills this condition.

4.3. Computing TED Using Forests

In its intermediate steps, the algorithm operates on forests. It does so by considering the post-order numbering of the nodes. Let $T[i]$ be the $i$ -th node of $T$ in the post-order traversal. Then, $T[i..j]$ is the forest consisting of the nodes numbered $i, i+1, \ldots, j-1, j$ in the post-order traversal of $T$ . For instance:

Although it may seem that dealing with a more general structure will complicate things, it isn’t so. Instead, it makes the problem easier. By dividing the trees into sub-forests, we can solve the problem using dynamic programming.

4.4. The Recurrence

Let $F_1 =T_1[i..j]$ be the post-order sub-forest of $T_1$ and let $r_1$ denote its rightmost root. Also, let $R_1$ be the rightmost tree of $F_1$ (the one rooted at $r_1$ ). The same notation holds for $T_2$ . Then, we calculate TED as follows:

(2)

We can calculate the recurrence with the memoization technique. For instance, we can hash each forest (tree) as a pair $(i, j)$ , where $i$ and $j$ are the post-order bounds of the nodes in the original tree. So, the first time we calculate the TED of $\boldsymbol{T_1[i_1..j_1]}$ and $\boldsymbol{T_2[i_2, j_2]}$ , we use $\boldsymbol{(i_1, j_1, i_2, j_2)}$ as the key and insert the corresponding value into a map. Later, we reuse the stored result to avoid repeated calculation.

The forest and tree structures should be supplied with efficient methods for evaluating $r_k, R_k, F_k-R_k$ , and $R_k-r_k$ ( $k=1,2$ ).

4.5. Tree Operations

First, let’s observe that the root of the rightmost tree of the forest $\boldsymbol{F_1=T_1[i_1..j_1]}$ is the node number $\boldsymbol{j_1}$ , so $r_1=j_1$ for $F_1=T_1[i_1..j_1]$ . From there, $F_1 - r_1$ amounts to $T_1[i_1..(j_1-1)]$ . The difference of $F_1$ and $R_1$ is the forest consisting of all the nodes that are in $F_1$ but not $R_1$ . The same goes for the other tree $F_2=T_2[i_2..j_2]$ ..

Finding $R_k$ ( $k=1,3$ ), however, requires a more careful consideration. The right boundary of $R_k$ is the post-order index of its root, that is, $r_k$ . We can compute the left boundary on the fly by running a post-order traversal routine, but this approach will slow down the algorithm. Instead, it’s more efficient to preprocess the tree and compute each sub-tree’s post-order left boundary so that it’s available in $O(1)$ time during the computation of TED. We can do it in linear time using a post-order traversal algorithm.

So, we’ll assume that $\boldsymbol{\ell(i)}$ , the left post-order boundary of the tree whose root is $\boldsymbol{i}$ , is available for the nodes in both trees. Then, $R_k$ is $T_k[\ell(r_k).. r_k]$ , and $R_k - r_k$ is $T_k\left[\ell(r_k)..r_k-1\right]$ . If $F_k \neq R_k$ (the forest doesn’t contain only one tree), then $F_k - R_k$ is $(i, \ell(r)-1)$ . Otherwise, $F_k-R_k = \emptyset \equiv T_k[0, 0]$ .

4.6. The Memoization Algorithm

Here’s how we can calculate $TED(T_1, T_2)$ with a memoization algorithm:

algorithm TEDMemoized(T1, T2, n, m):
    // INPUT
    //    T1 = the tree to transform
    //    T2 = the tree to transform T1 into
    //    n = the number of nodes in T1
    //    m = the number of nodes in T2
    //    T1 and T2 have been preprocessed to determine the left boundaries ℓ(·) for all the nodes
    //    results = the storage for the results, initially empty
    // OUTPUT
    //    The tree edit distance between T1 and T2

    // Initially called as TED(T1[1 : n], T2[1 : m])

    // Apply the memoization technique
    if results[T1, T2] is known:
        return results[T1, T2]

    // Cover the base cases
    if T1 = empty set and T2 = empty set:
        return 0
    if T1 != empty set and T2 = empty set:
        r1 <- find the rightmost root of T1
        return cost_del + TED(T1 - r1, empty set)
    if T1 = empty set and T2 != empty set:
        r2 <- find the rightmost root of T2
        return cost_ins + TED(empty set, T2 - r2)
    
    // Cover the recursive case (both T1, T2 != empty set)
    Determine r1, R1, r2, R2
    
    ted_del <- TED(T1 - r1, T2) + cost_del
    ted_ins <- TED(T1, T2 - r2) + cost_ins
    ted_rel <- TED(T1 - R1, T2 - R2) + cost_rel + TED(R1 - r1, R2 - r2)
    
    // Store and return the result
    results[T1, T2] <- min{ted_del, ted_ins, ted_rel}
    return results[T1, T2]

This algorithm is a memoized version of Zhang and Shasha‘s iterative TED algorithm. So, its complexity is the same.

4.7. Complexity

We’ll derive it by considering the relevant (sub)forests of $T_1$ and $T_2$ .

A forest is relevant if it appears in the recursive calculation of $\bodsymbol{TED(T_1, T_2)}$ . Zhang and Shasha define a keyroot as a tree’s root or a node with a left sibling. We can prove that the post-order numbering of a relevant forest is a prefix of the post-order traversal of a keyroot’s tree.

Let the keyroot depth $d_k$ of $T_k$ ( $k=1,2$ ) be the highest number of keyroot nodes on the path from the root to any other node in the tree. Then, Zhang and Shasha’s algorithm’s complexity (and consequently of ours too) is $O(d_1d_2 nm)$ .

We can improve the result even further. Let $n_{\mathrm{height}}$ and $n_{\mathrm{leaves}}$ denote the height and the number of leaves in $T_1$ , and let $m_{\mathrm{height}}$ and $m_{\mathrm{leaves}}$ do the same for $T_2$ . Zhang and Shasha prove that $d_1 \leq \min\{n_{\mathrm{height}, n_{\mathrm{leaves}}}\}$ and that $d_2 \leq \min\{m_{\mathrm{height}, m_{\mathrm{leaves}}}\}$ . So, the overall complexity is:

$O\left( \min\{n_{\mathrm{height}}, n_{\mathrm{leaves}}\} \min\{m_{\mathrm{height}}, m_{\mathrm{leaves}}\} n m \right)$

If the trees are balanced, then $n_{\mathrm{height}} \in O(\log n)$ and $m_{\mathrm{\height}} \in O(\log m)$ , so the overall complexity becomes:

$O \left( (n \log n) (m \log m) \right)$

or $O\left(n^2 \log^2 n \right)$ if $m \in \Theta(n)$ . However, if the trees are degenerate and represent linked lists, the complexity is $O(n^2 m^2)$ . Therefore, the worst-case complexity stays the same as in the version without the keyroots. In practice, we expect the improved algorithm to perform faster.

4.8. Tracing the Sequence

We can trace the minimal sequence of operations that transform $T_1$ into $T_2$ the same way we trace the paths in the search algorithms.

When we determine the minimum of $ted_{ins}$ , $ted_{del}$ , and $ted_{rel}$ , we remember which operation leads to that results. When we compute the complete TED, we can follow the operations in the reversed order.

5. Conclusion

In this article, we talked about tree edit distance. We showed how to define and how to calculate it in polynomial time.

Core Concepts

Operating Systems

Artificial Intelligence

Graph Theory

Latex

Full Archive

About Baeldung