In this tutorial, we’ll talk about the tree edit distance (TED) and how to calculate it. TED is similar to string edit distance but works on trees.
2. What Is a Tree Edit Distance (TED)?
TED is the minimal number of select operations necessary to transform one tree () to another (). Usually, those operations are: relabel, delete, and insert a node. However, we can also define edge-based operations such as the removal and creation of an edge between two nodes. In either case, they act only on .
We can also associate different costs to those operations. Then, TED will be the cost of the least expensive sequence of actions transforming to , where the sequence’s cost is the sum of the individual actions’ costs.
Let’s take a look at these two trees:
We can define the transformation costs as follows:
The least expensive sequence is:
Its cost is . So, .
2.2. Interpretation of TED
Since it depends on the transforming operations and their costs, TED can be interpreted only concerning those settings. Using different operations and costs can give different distances. Therefore, we should choose them in a way that makes sense in our application.
Anyhow, if TED fulfills the metric axioms, it induces a metric space of trees. That allows us to analyze them rigorously.
So, let’s now define a general TED and the way to calculate it.
3. General Recursive Calculation
Let represent the set of operations we can perform on tree , and let be the cost of operation . The costs can be negative, but we’ll focus only on the case where all are positive.
Our general recursive definition of TED is:
We can’t say more about its complexity other than it may be exponential, depending on and .
However, if we find an efficient way to hash trees, we may reduce the complexity. The first time we calculate , we hash and insert the pair into a hash map. That way, if appears again in the recursion, we don’t have to compute it again.
This particular technique is known as memoization. Some authors consider it to be a tool of dynamic programming.
4. A Node-Based TED
The recursion (1) covers a general case of TED. Here, we’ll focus on more specific operations and costs. In particular, on the node-based relabel, insert, and delete.
Let’s explain the operations in more detail. While deleting a node, we don’t leave its children hanging in the air. Instead, we add them as the new children of the node’s parent:
Relabeling a node means changing its label. For example:
Unlike deletion and relabeling, insertions have more than one argument. We specify the node to insert and the node that will be its parent:
4.2. Assumptions and Remarks
Let’s first state an important assumption: the nodes in and have a left-to-right ordering. So, for any two nodes in a tree, we can say which one is to the left of the other.
Another assumption is that we can insert, delete, or relabel a node at most once during the transformation. For instance, we can’t delete a node after inserting it. Otherwise, the space of possible transformation sequences to consider would become infinite.
The costs of the operations are constant. So, has the same cost regardless of the node we’re deleting () and the state of the tree we’re deleting it from (). The same goes for insertion and relabeling. Let be their costs.
Since inserting a node to is the same as deleting it from , we need not apply insertion to in our implementation. Instead, we can delete the node from and consider it inserted into . Analogously, when relabeling to match , we don’t need to actually change the label. We can delete both nodes from the trees and consider relabeled to . To ease the notation, we’ll denote deletions with a minus sign: instead of .
Finally, the cost function in our algorithm is restricted to be a distance metric. That makes the TED that it computes a distance metric too. A non-negative constant cost function we opted for fulfills this condition.
4.3. Computing TED Using Forests
In its intermediate steps, the algorithm operates on forests. It does so by considering the post-order numbering of the nodes. Let be the -th node of in the post-order traversal. Then, is the forest consisting of the nodes numbered in the post-order traversal of . For instance:
Although it may seem that dealing with a more general structure will complicate things, it isn’t so. Instead, it makes the problem easier. By dividing the trees into sub-forests, we can solve the problem using dynamic programming.
4.4. The Recurrence
Let be the post-order sub-forest of and let denote its rightmost root. Also, let be the rightmost tree of (the one rooted at ). The same notation holds for . Then, we calculate TED as follows:
We can calculate the recurrence with the memoization technique. For instance, we can hash each forest (tree) as a pair , where and are the post-order bounds of the nodes in the original tree. So, the first time we calculate the TED of and , we use as the key and insert the corresponding value into a map. Later, we reuse the stored result to avoid repeated calculation.
The forest and tree structures should be supplied with efficient methods for evaluating , and ().
4.5. Tree Operations
First, let’s observe that the root of the rightmost tree of the forest is the node number , so for . From there, amounts to . The difference of and is the forest consisting of all the nodes that are in but not . The same goes for the other tree ..
Finding (), however, requires a more careful consideration. The right boundary of is the post-order index of its root, that is, . We can compute the left boundary on the fly by running a post-order traversal routine, but this approach will slow down the algorithm. Instead, it’s more efficient to preprocess the tree and compute each sub-tree’s post-order left boundary so that it’s available in time during the computation of TED. We can do it in linear time using a post-order traversal algorithm.
So, we’ll assume that , the left post-order boundary of the tree whose root is , is available for the nodes in both trees. Then, is , and is . If (the forest doesn’t contain only one tree), then is . Otherwise, .
4.6. The Memoization Algorithm
Here’s how we can calculate with a memoization algorithm:
This algorithm is a memoized version of Zhang and Shasha‘s iterative TED algorithm. So, its complexity is the same.
We’ll derive it by considering the relevant (sub)forests of and .
A forest is relevant if it appears in the recursive calculation of . Zhang and Shasha define a keyroot as a tree’s root or a node with a left sibling. We can prove that the post-order numbering of a relevant forest is a prefix of the post-order traversal of a keyroot’s tree.
Let the keyroot depth of () be the highest number of keyroot nodes on the path from the root to any other node in the tree. Then, Zhang and Shasha’s algorithm’s complexity (and consequently of ours too) is .
We can improve the result even further. Let and denote the height and the number of leaves in , and let and do the same for . Zhang and Shasha prove that and that . So, the overall complexity is:
If the trees are balanced, then and , so the overall complexity becomes:
or if . However, if the trees are degenerate and represent linked lists, the complexity is . Therefore, the worst-case complexity stays the same as in the version without the keyroots. In practice, we expect the improved algorithm to perform faster.
4.8. Tracing the Sequence
We can trace the minimal sequence of operations that transform into the same way we trace the paths in the search algorithms.
When we determine the minimum of , , and , we remember which operation leads to that results. When we compute the complete TED, we can follow the operations in the reversed order.
In this article, we talked about tree edit distance. We showed how to define and how to calculate it in polynomial time.