In this tutorial, we’ll discuss several ways of quantifying the extent to which an array is sorted.
We have an array and want to quantify how close it is to its sorted permutation.
So, whatever we define this measure, a completely sorted array such as:
should have a better score than a partially sorted one:
The metric should be defined in a way that allows for comparing two partially sorted arrays and concluding which one is more sorted.
It should also apply to any data type with an ordering relation, e.g., numbers and or strings and lexicographic ordering. We’ll cover the general case and denote the order as .
An inversion in an array is a pair whose elements are out of order:
A natural way to define sortedness is to count inversions:
For example, contains three inversions: , , and . We consider it more sorted than that has four inversions and less than with two.
By definition, the metric’s higher values correspond to less sorted arrays. So, a completely sorted array has zero inversions. An array sorted in the opposite direction has inversions: inversions containing the first element, inversions with the second, and down to one inversion with .
We can compare any two arrays by their inversion counts, but the comparison will be fair only if they have the same number of elements. For instance, three inversions in a -element array don’t denote the same degree of sortedness as three in a -element sequence.
To consider different sizes, we can divide the count with its maximum value of . This way, we get a metric whose values are in the range , where corresponds to a completely sorted array and 1 to a completely inverted sequence.
Let’s check out a few ways of computing this metric.
4. The Exact Approach
The straightforward way is to traverse the array with two loops:
We compare each with the elements to the right of it.
This approach is intuitive but has a quadratic time complexity. So, it isn’t practical for very large arrays.
5. The Sampling Approach
One way to avoid the performance issue is to count the inversions stochastically. That means sampling a predefined number of pairs and counting inversions among them:
When sampling, we need to ensure we don’t draw the same index twice and get as a pair to check. To see why, let’s recall that the condition is in the definition of an inversion, which implies that . This seems obvious, but it’s easy to overlook.
In practice, this approach will be faster than the solution but has several drawbacks:
- it can return a different result each time with run it on the same array
- we need to fine-tune since a value too small is likely to miss many inversions, whereas a too-large will make the execution longer
- the algorithm can get stuck in a loop if it never draws different indices, which is very unlikely but still possible
- we can end up checking the same pair or counting the same inversion multiple times
The first issue can be addressed by running the algorithm several times and returning the average result. Alternatively, we can use a larger if it doesn’t cause the method to run too long. Although the algorithm can run forever, it rarely happens in practice, especially with large values of and .
To fix the last drawback, we can sample the pairs without repetition.
5.2. Sampling Without Repetition
That means we should store the sampled pairs and ignore them if we draw them again:
Whenever we draw a new index pair , we check if we considered it before. If yes, we’ll skip it. Otherwise, we’ll insert it into to ignore it in the future.
For this approach to be efficient, the storage should be a data structure with efficient insertions and lookups, such as a hash table. Because of the overhead, this will be slower than sampling with repetition but more accurate.
6. Merge Count
Another way is to tweak a sorting algorithm to count inversions.
Not all algorithms will be easy to modify since we aren’t supposed to count swaps done during sorting but only the inversions in the original array.
Here’s a modification of Merge Sort:
To scale the count, we divide it with . The trick is to realize that the total number of inversions in is the sum of the counts of three types of inversions:
- when and are in the left half
- when and are in the right half
- is in the left and in the right halves (cross-inversions)
6.1. The Merge Step
We count cross-inversions in the merge step. Since and are sorted after the recursion, if and , it holds that is in inversions with and its predecessors in :
The algorithm’s time complexity is like that of Merge Sort, which beats the straightforward method.
It’s usually slower than the sampling approach but returns the same value over multiple runs. Therefore, this implementation might be the middle ground between the sampling’s performance and the quadratic algorithm’s exactness.
Unlike the previous two approaches, this one solves an additional problem (sorting an array) to handle the original one. So, it does more work than necessary.
7. Alternative Measures
A computationally lightweight approach would be to consider only the inversions with successive elements:
There are at most such inversions, and we need only one traversal over the array to count them, checking the successive pairs. So, the time complexity will be , but this measure might not give us a clear picture of an array’s sortedness. The reason is that out-of-order elements at the beginning of can participate in more inversions than those near the end, but this metric treats all pairs equally.
Another approach is to calculate the rank distance. Let be the sorted permutation of , and let be the rank of in the , which we can find by looking it up in (we use averages for non-unique elements). Then, the rank distance can be computed as the Euclidean distance between the original indices and ranks:
or the average absolute difference:
A shortcoming of this approach is that we need to sort to find the ranks.
In this article, we showed how to define and compute the sortedness of an array. A natural measure of it is the count of inversions. There are stochastic and deterministic ways to calculate it.
We should use the sampling approaches if we’re more concerned about performance.