Authors Top

If you have a few years of experience in Computer Science or research, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.


1. Introduction

In this tutorial, we’ll show several ways to find the k smallest numbers in an array. We’ll also discuss those methods’ complexity and relative performance to one another.

All those techniques can be adapted to the converse problem of finding the k largest numbers in an array.

2. Problem Description

Given an array \boldymbol{a} of \boldsymbol{n} numbers, our goal is to find the \boldsymbol{k} smallest elements and return them in the non-decreasing order.

For example, if a = [1, 20, -30, 0, 11, 1, 503] and k=3, the solution is the array [-30, 0, 1] because those are the three smallest numbers in a.

When k=1, the problem reduces to finding the minimum of an array.

3. Naïve Solutions

We can loop over the array k times and find and remove the smallest number in each pass. This solution’s time complexity is O(n\times k). It’s inefficient because it traverses the whole array repeatedly.

Another naïve solution is to sort the whole a non-decreasingly and then take the first k elements of the sorted array. The problem with this approach is that it sorts all the n elements even though we need only the k smallest. This method’s complexity depends on the sorting algorithm but will be inefficient whenever k is much smaller than n. For instance, it doesn’t make sense to sort the whole array of 100 000 numbers if we need only the five smallest.

4. The Solution Based on a Min-Heap

We can do better with a min-heap. Thought of as a binary tree, a min-heap has the property that each node is less than or equal to its children. So, the root node is the minimum of a min-heap.

Once we build a min-heap out of the array, we can pop its root to get the smallest element in it. Then, we process the rest of the heap to preserve the heap property and pop the new root to get the second smallest number.

Repeating the process k times gives us the desired output:

 

cs1

Using Floyd’s algorithm, we can build the heap in O(n) time. Popping the root and heapifying the rest takes O(\log n) time, so k such steps can be done in O(k \log n) time in total. Hence, the total complexity is:

    \[O(n+k \log n)\]

If k is a linear function of n (for example, k=\frac{n}{10} when we look for the bottom 10\%), then the complexity becomes

    \[O(n\log n)\]

which is the lower bound of the complexity of sorting algorithms that use binary comparisons only.

5. The Solution Based on a Max-Heap

We can use a max-heap to solve the problem. A max-heap is like a min-heap, the only difference being that each max-heap’s node is greater than or equal to its children. So, the root of a max-heap contains its maximum.

In the beginning, we take the first k elements of the unsorted array and put them into a max-heap. Then, we iterate over the remaining n-k elements and compare them to the root.

If the current element (\boldsymbol{a[i]}) is lower than the root, then we know that there are at least \boldsymbol{k} elements not greater than the root (k-1 currently in our heap and the one with which we’ve just compared), so the root can’t be among the k smallest numbers. Hence, we replace the root with a[i], process the heap to preserve its property, and move to the next element in the array.

Otherwise, if \boldsymbol{a[i}</strong><strong>] is not lower than the root, we know that it’s greater than or equal to at least \boldsymbol{k} other numbers in the array, so it can’t be one of the k smallest, and we can discard it:

 

 

cs2

Using Floyd’s algorithm, we can construct the heap in O(k) time. In the worst case, each of the n-k elements initially out of the heap will be lower than the root, so we’ll perform the O(\log k) root replacement operation n-k times. As a result, the worst-case complexity is

    \[O(k+(n-k)\log k)=O(n \log k)\]

If k is a linear function of n, then the complexity is

    \[O(n\log n)\]

6. QuickSelSort

This approach combines QuickSelect with QuickSort.

First, we use QuickSelect to find the k-th smallest number. It will place it at the k-th position in the input array a and put the k-1 smaller numbers on positions 1 through k-1 but won’t sort them.

To sort those numbers from the smallest to the largest, we apply QuickSort to a[1], a[2], \ldots, a[k-1]. We don’t need to process a[k] with QuickSort because we already know it’s the k-th smallest:

 

 

cs3

Although the worst-case complexity of the approach is

    \[O(n^2)\]

its average complexity is

    \[O(n+k\log k)\]

which is rather good. However, if k is a linear function of n, then the average complexity turns out to be O(n\log n).

A notable advantage of QuickSelSort is that it uses two algorithms that are available in many libraries and tuned for performance, so we don’t have to implement it from scratch.

7. Partial QuickSort

Partial QuickSort is a modification of the usual QuickSort. The difference is that the former makes recursive calls only on the parts of the array that provably contain the k smallest elements.

In the usual QuickSort, we get the subarray a[i:j]=a[i,2,\ldots,j], select a pivot and partition a[i:j] into two smaller subarrays a[i:(p-1)] and a[(p+1):j]. Since the goal of QuickSort is to sort the whole array, we must make two recursive calls: one for the left and the other for the right subarrays.

In contrast, Partial QuickSort doesn’t always sort the right subarray.

If p + 1 > k, then Partial QuickSort completely ignores the right subarray since none of the k smallest numbers is in it. All the numbers we’re after are in the left subarray, so we make the recursive call only on a[i:(p-1)].

If p + 1 \leq k, then some of the sought numbers are in a[(p+1):i], so we have to make a recursive call on it as well:

 

 

cs4 cs5

Rendered by QuickLaTeX.com

If we select the pivot randomly, then the worst-case complexity of Partial QuickSort is O(n^2), and the expected complexity is O(n+k\log k), just as it’s the case with QuickSelSort. However, partial QuickSort makes fewer comparisons on average.

8. Partial QuickSort with Median of Medians

There’s a way to reduce the worst-case time complexity of Partial QuickSort to O(n+k\log k). We can achieve that by choosing the pivot element with an algorithm known as Median of Medians.

8.1. Median of Medians

The QuickSort and Partial QuickSelect algorithms show the best asymptotic performance if we reduce the array size by a constant factor before the recursive calls, as the size of the array to process will drop exponentially with the recursion depth.

The Median of Medians (MoM) is a pivot selection algorithm that does precisely so.

MoM returns an element between the bottom 30\% and the top 30\% of the input array. Therefore, the returned pivot splits the input array in a ratio between 30\%:\70% and 70\%:30\%. In the worst case, we reduce the size of the input by a factor of 70\%. Since that’s a reduction by a known constant, the input size drops exponentially.

8.2. Complexity Analysis

MoM runs in O(n) time. As Partial QuickSort does O(n) comparisons in each call, the overhead caused by MoM doesn’t change the asymptotic complexity of the call. However, the worst-case time complexity reduces to

    \[O(n+k\log k)\]

That’s the average-case complexity of the generic version of Partial QuickSort.

9. Hybrid Partial QuickSort

Even though Partial QuickSort with MoM has an O(n+k\log k) worst-case time complexity, it runs slower than the asymptotically inferior version of Partial QuickSort on most arrays in practice. MoM improves the worst-case complexity but makes the algorithm run slower on most inputs due to the additional overhead.

There’s a way to keep the practical efficiency and reduce the worst-case complexity at the same time. We draw inspiration from IntroSelect.

The idea is to start with the usual Partial QuickSort, which chooses pivots at random. We assume that the input at hand won’t cause the algorithm’s worst-case behavior and that it’s safe to use random selection. A good deal of time, we’ll be right. The average–case model will apply to most input arrays. However, if we notice that random selection results in inefficient partitioning, we fall back on MoM to avoid the O(n^2) worst-case and have it O(n+k\log k) instead.

We can switch to MoM with two simple rules:

  • We track the input array sizes in all the recursive calls. If any m consecutive calls do not halve the input’s length, where m is a constant set in advance, then we switch to MoM.
  • We track the sum of the input array sizes in recursive calls. If we notice that the sum in a call exceeds m\times n, where m>1, then we switch to MoM.

These rules limit the recursion depth to \Theta (\log n) ensuring that the worst-case complexity is O(n+k\log k). In addition, they’ll keep the algorithm efficient on most inputs. Let’s see how the second strategy works:

Rendered by QuickLaTeX.com

10. Complexity Comparison

Rendered by QuickLaTeX.com

We see that all the methods based on QuickSort and QuickSelect have the same average–case time complexity of O(n+k\log k). However, QuickSelSort and Partial QuickSort have an O(n^2) worst-case time complexity. Though asymptotically equivalent, the latter is more efficient in practice.

If we use MoM to choose the pivot in Partial QuickSort, we reduce the worst-case complexity to O(n+k\log k), but end up with an inefficient algorithm because of the overhead caused by MoM. Hybrid Partial QuickSort has the same asymptotic complexity as Partial QuickSort with MoM. But, it’s faster in practice because it uses MoM only when random selection results in inefficient partitioning.

Heaps offer interesting alternatives. However, their performance in practice depends on how much n is larger than k.

If k is a linear function of n, all the approaches become log-linear in n in the average case. Partial QuickSort with MoM, Hybrid Partial QuickSort, and the heap-based solutions become O(n\log n) in the worst case as well.

11. Other Approaches

Instead of QuickSort, we can adapt other sorting algorithms, such as MergeSort.

We can also base our solutions on the Floyd–Rivest selection algorithm. It finds the k-th smallest element in an array just as QuickSelect but is faster on average.

12. Conclusion

In this tutorial, we showed several ways to find the k smallest numbers in an array. We based most of our solutions on QuickSort but also presented two heap-inspired approaches and mentioned other algorithms that we can adapt to solve the problem.

Authors Bottom

If you have a few years of experience in Computer Science or research, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

Comments are closed on this article!