1. Introduction

In this tutorial, we’ll present two algorithms for calculating the median of a data stream.

2. The Median of a Stream

A median of a random variable X is a value {med}_X that splits the X‘s range in half: one half lower than the {med}_X and the other greater than it:

    \[\Pr\left(X \geq {med}_X\right) = \Pr\left( X \leq {med}_X \right) = \frac{1}{2}\]

To determine the median of a sample \{x_i\}_{1}^{n}=\{x_1, x_2, \ldots, x_n\} drawn from X, we sort it to get the sample order statistics x_{(1)} \leq x_{(2)} \leq \ldots \leq x_{(n)} and select the middle value as the sample’s median:

    \[med_n = \begin{cases} x_{\left\lceil n/2 \right\rceil},& n \text{ is odd} \\ \frac{ x_{ n/2 } + x_{ n/2 + 1} }{2},& n \text{ is even} \end{cases}\]

This formula, however, assumes that the sample is sortable. That means that we can store it in the memory and make multiple passes over it.

2.1. Data Streams

That’s not the case when the sample is a big data stream. In such cases, we read the elements one by one from the stream as soon as they become available. Additionally, were we to keep them in the memory, we’d run out of it before the stream would end. So, we can’t effectively store or sort the whole stream, and we can’t revisit an element after reading it unless we save it. Unfortunately, we don’t have enough memory to keep the whole stream. For those reasons, we can’t determine the median in the usual way.

Instead, we use the streaming algorithms specialized for data streams. By design, they run in low-memory environments that can store only a portion of the stream at any point during an algorithm’s execution. To respect those constraints, the streaming algorithms sacrifice precision for low memory complexity. So, they don’t return the estimates the offline algorithms would. The goal is to have low time and space complexities while retaining as much precision as possible.

3. Symmetric Data

If the data are distributed symmetrically, then their median is equal to their mean. So, we can find the median by calculating the streaming mean:

Rendered by QuickLaTeX.com

This algorithm has O(1) memory complexity and runs in O(n) time. However, if we can’t guarantee that the data follow a symmetric distribution such as normal or uniform, we shouldn’t estimate the median with the streaming mean. Instead, we can use a quantile estimation streaming algorithm such as P^2.

4. The \boldsymbol{P^2} Algorithm

A p-quantile of X (p\in [0, 1]) is the value x_p greater than 100p\% of X‘s distribution:

    \[\Pr\left(X \leq x_p\right) = p \quad (p \in [0, 1])\]

The median is a 0.5-quantile, so we can use any method for streaming quantile estimation, such as the P^2 algorithm. We’ll present it for the median case (p=0.5).

4.1. Five Markers

To find the median, P^2 maintains five markers while reading the stream:

  • q_1, the minimimum
  • q_2, the 0.25-quantile
  • q_3, the 0.5-quantile
  • q_4, the 0.75-quantile
  • q_5, the maximum (the 1-quantile)

The q_2 and q_4 are called the middle markers because they’re between the median and the extreme values. The idea is to update the markers as we read the stream. In the end, we return the current value of q_3.

4.2. The Update Step

Each q_i is associated with n_i, the number of elements observed so far not greater than the current q_i (i=1,2,3,4,5). Ideally, n_i will be approximately equal to n'_i = p_i n at any point, where n is the number of elements we’ve read from the stream, and p_i determines which quantile the marker q_i represents (p_1 = 1/n, p_2=0.25, p_3=0.5, p_4=0.75, p_5=1). More precisely:

    \[\begin{aligned} n'_1 &= 1 \\ n'_2 &= 0.25(n-1) + 1 \\ n'_3 &= 0.5(n-1)+1 \\ n'_4 &= 0.75(n-1) + 1\\ n'_5 &= n \end{aligned}\]

When reading a new element, x_n, we increase by 1 the n_i of all markers that are not lower than x_n. If the new n_i doesn’t differ from its ideal value n'_i by at least 1, we don’t update q_i. Otherwise, we update q_i using the piecewise parabolic (PP or \boldsymbol{P^2}) formula (which is how the algorithm got its name). If we adjust q_i, we also modify n_i as explained in the next section.

4.3. Piecewise Parabolic Adjustment

We assume that the curve passing through any three adjacent markers is a parabola:

    \[q_i = a n_i^2 + b n_i + c\]

That is, we assume that a second-order function is a good enough approximation of the underlying CDF between any three markers.

The ideal n_i's are on the real scale, but the actual n_is are integers. If the difference between an updated n_i and n'_i is < 1, we do nothing. Otherwise, we update q_i as follows:

(1)   \begin{equation*} q_i \leftarrow q_i + \frac{d}{n_{i+1}-n_{i-1}}\left((n_i - n_{i-1} + d) \frac{q_{i+1} - q_i}{n_{i+1} - n_i} +(n_{i+1} - n_i - d) \frac{q_i - q_{i-1}}{n_{i} - n_{i-1}}\right)\end{equation*}

where d=sgn(n_i - n'_i), that is, d=1 if the difference is \geq 1, and d=-1 if the difference is \leq -1.

If q_i is to be adjusted, we adjust n_i as well:

    \[n_i \leftarrow n_i + d\]

4.4. Technical Remarks

The adjustment formula applies only to \boldsymbol{i=2,3,4}. The first marker, which corresponds to the minimum, is replaced by x_n if it’s the new minimum. The same goes for the maximum marker, q_5. We don’t adjust their values as we do for q_2, q_3, and q_4.

All the \boldsymbol{n_i}s must be different. So, if adjustment leads to any two being equal, we don’t apply it. Also, q_is must be non-decreasing (q_1 \leq q_2 \leq q_3 \leq q_4 \leq q_5). If the parabolic adjustment violates the ordering, we use the linear alternative instead:

(2)   \begin{equation*} q_i \leftarrow q_i + d\frac{q_{i+d} - q_i}{n_{i+d} - n_i}\end{equation*}

Last but not least, we use the first five elements x_1, x_2, x_3, x_4, x_5 to initialize the markers, so we don’t update or adjust them. So the main part of the algorithm starts with n>5.

4.5. Pseudocode

Here we present the pseudocode of P^2:

Rendered by QuickLaTeX.com

4.6. Possible Improvements and Trade-Offs

It turns out that P^2 performs rather well. The reason is that second-degree curves are sufficiently precise piecewise approximations of the actual CDFs the data follow in practice. Higher-order curves may offer even better estimates, but improved precision may not be worth the increased mathematical complexity and computational burden.

Just as five markers work better than three (the minimum q_1, the target q_2, and the maximum q_3), seven markers would probably allow more precise estimation. However, using more markers would slow down the algorithm.

4.7. Complexity

The space complexity of P^2 is O(1) because it uses the constant number of markers. Time complexity is O(n) because it reads the stream once.

5. Example

Let’s demonstrate P^2 on the following stream:

    \[x = [50,30,20,40,10,10,20, \ldots]\]

5.1. Initialization

We read and sort the first five numbers to initialize the q_is and n_is (i=1,2,3,4,5):

    \[\begin{aligned} && q_1 = 10 && q_2 = 20 &&  q_3 = 30 &&  q_4 =40 && q_5 = 50 \\ && n_1 = 1 && n_2 = 2 && n_3 = 3 && n_4 = 4 && n_5 = 5 \end{aligned}\]

Since n=5, the ideal n_i' values are:

    \[\begin{aligned} n_1' &= 1 \\ n_2' &= 0.25(5-1) + 1 = 0.25\cdot 4 + 1 = 2 \\ n_3' &= 0.5(5-1)+1 = 0.5 \cdot 4 + 1 = 3 \\ n'_4 &= 0.75(5-1) + 1 = 0.75\cdot 4 + 1 = 4 \\ n'_5 &= 5 \end{aligned}\]

5.2. Reading the Sixth Element

Now, we read x_6=10. Since n=6, the ideal values change:

    \[\begin{aligned} n_1' &= 1 \\ n_2' &= 0.25(6-1) + 1 = 0.25\cdot 5 + 1 = 2.25 \\ n_3' &= 0.5(6-1)+1 = 0.5 \cdot 5 + 1 = 3.5 \\ n'_4 &= 0.75(6-1) + 1 = 0.75\cdot 5 + 1 = 4.75 \\ n'_5 &= 6 \end{aligned}\]

and so do the actual values:

    \[\begin{aligned} && n_1 = 2 && n_2 = 3 && n_3 = 4 && n_4 = 5 && n_5 = 6 \end{aligned}\]

No absolute difference |n'_i-n_i| is \geq 1, so we don’t update q_i.

5.3. Reading the Seventh Element

Now, we read x_7 = 20. As before, the ideal values change:

    \[\begin{aligned} n_1' &= 1 \\ n_2' &= 0.25(7-1) + 1 = 0.25\cdot 6 + 1 = 2.5 \\ n_3' &= 0.5(7-1)+1 = 0.5 \cdot 6 + 1 = 4 \\ n'_4 &= 0.75(7-1) + 1 = 0.75\cdot 6 + 1 = 5.5 \\ n'_5 &= 7 \end{aligned}\]

and so do the actual ones:

    \[\begin{aligned} n_1 = 2 && n_2 = 4 && n_3 = 5 && n_4 = 6 && n_5 = 7 \end{aligned}\]

Since d'_2 = 2.5-4=-1.5 \leq -1 and d'_3 = 4-5=-1 \leq -1, we need to update q_2 and q_3. Using the parabolic formula for q_2 (and d=-1), we get:

    \[\begin{aligned} q'_2 &= 20 +\frac{-1}{5-2}((4-2-1)\cdot \frac{30-20}{5-4} + (5-4+1)\cdot\frac{20-10}{4-2}) \\ &= 20- \frac{1}{3}(1\cdot 10 + 2\cdot 5) \\ &=20 - \frac{20}{3} \approx 13.33 \end{aligned}\]

Since 10< 13.33 < 30, we accept the parabolic adjustment. We also update n_2 \leftarrow 4 - 1 = 3. The same logic applies to q_3.

6. Conclusion

In this article, we presented two algorithms for finding the median of a big data stream. First, if the data are symmetrical, we run the streaming mean algorithm because the median and the mean of symmetric distributions are the same. Otherwise, we can use P^2, one of the many methods designed to estimate the streaming quantiles, of which the median is a special case.

Comments are closed on this article!