Streaming Median | Baeldung on Computer Science

1. Introduction

In this tutorial, we’ll present two algorithms for calculating the median of a data stream.

2. The Median of a Stream

A median of a random variable $X$ is a value ${med}_X$ that splits the $X$ ‘s range in half: one half lower than the ${med}_X$ and the other greater than it:

$\Pr\left(X \geq {med}_X\right) = \Pr\left( X \leq {med}_X \right) = \frac{1}{2}$

To determine the median of a sample $\{x_i\}_{1}^{n}=\{x_1, x_2, \ldots, x_n\}$ drawn from $X$ , we sort it to get the sample order statistics $x_{(1)} \leq x_{(2)} \leq \ldots \leq x_{(n)}$ and select the middle value as the sample’s median:

$med_n = \begin{cases} x_{\left\lceil n/2 \right\rceil},& n \text{ is odd} \\ \frac{ x_{ n/2 } + x_{ n/2 + 1} }{2},& n \text{ is even} \end{cases}$

This formula, however, assumes that the sample is sortable. That means that we can store it in the memory and make multiple passes over it.

2.1. Data Streams

That’s not the case when the sample is a big data stream. In such cases, we read the elements one by one from the stream as soon as they become available. Additionally, were we to keep them in the memory, we’d run out of it before the stream would end. So, we can’t effectively store or sort the whole stream, and we can’t revisit an element after reading it unless we save it. Unfortunately, we don’t have enough memory to keep the whole stream. For those reasons, we can’t determine the median in the usual way.

Instead, we use the streaming algorithms specialized for data streams. By design, they run in low-memory environments that can store only a portion of the stream at any point during an algorithm’s execution. To respect those constraints, the streaming algorithms sacrifice precision for low memory complexity. So, they don’t return the estimates the offline algorithms would. The goal is to have low time and space complexities while retaining as much precision as possible.

3. Symmetric Data

If the data are distributed symmetrically, then their median is equal to their mean. So, we can find the median by calculating the streaming mean:

algorithm StreamingMeanAlgorithm(stream):
    // INPUT
    //    stream = a data stream
    // OUTPUT
    //    The mean of the stream

    n <- 0
    mean <- 0

    while stream is not empty:
        n <- n + 1
        x <- get the next (n-th) element of stream
        mean <- ((n - 1) * mean + x) / n

    return mean

This algorithm has $O(1)$ memory complexity and runs in $O(n)$ time. However, if we can’t guarantee that the data follow a symmetric distribution such as normal or uniform, we shouldn’t estimate the median with the streaming mean. Instead, we can use a quantile estimation streaming algorithm such as $P^2$ .

4. The $\boldsymbol{P^2}$ Algorithm

A $p$ -quantile of $X$ ( $p\in [0, 1]$ ) is the value $x_p$ greater than $100p\%$ of $X$ ‘s distribution:

$\Pr\left(X \leq x_p\right) = p \quad (p \in [0, 1])$

The median is a $0.5$ -quantile, so we can use any method for streaming quantile estimation, such as the $P^2$ algorithm. We’ll present it for the median case ( $p=0.5$ ).

4.1. Five Markers

To find the median, $P^2$ maintains five markers while reading the stream:

$q_1$ , the minimimum
$q_2$ , the $0.25$ -quantile
$q_3$ , the $0.5$ -quantile
$q_4$ , the $0.75$ -quantile
$q_5$ , the maximum (the $1$ -quantile)

The $q_2$ and $q_4$ are called the middle markers because they’re between the median and the extreme values. The idea is to update the markers as we read the stream. In the end, we return the current value of $q_3$ .

4.2. The Update Step

Each $q_i$ is associated with $n_i$ , the number of elements observed so far not greater than the current $q_i$ ( $i=1,2,3,4,5$ ). Ideally, $n_i$ will be approximately equal to $n'_i = p_i n$ at any point, where $n$ is the number of elements we’ve read from the stream, and $p_i$ determines which quantile the marker $q_i$ represents ( $p_1 = 1/n, p_2=0.25, p_3=0.5, p_4=0.75, p_5=1$ ). More precisely:

$\begin{aligned} n'_1 &= 1 \\ n'_2 &= 0.25(n-1) + 1 \\ n'_3 &= 0.5(n-1)+1 \\ n'_4 &= 0.75(n-1) + 1\\ n'_5 &= n \end{aligned}$

When reading a new element, $x_n$ , we increase by $1$ the $n_i$ of all markers that are not lower than $x_n$ . If the new $n_i$ doesn’t differ from its ideal value $n'_i$ by at least 1, we don’t update $q_i$ . Otherwise, we update $q_i$ using the piecewise parabolic (PP or $\boldsymbol{P^2}$ ) formula (which is how the algorithm got its name). If we adjust $q_i$ , we also modify $n_i$ as explained in the next section.

4.3. Piecewise Parabolic Adjustment

We assume that the curve passing through any three adjacent markers is a parabola:

$q_i = a n_i^2 + b n_i + c$

That is, we assume that a second-order function is a good enough approximation of the underlying CDF between any three markers.

The ideal $n_i'$ s are on the real scale, but the actual $n_i$ s are integers. If the difference between an updated $n_i$ and $n'_i$ is $< 1$ , we do nothing. Otherwise, we update $q_i$ as follows:

(1)

where $d=sgn(n_i - n'_i)$ , that is, $d=1$ if the difference is $\geq 1$ , and $d=-1$ if the difference is $\leq -1$ .

If $q_i$ is to be adjusted, we adjust $n_i$ as well:

$n_i \leftarrow n_i + d$

4.4. Technical Remarks

The adjustment formula applies only to $\boldsymbol{i=2,3,4}$ . The first marker, which corresponds to the minimum, is replaced by $x_n$ if it’s the new minimum. The same goes for the maximum marker, $q_5$ . We don’t adjust their values as we do for $q_2$ , $q_3$ , and $q_4$ .

All the $\boldsymbol{n_i}$ s must be different. So, if adjustment leads to any two being equal, we don’t apply it. Also, $q_i$ s must be non-decreasing ( $q_1 \leq q_2 \leq q_3 \leq q_4 \leq q_5$ ). If the parabolic adjustment violates the ordering, we use the linear alternative instead:

(2)

Last but not least, we use the first five elements $x_1, x_2, x_3, x_4, x_5$ to initialize the markers, so we don’t update or adjust them. So the main part of the algorithm starts with $n>5$ .

4.5. Pseudocode

Here we present the pseudocode of $P^2$ :

algorithm P2Algorithm(stream):
    // INPUT
    //    stream = a data stream
    // OUTPUT
    //    The median of stream

    for n <- 1, 2, 3, 4, 5:
        x_n <- get the next element of stream

    Sort x_1, x_2, x_3, x_4, x_5

    Use the sorted five elements to initialize the markers q_i
    for i <- 1 to 5:
       Set n_i to i

    n <- 5

    while stream is not empty:
        n <- n + 1
        Update the ideal positions n_i` for each i=1,2,4,5
        x_n <- get the next element of stream
        Find the lowest j such that x_n >= q_j
        for i <- j to 5:
            n_i <- n_i + 1
        for i <- 2, 3, 4:
            d_i <- n_i` - n_i
            if |d_i| >= 1:
                d <- sgn(d_i)
                Calculate the adjusted marker q_i` using the parabolic formula
                if q_{i-1} < q_i` < q_{i+1}:
                    q_i <- q_i`
                else:
                    Use the linear formula to calculate the adjusted marker q_i`
                    q_i <- q_i`
                n_i <- n_i + d

    return q_3

4.6. Possible Improvements and Trade-Offs

It turns out that $P^2$ performs rather well. The reason is that second-degree curves are sufficiently precise piecewise approximations of the actual CDFs the data follow in practice. Higher-order curves may offer even better estimates, but improved precision may not be worth the increased mathematical complexity and computational burden.

Just as five markers work better than three (the minimum $q_1$ , the target $q_2$ , and the maximum $q_3$ ), seven markers would probably allow more precise estimation. However, using more markers would slow down the algorithm.

4.7. Complexity

The space complexity of $P^2$ is $O(1)$ because it uses the constant number of markers. Time complexity is $O(n)$ because it reads the stream once.

5. Example

Let’s demonstrate $P^2$ on the following stream:

$x = [50,30,20,40,10,10,20, \ldots]$

5.1. Initialization

We read and sort the first five numbers to initialize the $q_i$ s and $n_i$ s ( $i=1,2,3,4,5$ ):

$\begin{aligned} && q_1 = 10 && q_2 = 20 && q_3 = 30 && q_4 =40 && q_5 = 50 \\ && n_1 = 1 && n_2 = 2 && n_3 = 3 && n_4 = 4 && n_5 = 5 \end{aligned}$

Since $n=5$ , the ideal $n_i'$ values are:

$\begin{aligned} n_1' &= 1 \\ n_2' &= 0.25(5-1) + 1 = 0.25\cdot 4 + 1 = 2 \\ n_3' &= 0.5(5-1)+1 = 0.5 \cdot 4 + 1 = 3 \\ n'_4 &= 0.75(5-1) + 1 = 0.75\cdot 4 + 1 = 4 \\ n'_5 &= 5 \end{aligned}$

5.2. Reading the Sixth Element

Now, we read $x_6=10$ . Since $n=6$ , the ideal values change:

$\begin{aligned} n_1' &= 1 \\ n_2' &= 0.25(6-1) + 1 = 0.25\cdot 5 + 1 = 2.25 \\ n_3' &= 0.5(6-1)+1 = 0.5 \cdot 5 + 1 = 3.5 \\ n'_4 &= 0.75(6-1) + 1 = 0.75\cdot 5 + 1 = 4.75 \\ n'_5 &= 6 \end{aligned}$

and so do the actual values:

$\begin{aligned} && n_1 = 2 && n_2 = 3 && n_3 = 4 && n_4 = 5 && n_5 = 6 \end{aligned}$

No absolute difference $|n'_i-n_i|$ is $\geq 1$ , so we don’t update $q_i$ .

5.3. Reading the Seventh Element

Now, we read $x_7 = 20$ . As before, the ideal values change:

$\begin{aligned} n_1' &= 1 \\ n_2' &= 0.25(7-1) + 1 = 0.25\cdot 6 + 1 = 2.5 \\ n_3' &= 0.5(7-1)+1 = 0.5 \cdot 6 + 1 = 4 \\ n'_4 &= 0.75(7-1) + 1 = 0.75\cdot 6 + 1 = 5.5 \\ n'_5 &= 7 \end{aligned}$

and so do the actual ones:

$\begin{aligned} n_1 = 2 && n_2 = 4 && n_3 = 5 && n_4 = 6 && n_5 = 7 \end{aligned}$

Since $d'_2 = 2.5-4=-1.5 \leq -1$ and $d'_3 = 4-5=-1 \leq -1$ , we need to update $q_2$ and $q_3$ . Using the parabolic formula for $q_2$ (and $d=-1$ ), we get:

$\begin{aligned} q'_2 &= 20 +\frac{-1}{5-2}((4-2-1)\cdot \frac{30-20}{5-4} + (5-4+1)\cdot\frac{20-10}{4-2}) \\ &= 20- \frac{1}{3}(1\cdot 10 + 2\cdot 5) \\ &=20 - \frac{20}{3} \approx 13.33 \end{aligned}$

Since $10< 13.33 < 30$ , we accept the parabolic adjustment. We also update $n_2 \leftarrow 4 - 1 = 3$ . The same logic applies to $q_3$ .

6. Conclusion

In this article, we presented two algorithms for finding the median of a big data stream. First, if the data are symmetrical, we run the streaming mean algorithm because the median and the mean of symmetric distributions are the same. Otherwise, we can use $P^2$ , one of the many methods designed to estimate the streaming quantiles, of which the median is a special case.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Core Concepts

Operating Systems

Neural Networks

Graph Theory

Latex

Full Archive

About Baeldung