1. Overview

When we manage files in Linux, we may want to extract statistical data from a list of numbers or from a file with numeric content.

In this tutorial, we’ll focus on obtaining the minimum and maximum values, the median, and the mean of such a dataset with the help of several Linux tools.

2. Creating Our Sample File

Before we start, let’s create a file called data.txt that we’ll use to test our strategies:

$ cat << eof > data.txt
10
32
4.23
4
223
-53
eof

3. Using awk

awk (GNU/AWK) is a powerful tool that deserves our attention. It allows us to process text record-by-record and create reports.

An advantage of awk is that it comes pre-installed on many Linux distributions.

3.1. Using Only awk

Using what we know about what we want to achieve, let’s create an awk script and save it in a file called calculate.awk:

#!/usr/bin/env awk -f
{ 
    sum += $1
    nums[NR] = $1  # We store the input records
}
END {
    if (NR == 0) exit  #To avoid division by zero
 
    asort(nums)  #  Here, we sort the array that
                 #+ contains the stored input
 
    median = (NR % 2 == 0) ?  #  Let's be carefull with the
                              #+ short-if syntax
        ( nums[NR / 2] + nums[(NR / 2) + 1] ) / 2 \
        :
        nums[int(NR / 2) + 1]
 
    #  We used "(NR / 2) + 1", instead "NR/2 + 1", just for
    #+ the sake of clarity; to be more verbose
 
    mean = sum/NR
 
    #Let's beautify the output
    printf \
        "min = %s, max = %s, median = %s, mean = %s\n",\
        nums[1],\
        nums[NR],\
        median,\
        mean
}

Let’s take a closer look at the code.

The NR variable contains the value of the total number of input records seen so far. We used NR inside the code executed after the END pattern because it executes when all the input is exhausted. Thus, we can confirm that the NR has the same value as the cardinality of our dataset.

In this code, we use the fact that, if the array nums is a finite, closed, and ordered set, then it has lower and upper bounds that will be, respectively, the minimum and maximum elements.

We called the first and the nth elements because the asort function sorts the array data, and this data will be indexed from 1 to some number n (n = NR in our case).

Now, with the awk script in place, let’s run:

$ awk -f calculate.awk data.txt
min = -53, max = 223, median = 7.115, mean = 36.705

3.2. Combining awk with sort

In the script we created in the previous section, we made use of awk‘s asort function, but we can do without it.

Let’s create a script called calculate2.awk that will be the same as calculate.awk but without the asort function. Now, we just need to sort the elements before the use of awk with the help of the sort command:

$ sort -n data.txt | awk -f calculate2.awk
min = -53, max = 223, median = 7.115, mean = 36.705

In the sort command, the option -n stands for numeric sorting.

3.3. Having All the Data We Need

What if we have all the data we need? In other words, what if we know the size of the dataset, and it’s also sorted (from lowest to highest)?

Let’s create an awk script called calculate3.awk in order to make simple operations and logical decisions — no sorting is required, and no array is needed to store the values:

#!/usr/bin/env awk -f
BEGIN {
    if (size % 2 == 0) {
        median_position[0] = size/2
        median_position[1] = (size/2) + 1
    }
    else
        median_position[0] = int(size/2) + 1
}
NR == 1 { min = $1 }
NR == median_position[0] { a_median[0] = $1 }
NR == median_position[1] { a_median[1] = $1 }
{ sum += $1 }
END {
    if (NR == 0) exit

    median = (median_position[1]) ? \
        (a_median[0] + a_median[1]) / 2 \
        : \
        a_median[0]

    max = $1
    mean = sum/size

    printf \
        "min = %s, max = %s, median = %s, mean = %s\n",\
        min,\
        max,\
        median,\
        mean
}

Let’s break this down. Since the data is already ordered from lowest to highest, the first record will be the minimum value, and the last record will be the maximum. That’s why we use min=$1 when the NR variable is equal to 1 and max=$1 when the NR variable is inside the END pattern.

Finally, we can run the script in a pipeline:

$ size=$(wc -l < data.txt); sort -n data.txt | awk -v size=$size -f calculate3.awk
min = -53, max = 223, median = 7.115, mean = 36.705

Let’s take a closer look at the pipeline:

  • In awk -v size=$size, we pass the information we need to calculate the median just by knowing if the size (stored in the size variable) is odd or even
  • Here, in wc -l < data.txt, we get the number of lines of the file data.txt

In all our examples, we can see that one of the strengths of awk is that we have a lot of control of the text flow in each of the stages of the process.

4. Using datamash

datamash is a very powerful and simple tool that can help us make command-line calculations.

This tool doesn’t come pre-installed in Linux distributions, so we need to install it before we start writing our one-liners.

If we’re using a Debian-based distribution, we can use:

$ sudo apt-get install -y datamash

Or, if we use yum:

$ sudo yum install datamash

And for other distributions and installation options, we can consult the download page for further instructions.

Fortunately, datamash does have all the operations that we want to apply to our dataset:

$ datamash min 1 max 1 median 1 mean 1 < data.txt
-53     223     7.115   36.705

Here, we indicated every operation followed by the column to which we want to apply it.

5. Conclusion

In this tutorial, we reviewed several approaches that we can use on the Linux command-line to get the minimum, maximum, median, and mean values of a dataset.

We also created a variety of awk scripts to calculate these values and then looked at an example using the datamash tool.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments