## 1. Overview

In this tutorial, we’ll focus on tools to print aggregated statics of numbers in a file. We’ll be evaluating statistics such as mean, median, mode, standard deviation, and many more.

## 2. Setup

Let’s create a *sample.txt *file containing a list of numbers separated by newlines:

`$ echo '1 2 3 4 5 6 7 8 9 10' |tr ' ' '\n' > sample.txt`

Here, we’re using *echo* to output all the numbers from 1 to 10, separated by spaces. Next, we’re piping the results to the *tr* command, which converts all spaces to newlines.

We can use the *cat* command to view the contents of the *sample.txt* file:

```
$ cat sample.txt
1
2
3
4
5
6
7
8
9
10
```

Next, we’ll evaluate statistics like mean, median, mode, average, and standard deviation on these numbers.

## 3. Using *awk*

*awk* is a powerful scripting language designed for text processing, extraction, and generation of data reports.

*awk* doesn’t require compilation and allows us to use logical operators, variables, string functions, and numeric functions.

Let’s print the mean of the numbers in the *sample.txt* file:

```
$ awk '{a+=$1} END{print "mean = " a/NR}' sample.txt
5.5
```

Here, we create a variable named *a *and then add up all the numbers in our file in the first field*.* In *awk, *the first field in input is represented as *$1.* Afterward, we divide the value of variable *a* by the total number of records *(NR) *and print the result.

To get the median, we use *gawk,* the GNU representation of *awk. gawk *has extra commands that aren’t available in the standard *awk *utility.

First, let’s install *gawk*:

`$ sudo apt install gawk`

Once installed, let’s get the median:

```
$ gawk -v max=100 '
function median(r,s) {
asort(s,t)
if (r%2) return t[(r+1)/2]
else return (t[r/2+1]+t[r/2])/2
}
{
count++
values[count]=$1
if (count >= max) {
print median(count,values); count=0
}
}
END {
print "median = " median(count,values)
}
' sample.txt
median = 5.5
```

Here, we’re using the *-v *flag to set the value of *max* to *100. *In other words, we’re using the value of *100 *as a limiter.

We’re also defining a *get_median()* function that evaluates the numbers and prints out the median.

Let’s also get the standard deviation of the numbers in the *sample.txt* file:

```
$ awk '{total+=$1; totalsq+=$1*$1} END {print "stdev = " sqrt(totalsq/NR - (total/NR)**2)}' sample.txt
stdev = 2.87228
```

We’re getting the sum of the number and the sum of their squares, then using them to calculate the standard deviation.

## 4. Using *ministat*

*ministat* is a statistics utility tool used in the calculation of core statistical properties of numerical data in input files or standard input.

It’s a tool from FreeBSD but also packaged for popular distributions like Debian and Ubuntu.

On Linux, we can install *ministat* using the package manager:

`$ sudo apt install ministat`

Alternatively, we can download, build and install it.

Once it’s installed, let’s print statistical data based on our *sample.txt* file:

```
$ cat sample.txt| awk '{print $1}' | ministat -w 70
x <stdin>
+--------------------------------------------------------------------------+
|x x x x x x x x x x|
| |________________________A___M___________________| |
+--------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 10 1 10 6 5.5 3.0276504
```

Here, we’re printing the data in the *sample.txt *file with the *cat* command. Next, we’re piping the result to *awk,* which prints the first row of numbers. Finally, we’re piping the results to *ministat,* which performs statistical calculations.

We’ve used the *-w *flag to set the width of the output to *70* if the standard output is not a terminal.

## 5. Using *perl*

** perl stands for Practical Extraction and Report Language. It’s very effective in printing reports based on data input through a file or standard input.** It has grown into a general-purpose language widely utilized for writing programs from quick one-liners to full-scale applications.

Let’s print the aggregated statistical data of the numbers in the *sample.txt* file:

```
$ cat sample.txt | perl -e '
use List::Util qw(max min sum);
@r=();
while(<>){
$sqtotal+=$_*$_; push(@r,$_)
};
$count=@r; $total=sum(@r); $average=$total/@r; $m_num=max(@r); $mm_num=min(@r);
$stdev=sqrt($sqtotal/$count-($total/$count)*($total/$count));
$middle_num=int @r/2; @srtd=sort @r;
if(@r%2){
$median=$srtd[$middle_num];
}
else{
$median=($srtd[$middle_num-1]+$srtd[$middle_num])/2;
};
print "records:$count\n sum:$total\n avg:$average\n std:$stdev\n med:$median\n max:$m_num min:$mm_num";'
records:10
sum:55
avg:5.5
std:2.87228132326901
med:4.5
max:10
min:1
```

Here, we’re using the *-e *flag to execute our Perl code. Here’s a breakdown of some parts of the script:

*use List::Util qw(max min sum)*: this is a module that enables us to use the*max*,*min*, and*sum*functions.*@r=()*: we’re defining an array variable named*@r*and setting its value to a blank list*while(<>)…*: this is a while loop that gets the sum of the square of each number in our*sample.txt*file. We’re also pushing each digit in the file to the*@r*array variable.

Then, we’re creating and evaluating variables that represent the number of records (*$count*), sum (*$total*), average (*$average*), standard deviation (*$stdev*), median (*$median*), max (*$m_num*) and min (*$mm_num*).

## 6. Using *datamash*

**GNU datamash is a command-line utility that performs textual, numerical, and statistical operations on data files or standard input.** It’s portable and aids in automating analysis pipelines without writing code or short scripts.

Let’s install *datamash* from the local package manager:

`$ sudo apt install datamash`

Once installed, let’s print aggregated statistical data based on the numbers in the *sample.txt* file:

```
$ cat sample.txt | datamash sum 1 mean 1 median 1 mode 1 sstdev 1
55 5.5 5.5 1 3.0276503540975
```

Here, we’re using *datamash* to print the sum, mean, median, mode, and sample standard deviation.

## 7. Using *st*

*st* is a simple command-line utility to display statistics of numbers from standard input or a file.

To install it, we first download it from its repository on GitHub:

`$ git clone https://github.com/nferraz/st.git`

Then, let’s navigate into the directory and use the *perl *command to generate the build files:

```
$ cd st && perl Makefile.PL
Generating a Unix-style Makefile
Writing Makefile for App::St
Writing MYMETA.yml and MYMETA.json
```

Finally, we use the *make* command to build and install *st*:

```
$ sudo make install
Manifying 1 pod document
Manifying 1 pod document
Appending installation info to /usr/local/lib/x86_64-linux-gnu/perl/5.30.0/perllocal.pod
```

After installation, we can navigate back to our working directory to generate aggregated statistical data:

```
$ st sample.txt
N min max sum mean stddev
10 1 10 55 5.5 3.02765
```

It’s also possible to filter the results by using some of the options available:

```
$ st --sum sample.txt
55
```

## 8. Using *clistats*

*clistats* is a command-line utility for the computation of statistical data from a set of delimited input numbers.

The numbers can be separated by either commas or tabs. However, the default delimiter is a comma.

We can pass input from a file, redirected pipes, or standard input.

To use *clistats*, let’s first download it from its repository:

`$ git clone https://github.com/dpmcmlxxvi/clistats.git`

Next, we can navigate into the downloaded directory and run the *make* command to build *clistats*:

```
$ cd clistats && make
g++ -O2 src/clistats.cpp -o clistats
```

This creates a file named *clistats* in the directory. We’ll be using this file to generate the reports.

Finally, let’s copy the *sample.txt* file to the *clistats* directory and then generate aggregated statistical data:

```
$ ./clistats < sample.txt
#=================================================================
# Statistics
#=================================================================
# Dimension Count Minimum Mean Maximum Stdev
#-----------------------------------------------------------------
1 10 1.000000 5.500000 10.000000 2.872281
```

## 9. Conclusion

In this article, we’ve looked at some Linux tools that are useful in generating aggregated statistical reports. These reports include statistics like max, min, median, mode, standard deviation, and many more.