
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
awk is a convenient and powerful command-line utility for processing text. Sometimes, we need to read and process multiple input files.
In this tutorial, we’ll learn how to process multiple input files using the awk command.
Sometimes, we want to process a collection of data files and generate some output.
For example, suppose we have three input files containing user scores:
$ head score*.txt
==> score1.txt <==
Tom 20
Jerry 40
Mark 25
Amanda 37
==> score2.txt <==
Mark 75
Tom 70
Jerry 7
Amanda 40
==> score3.txt <==
Mark 73
Amanda 47
Jerry 79
Tom 40
Notice that all files share the same format: each line contains a name and a score, separated by whitespace.
Let’s calculate the sum of scores for each user from the files above:
$ awk '{ sum[$1]+=$2 } END { for(user in sum) print user, sum[user] }' score*.txt
Tom 130
Jerry 126
Mark 173
Amanda 124
In the code above, we created an associative array sum to calculate and store the sum of scores of each user. Finally, in the END block, we printed elements in the array.
When our input files share the same format, we can treat multiple input files as a single merged input. This is a relatively simple situation.
However, in practice, we often need to handle the associations between input files. In the following sections, we’ll see these situations in detail.
In our next example, we’ll show how to process two associated input files using line numbers and awk‘s built-in NR and FNR variables.
NR and FNR are two built-in awk variables. NR tells us the total number of records that we’ve read so far, while FNR gives us the number of records we’ve read in the current input file.
Let’s understand the two variables through an example. First, let’s create two files:
$ head file1.txt file2.txt
==> file1.txt <==
file1-1
file1-2
file1-3
file1-4
file1-5
==> file2.txt <==
file2-1
file2-2
file2-3
file2-4
file2-5
Then we create a simple awk one-liner, which takes the two files above as input and prints lines in each file together with the values of NR and FNR:
$ awk '{ printf "Line:%s, NR:%d, FNR:%d\n", $0, NR, FNR}' file1.txt file2.txt
Line:file1-1, NR:1, FNR:1
Line:file1-2, NR:2, FNR:2
Line:file1-3, NR:3, FNR:3
Line:file1-4, NR:4, FNR:4
Line:file1-5, NR:5, FNR:5
Line:file2-1, NR:6, FNR:1
Line:file2-2, NR:7, FNR:2
Line:file2-3, NR:8, FNR:3
Line:file2-4, NR:9, FNR:4
Line:file2-5, NR:10, FNR:5
The output above shows us:
In the next section, we’ll see how to distinguish between the input files from the NR and FNR and handle the relations.
Let’s start with an example.
We prepared two files:
$ head all_lines.txt lines_to_show.txt
==> all_lines.txt <==
line-01
line-02
line-03
line-04
line-05
line-06
line-07
line-08
line-09
line-10
==> lines_to_show.txt <==
2
3
4
5
7
In the file all_lines.txt, we have ten lines of text, while the file lines_to_show.txt stores line numbers. Now, we want to output a line from the all_lines.txt file only if its line number is defined in the file lines_to_show.txt.
Let’s have a look at the solution, then understand how it works:
$ awk 'NR==FNR { out[$1]=1; next } { if (out[FNR]==1) print $0 }' lines_to_show.txt all_lines.txt
line-02
line-03
line-04
line-05
line-07
We solved this problem in two steps:
Now, let’s take a closer look at the awk code above to understand how it works.
Step 1: NR==FNR{ out[$1]=1; next }
Step 2: { if (out[FNR]==1) print $0 }
It’s worthwhile to mention that, in awk:
Therefore, we can write the awk one-liner solution to this problem more compactly:
$ awk 'NR==FNR { out[$1]=1; next } out[FNR]' lines_to_show.txt all_lines.txt
In this section, we’ll see another practical example. As usual, let’s first take a look at the two input files:
$ head price.txt purchasing.txt
==> price.txt <==
Product Price(USD/Kg) Supplier
Apple 3.20 Supplier_X
Orange 3.00 Supplier_Y
Peach 5.35 Supplier_Y
Pear 5.00 Supplier_X
Mango 12.00 Supplier_Y
Pineapple 7.70 Supplier_X
==> purchasing.txt <==
Product Volume(Kg) Date
Orange 120 2020-04-02
Apple 400 2020-04-03
Peach 70 2020-04-05
Pear 50 2020-04-17
We want to generate a cost report containing Product, Date, and a new column, Cost, where Cost = Price * Volume.
Let’s look at the solution first:
$ awk 'BEGIN { print "Product Cost Date" }
FNR>1 && NR==FNR { price[$1]=$2; next }
FNR>1 { printf "%s $%.2f %s\n",$1, price[$1]*$2, $3}' price.txt purchasing.txt
Product Cost Date
Orange $360.00 2020-04-02
Apple $1280.00 2020-04-03
Peach $374.50 2020-04-05
Pear $250.00 2020-04-17
Now let’s take a closer look at the code and understand how it works:
If we need to handle two input files using awk, we can consider using this typical pattern to solve the problem:
awk 'NR==FNR {
// read lines from the first input file
// do calculation and save required value
// in variables or arrays
next
}
{
// process the lines from the second file
// with the variables or arrays we prepared above
}' inputFile1 inputFile2
We’ve learned the compact way to handle two input files by comparing the values of FNR and NR.
However, if we have more than two input files, this method will not work.
This is because the FNR is always going to be reset to 1, once the input file changes. We cannot distinguish between the input files by the FNR variable anymore.
FILENAME is a built-in variable that stores the name of the input file the awk command is currently processing:
$ awk '{ print $0 " => " FILENAME}' file1.txt file2.txt file3.txt
file1-1 => file1.txt
file1-2 => file1.txt
file1-3 => file1.txt
file1-4 => file1.txt
file1-5 => file1.txt
file2-1 => file2.txt
file2-2 => file2.txt
file2-3 => file2.txt
file2-4 => file2.txt
file2-5 => file2.txt
file3-1 => file3.txt
file3-2 => file3.txt
file3-3 => file3.txt
file3-4 => file3.txt
file3-5 => file3.txt
We can make use of this variable to distinguish the input files and apply different processing logic.
In an earlier section, we’ve generated a report on the fruit purchasing cost.
Let’s review the example quickly. We have two input files:
Due to the good partnership with suppliers, they agreed to offer us some discounts. Now, we’ll add a third file, discount.txt:
$ cat discount.txt
Supplier Discount
Supplier_X 0.10
Supplier_Y 0.20
Let’s generate a new report on purchasing cost from the three input files:
$ awk 'fname != FILENAME { fname = FILENAME; idx++ }
FNR > 1 && idx == 1 { discount[$1] = $2 }
FNR > 1 && idx == 2 { price[$1] = $2 * ( 1 - discount[$3] ) }
FNR > 1 && idx == 3 { printf "%s $%.2f %s\n",$1, price[$1]*$2, $3 }
' discount.txt price.txt purchasing.txt
Orange $288.00 2020-04-02
Apple $1152.00 2020-04-03
Peach $299.60 2020-04-05
Pear $225.00 2020-04-17
In the code above, we used FNR>1 to skip the header lines from input files. Also, we created associative arrays to share data between different file processings.
However, the key to distinguishing between input files is this line of code:
fname != FILENAME{ fname = FILENAME; idx++ }
Now, let’s understand how it works:
This is one of the common techniques for handling multiple input files.
We’ve seen that the built-in FILENAME variable stores the name of the current input file. While reading the code in the previous section, we may come up with a question: why do we distinguish between input files by the index of each input instead of comparing the filename directly, as in the example:
FNR > 1 && FILENAME == "discount.txt" {...}
FNR > 1 && FILENAME == "price.txt" {...}
FNR > 1 && FILENAME == "purchasing.txt" {...}
Comparing the FILENAME variable with the filename works for this example, too. However, it has some disadvantages.
Most notably, it brings hardcoded filenames into our awk script. That is, when we change the name of a file, we must update the code, too.
For example, if we change the second file price.txt to “/full/path/to/price.txt”, we’d have to change our script.
Sometimes, we have to pass the filename with shell variables, such as “$PWD/price.txt“. In this case, we don’t know the exact value of the FILENAME variable.
A workaround is using the regular expression match operator ~ instead of == as in:
FNR > 1 && FILENAME ~ /\/price[.]txt$/ {...}
However, the workaround will fail when we feed the awk command by a process substitution as an input “file”.
With a process substitution, the name of the input file is going to be automatically generated by the pipe() system call. The filename will be dynamic.
Let’s see an example of this case:
$ echo "a dummy line" > dummy.txt
$ awk '{print FILENAME}' dummy.txt <(cat dummy.txt )
dummy.txt
/proc/self/fd/11
Therefore, we prefer to distinguish between input files using the index of an input file over the filename.
In this article, we’ve discussed how to handle multiple input files when we work with the awk command.