Baeldung Pro – Linux – NPI EA (cat = Baeldung on Linux)
announcement - icon

Learn through the super-clean Baeldung Pro experience:

>> Membership and Baeldung Pro.

No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.

Partner – Orkes – NPI EA (tag=Kubernetes)
announcement - icon

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

1. Overview

In data processing, merging files is not an uncommon task. For instance, we may need to combine the information on two files that share a common field into a single output. One powerful tool we can utilize to merge the two files is the awk command.

In this tutorial, we’ll explore how to merge two files based on a common column using awk.

2. Understanding the Problem

First, let’s explore the structure of the two files with the help of the cat command.

2.1. File Contents

file1 contains 5 columns namely A, B, C, D, and E:

$ cat file1
 A     B      C      D     E
4050  S00001  31228  3286   0
4050  S00012  31227  4251   0
4049  S00001  28342  3021   1
4048  S00001  46578  4210   0
4048  S00113  31221  4250   0
4047  S00122  31225  4249   0
4046  S00344  31322  4000   1

file2 contains 3 columns namely A, F, and G:

$ cat file2
 A     F      G
4050  12.1   23.6
4049  14.4   47.8
4048  23.2   43.9
4047  45.5   21.6

We’ll use column A for the merge since it’s common in both file1 and file2. Our goal is to add the F and G columns from file2 to file1.

2.2. Reason For Using AWK

awk enables us to search for patterns and perform operations based on the matched patterns. Additionally, it can handle large files efficiently in memory. In this scenario, awk loads file2 into memory which ensures the memory usage is efficient since it’s smaller compared to file1. However, if file1 is too large, we can monitor the memory usage or use additional tools such as gawk which offers additional features when dealing with large datasets.

Here’s the general syntax for the awk command:

$ awk 'command' file1 file2

Now, let’s use the syntax above to write an awk script to carry out the task at hand.

3. Merging the Files

Let’s now explore the awk script:

$ awk 'NR==FNR { f[$1] = $2; g[$1] = $3; next } { print $1, $2, $3, $4, $5, f[$1], g[$1] }' file2 file1
A B C D E F G
4050 S00001 31228 3286 0 12.1 23.6
4050 S00012 31227 4251 0 12.1 23.6
4049 S00001 28342 3021 1 14.4 47.8
4048 S00001 46578 4210 0 23.2 43.9
4048 S00113 31221 4250 0 23.2 43.9
4047 S00122 31225 4249 0 45.5 21.6
4046 S00344 31322 4000 1  

The script appends the values of F and G in file2 to each line in file1 based on the value of A. To explain, it first reads and processes file2 loading columns F and G into memory. After that, it processes file1 and displays its rows along with the corresponding F and G values from file2.

3.1. Processing file2

Now, awk processes file2 as long as the condition NR==FNR is true. NR==FNR enables awk to process multiple files. It ensures that file2 is processed separately from file1. Furthermore, we’re able to store data from file2 before merging it with file1:

  • NR – a built-in awk variable that tracks the number of lines read across all files
  • FNR – a built-in awk variable that tracks the number of lines read in the current file and resets when a new file starts

NR==FNR means we’re still reading the first file which is file2. To clarify, this condition is true only when processing the first file file2.

Meanwhile, f[$1] = $2; g[$1] = $3; creates two associative arrays, f and g:

  • $1 – represents the key for the arrays f and g
  • $2 – stores the value of F
  • $3 – stores the value of G

In awk, associative arrays use keys to store data, enabling mapping between the key (value of A) and corresponding values (F and G from file2). This part stores the values of F and G from file2 in arrays where the keys are the values of A. For instance, the value for a row in file2 such as 4050 12.1 23.6 is stored as f[4050] = 12.1 and g[4050] = 23.6. This makes it easy to retrieve these values when processing file1.

Further, the next command instructs awk to discard the rest of the script for the current line and move on to the next line. In detail, once we store F and G from file2, we move on to the next line of file2 instead of processing the rest of the script meant for processing file1.

3.2. Processing file1

Here, the condition NR==FNR is no longer true since awk finishes processing file2 and starts processing file1. awk skips the first block and executes the next block.

Now, for each line in file1, awk prints the A, B, C, D, and E original column values and appends the corresponding values of F and G to this output. print $1, $2, $3, $4, $5, f[$1], g[$1] is the block that enables us to achieve this:

  • $1, $2, $3, $4, $5, – represents the columns A, B, C, D and E from file1
  • f[$1], g[$1] – utilizes A as the key to get the corresponding F and G values we stored

For example, for the line 4050 S00001 31228 3286 appends f[4050] and g[4050]:

4050  S00001  31228  3286   0 12.1 23.6

In this example, we display each line from file1 and then append the corresponding F and G values from file2.

3.3. Handling Missing Rows and Delimiters

awk is a versatile tool we can use to handle missing rows as well as merge files with different delimiters.

We may encounter a scenario where file1 contains records that don’t have corresponding matches in file2. If this happens, the merged output will contain empty values for columns F and G. Thus, we need to modify the awk script to provide default values for these cases:

$ awk 'NR==FNR { f[$1] = $2; g[$1] = $3; next } { print $1, $2, $3, $4, $5, (f[$1] ? f[$1] : "NA"), (g[$1] ? g[$1] : "NA") }' file2 file1
A B C D E F G
...
4046 S00344 31322 4000 1 NA NA

Above, we see that NA is printed if the values of F or G are missing.

Aside from handling missing rows, another common scenario involves working with files that delimiters other than whitespace. For instance, a file may use commas, tabs, or semicolons to separate fields. By default, awk uses spaces to separate fields. In this case, this default behavior of awk applies since both files are space-delimited.

However, if the files use a different delimiter, we can specify it using the -F option:

$ awk -F',' 'NR==FNR { f[$1] = $2; g[$1] = $3; next } { print $1, $2, $3, $4, $5, f[$1], g[$1] }' file2 file1

The command above works for files that are comma-delimited. Similarly, we can use /t with the -F option for tab-delimited files.

4. Conclusion

In this article, we demonstrated how to merge two files using awk based on a common key column.

We efficiently merged the files and showed the desired output by adding the smaller file into memory and storing the necessary data in associative arrays.

First, we showed the contents of both files and discussed the reasons for using awk. Next, we discussed how awk processes both files as well as showed the merged output. Here, we add the smaller file into memory and store the necessary data in associative arrays. In addition, we append data to the associative arrays to the output of the larger file. Finally, we looked at how to handle missing rows and delimiters.