1. Overview

The awk command is a very powerful text processing tool. Using it, we can solve various text processing problems in the Linux command-line.

In this tutorial, we’ll look at how to call an external program from an awk script.

2. Call External Command From awk

Even though awk is a powerful utility, sometimes we need the assistance of external commands to solve some problems.

For example:

  • awk + sendmail: Read a CSV file containing email addresses and messages, and process and send each message
  • awk + cp: Read input of a file list, and copy the files to a required destination with a defined name pattern
  • awk + md5sum: Read input containing a list of filenames, output the filename and the MD5 hash of the file

When we call an external command from awk, depending on the requirement, usually we want to get either the returned status or the output of the command.

For instance, in the awk + cp example above, we want to get the return status of the cp command to know if the copy has been done successfully. While in the awk + md5sum example, we need the output of the md5sum command so that our script can produce the right output.

In later sections, we’ll see how to handle these two cases through examples.

3. Get the Execution Status of an External Command

We’ll address how to get the returned status of an external command by solving a file backup problem. Let’s have a look at our input file:

$ cat /tmp/test/file_list.csv 

As the name file_list.csv says, it is a CSV file. It contains filenames in the second field. According to this file, we have the files under the /tmp/test/source directory:

$ ls -l /tmp/test/source
-rw-r--r-- 1 kent kent   30 Jun  7 00:13 file1.txt
-rw-r--r-- 1 kent kent   36 Jun  7 00:13 file2.pdf
-rwx------ 1 root root 1752 Jun  7 00:10 file3.txt
-rw-r--r-- 1 kent kent   37 Jun  7 00:13 file4.zip

The requirement is, we’ll copy the files in the second field to /tmp/test/backup and add a new field in the CSV file called “Backup_status” to record if the backup status of the corresponding file is “Success” or “Failed“.

If we read the ls output above carefully, we’ll notice that the file3.txt has a permission 700. We’ll get “permission denied” error if we attempt to read it by a regular user. Therefore, we expect that the backup status of the file3.txt should be “Failed” in the output.

awk‘s system(cmd) function can call an external command and get the exit status. This function is the key to solving the problem.

First, let’s have a look at how the problem gets solved:

kent$ awk -F',' -v OFS=',' -v toDir="/tmp/test/backup"     \
        'NR==1{print $0,"Backup_status"; next}
        { backup_cmd = "cp " $2 " " toDir " >/dev/null 2>&1"
          st = system(backup_cmd)
          print $0, ( st==0? "Success" : "Failed" ) }' /tmp/test/file_list.csv

After we executed the awk command, all files with backup status “Success” have been copied to the expected directory:

$ ls /tmp/test/backup 
file1.txt  file2.pdf  file4.zip

Now, let’s go through the awk code line by line to understand how it works:

  • Line #1: Start the awk command by a regular user kent and set the required variables such as FS and OFS
  • Line #2: Extend the title by adding a new field: Backup_status and print the title
  • Line #3: Construct the copy command and discard all outputs by redirecting both stdout and stderr to /dev/null
  • Line #4: Call the system() function and hold the exit status in a variable st
  • Line #5: Print the output with the backup status information (st==0 means Success)

4. Get the Output of an External Command

We’ve seen how to call an external program and get the exit status from awk code. However, sometimes we’d like to use the output of an external command to do further processing.

A command can produce a single line output or multiple lines of output. In this section, we’ll discuss how to handle both cases.

4.1. Get a Single Line Output From an External Command

Let’s start with a problem as well.

We’ll reuse the same input file /tmp/test/file_list.csv. This time, we want to add a new column “MIME_type” in the CSV file, to show the MIME Type of each file.

To get the MIME type, we can make use of the file command. For example, we can get the MIME type of the file /tmp/test/source/file1.txt in this way:

$ file -b --mime-type file1.txt

Good. So far, the only missing part of solving our problem is how to call the file command and get the output from our awk script.

In awk, there is a multi-functional command called getline. We can pipe a constructed command to the getline command and save the output of the command to a variable with this syntax:

"an external command" | getline variable

For example, if we want to save the output of our file command to an awk variable result, we can write in this way:

"file -b --mime-type file1.txt" | getline result

Now, let’s assemble the things and solve our adding MIME type problem:

kent$ awk -F',' -v OFS=','                     \
        'NR==1{print $0,"MIME_type"; next}
        { cmd = "file -b --mime-type " $2
          cmd | getline result
          print $0, result }' /tmp/test/file_list.csv
3,"/tmp/test/source/file3.txt",2020-06-03,regular file, no read permission

Since we executed the command with the regular user kent, we got an error message when we called the file command on the file3.txt. This error message is also added to the output.

Getting a single line output by piping the command to the getline command is pretty straightforward.

Can we get the multi-line output still using the same method? Let’s find out in the next section.

4.2. Get Multi-Line Output From an External Command

Some commands may produce multi-line output. Let’s try if we can get complete output using the cmd | getline v approach:

$ awk 'BEGIN{cmd="seq 10"; cmd | getline out; close(cmd); print out}'

Oops! We know that the command seq 10 will produce a ten lines output. However, our awk only fetched the first line from the output. This is because this form of the getline command reads one record at a time from the pipe. 

The getline command itself has a return value. If there is still output coming from the pipe, it returns 1. Otherwise, the getline command will return 0:

$  awk 'BEGIN{cmd="seq 10";
        for(i=1;i<=11;i++) {
            retValue = cmd | getline out
            printf "getline returns: %s; cmd output: %s\n", retValue, retValue?out:"Null"                 
getline returns: 1; cmd output: 1
getline returns: 1; cmd output: 2
getline returns: 1; cmd output: 3
getline returns: 1; cmd output: 4
getline returns: 1; cmd output: 5
getline returns: 1; cmd output: 6
getline returns: 1; cmd output: 7
getline returns: 1; cmd output: 8
getline returns: 1; cmd output: 9
getline returns: 1; cmd output: 10
getline returns: 0; cmd output: Null

We write a loop to run getline 11 times. We know that the command seq 10 will output ten lines. Therefore, we have ten getline returns: 1 in the output above.

However, after the cmd output: 10 is printed, the pipe doesn’t contain data anymore. Now, if we run the getline command and try to read from the pipe once again, the command will return a 0.

Therefore, we can write a while loop to get the complete output from an external command:

$ awk 'BEGIN{cmd="seq 10";
       while(cmd | getline step_out){
          cmd_out=cmd_out (cmd_out=="" ? "" : "\n") step_out
       print cmd_out

4.3. Don’t Forget the close(cmd)

We’ve seen examples to get the output of external command using the cmd | getline variable.

It’s worthwhile to mention that we must call close(cmd) to close the pipe after calling the cmd | getline. Otherwise, our awk script may produce the wrong result.

Let’s have a look at what would happen if we don’t close the pipe. For example, say we have a text file:

$ cat close_test.txt
"Awk is cool!"
"Sed is cool!"
"Awk is cool!"
"Sed is cool!"

In the file, the first line and the third line are identical, so are the second line and the last line.

Now we want for each line in the input file, to call the external md5sum command to append the MD5 hash value on each line. We’re expecting the identical lines should get the same MD5 hash values. In the first try, we don’t call the close(cmd) to close the pipe:

$ awk '{cmd="md5sum <<<"$0 ; cmd|getline md5; print $0,"MD5:" md5}' close_test.txt
"Awk is cool!" MD5:04cbd36582f5c11cce032ec44ec476d8  -
"Sed is cool!" MD5:f1844ba1dd262ecbbf798f7c38180693  -
"Awk is cool!" MD5:f1844ba1dd262ecbbf798f7c38180693  -
"Sed is cool!" MD5:f1844ba1dd262ecbbf798f7c38180693  -

The output shows that the MD5 hash values for the last three lines are the same. Obviously, this is a wrong output.

Let’s explain shortly why has this happened.

If we don’t close the pipe, every time piping the same external command to the getline, it will not execute the command once again. Instead, it will attempt to read the next record from the output of the last execution. We know that the cmd | getline var command will return either 1 or 0. If there’s no data in the pipe anymore, it will return 0, and the var variable will not be set.

Let’s reset the md5 variable for each input line and print out the getline status to understand the problem easier:

$ awk '{md5=""; cmd="md5sum <<<"$0 
      status=cmd|getline md5; 
      print $0,"getline status:"status,"MD5:" md5}' close_test.txt
"Awk is cool!" getline status:1 MD5:04cbd36582f5c11cce032ec44ec476d8  -
"Sed is cool!" getline status:1 MD5:f1844ba1dd262ecbbf798f7c38180693  -
"Awk is cool!" getline status:0 MD5:
"Sed is cool!" getline status:0 MD5:

The fix to the problem is just calling the close(cmd) after we’ve read the output from the pipe:

$ awk '{cmd="md5sum <<<"$0 ; cmd|getline md5;close(cmd); print $0,"MD5:" md5}' close_test.txt
"Awk is cool!" MD5:04cbd36582f5c11cce032ec44ec476d8  -       
"Sed is cool!" MD5:f1844ba1dd262ecbbf798f7c38180693  -
"Awk is cool!" MD5:04cbd36582f5c11cce032ec44ec476d8  -
"Sed is cool!" MD5:f1844ba1dd262ecbbf798f7c38180693  -

5. Conclusion

In this article, we’ve learned how to call an external program using awk.

Depending on the requirement, we can call the system(cmd) to get the exit code of the external command, or get the output using the cmd | getline form.

We also discussed why we shouldn’t forget calling the close(cmd) to close the pipe.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.