Using awk With Column Value Conditions

1. Overview

awk is a handy tool for manipulating and analyzing structured text data efficiently. One of awk‘s features is its ability to work with columns and perform conditional processing.

In this tutorial, we’ll explore how to use awk with column value conditions to extract, transform, and filter data.

2. Example File

To show the functionality of each snippet, we use the same file.

So, let’s look at the content of data.txt:

$ cat data.txt 
Chinedu,35,Lagos
Amaka,28,Abuja
Olumide,42,Ibadan
Chinedu,51,Kano
Amaka,63,Lagos
Olumide,19,Port Harcourt
Amaka,47,Benin City
Chinedu,72,Enugu

Now, we can continue with specific examples.

3. Extracting Columns

Before we can work with columns in awk, we may need to define the way to split up the data into columns to begin with. For that, awk uses a field separator to know where each column starts and ends.

3.1. Using Field Separators

By default, awk assumes columns are separated by whitespaces. This means it splits each line of input into columns whenever it encounters whitespace. However, if the data uses a different delimiter such as a comma, we specify this using the -F option.

For example, we can tell awk that columns are separated by commas instead of spaces:

$ awk -F ',' '{ ... }' input.csv

In this command, -F ‘,’ sets the field separator to a comma, ‘{ … }’ represents the action we want awk to perform on each line, and input.csv is the input file containing comma-separated values.

3.2. Referencing Fields by Position

Once awk knows how the columns are separated, we can work with each column via its position. For that, we use the $ sign followed by the column number.

For instance, $1 means the first column, $2 means the second column, and so on.

As an example, let’s print the second column from the example file:

$ awk -F ',' '{ print $2 }' data.txt
35
28
42
51
63
19
47
72

As a result, awk prints the second column value from each line of the data.txt file.

3.3. Extracting Multiple Columns

We can also get values from multiple columns at once by listing the column numbers we want.

For example, we can print the first and third columns from the same file:

$ awk -F ',' '{ print $1, $3 }' data.txt
Chinedu Lagos
Amaka Abuja
Olumide Ibadan
Chinedu Kano
Amaka Lagos
Olumide Port Harcourt
Amaka Benin City
Chinedu Enugu

Thus, awk prints the first and third column values from each line separated by whitespace.

4. Conditional Processing

When manipulating data with awk, conditional processing makes it possible to perform actions based on specific criteria within column values.

4.1. Using Comparison Operators

We can use comparison operators to check if a column value matches a certain condition. As expected, we have the most commonly used operators:

==: equal to
!=: not equal to
>: greater than
<: less than
>=: greater than or equal to
<=: less than or equal to

For example, we can display only the lines where the value in the third column exceeds 40:

$ awk -F ',' '$2 > 40' data.txt
Olumide,42,Ibadan
Chinedu,51,Kano
Amaka,63,Lagos
Amaka,47,Benin City
Chinedu,72,Enugu

In this command, awk examines each line in data.txt, checks if the third field is greater than 40, and prints the line if the condition is true.

4.2. Combining Conditions With Logical Operators

To refine the data selection further, we can combine multiple conditions using logical operators:

&&: logical AND
||: logical OR
!: logical NOT

Suppose we want to print lines where the first column equals Amaka and the second column is greater than 40:

$ awk -F ',' '$1 == "Amaka" && $2 > 40' data.txt
Amaka,63,Lagos
Amaka,47,Benin City

Here, awk processes each line and checks if both conditions are met:

first field must be Amaka
second field must be greater than 40

It then prints only lines satisfying both criteria.

Alternatively, if we’re interested in lines where the first column is Amaka or the second column exceeds 40, we can use the logical OR operator:

$ awk -F ',' '$1 == "Amaka" || $2 > 40' data.txt
Amaka,28,Abuja
Olumide,42,Ibadan
Chinedu,51,Kano
Amaka,63,Lagos
Amaka,47,Benin City
Chinedu,72,Enugu

Thus, this command prints lines that meet at least one of the specified conditions.

4.3. Pattern Matching With Regular Expressions

awk can also use regular expressions to match patterns in column values. To leverage this, we employ the ~ operator:

$ awk '$3 ~ /ERROR/' log.txt
2023-04-01 10:32:15 ERROR Unable to connect to database
2023-04-01 10:50:10 ERROR File not found

This syntax tells awk to check whether the second column value for each line contains the word ERROR, and if so, print that line. Notably, we needn’t set the field separator since the file we query is using the default separator.

5. Combining Column Extraction and Conditions

Now that we know how to work with columns and conditions in awk, we can combine them to perform even more useful and specific manipulations.

5.1. Selecting Rows Based on Column Values

We can use conditions to pick out certain rows based on their column values.

For example, we can print the first and third columns but only for rows where the second column is Amaka:

$ awk -F ',' '$1 == "Amaka" { print $2, $3 }' data.txt
28 Abuja
63 Lagos
47 Benin City

This command checks the first column value for each row, and if it’s Amaka, it prints the second and third column values for that row.

5.2. Transforming Data Based on Conditions

awk also enables the changing of column values based on conditions.

For example, we can make the third column uppercase if the first column is Chinedu:

$ awk -F ',' '$1 == "Chinedu" { $3 = toupper($3) }; { print }' data.txt
Chinedu 35 LAGOS
Amaka,28,Abuja
Olumide,42,Ibadan
Chinedu 51 KANO
Amaka,63,Lagos
Olumide,19,Port Harcourt
Amaka,47,Benin City
Chinedu 72 ENUGU

As a result, this basic script checks if the first column is Chinedu for each line. If so, it converts the third column to uppercase using the toupper() function. Finally, the { print } part at the end tells awk to print every line, including the changed ones.

5.3. Calculating Aggregate Values With Conditions

We can use awk to calculate totals or other aggregate values based on conditions, too.

For instance, we can add up the values in the second column, but only for lines where the first column is Chinedu:

$ awk -F ',' '$1 == "Chinedu" { sum += $2 } END { print sum }' data.txt
158

Thus, we tell awk to keep a running total in the sum variable, adding the second column value each time the first column is Chinedu. The END part runs after all the lines are processed and prints the final total.

6. Using Variables and Arrays

We can use variables and arrays in awk to store and work with column values.

To illustrate, let’s keep track of all the different values in the second column:

$ awk -F ',' '{ unique[$2] = 1 } END { for (val in unique) print val }' data.txt
42
72
63
35
28
51
47
19

This code uses an array called unique to remember each value it sees in the second column. The END part prints out all the unique values at the end.

7. Conclusion

In this article, we explored how awk can be utilized to process text files based on column value conditions.

In conclusion, awk provides a flexible way to access and modify columns in structured data.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Administration

Scripting

Networking

Files

Processes

Full Archive

About Baeldung