1. Overview

In an awk program, the pattern-action pair is a fundamental structure that decides how an input line will be processed. Among these pairs, two pattern-action pairs, namely the BEGIN and END blocks, execute only once, while all other blocks execute for each input line. Such a peculiar execution behavior of these rules makes them more special than the remaining ones.

In this tutorial, we’ll solve the riddle of what’s the use of BEGIN and END rules in awk by exploring a variety of use cases.

2. Initialization and Setup Tasks

Since the BEGIN block executes only once and before all other rules, it’s the apt place to do the initialization and setup tasks in an awk program. In this section, let’s learn some of the most popular use cases we can solve using the BEGIN block.

2.1. Initializing Variables

Let’s start by creating a simple mental model of an awk program:

execute_begin_block()

for line in input_lines
do
    execute_main_block()
done

execute_end_block()

We can infer that the main block executes for each input line. So, initializing variables in the main block is a waste because we’ll be reinitializing them for each input line. Worse, we could inadvertently override a variable. On the other hand, it’s efficient and safer to initialize the variables in the BEGIN block and then use them within the main block.

Now, let’s write a one-liner awk program to show the contents of our sample text file, numbers.txt:

$ awk '{print}' numbers.txt
1
2
3
4
5

We can notice that our program didn’t have any BEGIN block, so the print function executes after reading each input line.

Next, we’ll extend this one-liner program to solve a use case where we only want to print a specified number of lines. For this purpose, let’s write our logic in the print_with_limit.awk script by adding the BEGIN block where we initialize the limit variable with the USER_DEFINED_LIMIT argument passed to the script:

$ cat print_with_limit.awk
BEGIN {
    limit=USER_DEFINED_LIMIT
}

NR<=limit {
    print
}

Another addition to the program is the addition of the NR<=limit condition before the main block.

Finally, let’s execute our script to print the first three lines from the numbers.txt file:

$ awk -v USER_DEFINED_LIMIT=3 -f print_with_limit.awk numbers.txt
1
2
3

Perfect! Our script works as expected.

2.2. Filter Patterns

Let’s explore another common usage of the BEGIN block, where we define patterns and use them later in the main block. For instance, let’s imagine that we want to filter the numbers in the numbers.txt file to be in the range [1-3]. Further, let’s solve our use case by writing the filter_nums.awk script that uses a pattern definition in the BEGIN block:

$ cat filter_nums.awk
BEGIN {
    pattern="^[1-3]$"
}

$0 ~ pattern {
    print $0
}

Next, let’s go ahead and execute our filter_nums.awk script:

$ awk -f filter_nums.awk numbers.txt
1
2
3

Great! We’ve got this one right.

2.3. Data Conversion

By default, awk uses space and newline as the input field (FS) and the input record separator (RS). Furthermore, the default values for the output field (OFS) and output record separator (ORS) are the same as FS and RS, respectively.

For situations where we want to use custom values, we can define them explicitly in the BEGIN block. Let’s understand this with the help of the sample_text.txt file:

$ cat sample_text.txt
a1,a2,a3
b1,b2,b3
c1,c2,c3

We can see that we’ve got comma-separated fields in each record. Further, let’s say we want to read individual fields and then print them such that the output fields are separated with a colon (:) and output records are separated with two newlines. So, let’s go ahead and write the field_record_separator.awk script that explicitly defines the FS, OFS, RS, and ORS variables to solve our use case:

$ cat field_record_separator.awk
BEGIN {
    FS=",";
    OFS=":";
    RS="\n";
    ORS="\n\n";
}

{
    print $1,$2,$3
}

Lastly, we’ll execute our awk script to process the sample_text.txt file:

$ awk -f field_record_separator.awk sample_text.txt
a1:a2:a3

b1:b2:b3

c1:c2:c3

Again, the output format looks correct, so we’ve learned one more use case for the BEGIN block.

3. Arithmetic Operations

In this section, we’ll extend our learning about variable initialization to solve the frequent use case of performing arithmetic operations using an awk script. For this purpose, we’ll process the numbers.txt file that contains numbers from 1 to 5 and compute the count of even numbers and the sum of all numbers tracked so far for each input line.

Let’s go ahead and write the logic in the arithmetic.awk script to solve our use case:

$ cat arithmetic.awk
BEGIN {
    sum=0
    even_count=0
}

{
    sum+=$0
    if ( $0%2 == 0 ) {
        even_count++
    }
    print "line no." NR, "sum so far:", sum, "count of even numbers", even_count
}

To summarize the logic in our script, we’ve initialized the sum and even_count variables within the BEGIN block. Further, the core logic of performing the arithmetic operation is within the main block, as it needs to execute for each input line.

Now, let’s go ahead and execute our arithmetic.awk script and verify the results:

$ awk -f arithmetic.awk numbers.txt
line no.1 sum so far: 1 count of even numbers 0
line no.2 sum so far: 3 count of even numbers 1
line no.3 sum so far: 6 count of even numbers 1
line no.4 sum so far: 10 count of even numbers 2
line no.5 sum so far: 15 count of even numbers 2

Perfect! We’ve solved this use case correctly.

4. Report Generation

awk is often used for processing text files and generating reports or analysis summaries. For this purpose, we can add the logic to print the report summary in the END block, as it executes only once after we’ve read the entire input lines.

Our earlier use case of performing arithmetic operations gave us a running total and count of even numbers after each line because we had the print statement in the main block. So, let’s extend the use case of performing the arithmetic operations to generating a report summary by writing the report.awk script:

$ cat report.awk
BEGIN {
    sum=0
    even_count=0
}

{
    sum+=$0
    if ( $0%2 == 0 ) {
        even_count++
    }
}

END {
	print "total lines: " NR, ", sum:", sum, ", count of even numbers", even_count
}

We can observe the key difference between the arithmetic.awk and report.awk scripts involves moving the print statement from the main block to the END block with minor changes to the wording.

Lastly, let’s generate the summary report by executing the report.awk script:

$ awk -f report.awk numbers.txt
total lines: 5, sum : 15, count of even numbers 2

As expected, we get the summary report after processing all the values from the numbers.txt input file.

5. Sorting

We can also use the END block in an awk program to sort a file. In this section, let’s learn how to read numbers from the unsorted_numbers.txt input file and then sort them using the bubble sort algorithm.

$ cat unsorted_numbers.txt
4
3
2
9
6

Next, let’s write the sort.awk script and implement the bubble sort algorithm in the END block:

$ cat sort.awk
{
    lines[NR] = $0
}

# Process lines
END {
    # bubble sort
    for (i = 1; i <= NR-1; i++) {
        for (j = i+1; j <= NR; j++) {
            if (lines[i] > lines[j]) {
                temp = lines[i]
                lines[i] = lines[j]
                lines[j] = temp
            }
        }
    }

    # Print sorted lines
    for (i = 1; i <= NR; i++) {
        print lines[i]
    }
}

We can see that we’ve used the main block to populate the lines array to store all the numbers. Later, we added most of the logic in the END block, which includes printing the sorted numbers from the lines array.

Finally, let’s put our sort.awk script in action and sort the numbers from the unsorted_numbers.txt file:

$ awk -f sort.awk unsorted_numbers.txt
2
3
4
6
9

Perfect! The numbers are in the correct order, so our script works as expected.

6. Data Validation

We often want to validate the input data based on some criteria, and awk is a perfect tool for such a use case. However, since data validation would involve reading the data, we’ll need to add the validation logic in the END block to provide a validation summary of the entire input data. In this section, we’ll understand this use case with a scenario where we want to validate if all the values in the input file are numeric.

Let’s add the validation logic in the number_validation.awk script and look at the script in its entirety:

$ cat number_validation.awk
BEGIN {
    valid = 1
}

{
    # Validate each line of the input
    if (!isValid($0)) {
        print "Invalid input:", $0
        valid = 0
    }
}

END {
    if (valid) {
        print "All input is valid."  # Display success message
    } else {
        print "Validation failed."  # Display error message
        exit 1  # Exit with non-zero status to indicate error
    }
}

function isValid(line) {
    # Check if line is a number
    if (line ~ /^[0-9]+$/) {
        return 1  # line is a number
    } else {
        return 0  # line is not a number
    }
}

Let’s break this down to understand the nitty gritty of the logic. In the BEGIN block, we’ve set the valid variable to 1, equivalent to a true value in an awk program. Further, we’ve defined the isValid() function to validate whether the input line is numeric. Then, we set the valid variable to 0 if at least one non-numeric value is present, else valid remains set to 1, and the validation message for the entire input file goes in the END block.

Next, let’s use the number_validation.awk script to validate the numbers.txt file:

$ awk -f number_validation.awk numbers.txt
All input is valid.

Lastly, let’s also see if it works correctly for a file containing non-numeric data:

$ cat num_str.txt
t
#
42
$ awk -f number_validation.awk num_str.txt
Invalid input: t
Invalid input: #
Validation failed.

Great! We’ve verified that our script is working correctly for both scenarios.

7. Line Deduplication

In this section, we’ll learn another use case of the END block, wherein we want to remove duplicate lines from a text file.

First, let’s take a look at the duplicates.txt file that contains duplicate numeric values:

$ cat duplicates.txt
1
8
2
3
1
3
2
8
2
1

Next, let’s write the dedup.awk script wherein we use the lines associative array to keep track of the count of occurrences of each unique line:

$ cat dedup.awk
{
    lines[$0]++
}

END {
    for (line in lines) {
        print line
    }
}

We must note that the for loop in the END block iterates over the keys of the lines associative array and gets us the unique values.

Finally, let’s execute the dedup.awk script to remove duplicate values from the duplicates.txt file:

$ awk -f dedup.awk duplicates.txt
3
8
2
1

It looks like we’ve nailed this!

8. Conclusion

In this article, we learned the use of BEGIN and END blocks in an awk program. Further, we solved several interesting use cases, including report generation, line deduplication, sorting, data validation, and data conversion.

Comments are closed on this article!