1. Overview

In data processing and analysis, it’s often necessary to extract specific fields or columns from structured text. Sometimes, the desired field is specified in reference to another one.

In this tutorial, we’ll explore the technique of extracting a field from a line of text by counting fields from the last. We’ll discuss several processes step-by-step and looks at an example to illustrate the concept.

2. Sample Task

In text data, fields are typically separated by delimiters such as commas, tabs, or spaces. The cut command in Linux and Unix-based systems provides a convenient way to extract fields based on their position. By default, cut uses a tab delimiter, but we can configure it to use any specified delimiter.

Let’s suppose we wish to extract the field preceding the last one from a comma-separated input file named file.txt:

$ cat file.txt

It’s important to note that a CSV file can consist of field values containing a comma. In such cases, we can employ specialized tools such as csvtool or Python‘s csv module to correctly separate the fields. However, in our current example, the field values don’t include the delimiter.

Since there are five columns in total, we can use the cut command with a comma delimiter to extract the fourth column, the one before the last:

$ cut -d ',' -f 4 file.txt

We use the -d option with cut to specify the delimiter, while the -f option selects the fields to extract. In this case, extracting the fourth column has the expected result.

However, when dealing with large datasets with many fields, manually counting the number of columns until the desired field becomes impractical. In such cases, it could be advantageous to be able to specify the required field number, counting from the end.

Now, let’s explore various methods to accomplish this task in Bash.

3. Using rev and cut

One of the simplest approaches to enable specifying a field by counting from the end involves combining the use of rev and cut:

$ rev file.txt | cut -d ',' -f 2 | rev

We use the rev command to reverse the sequence of characters within each line of file.txt. Then, we pipe the output to the cut command while specifying that we want to extract the second field from each line. Finally, we use rev once again to restore the extracted characters to their original order.

4. Using Total Column Count With cut

If the given data has the same number of fields for each entry, then we can use an alternative approach. This solution involves first determining the total number of columns in the data. By doing so, we can subsequently specify the desired field based on its position relative to the total number of columns:

$ n=$(head -1 file.txt | tr ',' '\n' | wc -l)
$ cut -d ',' -f "$((n-1))" file.txt

To find the number of fields, we employ a three-step procedure:

  1. use the head -1 command to extract the first line from the input file
  2. apply the tr command to translate each comma (,) or delimiter character into a newline character (\n)
  3. use wc -l to count the number of lines

The second step effectively transforms the comma-separated values into separate lines, while the third step counts those lines. Since we aim to assign the count of fields to the variable n, we execute these commands within a subshell using command substitution.

Once the count of fields is in n, we proceed with the extraction process using the cut command. We use the -d option to specify a comma delimiter and the -f option to indicate the target field number. In this case, the field to extract is specified as the value of n-1, where n represents the number of fields. To compute this value, we use arithmetic expansion, denoted by $(()).

By employing this approach, we ensure that the second-to-last field is selected for extraction. By adjusting the field number value used in the cut command, we can extend this method to extract any field relative to the last. For example, using $((n-2)) corresponds to extracting the third field counting from the end, while $((n-3)) corresponds to extracting the fourth field counting from the end, and so forth.

5. Using Bash Arrays

Another approach involves processing the input file line by line within a while loop. In this method, we store the fields of each line in an array and then echo the desired element from the array:

$ while IFS=, read -ra line_fields; do echo "${line_fields[-2]}"; done < file.txt

We employ a while loop to iterate over the lines from file.txt. We also set the Internal Field Separator (IFS) to a comma. The IFS variable determines how the line will be split into fields during the execution of the read command.

The read -ra line_fields command reads a line from the input file, splits it into fields using the comma as the delimiter, and assigns the fields as elements in the line_fields array variable.

Finally, for each line we read from the file, we echo ${line_fields[-2]}, which prints the second-to-last element of the line_fields array variable. In Bash, we use negative indices to access elements counting from the end of the array.

It’s worth noting that the -r option with read prevents the backslash character (\) from being treated as an escape. This enables reading input without interpreting any special characters. We also use the -a option to assign the values we read to an array.

Alternatively, we can access the same element in another way:

$ while IFS=, read -ra line_fields; do echo "${line_fields[@]: -2:1}"; done < file.txt

The expression ${line_fields[@]: -2:1} extracts a slice of the array line_fields using the ${line_fields[@]} notation. It specifies the range of elements to extract, where -2 indicates the second-to-last element and 1 specifies that only one element should be extracted.

6. Using awk

We can also use GNU awk to extract the required field, counting from the end of each line:

$ awk -F ',' '{print $(NF-1)}' file.txt

The -F option specifies the field separator for awk, which in this case is a comma. Since no pattern is provided, awk executes the print action for every line in the file. Hence, the expression print $(NF-1) directs awk to print the value of the second-to-last field for each line. The variable NF-1 references the field preceding the last one. In awk, the built-in NF variable represents the total number of fields present in an input line.

7. Using Python

Python also provides various methods to select a field number by counting from the end of a line. In particular, we can use the sys or pandas modules to accomplish this task.

7.1. Using the sys Module

By importing the sys module, we can read input from the standard input (stdin). We can extract the required field from each line:

$ cat extract_second_last_column.py
import sys
for line in sys.stdin:

The Python script implements several steps:

  1. iterate over each line of input from stdin, and assign it to the line variable
  2. split the line into a list of fields using the split() method with comma as the delimiter
  3. access the second-to-last element of the resulting list using negative indexing, denoted by [-2]
  4. use the print() function to display the extracted value

To run the script, we’d typically provide input via stdin redirection:

$ python3 extract_second_last_column.py < file.txt

Equivalently, we can provide input through a pipe from another command:

$ cat file.txt | python3 extract_second_last_column.py

Here, the contents of file.txt are piped as input to the Python script via stdin.

7.2. Using the pandas Module

If the given dataset has a consistent number of fields for each row, we can use Python’s pandas module to extract any field, counting backward:

$ cat extract_second_last_column.py
import pandas as pd
df = pd.read_csv('./file.txt', header=None)

We first import the pandas module and assign it the pd alias. Then, we use the read_csv() method to read the contents of the CSV file and create a DataFrame object called df. By specifying the header=None parameter, we indicate that the CSV file lacks a header row, ensuring that the first row is treated as data.

Following that, we print the second-to-last column of the DataFrame using df.iloc[:,-2]. The latter syntax employs the .iloc indexer to select all rows (:) and the second-to-last column (-2) of the DataFrame. To display the selected column as a string, we use the to_string() method. Additionally, we use the index=False parameter to exclude the row index from the output.

Next, we run the script:

$ python3 extract_second_last_column.py

Alternatively, instead of using a separate script file, we have the option of executing the commands within a single line:

$ python3 -c '
import pandas as pd; df = pd.read_csv("./file.txt", header=None); print(df.iloc[:,-2].to_string(index=False))

The -c option specifies the command to execute enclosed within single quotes. In this case, the command actually consists of multiple commands separated by semicolons.

8. Conclusion

Extracting a field from a text line by counting from the end is a practical requirement when working with structured data. In this article, we’ve explored several methods for accomplishing this task. The solutions involved various approaches such as using rev and cut, working with the total number of columns, as well as using Bash arrays, awk, and Python.

Comments are closed on this article!