1. Overview

When we process files under the Linux command line, we often need to manipulate each line of an input file, such as removing the last character from each line.

This time, let’s take a look at another problem: extracting the last word from each line.

2. Introduction to the Problem

2.1. The Example Input

Examples are always helpful to understand a problem quickly.

First of all, let’s see an input file:

$ cat input.txt
Linux rocks!
Next line is an empty line:

I have trailing spaces:     
I have a number: 42

Our input.txt has several lines of text.

Additionally, the file contains an empty line and one line with trailing spaces. However, this information is not so obvious in the output above.

The cat command with the -e option will print a ‘$‘ sign at the end of each line:

$ cat -e input.txt
Linux rocks!$
Next line is an empty line:$
I have trailing spaces:     $
I have a number: 42$

Now, we can clearly see the trailing spaces in the output.

Let’s revisit our goal, “extracting the last word from each line” — that seems clear enough. However, there are a couple of things we need to pay attention to.

2.2. The Definition of a Word

A word can have different definitions:

  • A word could mean an English word — a string like “ab_cd_1234” doesn’t count.
  • A word is a string that matches Regex “\w+“. That is, it only contains alphanumeric characters (letters or numbers, regardless of case) or the underscore character (“_“). For example, “ab_cd_1234” is a regex word but “ab.cd#1234” isn’t.
  • A word is a combination of any non-whitespace characters. For example, both “ab_cd_1234” and “ab.cd#1234” are words.

Our definition of “word” affects the solution to the problem. So, in this tutorial, we’ll take the last one in the list above as the definition of a “word”.

2.3. Handling Trailing Spaces

Depending on the requirement, the problem may have two different variants when a line contains trailing spaces:

  • Return an empty string as a result.
  • Return the last non-whitespace character sequence. If the whole line is blank or empty, we would like to have an empty string as a result.

In this tutorial, we’ll cover both variants and address two approaches to solve the problem:

  • Using the sed command
  • Using the awk command

Next, let’s see them in action.

3. Using the sed Command

sed is a non-interactive stream editing utility. Let’s see how to solve the problem using this great tool.

3.1. Trailing Spaces: Taking an Empty String

An idea to solve the problem with the sed command is to remove everything until the last horizontal whitespace character in the line, such as space or tab.

sed‘s “s/pattern/replacement/” command is good at solving this problem:

$ sed 's/.*[[:blank:]]//' input.txt | cat -e

As the example above shows, we’ve piped sed‘s output to the cat -e command to check whitespace characters more easily.

The output is what we’re expecting. Also, we’ve noticed that for empty lines and lines with trailing spaces, we’ve taken an empty string as the word.

It’s also worthwhile to mention that [:blank:] is a POSIX standard character class.

We’re using the GNU sed in this tutorial, so if we used “\s” instead of [[:blank:]], the solution would work as well. However, using the POSIX standard character class makes the solution most portable.

3.2. Trailing Spaces: Taking the Last Non-Whitespace Character Sequence

Solving this problem is not a challenge for us if we’ve solved the first variant of the problem.

We can extend the first solution by adding a preprocessing step: removing all trailing spaces.

Simply put, we can first right-trim the line and then apply the “s/.*[[:blank:]]//” substitution command:

$ sed 's/[[:blank:]]*$//; s/.*[[:blank:]]//' input.txt | cat -e

Again, we piped sed‘s output to the cat -e command to verify whitespace characters.

As the output above shows, we’ve got an empty string for the empty line in the input file, while for the line with trailing spaces, we’ve extracted the last non-whitespace character sequence (“spaces:“) as a result.

4. Using the awk Command

awk is another powerful text processing tool under the Linux command line.

Similar to sed, the awk command provides substitution functions sub() and gsub(). Therefore, we can certainly take the same ideas here to solve the problem.

However, awk by default provides good support to field-based inputs. For example, we can look at each word in a line as a field.

So, if the requirement is to extract the last word, then we just ask awk to return the last field.

However, before we start looking at the awk solutions to the problem, let’s spend a couple of minutes to take a closer look at awk‘s FS variable.

4.1. The awk FS Variable in a Nutshell

awk treats the values of the FS variable differently depending on how we define it, and we can define FS in three different ways:

  • as the empty string
  • as a single character
  • as more than one character

Let’s see how awk treats each case.

First, if FS is empty, each character in the input record will be a field:

$ awk 'BEGIN{FS=""}{print $1,$2,$3}' <<< "AWK"

Second, if FS is a single character, the literal character will be the separator:

$ awk 'BEGIN{FS="*"}{print $1,$2,$3}' <<< "A*W*K"

However, in this case, there is an exception.

When FS is a single space character, which is also the default value, the separator will be the same as the Regex separator “[[:space:]]+” or “[[ \t\n]]+:

$ awk 'BEGIN{FS=" "}{print $1,$2,$3}' <<< "    A  W    K    "

Third, if FS‘s value isn’t empty or a single character, awk treats it as a regex:

$ awk 'BEGIN{FS="[#@]"}{print $1,$2,$3}' <<< "A#W@K"

Now, let’s see how to solve the problem by adjusting the FS variable.

4.2. Trailing Spaces: Taking an Empty String

If we would like to have an empty string as a result when a line has trailing whitespace characters or is blank, we need to set awk‘s FS built-in variable with the horizontal whitespace character class:

$ awk -F'[[:blank:]]' '{print $NF}' input.txt | cat -e

We should note that setting the FS variable with ‘\s‘ works for some awk implementations, such as the widely used GNU awk.

However, we need to escape the backslash: awk -F’\\s’  ‘{print $NF}’ input.txt. Otherwise, awk will treats ‘\s‘ as literal ‘s‘.

4.3. Trailing Spaces: Taking the Last Non-Whitespace Character Sequence

To solve this variant of the problem, we can write a more compact awk one-liner:

$ awk '{print $NF}' input.txt | cat -e

As we can see, we’ve got the expected output.

Sharp eyes will see that in the awk command above, we didn’t set the FS variable. That is, we use the default value of FS.

As we’ve learned, the default FS will clean the leading and trailing whitespace characters from fields. Therefore, the short one-liner does the job.

5. Conclusion

In this article, we’ve addressed two ways to get the last word from each line of a file.

When we use sed, we can make use of its substitution command to solve the problem. If we take the awk command, we can adjust the FS variable and easily get the last field.

Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.