Authors Top

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

1. Overview

Extracting a substring from a string is a fundamental and common operation of text processing in Linux.

In this tutorial, we’re going to have a look at various ways to extract substrings using the Linux command line.

2. Introduction to the Problem

As the name tells, a substring is a part of a string. The problem is pretty straightforward: We want to extract a part of a given string. However, there are two different types of extraction requirements: index-based and pattern-based.

Let’s understand the two different requirements through a couple of examples.

An index-based substring is defined by the start and end indexes of the original string. Let’s look at the scenario of extracting an index-based substring.

Given that we have an input string, “0123Linux9“, we want to extract the substring from index positions 4 through 8. Then, the expected result will be “Linux“.

Next, let’s see an example of the pattern-based substring.

For instance, we have an input string, “Eric,Male,28,USA“. It’s a string of comma-separated values (Name,Gender,Age,Country).

Now, let’s say we want to extract the third field, 28, which is the age of Eric. In this case, we cannot predict the start index of the target substring since the Name and Gender have dynamic length. Therefore, the implementation will be different from the index-based extraction.

In this article, we’ll address some common ways to extract substrings in the Linux command line. Of course, we’ll cover both extraction types.

3. Extracting an Index-Based Substring

First, let’s have a look at how to extract index-based substrings. We’ll introduce four ways to do that:

Next, we’ll see them in action.

3.1. Using the cut Command

We can extract from the Nth until the Mth character from the input string using the cut command: cut -c N-M. 

As we’ve discussed in an earlier section, our requirement is to take the substring from index 4 through index 8.

Here, when we talk about the index, it’s in Bash’s context, which means it’s a 0-based index.

Therefore, if we want to solve the problem using the cut command, we need to add one to the beginning and ending index. Thus, the range will become 5-9.

Now, let’s see if the cut command can solve the problem:

$ cut -c 5-9 <<< '0123Linux9'
Linux

As the output shows, we’ve got the expected substring, “Linux” — problem solved.

In the example above, we passed the input string to the cut command via a here-string and saved an echo process.

3.2. Using the awk Command

When we need to solve some text processing problem in Linux, we shouldn’t forget the Swiss army knife: awk.

Awk script has a built-in substr() function. So, we can directly call the function to get the substring.

The substr(s, i, n) function accepts three arguments. Let’s take a closer look at them:

  • s – The input string
  • i – The start index of the substring (awk uses the 1-based index system)
  • n – The length of the substring. If it’s omitted, awk will return from index i until the last character in the input string as the substring

Now, let’s see if awk‘s substr() function can give us the expected result:

$ awk '{print substr($0, 5, 5)}' <<< '0123Linux9'
Linux

Good! The awk command works as expected.

Here, we pass i=5. This is because we need the 1-based index. The second argument, 5, is the length of the target substring, and we get it by 8-4+1.

3.3. Using Bash’s Substring Expansion

We’ve seen how cut and awk can easily extract index-based substrings.

Alternatively, Bash is sufficient to solve the problem since it supports substring expansion via ${VAR:start_index:length}. 

Today, Bash is the default shell for many modern Linux distros. In other words, we can solve the problem without using any external command:

$ STR="0123Linux9"
$ echo ${STR:4:5}
Linux

As we can see in the output above, we’ve solved the problem using pure Bash.

3.4. Using the expr Command

Even if Bash is available on most Linux distros, there are still a few Linux systems that ship without Bash, particularly in the embedded Linux world.

The expr command is a member of the Coreutils package. Therefore, it’s available on all Linux systems.

Further, expr has also a substr subcommand that we can use to extract index-based substrings easily:

expr substr <input_string> <start_index> <length>

It’s worth mentioning that the expr command uses the 1-based index system.

Let’s use expr with the substr command to solve our problem:

$ expr substr "0123Linux9"5 5
Linux

The output above shows that the expr command has solved the problem.

4. Extracting a Pattern-Based Substring

We’ve learned several ways to extract index-based substrings. Next, in this section, let’s look into the pattern-based substrings.

The solutions may look different from the index-based ones, but they’re also pretty straightforward to learn.

We’ll address two approaches to solve our problem:

  • Using the cut command
  • Using the awk command

Further, we’ll have a look at a different pattern-based substring extraction problem.

4.1. Using the cut Command

The cut command is a handy tool for working with field-based data.

Let’s review our problem quickly. Our input string is comma-separated values: “Eric,Male,28,USA”. And our goal is to extract the third field, “28“.

To solve the problem, we can tell cut that the string is separated by comma (-d ,), and ask cut to give us the third field (-f 3):

$ cut -d , -f 3 <<< "Eric,Male,28,USA"
28

We got the expected result and solved the problem.

4.2. Using the awk Command

awk is also good at handle field-based data. A compact awk one-liner can solve the problem:

$ awk -F',' '{print $3}' <<< "Eric,Male,28,USA"
28

Moreover, since awk‘s field separator (FS) supports regex, we can build more general solutions with awk.

For instance, if we change the input string by adding a space after each comma, we have “Eric, Male, 28, USA“. This is a common format we can see in the real world.

In this case, the cut command won’t be a good choice to solve the problem. This is because the cut command only supports a single character as the field delimiter.

However, it’s still a piece of cake for awk:

$ awk -F', ' '{print $3}' <<< "Eric, Male, 28, USA"
28

We can even write one awk command to work for both cases. This could be a useful trick in the real world:

$ awk -F', ?' '{print $3}' <<< "Eric, Male, 28, USA"
28
$ awk -F', ?' '{print $3}' <<< "Eric,Male,28,USA"
28

4.3. A Different Pattern-Based Substring Case

So far, we’ve solved our “Eric’s age” problem. In this problem, our input is a field-based value.

However, in practice, the pattern-based substring may not always be located in a CSV entry. Let’s see another example.

Given that we have an input string “whatever dataBEGIN:Interesting dataEND:something else“, our goal is to extract the substring between “BEGIN:” and “END:” — that is, between two patterns.

Obviously, the cut command cannot help us in this case. But it’s still not a challenge for awk. It can solve this problem in different ways.

Next, let’s see how awk solves it. We save the input string in a variable $STR to make the commands easier to read:

$ STR="whatever dataBEGIN:Interesting dataEND:something else"
$ awk -F'BEGIN:|END:' '{print $2}' <<< "$STR"
Interesting data

$ awk '{ sub(/.*BEGIN:/, ""); sub(/END:.*/, ""); print }' <<< "$STR"
Interesting data

The first awk command defines “BEGIN:” or “END:” as the field separator and takes the second field.

However, the second awk solution doesn’t tweak the field separator. Instead, it applies two regex substitutions to achieve the goal:

  • sub(/.*BEGIN:/, “”) – Removes everything from the beginning of the string until “BEGIN:
  • sub(/END:.*/, “”) – Removes from “END:” until the end of the input string

After the execution of these two substitutions, we’ll have our expected result. All we need to do is print it out.

5. Conclusion

Extracting a substring is a fundamental technique of text processing in Linux. Depending on the requirement, the substring extraction can be index-based or pattern-based.

In this article, we’ve addressed how to extract substrings in both types through examples.

Also, we’ve felt the power of the handy text processing utility awk.

Authors Bottom

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

Comments are closed on this article!