Replace the First n Matched Instances in a File

1. Overview

In this tutorial, we’ll explore how to replace the first n matched instances in a file using sed and awk.

2. Introduction to the Problem

We may encounter scenarios where we want to modify only the first n occurrences of a particular pattern, leaving the rest unchanged.

For example, let’s say we have the file file.txt:

$ cat file.txt
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

This file has 15 “Linux” instances, five per line. Next, we’ll use different approaches to replace the first n “Linux“s with “MacOS“.

3. Using sed

sed is a handy tool to process text in the command line. Next, let’s replace the first three “Linux“s with “MacOS“s using sed.

3.1. Performing n Substitutions on the File Level

By default, sed reads and processes line by line. So, if we write a substitution (s/../../) command to tell sed to perform a substitution only once, sed executes the command once per line:

$ sed 's/Linux/MacOS/' file.txt  
MacOS_Linux_Linux_Linux_Linux
MacOS_Linux_Linux_Linux_Linux
MacOS_Linux_Linux_Linux_Linux

So the first thing to do is ask sed to perform actions per file instead of per line:

$ sed ':a;$!{N;ba}; s/Linux/MacOS/' file.txt 
MacOS_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

As we can see, this time, the substitution is executed only once on the file level. Let’s quickly understand what :a;$!{N;ba}; does:

:a – Create a branch with the name (label) “a“
$!{…} – If it’s not the last line in the file, then do {…}
{N;ba} – Append the next line to the pattern space (N), and go to label “a” (ba)

In this way, the pattern space holds the entire file, so the s/../../ command replaces the first occurrence of “Linux” in the file.

Also, as the whole file is loaded in sed‘s pattern space, the memory usage can be significantly high when the input file is large.

The next step is to perform this substitution three times. As we cannot use variables, such as a counter, in the sed command, if we want to execute the same substitution n times, we must repeat it n times:

$ sed ':a;$!{N;ba}; s/Linux/MacOS/; s/Linux/MacOS/; s/Linux/MacOS/' file.txt
MacOS_MacOS_MacOS_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

One may ask, what if n=10? Do we have to repeat the s/../../ command ten times? In this case, we can write a script to generate the s/../../ commands:

$ s_cmd=; for ((i=1; i<=10; i++)) ;do s_cmd="$s_cmd s/Linux/MacOS/;"; done
$ echo $s_cmd
 s/Linux/MacOS/; s/Linux/MacOS/; s/Linux/MacOS/; ...

Now, we can put the $s_cmd variable in our sed command to replace the first ten “Linux“s:

$ sed ":a;\$!{N;ba}; $s_cmd" file.txt
MacOS_MacOS_MacOS_MacOS_MacOS
MacOS_MacOS_MacOS_MacOS_MacOS
Linux_Linux_Linux_Linux_Linux

We should note that when we use shell variables in the sed command, we should use double quotes to expand the variables. Further, the “$” address needs to be escaped.

3.2. Using the 0, /Pattern/ Address

We know sed supports range address: addr1, addr2. For example, we can process lines between two patterns in this way: ‘/PAT1/, /PAT2/{…}’. If we change the first address to ‘0‘, ‘0, /PAT/‘ represents from the beginning of the file until the first line matches the /PAT/ pattern. So, if we apply the substitution to this range, the first /PAT/ occurrence will be replaced:

$ sed '0, /Linux/s/Linux/MacOS/' file.txt
MacOS_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

Again, we repeat this action n times to apply the substitution n times:

$ sed '0, /Linux/s/Linux/MacOS/; 0, /Linux/s/Linux/MacOS/; 0, /Linux/s/Linux/MacOS/' file.txt
MacOS_MacOS_MacOS_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

3.3. An Edge Case

We’ve seen two sed solutions to solve the problem. But one case can break both solutions: when the replacement matches the search pattern.

Let’s say we want to replace the first three “Linux“s with “myLinux“s. Solution 1 produces this output:

$ sed ':a;$!{N;ba}; s/Linux/myLinux/; s/Linux/myLinux/; s/Linux/myLinux/' file.txt
mymymyLinux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

Solution 2 outputs the same:

$ sed '0, /Linux/s/Linux/myLinux/; 0, /Linux/s/Linux/myLinux/; 0, /Linux/s/Linux/myLinux/' file.txt
mymymyLinux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

This is because after the first substitution, every time sed searches “Linux“, it takes the replaced word “myLinux” as the first match. If we stick to sed, this problem has no simple fix.

4. Using awk

awk is another powerful text-processing utility. It’s a C-like script language, supporting variables, loops, functions, etc. So, in many cases, it’s more flexible than sed.

So next, let’s see how to solve the problem using awk.

4.1. Implementing a Counter of Successful Replacements

The first idea is defining a counter and putting the substitution operation in a loop. We increment the counter variable after each successful substitution.

Also, if either condition below is satisfied, we jump out of the loop:

The counter variable has reached the limit (n)
No match found in the current record (line)

Next, let’s implement this logic using awk:

$ awk -v limit=3 '{
    while(replaced < limit){
        x = sub(/Linux/, "MacOS")
        if (!x) break
        replaced += x
    }
    print
}' file.txt
MacOS_MacOS_MacOS_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

The output above shows that the awk command does the job.

The script is pretty straightforward. But we should note a couple of points:

sub(/PAT/, Replacement) – Return either 1 (success) or 0 (pattern not found).
if (!x) … – Zero is evaluated as False in awk. Therefore, if (!x) is the same as if (x == 0)

It’s worth noting that with the -v limit=n and the while loop, we can easily set the limit value without repeating the substitutions n times. For example, we can simply set limit=10 to replace the first ten “Linux“s:

$ awk -v limit=10 '{while ... }' file.txt
MacOS_MacOS_MacOS_MacOS_MacOS
MacOS_MacOS_MacOS_MacOS_MacOS
Linux_Linux_Linux_Linux_Linux

However, since each sub() function call is applied on the entire record, once the replacement matches the search pattern, this approach doesn’t work either:

$ awk -v limit=3 '{
    while(replaced < limit){
        x = sub(/Linux/, "myLinux")
        if (!x) break
        replaced += x
    }
    print
}' file.txt
mymymyLinux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

Next, let’s see another awk approach to handle this case.

4.2. Setting RS=”PATTERN”

awk‘s RS variable defines the record separator pattern, which is a newline character by default. The idea to solve the problem is to set RS=”PATTERN”. So, the records are separated by the search pattern. Then, we can go through the records and append the RT (record terminator matches the RS pattern) or the replacement, depending on whether the limit is reached.

Furthermore, since this approach doesn’t apply regex-based substitutions, it works when the replacement matches the search pattern.

Notably, since RT is GNU Awk’s extension, the solution works only with gawk.

Next, let’s implement it:

$ gawk -v RS='Linux' -v replacement="myLinux" -v limit=3 '{printf "%s%s", $0, NR<=limit? replacement: RT}' file.txt
myLinux_myLinux_myLinux_Linux_Linux
Linux_Linux_Linux_Linux_Linux
Linux_Linux_Linux_Linux_Linux

However, one case can still break this solution – when the limit exceeds the total number of the /PATTERN/ occurrences:

$ gawk -v RS='Linux' -v replacement="MacOS" -v limit=30 '{printf "%s%s", $0, NR<=limit? replacement: RT}' file.txt
MacOS_MacOS_MacOS_MacOS_MacOS
MacOS_MacOS_MacOS_MacOS_MacOS
MacOS_MacOS_MacOS_MacOS_MacOS
MacOS

This time, limit (30) exceeds the total occurrences number (15). As we can see, we have one extra “MacOS” at the end of the output. This is because we only checked NR <= limit without considering the EOF case. To fix it, we must first get the total record number to know when we reach the last record. There are different ways to get the number of records. Here, let’s take this approach:

Read the input file twice
The first time, get the total record number and store it in a variable, say total
The second time, extend the current solution to append an empty string when the last record still doesn’t reach limit

Finally, let’s implement it:

$ gawk -v RS='Linux' -v replacement="MacOS" -v limit=30 'NR==FNR{total++; next} {printf "%s%s", $0, FNR==total? "":(FNR<=limit? replacement: RT)}' file.txt file.txt
MacOS_MacOS_MacOS_MacOS_MacOS
MacOS_MacOS_MacOS_MacOS_MacOS
MacOS_MacOS_MacOS_MacOS_MacOS

5. Conclusion

In this article, we’ve explored how to replace only the first n pattern occurrences in a file using sed and awk.

Further, we discussed a few edge cases and saw how to handle them using awk.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security