1. Overview
While performing text manipulation, especially with big datasets, removing the last instance of a pattern can be challenging since common tools might not easily handle this task.
In this tutorial, we’ll learn how to remove the last occurrence of a pattern in a file using command-line utilities, such as sed, awk, and tac.
2. Scenario Setup
Let’s start by looking at the items.txt sample text file:
$ cat items.txt
item1,item2,item3,
item4,item5,item6,
item7,item8,item9,
The file contains comma-delimited values.
Unfortunately, the last line has an extra occurrence of a comma (,) after the item9 value. So, we aim to remove the last comma from the items.txt file.
3. Using sed
In this section, let’s explore how to solve our use case using the sed command-line utility.
3.1. With Greedy Approach
Firstly, let’s start by writing one of the common sed idioms to read the entire file into the pattern space:
$ sed -E ':a;N;$!ba' items.txt
item1,item2,item3,
item4,item5,item6,
item7,item8,item9,
Now, let’s break this down to understand the nitty-gritty of the logic. Firstly, we added the “a” label to facilitate iteration. Then, we continued to append the next line using the N and b functions until we reached the last line ($). Lastly, we can see that the entire file is displayed because of the default behavior of sed to print the pattern space.
If we look closely, there are just two commands other than the label definition:
:a
N
$!ba
Finally, we can use a greedy match approach with the (.*),(.*) group in the substitution (s) command. Our group-based substitution splits the entire pattern space into two groups, namely, \1 and \2, separated by a comma:
$ sed -E ':a;N;$!ba; s/(.*),(.*)/\1\2/' items.txt
item1,item2,item3,
item4,item5,item6,
item7,item8,item9
Great! We’ve got the correct results.
3.2. With tac
Our greedy approach to read the entire file into the pattern space works fine for smaller datasets. However, we’ll start noticing performance issues with large datasets because of extensive memory utilization.
To optimize memory utilization, we can use the tac command to reverse the order of lines, remove the pattern, and then reverse the order of lines back:
tac <file> | <sed script to remove> | tac
Since tac shows the contents of the file with the last line first and the first line last, we’ll use sed to remove the last occurrence of the pattern in the first line that contains it.
Let’s see the entire series of commands in action:
$ tac items.txt \
| sed -n -E ':remove_and_print;s/(.*),(.*)/\1\2/;t print_only; p; n; b remove_and_print :print_only; p; n; b print_only;' \
| tac
item1,item2,item3,
item4,item5,item6,
item7,item8,item9
Fantastic! Our approach works fine.
Now, let’s break down the nitty-gritty of the logic, particularly for the sed commands.
We’ve defined two labels for flow control, namely, remove_and_print and print_only. Then, within the remove_and_print, we’re trying to substitute the last occurrence of the pattern on that specific line. After a successful substitution, the flow is transferred to print_only:
:remove_and_print
s/(.*),(.*)/\1\2/;
t print_only;
p;
n;
b remove_and_print;
Moreover, within the print_only block, we continue to take the next (n) line and print (p) it:
:print_only;
p;
n;
b print_only;
Lastly, let’s acknowledge that the advantage of this approach is that we’re keeping a single line in the pattern space, so it doesn’t use much memory.
4. Using awk
In this section, let’s learn how we can use the awk utility to remove the last occurrence of a comma in the items.txt file.
4.1. With Buffer Array
Let’s start by looking at the remove_comma.awk script in its entirety:
$ cat remove_comma.awk
function sub_at_position(line, position) {
len = length(line);
pre = substr(line, 1, position-1);
post = substr(line, position+1, len-position-1);
return pre post;
}
{
buffer[NR] = $0;
n = split($0, a, ",");
if (n > 1) {
last_occurrence = NR;
position_last_comma = length($0) - length(a[n]);
}
}
END {
for (i = 1; i <= NR; i++) {
if (i == last_occurrence) {
buffer[i]=sub_at_position(buffer[i], position_last_comma);
}
print buffer[i];
}
}
Now, let’s understand the code flow within our script.
Firstly, we’ve defined the helper sub_at_position() function that accepts two positional parameters, line and position. It splits the line into pre and post as text falling before and after the position.
Then, we store each line in the buffer array. Additionally, we keep track of the last line number containing a comma with the last_occurrence variable. For this line, we define the position_last_comma variable as the last comma position.
Eventually, we print each line from the buffer array in the END block. Only for the last_occurrence line, we use the sub_at_position() function to remove the comma marked by the position_last_comma index.
Finally, let’s execute the remove_comma.awk script to remove the last occurrence of comma (,) in the items.txt file:
$ awk -f remove_comma.awk items.txt
item1,item2,item3,
item4,item5,item6,
item7,item8,item9
Perfect! It looks like we nailed this one.
4.2. With tac
Like our greedy approach with the sed utility, the buffer-based approach with awk utilizes high memory. So, it won’t work for large datasets. However, we can optimize our approach by using tac.
Our approach would involve reversing the items.txt file, removing the comma from the first matching line, and reversing it back:
$ tac items.txt | awk -f remove_comma_optimized.awk | tac
Now, let’s see the remove_comma_optimized.awk script in its entirety:
$ cat remove_comma_optimized.awk
function sub_at_position(line, position) {
len = length(line);
pre = substr(line, 1, position-1);
post = substr(line, position+1, len-position-1);
return pre post
}
BEGIN {
is_done=0
}
{
if (!is_done) {
n = split($0, a, ",");
if (n > 1) {
last_occurrence = NR;
position_last_comma = length($0) - length(a[n]);
$0=sub_at_position($0, position_last_commma)
is_done=1
}
}
print $0
}
Now, let’s understand the optimizations done in remove_comma_optimized.awk script over the remove_comma.awk script. Firstly, we’ve reused the sub_at_position() function from the remove_comma.awk script, so there is no change over there. Then, we can see that we no longer use the buffer array, so we’ve removed the END block.
Furthermore, we’ve defined the is_done variable in the BEGIN block to track the remove operation. We use this for performing a one-time removal operation.
Lastly, let’s execute the remove_comma_optimized.awk script in combination with tac:
$ tac items.txt | awk -f remove_comma_optimized.awk | tac
item1,item2,item3,
item4,item5,item6,
item7,item8,item9
It works as expected!
5. Using Vim Editor
Vim is a versatile text editor that can be used for effective text manipulation. In this section, we’ll write a vim script to solve our use case of removing the last occurrence of a pattern in a file.
5.1. Vim Script
We can automate text editing operations using a vim script and use them repeatedly. So, let’s write the remove_last_pattern.vim to solve our use case:
$ cat remove_last_pattern.vim
function! RemoveLastPattern(pattern)
" Go to last line
normal G
" Add empty line at the end
normal o
" Go to second last line
normal! $k
" Search for the pattern in reverse
execute "normal ?" . a:pattern . "\""
" Delete the pattern
let l:patternLength = strlen(a:pattern)
execute "normal " . l:patternLength . "x"
" Delete the last line
normal Gdd
endfunction
" Map the function to a command for ease of use
command! -nargs=1 RemoveLast :call RemoveLastPattern(<q-args>)
Initially, the script could look overwhelming. However, it’s just a series of vim commands. Let’s look closer to understanding the complete logic within the RemoveLastPattern() function.
We begin by moving the cursor to the last line using the G command in normal mode. Additionally, we insert a new line at the end of the file and move the cursor to the second last line using the $k command.
Then, we use the execute command to evaluate a string as a vim command. For creating the vim expression, we use the string concatenation operator (.) to create a backward search (?) for the pattern. Furthermore, we must note that the a: prefix indicates that the variable is an argument to the parent function.
Lastly, we create a delete vim expression using the x (delete) operator with a prefix of patternLength. At the end, we delete the last line using the Gdd command in the normal mode.
Lastly, we create a custom command mapping RemoveLast that calls the RemoveLastPattern with exactly one argument (-nargs=1). Further, we must note that <q-args> gets replaced by the argument passed to the RemoveLastPattern() function.
5.2. Vim Script in Action
Firstly, let’s open the items.txt file using the vim command:
$ vim items.txt
Now, we need to source the remove_last_pattern.vim script so that we get access to the RemoveLast custom command:
:source remove_last_pattern.vim
Next, we can all the RemoveLast command with a comma(,) as the first argument:
:RemoveLast ,
That’s it! We’ve successfully removed the last occurrence of a comma (,) in the items.txt file.
Finally, after verifying the changes, we can choose to save the file:
:wq
Perfect! We’ve now got a convenient way to solve our use case.
6. Conclusion
In this article, we learned how to remove the last occurrence of a pattern in a file. Furthermore, we explored command-line utilities, such as sed, awk, tac, and vim, to solve our use case.