Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
When we talk about removing duplicate lines in the Linux command line, many of us may come up with the uniq command and the sort command with the -u option.
Indeed, both commands can remove duplicate lines from input, for example, a text file. However, the uniq command requires the file to be sorted, and sort first sorts lines in the file.
In this tutorial, we’ll explore a method to remove duplicate lines from an input file without sorting.
Before we come to the solution to the problem, let’s discuss the scenarios in which we cannot or shouldn’t sort a file before removing duplicates.
Most of us may have thought of the first reason we should avoid sorting the input file: performance. If our final goal is to remove the duplicate lines, the sorting step isn’t necessary. Moreover, sorting is relatively expensive, especially for huge input files.
Additionally, sorting a file may change the original order of the lines. Therefore, we shouldn’t sort the file when we want to preserve the order of the lines.
A simple example can explain it clearly. Let’s say we have a file called input.txt:
$ cat input.txt
Linux
is
Linux
nice
is
In the input file, we have duplicate lines, such as “Linux” and “is“. If we remove duplicate lines and keep the lines in the original order, we should get:
Linux
is
nice
However, if we first sort the file and then remove duplicates, we’ll have:
$ sort -u input.txt
is
Linux
nice
As the output above shows, the duplicate lines are removed. However, the lines’ order is not what we expect. Further, it’s pretty hard to restore the original order.
Next, let’s see how to remove duplicate lines from a file without sorting.
First, let’s say we have a file called price_log.txt, which holds products’ price updates:
$ cat price_log.txt
Product, Price, Last Update
Table, 150, 2020-11-10
Table, 150, 2019-10-10
Table, 150, 2019-10-10
Table, 170, 2020-12-10
Chair, 57, 2019-05-05
Chair, 57, 2019-05-05
Chair, 57, 2020-02-04
Bed, 400, 2020-07-07
Bed, 400, 2020-07-07
Bed, 420, 2020-08-08
Bed, 420, 2020-07-10
As we’ve seen in the output, the records are not sorted since the file is maintained manually. Apart from that, there are some duplicate records in the file.
Now, we would like to keep the records’ original order and remove duplicate lines.
We’ll use the awk command to solve the problem. Let’s first see the solution, then understand how it works:
$ awk '!a[$0]++' price_log.txt
Product, Price, Last Update
Table, 150, 2020-11-10
Table, 150, 2019-10-10
Table, 170, 2020-12-10
Chair, 57, 2019-05-05
Chair, 57, 2020-02-04
Bed, 400, 2020-07-07
Bed, 420, 2020-08-08
Bed, 420, 2020-07-10
As the output above shows, such a compact awk one-liner has solved the problem. Next, let’s see how it works.
First, in awk, a non-zero number pattern will be evaluated as true. Further, a true pattern will trigger the default action: print. So, for example, awk ’42’ input_file will print all lines in input_file.
Oppositely, a false pattern will do nothing. For instance, awk ‘0’ input_file outputs nothing, no matter how many lines the file input_file has.
Now, let’s have a look at the command:
In this way, only the first “A LINE” line gets printed by the awk command, as all “A LINE” lines coming later will make !a[“A LINE”]++ be evaluated as false.
Once we understand how the awk solution works, we can easily adjust the solution to fit new requirements. Next, let’s look at some examples.
We’ve learned we can use the compact one-liner awk ‘!a[$0]++’ input to remove duplicate lines from an input file. Here, the $0 means the whole line.
Let’s say now we’ve got a new requirement. In our price_log.txt file, for the same product, we would like to only leave price-unique records in the file. In other words, we need to check the combination of Product and Price for duplicates.
As we’ve understood how the awk one-liner works, the key to solving the problem is to use Product and Price as the key to the associative array:
$ awk -F', ' '!a[$1 FS $2]++' price_log.txt
Product, Price, Last Update
Table, 150, 2020-11-10
Table, 170, 2020-12-10
Chair, 57, 2019-05-05
Bed, 400, 2020-07-07
Bed, 420, 2020-08-08
As we can see in the command above, we set the ‘, ‘ as the field separator (FS) and use $1 FS $2 as the key to the associative array ‘a‘.
In this article, we’ve first discussed when we want to remove duplicate lines from a file without sorting. Then, we’ve addressed the compact awk one-liner solution through an example.
Further, we’ve shown how to adjust the awk solution to solve similar problems. For example, the duplicate check is the combination of several fields instead of the whole line.