We know that the uniq command is a handy utility to remove duplicated adjacent lines from input.
However, when we handle column-based input files, for example, CSV files, we may want to remove lines with a duplicated column.
The uniq command cannot help us in this case because it removes lines only if the entire line is the same.
In this tutorial, we’ll learn how to do the “uniq” operation on a column instead of the entire line.
2. Introduction to the Problem
First of all, let’s create a CSV file price.csv as an example:
Product,Price,Update Monitor,218,2019-01-01 Keybord,20,2019-02-02 Wireless Mouse,25,2019-02-02 Keybord,22,2019-03-02 Wireless Mouse,30,2019-03-02 Keybord,18,2019-04-02 Monitor,208,2019-02-01 Keybord,25,2019-05-02 Monitor,230,2019-03-01
In the price.csv file, we have three columns, each line records when the price of each product is updated.
We would like to remove lines with duplicated products.
Therefore, we’re going to do the “uniq” operation by the product column.
Let’s look at two ways to solve the problem and compare the two approaches: sort and awk.
3. Using sort
The sort command can sort lines by a specific field and remove the duplicates from the sort result. For duplicates, only the first instance will be kept.
In this section, we’ll solve the problem using the sort command.
In the price.csv file, we notice that the first line is a header line, and we don’t want the header to participate in sorting when we sort the data.
To do that, we can use the tail command to cut the first line and pipe the rest to the sort command:
$ tail -n+2 price.csv | sort -u -t, -k1,1 Keybord,20,2019-02-02 Monitor,218,2019-01-01 Wireless Mouse,25,2019-02-02
Let’s take a closer look at the command above and understand each part of it:
- tail -n+2: take from the 2nd line till the end of the file
- sort -u: sort and remove duplicates
- -t,: define comma as the field separator
- -k1,1: sort by the first column
Very likely, we still want to keep the header line in our output.
To achieve that, we can let the head command print the first line first and then execute the command above:
$ head -n1 price.csv && tail -n+2 price.csv | sort -u -t, -k1,1 Product,Price,Update Keybord,20,2019-02-02 Monitor,218,2019-01-01 Wireless Mouse,25,2019-02-02
4. Using awk
awk is a very powerful command-line text processing tool in *nix systems.
It can address this problem succinctly:
$ awk -F, 'NR==1 || !a[$1]++' price.csv Product,Price,Update Monitor,218,2019-01-01 Keybord,20,2019-02-02 Wireless Mouse,25,2019-02-02
Compared to the solution with the sort command, the awk command is much more compact. Let’s take a closer look.
-F, sets comma as the field separator.
The NR==1 expression handles the header line: awk will print the first line in the output.
Now comes the tricky part: !a[$1]++, let’s understand how it works.
Back to our example:
- when a line with “Keyboard” comes the first time to awk, awk creates an associative array element: a[“Keyboard”] with the default value: 0
- a[“Keyboard”]++ returns the original value and then increments its value by 1, so the expression returns 0 and then we have a[“Keyboard”]=1
- !a[“Keyboard”]++ will become !0, as we’ve learned, it is evaluated as true, thus, triggers the default action: print the current line
- when a line with “Keyboard” comes again, the array element exists already, a[“Keyboard”]++ will return 1 and hold 2
- !a[“Keyboard”]++ this time will become !1, therefore we have false: do nothing.
In this way, after the first “Keyboard” gets printed, the array element a[“Keyboard”] holds a positive number. All of the following lines with “Keyboard” will make !a[“Keyboard”]++ evaluate as false. Thus, the duplicates are removed.
5. awk vs. sort
So far we’ve seen two different approaches to solve this problem. If we compare their outputs, we’ll find that they are different:
$ diff -y <(head -n1 price.csv && tail -n+2 price.csv | sort -u -t, -k1,1) <(awk -F, 'NR==1 || !a[$1]++' price.csv) Product,Price,Update Product,Price,Update Keybord,20,2019-02-02 < Monitor,218,2019-01-01 Monitor,218,2019-01-01 > Keybord,20,2019-02-02 Wireless Mouse,25,2019-02-02 Wireless Mouse,25,2019-02-02
This is because the sort command sorts the lines by the key column first, which may change the original order of the lines, but awk doesn’t.
Let’s review and compare the two solutions:
- If there is a header line in the input, awk can handle the header line much easier than the sort approach
- awk solves the problem with a single process while the sort approach needs three processes: head, tail, and sort
- awk would be faster than the sort approach because the awk solution doesn’t sort the file
- the sort approach would be the right choice if a sorted result is required
In this article, we first introduced the “uniq” by column problem. Then we’ve learned how to use the sort utility and the awk command to solve it.
Finally, we compared the two approaches so that we know how to choose the proper one depending on different requirements.