In this tutorial, we’ll explore how to use the uniq command.
The uniq command provides us with an easy way to filter text files and remove duplicate lines from a stream of data.
We can use uniq in a few ways. We can print out either unique lines or the repeated lines. Additionally, uniq can print out each distinct line with a count of how many times that line appears within a file.
An important aspect that we need to keep in mind is that uniq works with adjacent lines. This means we often need to first sort our data before uniq can work on processing the file. Luckily, in Linux, we can use the sort command to achieve that.
Let’s try using uniq on a list of countries of visitors to our web server. First, we’ll create a file called countries.txt:
$ cat << EOF > countries.txt Germany South Africa Japan USA England Spain Italy Cameroon Japan EOF
3. Printing Duplicate Lines
In this example, we’ll use the uniq command to print the duplicate lines in our file. Let’s sort our data and pipe it through uniq to see how this works:
$ sort countries.txt | uniq -d Japan
Here we’ve sorted the data before passing it to uniq using stream redirection. The -d flag presents us with just one instance of the duplicate lines. We’re presented with the output of “Japan” since that’s the only duplicate.
Now let’s take a look at a quick variation of this:
$ sort countries.txt | uniq -D Japan Japan
In our variation, we passed the -D flag to uniq, which prints all instances of the duplicate lines.
4. Counting Duplicate Lines
Let’s have a look at how we can get a quick and easy count of the duplicates in our data:
$ sort countries.txt | uniq -c 1 Cameroon 1 England 1 Germany 1 Italy 2 Japan 1 South Africa 1 Spain 1 USA
Using the -c flag, uniq prefixes each line with the number of times it appears in the file and prints it to the screen.
5. Removing Duplicate Lines
Now we’re going to use uniq to remove the duplicates lines entirely and present us with just those countries that only occur once in our countries.txt file.
We accomplish that with the -u flag to uniq:
$ sort countries.txt | uniq -u Cameroon England Germany Italy South Africa Spain USA
As expected, “Japan” is not in the output because it occurs more than once in our file and is therefore not considered a unique record.
6. Case Sensitivity
In the real world, our data might be more inconsistent. Let’s update our sample data file and use a mix of different cases as a test:
$ cat << EOF > countries.txt GERMANY South AFRICA Japan USA england Spain ItaLY CaMeRoon JAPAN EOF
Now let’s attempt to print the duplicates in this file:
$ sort countries.txt | uniq -D
Oddly, our output is blank. We know that Japan is duplicated and should be printed but the weird case is likely the issue.
Let’s see how we can account for that in uniq using the -i flag:
$ sort countries.txt | uniq -D -i Japan JAPAN
We can get further confirmation by counting of how many times Japan appears in the file:
$ sort countries.txt | uniq -c -i -d 2 Japan
By using -i, we’ve asked uniq to perform a case-insensitive comparison when searching for duplicates.
7. Skipping Characters
Sometimes we might want to skip over or ignore a certain number of characters while looking for duplicate values. We can achieve this in uniq with the -s flag.
First, let’s create some sample data for this example:
$ cat << EOF > visitors.txt Visitor from Cameroon Visitor from England Visitor from Germany Visitor from Italy Visitor from Japan Visitor from Japan Visitor from South Africa Visitor from Spain Visitor from USA EOF
Now that we’ve created our data, we’ll pass -s the number of characters from the start of the line to skip over:
$ uniq -s 13 -c visitors.txt 1 Visitor from Cameroon 1 Visitor from England 1 Visitor from Germany 1 Visitor from Italy 2 Visitor from Japan 1 Visitor from South Africa 1 Visitor from Spain 1 Visitor from USA
In this example, we’ve used the -s flag to tell uniq to skip over the first 13 characters of each line. Doing this leaves uniq with just the country names to filter and, as expected, it’s just Japan that appears twice in our visitors.txt file.
8. First n Characters
We’re able to limit the number of characters that uniq uses for comparison when searching for duplicates.
Let’s take a look at how the -w option can be used to compare the first seven characters of each line in our visitors.txt file:
$ uniq -w 7 -c visitors.txt 9 Visitor from Cameroon
We must take care not to get confused with uniq’s output when using the -w flag. What the output here tells us is that the first seven characters in the file were matched nine times in visitors.txt.
9. Ignoring Fields
We may want uniq to ignore a certain number of fields on each line when performing duplicate searches, and this is where the -f option comes into play:
$ uniq -f 2 -D visitors.txt Visitor from Japan Visitor from Japan
We’ve asked uniq to ignore the first two fields on each line in this example. A field is a set of characters separated by a space so effectively we’re ignoring the “Visitor from” text on each line.
In this tutorial, we explored the uniq command and listed some of its common uses. We then used uniq in a few examples to highlight how it works.
As always, we can refer to the man page for more information about it.