Finding Unique Text Between Two Files

1. Overview

Comparing text files and finding unique content between them is a common task in Linux. Software developers often examine changes among various versions of source code. They also utilize the differences between files to generate kernel or application patches that can transform one file version into another. System administrators also often compare different versions of configuration files to check for changes.

The task of finding unique text between two files can be accomplished using several methods. In this tutorial, we’ll look at some of the most common methods we can use to do this.

2. Comparing the Content of Two Files

Suppose we have two files, file1 and file2, and we want to find lines of text that are unique to one file.

Let’s first display each file content with cat:

$ cat file1
A
B
C
D
$ cat file2
A
B
B
C
C
C

Since a file may contain duplicate lines, such as file2, we should consider two different cases when comparing files.

The first case consists of finding unique text in one file while allowing duplicate lines in the files. In such a case, unique lines in file2 compared to file1 include one occurrence of B and two occurrences of C. This is because we subtract the number of instances in file2 from the number of matching instances in file1.

The second case involves finding unique text in one file while disallowing duplicate lines in the files. Under this condition, file2 will have no unique text compared to file1, whereas file1 will have the letter D as unique.

3. Finding Unique Text, Allowing Duplicates

Let’s consider the case of finding unique text in one file compared to another while allowing duplicate lines in the files.

The GNU tools in Linux provide two commands that are useful for comparing file content: comm and diff.

3.1. Using comm

The comm command compares two files line-by-line. It returns three columns: one showing lines unique to the first file, another showing lines unique to the second file, and a third column showing lines common to both files.

The comm command requires that both files be sorted. We may choose to suppress some of the columns in the output by indicating the column number(s) to hide.

Let’s use comm to find lines of text appearing only in file1:

$ sort file1 > file1_sorted
$ sort file2 > file2_sorted
$ comm -23 file1_sorted file2_sorted
D

Here, we first sorted the files as required and then used comm while suppressing the second and third columns. The result shows that file1 has only one unique line compared to file2, namely the line containing the letter D.

We may also shorten the procedure into one line:

$ comm -23 <(sort file1) <(sort file2)
D

The symbol <() represents process substitution. It allows us to save the output of one process in a temporary file and pass it as argument to another process, which is, in this case the comm command.

Let’s now find lines unique to file2 by suppressing the first and third columns:

$ comm -13 <(sort file1) <(sort file2)
B
C
C

There are three lines unique to file2 after accounting for a matching A, B, and C in file1. Recall that, in this case, we’re allowing duplicate lines in the files and the final result.

3.2. Using diff

The diff command compares two files and outputs the lines that are unique to each file.

Let’s find lines unique to file1 using diff:

$ diff file1 file2 | grep '^<' | cut -c 3-
D

Entries marked with ‘<‘ represent lines that would have to be removed from file1 so that it resembles file2. Therefore, these are lines unique to file1. We use cut -c 3- to remove the first two characters because these are added by diff and are not part of the original file.

To find lines unique to file2 instead, we only need to change the pattern used with grep to ‘^>‘:

$ diff file1 file2 | grep '^>' | cut -c 3-
B
C
C

Entries that begin with ‘>‘ represent lines that would have to be added to file1 so that it resembles file2. Therefore, these are lines unique to file2 which are missing in file1.

4. Finding Unique Text, Disallowing Duplicates

If we wish to find lines in one file that don’t appear at all in the other file, then we should discard duplicates. We can still use the comm and diff commands by simply adjusting their inputs.

4.1. Using comm

The only adjustment we should make to comm is to sort the input files uniquely via sort -u. This way, we discard any duplicate entries after sorting.

Let’s find lines unique to file1 by suppressing the second and third columns:

$ comm -23 <(sort -u file1) <(sort -u file2)
D

Let’s now find lines unique to file2 by suppressing the first and third columns:

$ comm -13 <(sort -u file1) <(sort -u file2)

In this case, the result is empty because file2 does not contain any unique lines. All three letters A, B, and C appearing in file2 also exist in file1.

4.2. Using diff

Like comm, the only adjustment needed to diff is to sort the input files uniquely in order to avoid duplicate lines in each file.

Let’s find lines unique to file1 that do not appear by any count in file2:

$ diff <(sort -u file1) <(sort -u file2) | grep '^<' | cut -c 3-
D

Likewise, let’s find lines unique to file2 instead:

$ diff <(sort -u file1) <(sort -u file2) | grep '^>' | cut -c 3-

Here again, we see that file2 does not contain any unique lines compared to file1 when duplicates are removed.

4.3. Using grep

We can use the grep command with the -v option to search for lines that are not present in one of the files.

Let’s find lines unique to file1 with grep:

$ grep -Fxvf file2 file1
D

We use the -F option to interpret patterns as fixed strings. The -x option is for matching entire lines, while -f is for specifying the pattern file. Finally, -v is for excluding the patterns found in the pattern file, which is in this case file2.

To find lines unique to file2 instead, we specify this time file1 as the pattern file:

$ grep -Fxvf file1 file2

We see that file2 has no unique patterns because all of them already appear in file1.

4.4. Using awk

It’s also possible to extract unique lines between two files via GNU awk. While we may construct various awk expressions for this purpose, we’ll look at one which works even if either file is empty.

Let’s find lines unique to file1 via awk:

$ n_lines=`cat file2 | wc -l`
$ awk -v n="$n_lines" 'NR<=n {a[$0]++} !a[$0]' file2 file1
D

Here, we first find the number of lines in file2. We assign this value to a variable, n, that we define in awk using the -v option. Now, as long as the number of records (NR) is less than or equal to n, we’re counting lines in file2. We save the lines as keys of array a and increment their value. Finally, once the number of records exceeds n, we’re now counting lines in file1. Then if the value of array a for a line is zero, we print that line because we haven’t encountered it before.

Printing the whole line is the default action in awk when not explicitly stated. In other words, when !a[$0] is true, then the line is printed, and it’ll appear in the result.

Similarly, let’s now find lines unique to file2 instead:

$ n_lines=`cat file1 | wc -l`
$ awk -v n="$n_lines" 'NR<=n {a[$0]++} !a[$0]' file1 file2

This time we switch the order of the files, and we define variable n as the number of lines in file1.

5. Conclusion

In this article, we learned how to find unique text between two files in Linux using a variety of methods, including command line tools like comm, diff, grep, and awk. The best method depends on the specific requirements and the size of the files being compared.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security