How to Find Common Intersecting Lines Between Two Files in Linux

1. Introduction

Finding the intersection of lines in two files in Linux is crucial for data-related tasks. It allows for efficient data comparison, filtering out irrelevant information, and validating data integrity. The purpose is to identify common elements between datasets, ensuring accuracy. This facilitates further analysis or manipulation of the data.

Finding intersections is fundamental in managing and processing data effectively in Linux environments.

In this tutorial, we’ll explore various methods to identify identical lines that can appear anywhere within two files in Linux.

First, we’ll examine the comm command to identify intersecting lines. Then, we’ll explore the awk, join, and grep commands to accomplish similar outcomes. Next, we’ll discuss the uniq -d command used for the same objective. Additionally, we’ll explore using IFS to find intersecting lines. Lastly, we’ll look into the variance in result order and benchmarking of each command that is used to identify intersecting lines between files.

2. Dataset and Expected Results

Let’s start by inspecting the contents of the files using the cat command before executing the main command:

$ cat file1.txt                          	 
34
67
89
92
102
180
blue2
3454
6678
87356
14
255

Now, let’s view the file2.txt contents:

$ cat file2.txt
23
56
67
87
2574
69
d23d
180
245
92
200

After the intersection of lines between file1.txt and file2.txt, the expected results consist of the following common lines:

67
180
92

Next, let’s proceed with the tutorial to explore various approaches to achieve comparable outcomes.

3. Using comm

To begin with, we can use the comm command to find the common intersecting of lines between two files in Linux.

The comm command compares sorted files line by line, highlighting unique lines or common lines shared between the files. Its uses include data comparison to spot similarities and differences and finding unique lines in each file. Furthermore, it supports identifying common lines, validating data integrity, and facilitating scripting and automation tasks.

In summary, comm is crucial for data analysis, validation, and scripting operations in Linux environments.

Now, let’s execute the command in the terminal:

$ comm -12 <(sort file1.txt) <(sort file2.txt)
180
67
92

This code finds and displays the common lines between two sorted files in Linux. It leverages the comm command for line-by-line comparison and process substitutions with sort to ensure it sorts the files before comparison. Next, the -12 option specifically instructs comm to output only the intersecting lines. Thus providing a straightforward way to identify shared content between the two files.

4. Using awk

Similarly, we can use the awk command in Linux to discover the common intersecting lines between two files.

Now, let’s take a look at the code and execute it:

$ awk 'NR==FNR { lines[$0] += 1; next } lines[$0] {print; lines[$0] -= 1}' file1.txt file2.txt
67
180
92

In Linux, the awk command identifies and prints the common lines between files, file1.txt and file2.txt.

Next, let’s break down each part of the command and its functionality:

NR==FNR { lines[$0] += 1; next } this pattern in awk processes file1.txt, storing each line as a key in the lines array with a count increment for duplicates.
lines[$0] this checks if the current line from file2.txt exists as a key in the lines array.
{print; lines[$0] -= 1} if a common line is found, it’s printed, and its count in the lines array is decreased.

Thus, the awk command efficiently identifies and prints common lines between files in Linux, file1.txt, and file2.txt. Moreover, it initializes an array named lines to track line occurrences in file1.txt. Next, it compares lines from file2.txt to find and print common lines while decrementing their counts in the array.

Hence, this approach effectively determines intersecting data points, showcasing the shared lines between the two files as the command output.

5. Using awk with loop and delete

Alternatively, we can use the awk command with loop and delete command to get similar results.

Now, let’s run the command and view the results:

$ awk 'NR==FNR { p[NR]=$0; next; }
   { for(val in p) if($0==p[val]) { delete p[val]; print; } }' file1.txt file2.txt
67
180
92

The awk command finds and prints common lines between files, file1.txt and file2.txt.

Let’s break down each part of the command and its functionality:

NR==FNR – This checks if the total record number NR equals the record number within the current file being processed FNR.
{ p[NR]=$0; next; } – This stores each line from file1.txt in an array p using the line number NR as the key and the line content $0 as the value, with next used to skip to the following line.
{ for(val in p) if($0==p[val]) { delete p[val]; print; } } – This part is executed for the second file file2.txt. It loops through each line and checks if it matches any line stored in array p from the first file. If there’s a match ($0==p[val]), it deletes that line from array p and prints the matching line.

Thus, this awk command compares lines between two files. It stores lines from the first file in an array and then checks for matches in the second file. If a match is found, it prints the matching line from the second file.

6. Using join

Another method to find common intersecting lines between files is by using the join command.

Let’s proceed to run the command in the terminal environment to observe its functionality firsthand and obtain the desired output:

$ join <(sort file1) <(sort file2)
67
180
92

The command join <(sort file1.txt) <(sort file2.txt) merges lines from two files, file1.txt and txt, based on a common field. Next, it sorts the lines in each file using sort and combines them using join, ensuring proper sorting for accurate merging.

Hence, this command helps merge and analyze data from multiple files in Linux.

7. Using grep

We can use the grep to achieve common intersecting lines between two files in a Linux system.

Now, let’s execute the command in the terminal to see how it works and generate the expected output:

$ grep -Fxf file1.txt file2.txt
67
180
92

The command grep -Fxf file1.txt file2.txt searches for exact line matches from file1.txt within file2.txt. Additionally, it uses grep with options -F to treat patterns as fixed strings and -x to match whole lines.

Therefore, this command is helpful for tasks like data validation and extracting specific content from files in Linux.

8. Using uniq -d

Furthermore, we can employ the uniq -d command to identify common intersecting lines between two files in a Linux system.

Next, we’ll run the command in the terminal to demonstrate its functionality and produce the desired result:

$ sort file1.txt file2.txt | uniq -d
180
67
92

The above command identifies and displays duplicate lines present in file1.txt and file2.txt. It first sorts the lines in both files individually, ensuring that similar lines are grouped together. After that, it combines the sorted lists from both files and passes them to the uniq command with the -d option. Subsequently, it instructs uniq to only output duplicate lines.

This command is particularly useful when working with datasets or text files. It helps us compare and extract common elements.

This process involves tasks such as identifying duplicate entries in lists. Consequently, it offers a direct method to enhance efficiency in data analysis within a Linux setting.

9. Using IFS

An alternative approach involves using IFS with the loop and delete command to identify intersecting lines common to two files in a Linux environment.

IFS stands for Internal Field Separator in Linux. Therefore, we can use it as a special variable to define word boundaries or field separators in text within the shell.

Additionally, it’s used to parse text by splitting it into fields based on a specified delimiter and looping over elements in lists. Also, it combines with various commands like sed, grep, and loops for efficient text processing. IFS is a versatile tool in Linux shell scripting that facilitates data manipulation, input redirection, and automation tasks.

Now, let’s view the IFS_exp.sh script:

$ cat IFS_exp.sh
#!/bin/bash 
# Read each line from file1.txt 
while IFS= read -r line1; do 
    # Read each line from file2.txt 
    while IFS= read -r line2; do 
        # Compare lines from both files 
        if [[ "$line1" == "$line2" ]]; then 
            echo "$line1" 
            break # Exit the inner loop once a match is found 
        fi 
    done < file2.txt # Redirect file2.txt to the inner loop 
done < file1.txt # Redirect file1.txt to the outer loop

Next, we make the script executable and run it:

$ chmod +x IFS_exp.sh
$./IFS_exp.sh
67
92
180

The above bash script structures compare lines between two files, file1.txt and file2.txt, in a nested loop fashion. Once, the outer loop reads each line from file1.txt, assigning it to the variable line1, while the inner loop reads each line from file2.txt, assigned to line2. In the loops, a conditional statement checks if line1 matches line2, and if so, echoes the common line to the terminal.

Subsequently, the script exits the inner loop once it finds a match, optimizing efficiency by avoiding unnecessary comparisons using the break statement.

Hence, this script effectively identifies and prints common lines between the two files, providing a straightforward approach to data comparison tasks in a Linux environment.

10. Variance in Result Order and Benchmarking in Tools

The order of results and benchmark can vary depending on the tool or technique used to find common lines between two files. Let’s break down why some tools show the results and benchmark in a specific order while others may display the results differently:

10.1. comm Command

When used with -12, the comm command displays common lines in sorted order because it internally sorts the input files before comparing and outputting the common lines.

Let’s view the comm command benchmark:

$ time comm -12 <(sort file1.txt) <(sort file2.txt)
180
67
92

real    0.03s
user    0.00s
sys     0.00s
cpu     11%

The output from running the above command shows that the command took 0.03 seconds in real time to execute with a minimal CPU utilization of 11%. No significant CPU time was spent in user or system modes, indicating efficient execution with low resource usage.

10.2. awk Command

The awk command may not maintain the original order of lines because it processes the files based on the order of line occurrences in memory, which may not correspond to the original file order.

Next, let’s take a closer look at the benchmark results for the awk to gain a better understanding of its performance:

$ time awk 'NR==FNR { lines[$0] += 1; next } lines[$0] {print; lines[$0] -= 1}' file1.txt file2.txt
67
180
92

real    0.01s
user    0.00s
sys     0.00s
cpu     74%

The above command executed swiftly in 0.01 seconds, showcasing a high CPU utilization of 74% and demonstrating efficient execution with substantial utilization of CPU resources.

10.3. awk with loop and delete

Similarly, the awk command with loop and delete may not preserve the original order as it processes lines based on associative array keys, which are not inherently ordered.

Similarly, we’ll look into the benchmark of awk with loop and delete command:

$ time awk 'NR==FNR { p[NR]=$0; next; }
   { for(val in p) if($0==p[val]) { delete p[val]; print; } }' file1.txt file2.txt
67
180
92

real    0.01s
user    0.00s
sys     0.00s
cpu     91%

The above command produced results in just 0.01 seconds, showing a significant CPU usage of 91%. This indicates effective performance and substantial utilization of CPU resources.

10.4. join Command

The join command also sorts files internally, displaying common lines in sorted order.

Let’s explore the benchmarking analysis of the join command to gain insights into its performance metrics:

$ time join <(sort file1.txt) <(sort file2.txt)
180
67
92

real    0.01s
user    0.00s
sys     0.00s
cpu     46%

The time command was used to measure the execution time and CPU utilization of the join command on sorted files file1.txt and file2.txt. The command was completed in 0.01 seconds with a CPU utilization of 46%, indicating efficient execution with moderate CPU resource usage.

10.5. grep Command

The grep -Fxf processes lines in the order they appear in the input files, potentially preserving the original order of common lines.

Let’s analyze the grep command benchmark to understand its performance and efficiency:

time grep -Fxf file1.txt file2.txt           
67
180
92

real    0.00s
user    0.00s
sys     0.00s
cpu     83%

The above command completed almost instantly 0.00 seconds with a high CPU utilization of 83%. This indicates that the grep command efficiently processed the files and utilized a significant portion of CPU resources during its brief execution.

10.6. uniq Command

The sort | uniq -d sorts lines before finding duplicates, which can alter the order of common lines compared to the original files.

Let’s dive deep into the benchmarking analysis of the sort file1.txt file2.txt | uniq -d command to gain a thorough understanding of how it performs and its operational efficiency:

$ time sort file1.txt file2.txt | uniq -d
180
67
92

real    0.01s
user    0.01s
sys     0.00s
cpu     91%

real    0.03s
user    0.00s
sys     0.00s
cpu     11%

The time command was utilized to assess the execution duration and CPU usage of two commands to sort and identify duplicates in file1.txt and file2.txt.

The first command was completed quickly in 0.01 seconds with a high CPU utilization of 91%, indicating efficient processing and significant CPU resource usage.

The second command took slightly longer, 0.03 seconds, and used less CPU, 11%, suggesting a slower execution with less CPU resource consumption compared to the first command.

10.7. IFS Script

The IFS script processes lines sequentially, maintaining the order of common lines as they appear in the input files.

Now, we’ll look into the benchmarking analysis of the IFS script to gain a thorough understanding of its performance and operational efficiency:

$ time ./IFS_exp.sh   
67
92
180

real    0.01s
user    0.01s
sys     0.00s
cpu     89%

The above script finished in just 0.01 seconds and utilized 89% of the CPU, demonstrating efficient execution and significant utilization of CPU resources.

10.8. Summary

The order of results and benchmark can vary depending on the tool or technique used to find common lines between two files. Let’s explore why some tools show results and benchmarks in specific orders while others may display them differently. Below is a summarized table of benchmark timings for each tool:

| Tool/Command        | Execution Time (real)       | CPU Utilization (%)|
|---------------------|-----------------------------|--------------------|
| comm                | 0.03s             	    | 11%                |
| awk                 | 0.01s             	    | 74%                |
| awk with loop       | 0.01s             	    | 91%                |
| join                | 0.01s             	    | 46%                |
| grep                | 0.00s             	    | 83%                |
| uniq                | 0.01s / 0.03s     	    | 91% / 11%          |
| IFS Script          | 0.01s             	    | 89%                |

This table summarizes the execution time and CPU utilization for each tool when finding common lines between file1.txt and file2.txt. The benchmarks provide insights into the performance and efficiency of each approach.

Thus, these benchmarks help understand the speed and resource usage of different methods, aiding in choosing the most suitable tool for specific tasks.

11. Conclusion

In this tutorial, we explored various methods to identify identical lines that can appear anywhere within two files in Linux.

We began by examining the comm command to identify intersecting lines. After that, we explored the awk, join, and grep commands to accomplish similar outcomes. Next, we discussed the uniq -d command used for the same objective. Additionally, we explored the use of IFS to find intersecting lines. Finally, we looked into the variance in result order and benchmarking of each command used to identify intersecting lines between files.

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security

Full Archive

About Baeldung