In this tutorial, we’ll explore how to use the join command and illustrate some examples.
The join command provides us with the ability to merge two files together using a common field in each file as the link between related lines in the files.
We can think of the Linux join command the same way we think of SQL joins when we want to join two or more tables in a relational database.
To do this, we need a common field in each table that we can use to form the relationship between them. The same holds true when we’re joining files in Linux – we need to have a way of letting join know how the lines in our files related to each other.
We can expect inaccurate results if we don’t take care to sort our input files. We must sort out input files by the common field.
Let’s first create some sample files to use in our examples. We’ll create a scenario in which we want to combine two log files using a device’s IP address as the common field.
Our first file will contain the device’s IP address and machine name:
$ cat << EOF > device_names.log 10.0.1.10 WINSHARE01 10.0.1.13 WEBSERVER03 10.0.1.15 FINSERVER02 188.8.131.52 WEBSERVER02 184.108.40.206 HRSERVER01 192.168.8.28 MYWORKSTATION 192.168.10.4 PRINTER02 EOF
For our second file, we’ll add an internet browsing policy that defines each device’s level of internet access:
$ cat << EOF > device_policies.log 10.0.1.10 RED 10.0.1.13 YELLOW 10.0.1.15 RED 220.127.116.11 YELLOW 18.104.22.168 BLUE 192.168.8.28 GREEN 192.168.10.4 RED EOF
Let’s discuss our sample data before getting into some examples. Each file starts with the IP address as the first field and we do this because by default, join expects that the first field is the common field between the files. Also, we’re using a space character as a separator between fields which again is join‘s default.
3.1. Joining Sorted Files
Since we’ve taken care to sort for both log files by IP address and use the default space delimiter, we’re able to run the simplest invocation of the join command:
$ join device_names.log device_policies.log 10.0.1.10 WINSHARE01 RED 10.0.1.13 WEBSERVER03 YELLOW 10.0.1.15 FINSERVER02 RED 22.214.171.124 WEBSERVER02 YELLOW 126.96.36.199 HRSERVER01 BLUE 192.168.8.28 MYWORKSTATION GREEN 192.168.10.4 PRINTER02 RED
Our resulting output displays the IP address, device name, and internet policy for each related device per line.
3.2. Joining Unsorted Files
Let’s create a copy of device_policies.log with the order changed slightly to see how we can handle unsorted files:
$ cat << EOF > device_policies_unordered.log 10.0.1.10 RED 10.0.1.13 YELLOW 10.0.1.15 RED 188.8.131.52 YELLOW 192.168.8.28 GREEN 192.168.10.4 RED <strong>184.108.40.206 BLUE</strong> EOF
Now let’s issue the basic join command this time using our device_policies_unordered.log file instead:
$ join device_names.log device_policies_unordered.log 10.0.1.10 WINSHARE01 RED 10.0.1.13 WEBSERVER03 YELLOW 10.0.1.15 FINSERVER02 RED 220.127.116.11 WEBSERVER02 YELLOW <strong>join: device_names.log:7: is not sorted: 192.168.10.4 PRINTER02 join: device_policies_unordered.log:6: is not sorted: 192.168.10.4 RED</strong> 192.168.8.28 MYWORKSTATION GREEN 192.168.10.4 PRINTER02 RED
Our output looks different now. and we see some informational messages as well. Our files are not sorted the same, and that’s why we see error messages in our output. join continues to match the lines in the files that do correspond correctly.
Fortunately, we have a way of testing our input files prior to performing a merge. We pass the –check-order flag so that join checks that our input files are correctly sorted. Let’s see what happens if we attempt joining our unmatched files again, this time using the –check-order flag:
$ join --check-order device_names.log device_policies_unordered.log 10.0.1.10 WINSHARE01 RED 10.0.1.13 WEBSERVER03 YELLOW 10.0.1.15 FINSERVER02 RED 18.104.22.168 WEBSERVER02 YELLOW join: device_names.log:7: is not sorted: 192.168.10.4 PRINTER02
In this example, join started merging the records but stopped upon finding the first instance of records not correctly sorted.
There might be times, for example, when using join as a step in a script, where we’d like to ignore unmatched records and continue with joining records that are correctly sorted. The –nocheck-order flag gives us that exact behavior.
Let’s again attempt to join our mismatched files without worrying about the lines that don’t match up:
$ join --nocheck-order device_names.log device_policies_unordered.log 10.0.1.10 WINSHARE01 RED 10.0.1.13 WEBSERVER03 YELLOW 10.0.1.15 FINSERVER02 RED 22.214.171.124 WEBSERVER02 YELLOW 192.168.8.28 MYWORKSTATION GREEN 192.168.10.4 PRINTER02 RED
Our output shows that by using the –nocheck-order flag with join, we’ve successfully processed the files and receive all matching lines while ignoring those that don’t without unexpected failures or error messages.
3.3. Joining Files Using a Custom Separator
We need a way to provide join with a custom separator. By passing a value into join using the -t option, we’re giving join the custom separator to use.
Let’s update our log files and change use a comma as the separator between fields:
$ cat << EOF > device_names_comma.log 10.0.1.10,WINSHARE01 10.0.1.13,WEBSERVER03 10.0.1.15,FINSERVER02 126.96.36.199,WEBSERVER02 188.8.131.52,HRSERVER01 192.168.8.28,MYWORKSTATION 192.168.10.4,PRINTER02 EOF
$ cat << EOF > device_policies_comma.log 10.0.1.10,RED 10.0.1.13,YELLOW 10.0.1.15,RED 184.108.40.206,YELLOW 220.127.116.11,BLUE 192.168.8.28,GREEN 192.168.10.4,RED EOF
Now let’s attempt to merge these two new comma-separated log files:
$ join -t , device_names_comma.log device_policies_comma.log 10.0.1.10,WINSHARE01,RED 10.0.1.13,WEBSERVER03,YELLOW 10.0.1.15,FINSERVER02,RED 18.104.22.168,WEBSERVER02,YELLOW 22.214.171.124,HRSERVER01,BLUE 192.168.8.28,MYWORKSTATION,GREEN 192.168.10.4,PRINTER02,RED
We’ve successfully merged our records using our custom delimiter.
3.4. Joining Files Using a Different Field in the First File
Our common field may not always be the first field in each file. We need to be able to join files using fields that appear in a different order. Let’s change the order of fields in our device_names.log file:
$ cat << EOF > device_names_reverse.log WINSHARE01 10.0.1.10 WEBSERVER03 10.0.1.13 FINSERVER02 10.0.1.15 WEBSERVER02 126.96.36.199 HRSERVER01 188.8.131.52 MYWORKSTATION 192.168.8.28 PRINTER02 192.168.10.4 EOF
If we attempt to join device_names_reverse.log and device_policies.log we can expect the operation to fail as the common field – IP address – is no longer the first field in both files.
Fortunately, we can override join‘s default behavior and pass in the number of the field in the first file to use as the common field:
$ join -1 2 device_names_reverse.log device_policies.log 10.0.1.10 WINSHARE01 RED 10.0.1.13 WEBSERVER03 YELLOW 10.0.1.15 FINSERVER02 RED 184.108.40.206 WEBSERVER02 YELLOW 220.127.116.11 HRSERVER01 BLUE 192.168.8.28 MYWORKSTATION GREEN 192.168.10.4 PRINTER02 RED
We’re able to join files where the order of the common field is not the same in both files by using the -1 option. We simply pass the position of our common field in the first file as the value for the -1 option. The -1 option very simply refers to the first file listed in the arguments to the join command.
3.5. Joining Files Using a Different Field in the Second File
Let’s now change the order of the arguments around a bit and see how we can achieve the same behavior for the second file:
$ join -2 2 device_policies.log device_names_reverse.log 10.0.1.10 RED WINSHARE01 10.0.1.13 YELLOW WEBSERVER03 10.0.1.15 RED FINSERVER02 18.104.22.168 YELLOW WEBSERVER02 22.214.171.124 BLUE HRSERVER01 192.168.8.28 GREEN MYWORKSTATION 192.168.10.4 RED PRINTER02
We’ve changed from using the -1 option to using -2, which refers to the second file in the arguments list. We pass in the position of the common field as the value for the -2 option.
Let’s make another quick change and reverse the order of the fields in the device_policies.log file:
$ cat << EOF > device_policies_reverse.log RED 10.0.1.10 YELLOW 10.0.1.13 RED 10.0.1.15 YELLOW 126.96.36.199 BLUE 188.8.131.52 GREEN 192.168.8.28 RED 192.168.10.4 EOF
3.6. Joining Files Using a Different Field in Both Files
We’ve switched the order of the file, placing the policy group as the first field and the IP address as the second. This is the same order as our device_names_reverse.log file. So now our common field – IP address – is in the same position in both files.
Let’s look at how we can invoke join to cater for this scenario:
$ join -j 2 device_names_reverse.log device_policies_reverse.log 10.0.1.10 WINSHARE01 RED 10.0.1.13 WEBSERVER03 YELLOW 10.0.1.15 FINSERVER02 RED 184.108.40.206 WEBSERVER02 YELLOW 220.127.116.11 HRSERVER01 BLUE 192.168.8.28 MYWORKSTATION GREEN 192.168.10.4 PRINTER02 RED
We use the -j option to instruct join to use the reference the same field position for the common field in each of the respective files.
3.7. Customizing the Output
In the past examples, we’ve seen how join prints the records of the first file, followed by the corresponding records in the second file. Through the -o option, we’re able to provide join with a custom output format to use:
$ join -t , -o 1.1,2.2,1.2 device_names_comma.log device_policies_comma.log 10.0.1.10,RED,WINSHARE01 10.0.1.13,YELLOW,WEBSERVER03 10.0.1.15,RED,FINSERVER02 18.104.22.168,YELLOW,WEBSERVER02 22.214.171.124,BLUE,HRSERVER01 192.168.8.28,GREEN,MYWORKSTATION 192.168.10.4,RED,PRINTER02
By using the notation of <file_number>.<field_number> we’re able to define how we’d like the output to be presented to us. We refer to the first file in our argument list to join is as 1 and reference the second file as 2. Our field numbering starts from 1. Our chosen delimiter separates the output.
In this tutorial, we discovered the join command and by way of examples, explored some of its uses. We also looked at how to override join‘s default behaviors to give it greater usability.
As always, more information about the join command can be found over on its man page.