1. Overview

Adding new columns into a CSV file can be very useful in various scenarios, such as enriching existing datasets, validating data, and so on.

In this tutorial, we’ll explore multiple Linux utilities for adding new columns to a CSV file. Though we’ve tested each approach on Ubuntu Linux, they should work fine on most other Linux distributions without any modification.

2. Understanding the Scenario

Let’s start by looking at the sample employees.csv file:

$ cat employees.csv
id,name,age
1,John,20
2,Nathan,23
3,Mary,16
4,Peter,34
5,Amanda,57

The employees.csv file contains three columns containing the values for the id, name, and age fields. Further, we must note that the first row has the name of the fields.

Next, let’s check out the salaries.csv file that contains the salary data for the employees:

$ cat salaries.csv
id,salary
1,100000
2,130000
3,87000
4,95000
5,120000

We’ve got one common field, id, between the employees.csv and salaries.csv files.

In addition, we’ve got the departments.csv file that contains the department names:

$ cat departments.csv
department
engineering
finance
human resources
product
sales

It’s interesting to note that departments.csv doesn’t share any common field with employees.csv or salaries.csv. However, the values follow the same order as the other two files.

Lastly, our goal is to add two new columns to employees.csv, namely, salary and department, using the salaries.csv and departments.csv data sources.

3. Using awk

Awk is one of the most popular text processing tools for working with structured text files, such as CSVs. Let’s learn how to use awk to solve our use case of adding new columns to a CSV file.

For our use case, we want to add new columns from the departments.csv and salaries.csv files to the employees.csv file.

Let’s go ahead and write the add_columns_csv.awk script:

$ cat add_columns_csv.awk
BEGIN {
    FS=",";
    OFS=",";
}
{
    if (FILENAME == "salaries.csv") {
        salary[FNR]=$2
    }
    if (FILENAME == "departments.csv") {
        department[FNR]=$1
    }
    if (FILENAME == "employees.csv") {
        print $1,$2,$3,salary[FNR],department[FNR]
    }
}

Now, let’s break this down to understand the logic. In the BEGIN block, we set FS and OFS to commas as our input, and our output format is CSV. Next, we use the FILENAME and FNR variables to identify the input file and corresponding record for tracking the salary and department information in the salary and department arrays, respectively. Lastly, we use these arrays to combine with the fields ($1, $2, $3) from the employees.csv file.

Finally, let’s execute the add_columns_csv.awk script and see it in action:

$ awk -f add_columns_csv.awk salaries.csv departments.csv employees.csv
id,name,age,salary,department
1,John,20,100000,engineering
2,Nathan,23,130000,finance
3,Mary,16,87000,human resources
4,Peter,34,95000,product
5,Amanda,57,120000,sales

Perfect! We’ve got the expected output, wherein we’ve appended new columns to the employees.csv file.

4. Using cut and paste

Another approach to adding new columns to an existing CSV file is using the cut and paste utilities. While the former helps us extract columns, the latter helps us append the extracted columns.

First, let’s use the cut command with the -d and -f options to extract the second column from the salaries.csv file using comma delimiter:

$ cut -d',' -f2 salaries.csv
salary
100000
130000
87000
95000
120000

As expected, we’ve got the values from the second column containing the salary information.

Next, let’s use the paste command with the -d option to paste this content alongside the employees.csv file:

$ paste -d',' employees.csv <(cut -d',' -f2 salaries.csv)
id,name,age,salary
1,John,20,100000
2,Nathan,23,130000
3,Mary,16,87000
4,Peter,34,95000
5,Amanda,57,120000

It’s important to note that the paste command accepts multiple files as the argument. So, we had to use the process substitution syntax, <(), to treat the output from the cut command as if it were coming from a file.

Finally, let’s also append the department information from departments.csv to this output:

$ paste -d',' employees.csv <(cut -d',' -f2 salaries.csv) \
| paste -d',' - departments.csv
id,name,age,salary,department
1,John,20,100000,engineering
2,Nathan,23,130000,finance
3,Mary,16,87000,human resources
4,Peter,34,95000,product
5,Amanda,57,120000,sales

Fantastic! It looks like we’ve nailed this one. Further, it’s important to note that we used the dash () notation as the paste command needs to read the content received through the pipe from stdin rather than a file.

5. Using join and paste

Alternatively, we can use the join and paste commands to combine content from multiple data sources and add new columns to an existing CSV file.

Let’s start by using the paste command to combine the columns from the salaries.csv and departments.csv files:

$ paste -d"," salaries.csv departments.csv
id,salary,department
1,100000,engineering
2,130000,finance
3,87000,human resources
4,95000,product
5,120000,sales

We used the -d option to append the comma while joining the contents of these two files. Further, it’s important to note that the first column contains the id field. So, the output has one field in common with the employees.csv file.

Now, let’s use the join command to add the output from the paste command as new columns to the employees.csv file:

$ join -t',' -1 1 -2 1 employees.csv <(paste -d"," salaries.csv departments.csv)
id,name,age,salary,department
1,John,20,100000,engineering
2,Nathan,23,130000,finance
3,Mary,16,87000,human resources
4,Peter,34,95000,product
5,Amanda,57,120000,sales

Great! We got the expected output. Furthermore, it’s important to note that we used -1 and -2 options to specify the join field from the first and second files, respectively.

6. Using csvtool

In this section, we’ll explore an interesting utility, csvtool, that works well with CSV files.

Since csvtool doesn’t come preinstalled in Linux distributions, let’s first install the utility:

$ apt-get install csvtool

Now, let’s use the csvtool command with the paste sub-command to concatenate the columns of the employees.csv, salaries.csv, and departments.csv files:

$ csvtool paste employees.csv salaries.csv departments.csv
id,name,age,id,salary,department
1,John,20,1,100000,engineering
2,Nathan,23,2,130000,finance
3,Mary,16,3,87000,human resources
4,Peter,34,4,95000,product
5,Amanda,57,5,120000,sales

We can notice that the id column is present twice in the 1st and 4th positions.

Lastly, let’s pipe this output to csvtool, where we drop the fourth column from the output using the col sub-command:

$ csvtool paste employees.csv salaries.csv departments.csv | csvtool col 1-3,5- -
id,name,age,salary,department
1,John,20,100000,engineering
2,Nathan,23,130000,finance
3,Mary,16,87000,human resources
4,Peter,34,95000,product
5,Amanda,57,120000,sales

We got the expected output.

7. Using mlr

mlr is yet another versatile utility to work with CSV files. Since it doesn’t come preinstalled, let’s start by installing it:

$ apt-get install miller

We can use the join sub-command to join two CSV files, namely, file1.csv and file2.csv, based on the common_field:

$ mlr --csv join --ul -j <common_field> -f <file1.csv> then unsparsify <file2.csv>

We retrieve the unpaired records from the left file using the –ul option. Then, we apply the unsparsify operation to include any missing columns from file2.csv that aren’t present in file1.csv.

Now, we need a common field for all the files for the join operation using mlr. Further, employees.csv and salaries.csv have the id field, but departments.csv doesn’t. So, let’s inject this field by using the put sub-command:

$ mlr --csv put '$id = NR' departments.csv
department,id
engineering,1
finance,2
human resources,3
product,4
sales,5

It’s important to note that NR is a special variable in mlr, denoting the line number of the current record. Further, we used the ‘$id = NR’  expression to add a 1-based row value for the id field.

Finally, we can now perform a chained join operation based on the id field:

$ mlr --csv join --ul -j id -f employees.csv then unsparsify \
<(mlr --csv join --ul -j id -f salaries.csv then unsparsify \
<(mlr --csv put '$id = NR' departments.csv))
id,name,age,salary,department
1,John,20,100000,engineering
2,Nathan,23,130000,finance
3,Mary,16,87000,human resources
4,Peter,34,95000,product
5,Amanda,57,120000,sales

Excellent! We’ve solved our use case. Further, we must note that we used the <() process substitution syntax to treat the output of an intermediate command as if it were a file.

8. Conclusion

In this article, we learned how to add new columns to a CSV file. Furthermore, we explored multiple Linux utilities, such as awkcut, paste, and join, while solving the use case.

Lastly, we also solved our use case using tools, such as mlr and csvtool, which are specifically designed to work with CSV data.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments