1. Overview

When we work with the Linux command-line, we often need to process text files. In this tutorial, we’ll address different ways to remove the first line from an input file.

Also, we’ll discuss the performance of those approaches.

2. Introduction to the Example

Let’s first create an example text file to process. We’ll go with a CSV file for our use case since these often contain column names in the first line. If we can remove the first line from the CSV file, it can make later processing easier.

So, let’s create the file books.csv for our example:

$ cat books.csv 
ID, BOOK_TITLE, AUTHOR, PRICE($)
1, A Knock at Midnight, Brittany K. Barnett, 13.99
2, Migrations: A Novel, Charlotte McConaghy, 13.99
3, Winter Counts, David Heska, 27.99
4, The Hour of Fate, Susan Berfield, 30.00
5, The Moon and Sixpence, W. Somerset Maugham, 6.99

In this tutorial, we’ll remove the first line from books.csv using three techniques:

  • Using the sed command
  • Using the awk command
  • Using the tail command

As the example shows, our books.csv contains only six lines. However, in the real world, we might face much bigger files.

Therefore, after we address the solutions, we’ll discuss the performance of the solutions and find out which is the most efficient approach to the problem.

3. Using the sed Command

sed is a common text processing utility in the Linux command-line. Removing the first line from an input file using the sed command is pretty straightforward.

Let’s see how to solve the problem with sed:

$ sed '1d' books.csv
1, A Knock at Midnight, Brittany K. Barnett, 13.99
2, Migrations: A Novel, Charlotte McConaghy, 13.99
3, Winter Counts, David Heska, 27.99
4, The Hour of Fate, Susan Berfield, 30.00
5, The Moon and Sixpence, W. Somerset Maugham, 6.99

The sed command in the example above isn’t hard to understand. The parameter ‘1d’ tells the sed command to apply the ‘d’ (delete) action on line number ‘1’.

It’s worth mentioning that if we use GNU sed, we can add the -i (in-place) option to write the change back to the input file instead of printing the result to stdout:

sed -i '1d' books.csv

4. Using the awk Command

awk is another powerful Linux command-line text processing tool. A short awk one-liner can solve our problem:

$ awk 'NR>1' books.csv
1, A Knock at Midnight, Brittany K. Barnett, 13.99
2, Migrations: A Novel, Charlotte McConaghy, 13.99
3, Winter Counts, David Heska, 27.99
4, The Hour of Fate, Susan Berfield, 30.00
5, The Moon and Sixpence, W. Somerset Maugham, 6.99

The awk command above prints the line from the input file if its line number (NR) is greater than 1.

Since version 4.1.0, GNU awk supports the inplace” extension to emulate the -i (in-place) option of GNU sed:

gawk -i inplace 'NR>1' books.csv

If our awk implementation doesn’t ship with the “in-place” feature, we can always do the “in-place” change using a temp file:

awk 'NR>1' books.csv > tmp.csv && mv tmp.csv books.csv

5. Using the tail Command

Usually, we use the “tail -n x file” command to get the last x lines from an input file. If we prepend a “+” sign to the “x“, the “tail -n +x file” command will print starting with the xth line until the end of the file.

Therefore, we can convert our “removing the first line from a file” problem into “get the second line until the end of the file”:

$ tail -n +2 books.csv
1, A Knock at Midnight, Brittany K. Barnett, 13.99
2, Migrations: A Novel, Charlotte McConaghy, 13.99
3, Winter Counts, David Heska, 27.99
4, The Hour of Fate, Susan Berfield, 30.00
5, The Moon and Sixpence, W. Somerset Maugham, 6.99

Similarly, we can write the change back to the input file through a temp file:

tail -n +2 books.csv > tmp.csv && mv tmp.csv books.csv

6. Performance

Our books.csv has only six lines, so all the commands we’ve seen finish almost instantly.

However, in the real world, we usually need to process bigger files. Let’s discuss the performance of our approaches and find the most efficient solution to the problem.

First of all, we’ll create a big input file with 100 million lines:

$ wc -l big.txt 
100000000 big.txt

Then, we’ll test each solution on our big input file to remove the first line.

To benchmark their performance, we’ll use the time command:

  • The sed solution: time sed ‘1d’ big.txt > /dev/null
  • The awk solution: time awk ‘NR>1’ big.txt > /dev/null
  • The tail solution: time tail -n +2 big.txt > /dev/null

Now, let’s have a look at the result:

Solutions time output
The sed solution
real	0m6.630s
user	0m6.053s
sys	0m0.559s
The awk solution
real	0m15.799s
user	0m15.282s
sys	0m0.499s
The tail solution
real	0m0.582s
user	0m0.097s
sys	0m0.474s

As the table shows, the tail command is the most efficient solution to the problem. It’s about 13 times faster than the sed command and approximately 30 times faster than the awk command.

This is because the tail command seeks until it finds the target line number and dumps the contents into the output. Thus, it only reads the newline characters without pre-processing or holding the line’s text.

On the other hand, the sed and the awk command will read and pre-process every line of the input file. For example, the awk command initializes some internal attributes depending on the given FS and RS, such as fields, NF, and records. Therefore, it adds a lot of overhead even though it’s not needed for our problem.

Although the sed and the awk solutions are much slower than the tail solution to solve our problem, it’s still worthwhile to learn sed and awk because they’re much more powerful and extendable than the tail command.

7. Conclusion

In this article, we’ve addressed different ways to remove the first line from an input file. After that, we discussed the performance of the solutions.

If we need to solve this problem on a large input file, the tail solution will give us the best performance.

guest
0 Comments
Inline Feedbacks
View all comments