1. Overview

When we work with a large file, sometimes we need to break it into parts and process them separately. We call it “splitting a file”.

The convenient split command can help us to split a file in most cases. However, in this tutorial, we are going to discuss a particular file splitting scenario: how to split a file at given line numbers.

2. Introduction to the Problem

When we split a file using the split command, we can split the file by size or the number of lines. However, sometimes we want to split a file at given line numbers.

An example file will help us to understand the problem quickly. Let’s say we have a text file called input.txt:

$ cat input.txt
01 is my line number.
02 is my line number.
03 is my line number.
04 is my line number.
05 is my line number.
06 is my line number.
07 is my line number.
08 is my line number.
09 is my line number.
10 is my line number.
11 is my line number.
12 is my line number.
13 is my line number.
14 is my line number.
15 is my line number.

The file has 15 lines. Now, let’s split the file at three line numbers: 4, 7, and 12. That is, after the splitting, we’ll get four files:

  • file1 will contain lines 1-4 of input.txt (4 lines)
  • file2 contains lines 5-7 of input.txt (3 lines)
  • file3 holds lines 8-12 of input.txt (5 lines)
  • file4 has lines 13-15 of input.txt (3 lines)

As the split files contain different numbers of lines, we can’t use the split command to solve the problem.

We’ll address solutions to the problem using three approaches:

  • A shell script using the head and tail commands
  • A shell script based on the sed command
  • Using the awk command

Usually, when we need to split a file into chunks, we are very likely facing a large file. Therefore, the performance of the solutions does matter.

We’ll discuss the performance of the solutions and find out which is the most efficient approach.

3. Using the head and tail Commands

Using the head and tail commands together with their -n options, we can extract the lines from an input file.

Let’s extract lines 3-7 from the input.txt:

$ tail -n +3 input.txt | head -n $(( 7-3+1 ))
03 is my line number.
04 is my line number.
05 is my line number.
06 is my line number.
07 is my line number.

Therefore, we can create a shell script to wrap the tail | head command to split the file at given line numbers:

$ cat head_and_tail.sh
#!/bin/bash
INPUT_FILE="input.txt"  # The input file
LINE_NUMBERS=( 4 7 12 ) # The given line numbers (array)
START=1                 # The offset to calculate lines
IDX=1                   # The index used in the name of generated files: file1, file2 ...

for i in "${LINE_NUMBERS[@]}"
do
    # Extract the lines using the head and tail commands
    tail -n +$START "$INPUT_FILE" | head -n $(( i-START+1 )) > "file$IDX.txt"
    (( IDX++ ))
    START=$(( i+1 ))
done
# Extract the last given line - last line in the file
tail -n +$START "$INPUT_FILE" > "file$IDX.txt"

Now, let’s run the script and check if it can split the input.txt into expected chunks:

$ ./head_and_tail.sh
$ head file*
==> file1.txt <==
01 is my line number.
02 is my line number.
03 is my line number.
04 is my line number.

==> file2.txt <==
05 is my line number.
06 is my line number.
07 is my line number.

==> file3.txt <==
08 is my line number.
09 is my line number.
10 is my line number.
11 is my line number.
12 is my line number.

==> file4.txt <==
13 is my line number.
14 is my line number.
15 is my line number.

As the output above shows, our problem gets solved.

4. Using the sed Command

The sed command supports the address range of two given line numbers.

For example, we can write a short sed one-liner to extract lines 3-7 from the input.txt file:

$ sed -n '3,7p; 8q' input.txt 
03 is my line number.
04 is my line number.
05 is my line number.
06 is my line number.
07 is my line number.

In the command above, we tell the sed command to stop further processing after printing line 7 using the “8q” to get better performance.

As we can see, extracting lines using the sed command’s address range is more straightforward than the head and tail combination. Therefore, to solve our problem, we just need to calculate the boundaries of each address range and pass them to the sed command:

$ cat using_sed.sh  
#!/bin/bash
INPUT_FILE="input.txt"  # The input file
LINE_NUMBERS=( 4 7 12 ) # The given line numbers (array)
START=1                 # The start line number
IDX=1                   # The index used in the name of generated files: file1, file2 ...

for i in "${LINE_NUMBERS[@]}"
do
    # Extract the lines using sed command
    NEXT_LINE=$(( i+1 ))
    sed -n "$START, $i p; $NEXT_LINE q" "$INPUT_FILE" > "file$IDX.txt"
    (( IDX++ ))
    START=$NEXT_LINE
done

# Extract the last given line - last line in the file
sed -n "$START, $ p" "$INPUT_FILE" > "file$IDX.txt"

Now, let’s run the script and check the files it created:

$ ./using_sed.sh
$ head file*    
==> file1.txt <==
01 is my line number.
02 is my line number.
03 is my line number.
04 is my line number.

==> file2.txt <==
05 is my line number.
06 is my line number.
07 is my line number.

==> file3.txt <==
08 is my line number.
09 is my line number.
10 is my line number.
11 is my line number.
12 is my line number.

==> file4.txt <==
13 is my line number.
14 is my line number.
15 is my line number.

Great! The problem has been solved.

5. Using the awk Command

Since the powerful awk script itself supports arrays, loops, redirection, and a lot of other features, we don’t need to wrap the awk command in a shell script to solve the problem.

We could even solve the problem using an awk one-liner. However, we break it into multiple lines of codes with proper indentations so that we can more easily understand it:

awk -v nums="4 7 12" '
    BEGIN {        
        c=split(nums,b)
        for(i=1; i<=c; i++) a[b[i]]
        j=1; out = "file1.txt"
    } 
    { print > out }
    NR in a {
        close(out)
        out = "file" ++j ".txt"
    }' input.txt

If we run the above awk command, we’ll get the four files with expected data in each.

Now, let’s understand how it works:

  • -v nums=”4 7 12″: We assign the given line numbers to a variable nums
  • BEGIN { … }: The codes in the BEGIN block will run only once before reading the first line from the input file
    • c=split(nums,b): Using the split() function, we split the three numbers into an array (b[]), and the variable c holds the length of the array (3)
    • for(i=1; i<=c; i++) a[b[i]]: We create another associative array a[] holding the elements of b[] as keys. For example: b[1]=4 -> a[4]b[2]=7 -> a[7] and so on
    • j=1; out = “file1.txt”: Here we initialize a variable (out) to contain the filename of the output file and a variable (j) to hold the index of each output file
  • { print > out }: We print the current line to the output file
  • NR in a { close(out); out = “file” ++j “.txt” }: If the current line number exists in the associative array a[], we need to close the current output file and increment the index in the filename

6. Performance

So far, we’ve learned three different ways to solve the problem. Now it’s time to discuss their performance.

Before we benchmark the scripts, let’s review the three approaches and estimate the result.

Let’s say we need to split an input file into n chunks:

  • head_and_tail.sh – requires 2n processes (tail | head) and processes the input file n times
  • using_sed.sh – starts n processes (sed) and processes the input n times
  • awk command – creates a single process (awk) and processes the input only once

Based on the analysis above, it seems that the awk solution would have the lowest cost and should have the best performance. Contrarily, the head_and_tail.sh should be the slowest.

Next, let’s verify if our estimation is correct.

6.1. Creating a Big Input File

Our input.txt has only 15 lines and isn’t suitable for performance testing. Let’s create a big.txt input file with 100 million lines:

$ seq 100000000 > big.txt
$ du big.txt 
848M	big.txt

$ wc -l big.txt
100000000 big.txt

We’ll use big.txt as our input file for performance benchmarking.

6.2. Benchmarking the Performance

We’ll use the time command to test each script or command to benchmark its performance.

Before we start the testing:

  • We’ve changed the INPUT_FILE variable to point to big.txt
  • Also, since our input file now has 100 M lines, we’ve changed the LINE_NUMBERS array to “( 400000 50000000 70000000 )
  • We do all tests in the /tmp directory in the tmpfs filesystem to avoid any filesystem caching influence

First, let’s test our head_and_tail.sh script:

$ time ./head_and_tail.sh 
real 1.40
user 1.11
sys 1.00

Second, we’ll see how fast the using_sed.sh script will run:

$ rm file* ; time ./using_sed.sh
real 10.80
user 10.08
sys 0.68

Finally, let’s test the awk script:

$ rm file* ; time awk -v nums="400000 50000000 70000000" ' .... '  big.txt
real 18.73
user 18.33
sys 0.38

6.3. Understanding the Result

The result is surprising!

Even though the head_and_tail.sh script starts eight processes and reads the big input file four times, it’s the fastest solution.

However, the awk command, which we thought might be the fastest solution, was about 16 times slower than the head_and_tail.sh script and the slowest one among the three approaches.

The sed solution sits in between, but it’s still approximately eight times slower than the head_and_tail.sh.

Now, the question comes up: Why is the awk command, which reads the input file only once, so much slower than head_and_tail.sh?

It’s because the awk command reads every line of the file and initializes some internal attributes depending on the given FS and RS, such as fields, NF, records, and so on. Then, it’ll read our awk script and see if it should do something in the text. In our case, we do nothing but redirect the line to a file. The awk command will then write the text to the file. So, it brings a lot of overhead that isn’t needed for the problem at hand.

On the other hand, the head and the tail commands will only read the newline characters without doing anything or holding the text of a line. They seek until they find the target line number. Then again, they don’t read and hold lines. Instead, they just dump the contents into the output.

The sed command reads and holds every line of the input file as well. Therefore, it’s much slower than the head_and_tail solution, too.

However, the sed command does less initialization than the awk command, so, the sed script is faster than the awk solution.

Moreover, the sed command’s “q” address command enhances its performance. If we remove “$NEXT_LINE q” from the using_sed.sh script and test again, it’ll be slower:

$ time ./using_sed_without_q.sh
real 15.69
user 14.69
sys 0.99

 7. Conclusion

In this article, we’ve addressed three different ways to split a file at given line numbers.

The solution based on the sed command is most straightforward.

However, if we have to work on some large files, the head and tail solution will give us the best performance.

The awk command is a very powerful text-processing utility that can solve the problem in a one-liner. However, it is the slowest solution among the three approaches.

guest
2 Comments
Oldest
Newest
Inline Feedbacks
View all comments