How to Split a Text File Based on a Regular Expression

1. Overview

Splitting the contents of a text file is a common text-processing operation.

In this tutorial, we’ll learn how to split a text file based on a regular expression with the help of a few popular text-processing utilities in Linux.

2. Sample Text File

Let’s take a sample text file named tables.txt containing multiplication tables of a few numbers:

$ cat tables.txt
Multiplication Table of n=2
2x1=2
2x2=4

Multiplication Table of n=3
3x1=3
3x2=6

Multiplication Table of n=4
4x1=4
4x2=8

For simplicity, we’ve taken only two rows of multiplications for each number. Further, we’ll use this file as input while splitting this file into multiple files, each containing a multiplication table of a single number.

3. Split Using awk

awk is a robust text-processing utility that offers sophisticated programming constructs such as loops and if-else blocks to give a great deal of control while working with text files. In this section, we’ll see how we can split a file based on a delimiter.

3.1. Regex for Delimiter

When we can identify a regular expression for a delimiter that separates the individual text blocks, we can specify the record separator (RS) to split the file into individual records directly. In this case, we can notice that an empty line separates two consecutive multiplication tables.

Let’s create the split-using-delimiter.awk script and add the BEGIN block for initializing the record separator (RS) and a counter to keep track of the parts:

BEGIN {
    RS="\n\n"
    part=0
}

Next, we’ll add the main block, where we write the contents of each part to a separate file:

{
    file="out-" part ".txt"
    print $0 > file
    part=part+1
}

Finally, let’s execute the split-using-delimiter.awk script to split the file and verify the output:

$ awk -f split-using-delimiter.awk tables.txt

$ find . -type f -name "out*" -print -exec cat {} \;
./out-2.txt
Multiplication Table of n=4
4x1=4
4x2=8
./out-1.txt
Multiplication Table of n=3
3x1=3
3x2=6
./out-0.txt
Multiplication Table of n=2
2x1=2
2x2=4

It looks like we’ve got this right!

3.2. Regex for Text Block

When we can identify each text block using a regex, we can consider reading the input until the regex pattern matches the content read so far. Looking at the content, we can figure out the regex for the text block:

Multiplication Table of n=.*\n([0-9]*x[0-9]*=[0-9]*\n){2}

We can notice that the regex starts with the title for the multiplication table. Further, we’ve captured the numbers with the [0-9]* pattern and used {2} as a repetition quantifier to specify that there are exactly two lines of equations in any multiplication table.

Now, let’s add the main block in the split-using-regex-text-block.awk script by initializing a few variables and a reading loop in the main block:

{
    content=$0
    part=0
    pending=1

    while(pending) {
        # logic for splitting
    }
}

We store individual parts in the content variable and use the part variable to generate different output filenames for each part. Further, we use the pending variable to terminate the reading loop.

Next, let’s add the logic of reading a line into the tmp variable using the getline function and populating the content variable:

if ((getline tmp) > 0) {
    if(length(content)>0) {
        content=content "\n" tmp
    } else {
        content=tmp
    }
} else {
    pending=0
}

We intend to reset the content variable after each pattern match. So, we’ve added a conditional block to concatenate the tmp variable if it’s non-empty; otherwise, we set it to tmp. Additionally, we set the pending variable to 0 (false) when the getline function can’t read any further.

Moving on, let’s add the logic to match the content against the text block and write it to individual files:

if( content ~ /Multiplication Table of n=.*\n([0-9]*x[0-9]*=[0-9]*\n){2}/) {
    file_name="out-" part ".txt"
    print content > file_name
    content=""
    part=part+1
}

Finally, let’s see our script in action:

$ awk -f split-using-regex-text-block.awk tables.txt

$ find . -type f -name "out*" -print -exec cat {} \;
./out-2.txt
Multiplication Table of n=4
4x1=4
4x2=8

./out-1.txt
Multiplication Table of n=3
3x1=3
3x2=6

./out-0.txt
Multiplication Table of n=2
2x1=2
2x2=4

We’ve successfully split the file into separate parts based on the regex for the text block.

3.3. A Word About the “Too many open files” Problem

Let’s revisit the key steps of our awk solutions:

Creating a regex to match content for one single file
Construct a new filename, such as out-1.txt, out-2.txt, and so on
Using the print statement to redirect the content to the file: print content > newFilename

When we implemented a split script using awk following the above steps, the script worked for most cases. However, it may fail one day when we run the script to split a huge file. We might see an error message “Too many open files“.

This is because every time we perform “print something > newFileName“, the awk process opens a file descriptor, and most awk implementations won’t automatically close it until the awk process is finished.

There’s a limit to how many files a process – awk, in our case – can open. This is decided by the system settings and the awk implementations. Once the number of opened files by awk exceeds the limit, the execution is aborted, and we see the mentioned error message.

Therefore, to make our script robust, we should close the files that we don’t need anymore. awk provides the close(filename) function for that. So, for example, we can add the close() function call after the print statement:

...
{
  file="out-" part ".txt" 
  print $0 > file
  close(file) //<-- after writing content to the file, close it
  part++ 
}
...

4. Split Using csplit

In this section, we’ll explore the csplit utility of the GNU coreutils package for breaking the sample input file into sections. Unfortunately, GNU csplit understands only POSIX regular expressions, so we can’t use the “\n” character as part of the regex. Let’s keep this limitation in mind for our use case.

The standard use case for csplit utility is for text files containing headers or titles and using them as a delimiter. Let’s split our input file using the regex for the header:

$ csplit tables.txt '/^Multiplication Table of n=[0-9]*/' '{2}' -f outfile
0
41
41
41

We must note that the ‘{2}’ pattern specifies the number of occurrences of the delimiters. Moreover, the command generated four output files, wherein one file has zero bytes, and the remaining three have a size of 41 bytes. That’s because there’s no content before the first occurrence of the delimiter.

Next, let’s verify the contents of the output files with the filename prefix of outfile as specified with the -f option of the csplit command earlier:

$ find . -type f -name "outfile*" -print -exec cat {} \;
./outfile01
Multiplication Table of n=2
2x1=2
2x2=4

./outfile00
./outfile02
Multiplication Table of n=3
3x1=3
3x2=6

./outfile03
Multiplication Table of n=4
4x1=4
4x2=8

As expected, the outfile00 is empty, and the rest of the files have multiplication tables of individual numbers.

5. Conclusion

In this article, we learned how to split a text file based on regex using awk and csplit utilities. Moreover, we also saw that awk gives us more control for advanced used cases with its programming constructs. However, we can write one-liners using the csplit command for straightforward use cases.

Administration

Scripting

Networking

Files

Processes

Full Archive

About Baeldung