## 1. Overview

Splitting the contents of a text file is a common text-processing operation.

In this tutorial, we’ll learn how to **split a text file based on a regular expression with the help of a few popular text-processing utilities** in Linux.

## 2. Sample Text File

Let’s take a **sample text file named tables.txt containing multiplication tables** of a few numbers:

```
$ cat tables.txt
Multiplication Table of n=2
2x1=2
2x2=4
Multiplication Table of n=3
3x1=3
3x2=6
Multiplication Table of n=4
4x1=4
4x2=8
```

For simplicity, we’ve taken only two rows of multiplications for each number. Further, we’ll use this file as input while splitting this file into multiple files, each containing a multiplication table of a single number.

## 3. Split Using *awk*

*awk* is a robust text-processing utility that offers sophisticated programming constructs such as loops and if-else blocks to give a great deal of control while working with text files. In this section, we’ll see how we can split a file based on a delimiter.

### 3.1. Regex for Delimiter

When we can identify a regular expression for a delimiter that separates the individual text blocks, we can specify the record separator (*RS*) to split the file into individual records directly. In this case, we can notice that **an empty line separates two consecutive multiplication tables**.

Let’s create the *split-using-delimiter.awk* script and add the *BEGIN* block for initializing the record separator (*RS*) and a counter to keep track of the parts:

```
BEGIN {
RS="\n\n"
part=0
}
```

Next, we’ll add the main block, where we write the contents of each part to a separate file:

```
{
file="out-" part ".txt"
print $0 > file
part=part+1
}
```

Finally, let’s execute the *split-using-delimiter.awk* script to split the file and verify the output:

```
$ awk -f split-using-delimiter.awk tables.txt
$ find . -type f -name "out*" -print -exec cat {} \;
./out-2.txt
Multiplication Table of n=4
4x1=4
4x2=8
./out-1.txt
Multiplication Table of n=3
3x1=3
3x2=6
./out-0.txt
Multiplication Table of n=2
2x1=2
2x2=4
```

It looks like we’ve got this right!

### 3.2. Regex for Text Block

When we can identify each text block using a regex, we can consider reading the input until the regex pattern matches the content read so far. **Looking at the content, we can figure out the regex for the text block**:

`Multiplication Table of n=.*\n([0-9]*x[0-9]*=[0-9]*\n){2}`

We can notice that the regex starts with the title for the multiplication table. Further, we’ve captured the numbers with the *[0-9]** pattern and used *{2}* as a repetition quantifier to specify that there are exactly two lines of equations in any multiplication table.

Now, let’s add the main block in the *split-using-regex-text-block.awk* script by initializing a few variables and a reading loop in the main block:

```
{
content=$0
part=0
pending=1
while(pending) {
# logic for splitting
}
}
```

We **store individual parts in the content variable** and use the

*part*variable to generate different output filenames for each part. Further, we use the

*pending*variable to terminate the reading loop.

Next, let’s add the logic of **reading a line into the tmp variable using the getline function** and populating the

*content*variable:

```
if ((getline tmp) > 0) {
if(length(content)>0) {
content=content "\n" tmp
} else {
content=tmp
}
} else {
pending=0
}
```

**We intend to reset the content variable after each pattern match**. So, we’ve added a conditional block to concatenate the *tmp* variable if it’s non-empty; otherwise, we set it to *tmp*. Additionally, we set the *pending *variable to *0 *(false) when the *getline* function can’t read any further.

Moving on, let’s **add the logic to match the content against the text block and write it to individual files**:

```
if( content ~ /Multiplication Table of n=.*\n([0-9]*x[0-9]*=[0-9]*\n){2}/) {
file_name="out-" part ".txt"
print content > file_name
content=""
part=part+1
}
```

Finally, let’s see our script in action:

```
$ awk -f split-using-regex-text-block.awk tables.txt
$ find . -type f -name "out*" -print -exec cat {} \;
./out-2.txt
Multiplication Table of n=4
4x1=4
4x2=8
./out-1.txt
Multiplication Table of n=3
3x1=3
3x2=6
./out-0.txt
Multiplication Table of n=2
2x1=2
2x2=4
```

We’ve successfully split the file into separate parts based on the regex for the text block.

### 3.3. A Word About the *“Too many open files”* Problem

Let’s revisit the key steps of our *awk* solutions:

- Creating a regex to match content for one single file
- Construct a new filename, such as
*out-1.txt*,*out-2.txt,*and so on - Using the
*print*statement to redirect the content to the file:*print content > newFilename*

When we implemented a split script using *awk* following the above steps, the script worked for most cases. However, it may fail one day when we run the script to split a huge file. We might **see an error message “ Too many open files“.**

This is because **every time we perform “ print something > newFileName“, the awk process opens a file descriptor**, and most

*awk*implementations won’t automatically close it until the

*awk*process is finished.

There’s a limit to how many files a process – *awk,* in our case – can open. This is decided by the system settings and the *awk* implementations. Once the number of opened files by *awk* exceeds the limit, the execution is aborted, and we see the mentioned error message.

Therefore, to make our script robust, **we should close the files that we don’t need anymore**. *awk* provides the *close(filename)* function for that. So, for example, we can add the *close()* function call after the *print* statement:

```
...
{
file="out-" part ".txt"
print $0 > file
close(file) //<-- after writing content to the file, close it
part++
}
...
```

## 4. Split Using *csplit*

In this section, we’ll explore the *csplit* utility of the GNU *coreutils* package for breaking the sample input file into sections. Unfortunately, GNU *csplit* understands only POSIX regular expressions, so we can’t use the “*\n*” character as part of the regex. Let’s keep this limitation in mind for our use case.

**The standard use case for csplit utility is for text files containing headers or titles and using them as a delimiter**. Let’s split our input file using the regex for the header:

```
$ csplit tables.txt '/^Multiplication Table of n=[0-9]*/' '{2}' -f outfile
0
41
41
41
```

We must note that the *‘{2}’* pattern specifies the number of occurrences of the delimiters. Moreover, the command generated four output files, wherein one file has zero bytes, and the remaining three have a size of *41* bytes. That’s because there’s no content before the first occurrence of the delimiter.

Next, let’s verify the contents of the output files with the filename prefix of *outfile *as specified with the *-f* option of the *csplit* command earlier:

```
$ find . -type f -name "outfile*" -print -exec cat {} \;
./outfile01
Multiplication Table of n=2
2x1=2
2x2=4
./outfile00
./outfile02
Multiplication Table of n=3
3x1=3
3x2=6
./outfile03
Multiplication Table of n=4
4x1=4
4x2=8
```

As expected, the *outfile00* is empty, and the rest of the files have multiplication tables of individual numbers.

## 5. Conclusion

In this article, we learned how to split a text file based on regex using *awk* and *csplit* utilities. Moreover, we also saw that *awk* gives us more control for advanced used cases with its programming constructs. However, we can write one-liners using the *csplit* command for straightforward use cases.