1. Overview

When manipulating text files, some programs read one whole line into memory. If the input file has a very large line, the program may crash if there is not enough memory to store the line.

In this tutorial, we’ll see how we can replace a string in a very large one-line text file. For example, we may want to manipulate a 50GB file containing only one line of text. As some programs can’t handle very large one-line text files, we’ll see what alternatives we have.

2. Target File

Some modern JavaScript libraries compress all of the code into a single line. Let’s suppose we have a one-line JavaScript file called original.js with a typo in it. It calls the “fliter” method instead of “filter“. We’ll fix this typo in the following sections.

3. Using tr and sed

We can use tr and sed to replace the line. This will be a 2 step process to split up the line and replace the string.

3.1. Splitting Long Lines

We usually use sed to replace a string, but sed will try to load the whole line into memory. To overcome this, we’re going to split the line into multiple smaller lines. Then, we’ll feed sed with this new input to replace the string. Finally, we’ll rejoin the output into one line.

In Linux, by default, lines are separated with the newline character “\n”. In our case, we’ll be replacing another character with “\n” and feed sed with this new input. We have to choose a character that is not in the string we want to replace. Also, the input file has to contain the character and produce relatively small lines when we replace it with “\n”.

To split the one-line into several lines, we can use the program tr to substitute a character with “\n”. tr processes one character at a time, meaning it can handle large files with large lines without issue.

tr takes two parameters. It will replace the first parameter with the second parameter. Let’s see how to replace “;” with “\n”:

$ echo "line one;line two" | tr ";" "\n"
line one
line two

In case there are newlines in our file, we should change both “;” with “\n” and “\n” with “;”. Doing this will help to preserve the original newline characters. To do this, we’ll run tr “;\n” “\n;”. This way, tr will change the first character from the first parameter (;) with the first character from the second parameter (\n), and also the second character from the first parameter (\n) with the second character from the second parameter (;).

As we are adding new lines to the input, we’ll then rejoin the lines, so the output is consistent with the input. This can be easily done by swapping the tr parameters to produce the inverse replacement. Let’s use tr “;\n” “\n;” to split the line, and then tr “\n;” “;\n” to rejoin it:

$ echo "line one;line two" | tr ";\n" "\n;" | tr "\n;" ";\n"
line one;line two

We can see, we produced the same input.

3.2. Replacing the String

After splitting the input into several lines, we can run sed to replace the string. Let’s see how to run sed to replace the string “Alan Turing” with “Alan Mathison Turing”:

$ echo "Alan Turing was born in London." | sed 's/Alan Turing/Alan Mathison Turing/'
Alan Mathison Turing was born in London.

So far, we saw how to split long lines, replace a string, and rejoin the lines. Finally, we can write our script to replace a string in a file with very long lines.

If we want to substitute “.fliter(” with “.filter(” in our target file, we can choose the character “;” to split the lines. The “;” character is not in the “.fliter(” string, and it is a character usually present in a JavaScript file, so it should produce short lines. Let’s see how to fix original.js and write the result to fixed.js:

$ tr ";\n" "\n;" < original.js | sed 's/\.fliter(/.filter(/' | tr "\n;" ";\n" > fixed.js

Notice we have to escape the dot character when we replace “.fliter(” with “.filter(“. This is because the sed‘s substitution command takes a regular expression for the first parameter.

4. Using awk

There are other programs that can replace a string in a text file. Instead of sed, we can use awk and its gsub function. This will be a 2 step process to configure awk‘s line delimiter and substitute the string.

4.1. Changing the Line Delimiter

With awk, we can change the character used to delimit lines. Then, instead of using the default “\n” line delimiter, we can use another character that produces smaller lines. As we saw in the previous section, we have to choose a character that is not in the string we want to substitute.

To change the line delimiter used in awk, we’ll set the RS variable to the desired character inside the BEGIN block. For instance, if we choose “;” as the newline delimiter, we set RS=”;”. Let’s see how it works:

$ echo "line one;line two" | awk 'BEGIN{RS=";"}{print}'
line one
line two

As mentioned in the previous section, we have to produce an output consistent with the input. Even if awk split lines with the “;” character, the output has to be the same as the input. We can see that the awk‘s print function writes a new line that was not in the original input.

Let’s use the printf function instead, so no newlines are added:

$ echo "line one;line two" | awk 'BEGIN{RS=";"}{printf "%s", $0}'
line oneline two

We can see, we are only missing the “;” character. We know that all lines start with a line delimiter, except the first one. So, let’s prepend the “;” character to all lines unless it is the first one:

$ echo "line one;line two" | awk 'BEGIN{RS=";"}{
    if (NR != 1) {
        printf "%c", RS
    }
    printf "%s", $0
}'
line one;line two

Notice we used the NR variable to get the current line number and ignore the first line. Also, we used the RS variable to print the line delimiter.

4.2. Replacing the String

We saw how to use awk to process a file splitting lines with any character other than “\n“. So, we can now replace a string in a file with very long lines.

To replace a string with awk, we’ll use the gsub function. This function works similarly to the sed‘s substitute command. It takes the first parameter as a regular expression and substitutes it with the second parameter. Then, we’ll call gsub to make the substitution and then use the code from the last example to print the line.

We’ll repeat the same idea from the previous section. Let’s fix our target file by replacing “.fliter(” with “.filter(“:

$ awk 'BEGIN{RS=";"} {
    gsub("\\.fliter\\(", ".filter(")
    if (NR != 1) {
        printf "%c", RS
    }
    printf "%s", $0
}' < original.js > fixed.js

Notice there is a difference with sed when we escape characters. We have to also escape the “(” character, and we have to use double backslashes.

5. Conclusion

In this tutorial, we saw two methods to replace a string inside a very large one-line file.

On the one hand, we saw how to use sed. In this case, we had to use tr to split the one-line file into several lines. On the other hand, we saw we can also use awk by setting the RS variable with a character that will split the line.

Comments are closed on this article!