1. Overview

In this article, we’ll discuss the negative impacts of using a shell loop to process text contents. First, we’ll cover the concept behind how the shell loops process strings and text files by example. Afterward, we’ll go through the negative side-effects of using a shell loop to process text data.

2. The Concept

When we write programs in a language like C or Python, we have multiple ways to accomplish a task. Similarly, in shell scripting, we can automate a given task in various ways. However, that doesn’t mean that every method is effective.

Most novice Linux beginners will try to project the concepts from other languages like C and Python to shell scripting. On account of that, they usually end up writing “bad code”, which can be done more efficiently — sometimes, with less code.

For instance, let’s take a look at an example:

while read line; do
  echo $line | cut -c3
done

This code basically works on each text line and then prints out the third character from the string. For a few lines, it’s fine. But what happens when we have a million lines? Would we use the cut binary on a million text lines individually? We’ll get to that.

2.1. Why Use Shell Scripts?

The shell’s job is just to execute commands. When we collect a bunch of logically related commands and put them in a source file, we create an organized set of activities that we can use over and over. This increases productivity and makes life easier.

Shell merely interprets the commands that we give it. These commands can be built-in or just a stand-alone binary written in a high-level language like C.

On the other hand, the shell is a higher-level language (some might argue that it’s not even a language), and it’s the commands that do the actual jobs. The shell is only meant to orchestrate these commands.

Command orchestration is done through the use of shell features like piping, streams, and redirections. These shell features are so convenient that, even after half a century, we still use shells. Therefore, we just write a set of commands and let the shell invoke them, and the shell lets them work at their own pace.

In shells, we invoke as few commands as possible for a given task. Why? Because invoking a command has its cost, which we’ll cover in the next section.

3. Performance Issues

3.1. Behind the Scenes

In the shell, when we invoke an external command, a lot of things happen behind the scenes. The command binary is loaded into the memory, and a process is created and initialized. Then, hundreds of instructions are executed for a simple task. Finally, the process has to be destroyed and cleaned out from the memory.

Now, imagine if we have a text file that contains a million lines, and we execute this block on it:

while IFS= read line; do
  echo $line | cut -c3
done < file

It means we call the cut binary that goes through the above process a million times. All these are just to extract a single character from each line. It’s bad code, isn’t it?

3.2. Shell Commands Are Not UNIX Utilities

Functions in programming languages like C usually run in a single process. The standard library keeps track of its memory space and internal buffers to avoid costly system calls.

In contrast, when we run a shell built-in like read or echo, a separate process for that command is created. Therefore, they don’t share a common memory.

The read command, for instance, is just meant to read a single line. If it reads past the newline character, then the next command will miss it because there’s no buffer to keep track of it. Thus, the next command has to wait for read to finish.

For that reason, the commands in the shell run sequentially, and we have to wait for them.

3.3. Using a Shell Built-in Instead

Most shells, like bash and zsh, provide built-in features to perform a lot of common tasks. Running the shell built-ins, as opposed to external binaries, are very performant.

For instance, let’s rewrite the above code, but let’s use a shell built-in instead of cut:

while IFS= read line; do
    echo ${line:2:1}
done < file

The performance ratio between the previous code on Bash and the previous code is around 1:600 — a minute vs. ten minutes. That’s huge!

Similarly, the following code can also be written using a shell built-in:

while IFS= read line; do
    echo "Length: $(wc -c $line)"
done < file
while IFS= read line; do
    echo "Length: ${#line}"
done < file

4. Avoiding Loops Where Possible

As we saw above, there’s always a more efficient way to do a given task in the shell. Finding a more efficient solution to the problem can take time, but it pays off in the long run.

For example, let’s use a loop to process a huge text file containing around 30,000 lines. We’ll print the first character from each line:

#!/bin/bash

while IFS= read -r line; do
  echo "$line" | head -c1
done < data.json

Now, we’re going to use the time command to benchmark the time it takes:

$ time ./extract.sh
...
...
./extract.sh  36.96s user 15.43s system 97% cpu 53.834 total

It takes an insane amount of time just to process this file. However, if we just omit the loop and use a simple awk command, it takes 0 seconds:

$ time awk '{print substr($0,0,1)}' data.json
awk '{print substr($0,0,1)}' data.json  0.01s user 0.06s system 89% cpu 0.081 total

As we can see, there’s no need to use a loop when we can accomplish the same task more efficiently. However, that doesn’t mean that we should avoid using loops. Sometimes, we need loops for specific tasks like making several subsequent requests to a web server.

5. Conclusion

In this article, we touched on the subject of the negative impacts of using loops to process text. First, we got familiar with the concept of how the shell works. Then, we discussed the performance issue that comes with processing text in loops.

Finally, with the help of an example, we illustrated how we could avoid loops and adopt a more simple and efficient approach.

Comments are closed on this article!