1. Overview

In Linux, rearranging lines in a text file is a common operation. Sometimes, we want to rearrange the lines in a particular required order. The sort command can help us with that.

However, sometimes we would like to randomize lines in a file — in other words, to shuffle lines in a file.

In this tutorial, we’re going to see different ways to shuffle lines in a text. Also, we’ll compare those approaches and discuss their pros and cons.

2. Example Input File

Before we start shuffling files, let’s first prepare a text file as the input for all examples in later sections:

$ cat input.txt
The original line number: 1
The original line number: 2
The original line number: 3
The original line number: 4
The original line number: 5
The original line number: 6
The original line number: 7
A line with some text.
A line with some text.
A line with some text.

As the output above shows, we’ve created a file called input.txt, which has ten lines. In the first seven lines, we put the original line number in each line, so that we can easily see the result after the shuffling.

It’s worthwhile to mention that the last three lines are duplicated, which means they contain the same text. We added these three lines because we would like to observe the distribution of repeated lines in the shuffled result.

In this tutorial, let’s shuffle the input file in three ways:

  • Using the shuf command
  • Using the sort command
  • Using a random number generator and the sort command

3. Using the shuf Command

The shuf utility is a member of the GNU Coreutils package. It outputs a random permutation of the input lines.

The shuf command will load all input data into memory during the shuffling, and it won’t work if the input file is larger than the free memory.

It’s pretty straightforward to shuffle lines in a file using this command:

$ shuf input.txt
The original line number: 7
The original line number: 4
A line with some text.
The original line number: 1
A line with some text.
The original line number: 5
A line with some text.
The original line number: 3
The original line number: 6
The original line number: 2

The output above shows that the lines are re-ordered randomly, including the three lines containing identical text.

If we run the command multiple times, we’ll get a different result each time.

4. Using the sort Command

Most of the time, we use the sort command to rearrange lines in a file in a certain predefined order. However, we can use the sort command together with the option -R, to do a random sort:

$ sort -R input.txt
A line with some text.
A line with some text.
A line with some text.
The original line number: 3
The original line number: 6
The original line number: 1
The original line number: 2
The original line number: 5
The original line number: 4
The original line number: 7

The output above shows that the lines are rearranged randomly, as we expected. But if we check the three lines with the same text, we’ll find that they are consecutive.

Well, it probably happened by accident. Let’s run the command once again, to see what comes:

$ sort -R input.txt
The original line number: 6
The original line number: 5
A line with some text.
A line with some text.
A line with some text.
The original line number: 7
The original line number: 1
The original line number: 4
The original line number: 3
The original line number: 2

This time, the order of lines in the output is different from the first run. However, the three lines with the same text are still together.

In fact, no matter how many times we run the command, these three lines will always be consecutive in the output. This is because the sort command will compute a hash key for each line and then sort by the generated hash key. If the input lines are the same, the same hash function will produce the same hash key, too. Therefore, the three lines are always together in our output.

5. A Three-Step Approach

In addition to the shuf and sort -R, we can also shuffle lines in a file by ourselves. As the title indicates, we’ll do it in three steps:

  1. Add a random number as a prefix on each line.
  2. Sort the lines by the prefix number.
  3. Remove the prefix.

5.1. Get a Random Number

Generating a random number in Linux is an interesting and relatively complicated topic. In this tutorial, we won’t dive into the random number generation analysis. Instead, we’ll just introduce several ways to get a random number in Linux.

The most straightforward way to get a random number in Linux would be reading the $RANDOM “variable”:

$ echo $RANDOM 
12587
$ echo $RANDOM
26089
$ echo $RANDOM
17442

In fact, $RANDOM is an internal Bash function that generates a random signed 16-bit integer (between 0 and 32767 ).

Another way to get a random number is to use the device /dev/random.

The /dev/random is a special device file. It uses the noise collected from the device drivers and other sources to generate random data. We can use the od (octal dump) command to extract a number of bytes and print their decimal equivalent:

$ od -An -N2 -i /dev/random
   25824
$ od -An -N2 -i /dev/random
   56167
$ od -An -N2 -i /dev/random
   42527

The /dev/random is also the kernel’s random number generator. It can generate a 4096-bit random integer:

$ cat /proc/sys/kernel/random/poolsize
4096

We can also get a random number by some high-level script language such as Python, Perl, or awk:

$ awk 'BEGIN{srand();for(i=1;i<=3;i++)print rand()}'
0.714734
0.174336
0.369674

5.2. Assemble the Command

Finally, we’ll need to build our command like:

cmd-to-prepend-random-number-on-each-line | sort -n -k 1 | cmd-to-remove-random-number-prefix-from-each-line

The sort command in the middle sorts the prepared line by the first field, which contains the random numbers we prepended.

Let’s use the powerful awk command to prepend a random number on each line:

$ awk 'BEGIN{srand()}{print rand(), $0}' input.txt
0.118435 The original line number: 1
0.277674 The original line number: 2
0.139113 The original line number: 3
0.351707 The original line number: 4
0.178648 The original line number: 5
0.128693 The original line number: 6
0.625488 The original line number: 7
0.179445 A line with some text.
0.100277 A line with some text.
0.584702 A line with some text.

To remove our random number prefix, we can remove the text from the beginning of each line until the first space (inclusive). We have many options to do that. We’ll still use the awk command to do the job.

Now, let’s build the three parts together:

$ awk 'BEGIN{srand()}{print rand(), $0}' input.txt \
    | sort -n -k 1 \
    | awk 'sub(/\S* /,"")'
The original line number: 4
A line with some text.
The original line number: 1
The original line number: 2
A line with some text.
The original line number: 7
A line with some text.
The original line number: 6
The original line number: 3
The original line number: 5

The result above shows that all lines, including the three duplicated lines, are rearranged randomly.

6. Performance Comparison

So far, we have seen three different ways to shuffle lines in a file. We might ask, which one is faster?

Let’s first create a big file using the seq command for the performance tests:

$ seq -f "The original line number: %g" 3000000 > big.txt

$ ls -l big.txt 
-rw-r--r-- 1 kent kent 104M May 23 23:06 big.txt

$ wc -l big.txt 
3000000 big.txt

As the output above shows, we’ve created a file with the name big.txt, which contains three million lines.

Now, we’ll test compare the performance of our different approaches using the time command.

First, let’s test the shuf command:

$ time shuf big.txt > result.txt
real 0.73
user 0.64
sys 0.08

And now, the sort -R approach:

$ time sort -R big.txt > result.txt                      
real 35.73
user 117.17
sys 0.17

And finally, our three-step approach:

$  time awk 'BEGIN{srand()}{print rand(), $0}' big.txt \  
    | sort -n -k 1 \                                          
    | awk 'sub(/\S* /,"")' > result.txt
real 2.99                              
user 1.67
sys 0.05

The result shows that the sort -R approach is much (about 50 times) slower than the one with shuf. That’s because the sort command must compute a hash key for every single line, which is a quite expensive operation, while the shuf command doesn’t have this calculation. Moreover, shuf does all the work in memory.

Our three-step solution is also pretty fast. However, it’s still about four times slower than the shuf approach. This is because the three-step approach starts three processes and reads the large input three times.

7. Pros and Cons

After the performance test, it’s time to discuss the pros and cons of these three approaches.

7.1. The shuf Command

Pros:

  • Straightforward
  • Fast, since all processings are done in memory
  • Duplicated lines can be shuffled as well

Cons:

  • File size limited to the amount of free memory

When we need to shuffle a file, and the file can be loaded into memory, the shuf command would be our first choice.

However, if we want to use the shuf command to shuffle lines of a huge file, whose size is greater than the memory size, we may have to first split it into small files and then merge them after the shuffling.

7.2. The sort -R Approach

Pros:

  • Straightforward
  • No limitation on file size

Cons:

  • Slow on big files due to the hash calculation mechanism
  • Duplicated lines will always stick together

7.3. The Three-Step Approach

Pros:

  • Reasonably fast
  • Duplicated lines can be shuffled as well
  • No limitation on file size
  • Flexible and extensible

Cons:

  • The script is more complicated than the other solutions
  • Starts three processes and processes the input data three times

When we need to shuffle a huge file, which cannot be loaded into memory completely, this approach would be simpler than the shuf command and much faster than the sort -R approach.

8. Conclusion

In this article, we’ve learned three different ways to shuffle lines of a file. We’ve also compared their performance and discussed their pros and cons.

Apart from that, we’ve seen how to get a random number in Linux as well.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments