1. Overview

When we work in the Linux command-line, we often process text files. In this tutorial, we’ll discuss how to find out the longest lines in a file.

2. Introduction to the Problem

First of all, let’s understand the problem through an example:

$ cat lines.txt
Hi there,
How are you?
Recently I have to process many text files.
I like it.
Sometimes files could have very long lines.
My task is finding the longest lines.
For example, this is a really long... line.
I love Linux.

Bye

We want to find the longest lines from the file lines.txt.

To identify the longest lines, let’s prepend the length of each line by a short awk one-liner:

$ awk '{printf "%2d| %s\n",length,$0}' lines.txt
 9| Hi there,
12| How are you?
43| Recently I have to process many text files.
10| I like it.
43| Sometimes files could have very long lines.
37| My task is finding the longest lines.
43| For example, this is a really long... line.
13| I love Linux.
 0| 
 3| Bye

The output above shows the largest line length in the file is 43, and three lines have this length. Our goal is to find these three longest lines. 

Sometimes, we probably don’t want to have all the longest lines. We may have some requirements like: if there are multiple longest lines, print the first or the last one of them.

However, once we have all the longest lines, there are many ways to pick the first or the last of them. For example, we can use the head and tail commands to do that.

Therefore, in this tutorial, we’ll study two approaches to finding all the longest lines from a file:

  • Using the wc and grep commands
  • Using the awk command

Now, let’s see how to solve the problem.

3. Using wc and grep

One way to solve the problem is to combine the wc and grep commands.

We know that the grep command can match a pattern using the regex in a text file. If we know the longest line length as MAX_LEN, all longest lines should match the ERE pattern “^.{MAX_LEN}$“.

Finding the maximum line length would then be the task of the wc command.

3.1. The Pitfall of Using the wc Command

Let’s first have a look at the wc command.

The wc command has an option -L (–max-line-length), which we can use to print the maximum line length:

$ wc -L lines.txt 
43 lines.txt

As the output shows, 43 is the maximum line length in the lines.txt file — so far, so good.

However, wc -L will surprise us if there are TABs in the input. Let’s have a look at some examples:

$ echo -e "\t" | wc -L
8
$ echo -e "a\t" | wc -L
8
$ echo -e "abc\t" | wc -L
8
$ echo -e "abcde\t" | wc -L
8

This is because wc -L prints the maximum display width instead of the max line length, even if the long option is called –max-line-length.

Let’s append a “$” character to each input in the example above and check if they have the same display width:

$ echo -e "a\t$"
a       $
$ echo -e "ab\t$"
ab      $
$ echo -e "abc\t$"
abc     $
$ echo -e "abcde\t$"
abcde   $

The wc command counts a TAB as the length of 8 characters. So far, no option can change it.

If we want to take a closer look at it, we can find how a TAB is handled in the wc source code:

switch (wide_char)
   {
 ...
   case '\t':
     linepos += 8 - (linepos % 8);
 ...

We can solve this problem using the tr command to convert TABs to spaces before we pass the input to the wc command:

$ echo -e "a\t" | tr '\t' ' ' | wc -L 
2
$ echo -e "abcde\t" | tr '\t' ' ' | wc -L 
6

Therefore, a stable wc command for our problem should be:

$ tr '\t' ' ' <lines.txt | wc -L
43

3.2. Assemble the wc and grep Commands

Now we can just assemble the wc -L and grep commands to find all longest lines:

$ grep -E "^.{$(tr '\t' ' ' <lines.txt | wc -L)}$" lines.txt
Recently I have to process many text files.
Sometimes files could have very long lines.
For example, this is a really long... line.

Good! All three longest lines are printed out.

The command is straightforward. We used the command substitution $(tr ‘\t’ ‘ ‘ <lines.txt | wc -L) to get the output of the wc command: 43.

4. Using the awk Command

Let’s have a look at how awk solves the problem:

$ awk '{ln=length}
       ln>max{delete result; max=ln}
       ln==max{result[NR]=$0} 
       END{for(i in result) print result[i] }' lines.txt
Recently I have to process many text files.
Sometimes files could have very long lines.
For example, this is a really long... line.

The output above shows the three longest lines we want. Now, let’s understand how the awk command works:

  • {ln=length} — The length here is a short form of length($0). We save the line length in a variable: ln
  • ln>max{…} — We use a max variable to save the greatest ln so far. For each new coming ln, we compare the ln with the max variable
  • ln>max{delete result; max=ln} — If the current ln is greater than the max, we empty the result array, and let the ln be the max
  • ln==max{result[NR]=$0} — If the ln == max, it means the current is one of the longest lines, and we add it into the result array
  • END{for(i in result) print result[i]} — In the END block, we print all elements in our result array

5. Benchmarking Performance

So far, we have two different solutions to our problem. We may want to know which approach is faster.

Before we compare the performance of them, let’s create a larger input file with very long lines:

$ for n in {1..10}; do rand=$RANDOM;echo "$(tr -dc A-Za-z0-9 </dev/urandom | head -c$rand)" >>big.txt;done
$ file big.txt 
big.txt: ASCII text, with very long lines

We’ve created a text file with eleven lines by reading from /dev/urandom. Let’s check out the line length:

$ awk '{printf "line #%d has length:%d\n",NR,length}' big.txt
line #1 has length:0  
line #2 has length:4998
line #3 has length:5880
line #4 has length:9487
line #5 has length:18474
line #6 has length:28352
line #7 has length:4441
line #8 has length:6502
line #9 has length:21588
line #10 has length:5051
line #11 has length:15307

We’re going to benchmark the performance of the two approaches using the time command.

First, let’s test the wc and grep solution:

$ time grep -E "^.{$(tr '\t' ' ' <big.txt | wc -L)}$" big.txt > /dev/null
real 4.12 
user 4.06
sys 0.05

We’ve waited more than four seconds to find the longest lines.

Now, let’s see how the awk solution performs:

$ time awk '{ln=length}ln>max{delete result; max=ln}
     ln==max{result[NR]=$0} END{for(i in result) print result[i] }' big.txt > /dev/null
real 0.00  
user 0.00
sys 0.00

The awk command finished immediately!

The test shows that the awk solution is much faster than the wc and grep solution. This is because:

  • The wc and grep solution will go through the input three times: tr(1), wc(1), and grep(1), while the awk command passes through the file only once
  • The tr and wc commands will check every single character in the input, and this is costly
  • grep does regex matching on each line, which is also a pretty expensive operation
  • On the other hand, the awk command focuses only on the length of each line without looking at each character

6. Conclusion

In this article, we addressed two different ways to find the longest lines from an input file.

Also, we benchmarked their performance and discussed why the awk solution is much faster than the wcgrep approach.

Apart from that, we took a closer look at a pitfall of the wc command that we must be aware of when we need to use the -L option.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments