1. Overview

Sometimes, we may need to remove whitespace characters to sanitize the content of some files. We can do this a few different ways from the Linux command line.

In this tutorial, we’ll cover how to remove all types of whitespace, including Unicode. We’ll also look at how to manage line breaks separately.

2. Introduction to the Problem

2.1. What Are Whitespace Characters?

Whitespace is usually the spacing between printable characters. This can either be within a line (horizontal) or separating lines (vertical).

Sometimes, we want to remove all whitespace characters from a file. However, we often face the requirement of removing just the horizontal whitespace characters. In other words, we may want to remove all whitespaces from each line of a file, but still keep them as separate lines.

In this tutorial, we’ll explore both scenarios.

We should also note that the Unicode character set defines some additional whitespace characters, for example, the vertical tab (U+000B) character and the “figure space” (U+2007) character.

2.2. The Input Example

Let’s start with an example of horizontal and vertical whitespace:

$ cat -n raw_file.txt
     1	     We Have Leading Spaces.
     2	Now We Have Two Tabs:		And An Empty Line:
     3	
     4	And We Have A Couple Of Trailing Blank Lines:
     5		      
     6		 

Here, we’ve used the cat command with the -n option to print the file content with line numbers. In this way, we can clearly see empty lines in the output.

As the output above shows, our raw_file.txt contains different whitespace characters, such as spaces, tabs, and line breaks. Our goal is to remove them all.

In this tutorial, we’ll look at a few commands:

These are very common and should be found in most Linux distros.

3. Using the tr Command

The tr command reads a byte stream from standard input (stdin), translates or deletes characters, then writes the result to standard output (stdout).

We can use the tr command’s -d option – for deleting specific characters – to remove whitespace characters. The syntax is: tr -d SET1

So, depending on the requirement, passing the right SET1 characters to tr becomes the key to either removing only horizontal whitespace characters or all whitespace.

3.1. Removing Horizontal Whitespace Only

First, let’s remove all horizontal whitespace from the input file. tr defines the “[:blank:]” character set for all horizontal whitespace.

Also, we should keep in mind that the tr command only reads data from stdin. Therefore, we need to redirect the content of raw_file.txt to stdin:

$ tr -d "[:blank:]" < raw_file.txt | cat -n
     1	WeHaveLeadingSpaces.
     2	NowWeHaveTwoTabs:AndAnEmptyLine:
     3	
     4	AndWeHaveACoupleOfTrailingBlankLines:
     5	
     6  

In the example, we’ve also piped the result of tr to cat -n to verify empty lines.

So, as the output shows, we’ve removed all horizontal whitespace but kept the line breaks.

3.2. Removing All Whitespace Characters

Next, let’s remove all whitespace characters from the file.

The “[:space:]” character set means all horizontal and vertical whitespace:

$ tr -d "[:space:]" < raw_file.txt
WeHaveLeadingSpaces.NowWeHaveTwoTabs:AndAnEmptyLine:AndWeHaveACoupleOfTrailingBlankLines:

Here, we don’t need to pipe the output to cat to see that there are no line breaks!

4. Using the sed Command

sed is a widely used, non-interactive stream editing utility.

4.1. Removing Horizontal Whitespace Only

First, let’s remove all horizontal whitespace characters. [:blank:]” is also a POSIX standard character class that stands for horizontal whitespace.

sed works with regular expressions. To use this character class within the regular expression, it becomes “[[:blank:]]“:

$ sed 's/[[:blank:]]//g' raw_file.txt | cat -n
     1	WeHaveLeadingSpaces.
     2	NowWeHaveTwoTabs:AndAnEmptyLine:
     3	
     4	AndWeHaveACoupleOfTrailingBlankLines:
     5	
     6	

4.2. Removing All Whitespace Characters

Similarly, “[:space:]” is a POSIX standard character class for horizontal and vertical whitespace.

However, unlike the tr command, we cannot replace the [[:blank:]] with [[:space:]] in the sed command to remove all whitespace.

By default, the sed command reads, processes, and outputs line by line. When it writes to the output, it’ll automatically append a newline character to the current pattern space if the pattern space doesn’t end with a newline.

Therefore, even if we replace [:space:] with empty, the line break comes back when sed outputs the line.

If we want sed to remove vertical whitespace, such as line breaks, we need to tell sed to keep reading and removing whitespace until the end of the file and then output only once:

$ sed ':a; N; s/[[:space:]]//g; ta' raw_file.txt
WeHaveLeadingSpaces.NowWeHaveTwoTabs:AndAnEmptyLine:AndWeHaveACoupleOfTrailingBlankLines:

4.3. Understanding the sed Command

The sed command above is pretty compact. However, it might not be that straightforward to understand. Let’s break it down quickly and see how it works:

  • :a; – this is not a command. It merely defines a label called “a“.
  • N; – append the next line into the pattern space.
  • s/[[:space:]]//g; – as before, the s command removes all whitespace from the text in the current pattern space.
  • ta – this branches sed back to the label “a“.

In the sed command, :a …. ta‘ works like a loop. When we append a new line to the pattern space by the N; command, of course, we have at least one whitespace — the line break. Therefore, sed will keep appending the next line and removing whitespace characters until the last line in the file.

When it comes to the end of the input file, the N; command detects the EOF. Therefore, sed will output the current result in the pattern space and terminate processing.

In this way, sed has removed all whitespace characters, including line breaks, from the input file.

Many sed implementations support writing the result back to the input file. For example, the widely used GNU Sed provides the -i option to do “in-place” changes.

5. Using the awk Command

awk is another powerful text-processing utility. It has defined its own C-like script and plenty of built-in variables and functions to manipulate the processing flexibly.

5.1. Removing Horizontal Whitespace Only

awk supports regular expressions as well. Therefore, the awk command fully supports the POSIX standard character classes, such as [:blank:] and [:space:].

We can just call the gsub function to remove all horizontal whitespace:

$ awk '{gsub(/[[:blank:]]/,""); print}' raw_file.txt | cat -n          
     1	WeHaveLeadingSpaces.
     2	NowWeHaveTwoTabs:AndAnEmptyLine:
     3	
     4	AndWeHaveACoupleOfTrailingBlankLines:
     5	
     6	

As the output above shows, we’ve solved the problem.

5.2. Removing All Whitespace Characters

Similar to sed, by default, awk also reads, processes, and outputs line by line.

When awk prints records, it separates them by the built-in ORS variable. The default value of the ORS variable is one single line break.

Therefore, we can make two modifications to the awk command above to ask it to remove all whitespace, including line breaks:

  • Replace the character class with [:space:]
  • Set an empty character as the value of the ORS variable

Next, let’s see it in action:

$ awk -v ORS="" '{gsub(/[[:space:]]/,""); print}' raw_file.txt | cat -n
     1	WeHaveLeadingSpaces.NowWeHaveTwoTabs:AndAnEmptyLine:AndWeHaveACoupleOfTrailingBlankLines:

6. Unicode Whitespaces

So far, we’ve learned several approaches to remove whitespace characters from input files. These solutions will work for all ASCII text files.

In our day-to-day work, most text files we need to work with are ASCII text files. However, whitespaces contain non-ASCII Unicode characters.

Now, let’s discuss the handling of Unicode characters. We assume our default locale is en_US.utf-8.

6.1. An Input Example

First of all, let’s see an input file that contains non-ASCII Unicode characters:

$ cat raw_unicode.txt
Some Non-whitespace Unicode Characters:
[Check Mark]: U+2714 (✔)
[Cross Mark]: U+2716 (✖)

Some Unicode Whitespace Characters:
[Figure Space]: U+2007 ( )
[Thin Space]: U+2009 ( )
[Paragraph Separator]: U+2029 (
)
[Ideographic Space]: U+3000 ( )

In this file, we have six Unicode characters in the format: [Name]: Code_In_Hex (The Character)

Now, let’s try to remove horizontal whitespaces using our tr solution from this raw_unicode.txt file:

$ tr -d "[:blank:]" < raw_unicode.txt
SomeNon-whitespaceUnicodeCharacters:
[CheckMark]:U+2714(✔)
[CrossMark]:U+2716(✖)

SomeUnicodeWhitespaceCharacters:
[FigureSpace]:U+2007( )
[ThinSpace]:U+2009( )
[ParagraphSeparator]:U+2029(
)
[IdeographicSpace]:U+3000( )

As the output shows, all ASCII whitespaces have been removed, such as spaces. However, the non-ASCII Unicode whitespaces in the parentheses are still there.

This illustrates that when files contain Unicode characters, things work a little differently. It can be common when working with Unicode files in Linux that our tested commands or scripts suddenly don’t work anymore.

Therefore, before we focus on removing Unicode whitespaces, it’s worthwhile to test whether our file contains Unicode characters.

6.2. A Tip: Checking Unicode Characters in a Text File

First of all, we can use the file command to test if a text file contains ASCII or Unicode:

$ file raw_file.txt 
raw_file.txt: ASCII text

$ file raw_unicode.txt 
raw_unicode.txt: Unicode text, UTF-8 text

The output shows us which file contains Unicode characters.

So, in practice, if our scripts suddenly don’t work on a particular file, we may want first to check if the file contains Unicode characters.

6.3. Removing Unicode Whitespaces

Unfortunately, there is no standard character class to match all Unicode whitespaces. However, there are only about twenty Unicode characters with property white_space=yes in total.

Therefore, we can build our own “character class” to contain all these characters:

SPACES=$(printf "%b" "\U00A0\U1680\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U2028\U2029\U202F\U205F\U3000")

As the statement above shows, we saved all Unicode whitespaces in a shell variable called $SPACES.

Then, if we want to remove all Unicode whitespaces, we can build a Regex character class “[$SPACES]” to do the substitution.

Next, let’s remove all horizontal whitespaces, including non-ASCII ones, from the raw_unicode.txt file using the sed command:

$ sed "s/[[:blank:]$SPACES]//g" raw_unicode.txt 
SomeNon-whitespaceUnicodeCharacters:
[CheckMark]:U+2714(✔)
[CrossMark]:U+2716(✖)

SomeUnicodeWhitespaceCharacters:
[FigureSpace]:U+2007()
[ThinSpace]:U+2009()
[ParagraphSeparator]:U+2029()
[IdeographicSpace]:U+3000()

As we can see in the output above, the sed command has removed all horizontal whitespaces, including those non-ASCII ones in the parentheses. Also, the Unicode characters ‘✔’ and ‘✖’ are still there.

Finally, let’s see another example to remove all whitespaces from the file using the awk command:

$ awk -v ORS="" -v uspaces="$SPACES" '{gsub("[[:space:]"uspaces"]",""); print}' raw_unicode.txt 
SomeNon-whitespaceUnicodeCharacters:[CheckMark]:U+2714(✔)[CrossMark]:U+2716(✖)SomeUnicodeWhitespaceCharacters:[FigureSpace]:U+2007()[ThinSpace]:U+2009()[ParagraphSeparator]:U+2029()[IdeographicSpace]:U+3000()

7. Conclusion

In this article, we learned how to remove whitespace from a text file using the Linux command line.

Though the requirement looks pretty simple, it can have a few variations. We looked at how to solve horizontal and vertical whitespace removal with a few common Linux tools.

We also saw that when a file contains non-ASCII Unicode whitespaces, we need to handle it differently.

Additionally, we’ve learned how to check if a text file contains Unicode characters using the file command.

Comments are closed on this article!