
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
Sometimes, we may need to remove whitespace characters to sanitize the content of some files. We can do this a few different ways from the Linux command line.
In this tutorial, we’ll cover how to remove all types of whitespace, including Unicode. We’ll also look at how to manage line breaks separately.
Whitespace is usually the spacing between printable characters. This can either be within a line (horizontal) or separating lines (vertical).
Sometimes, we want to remove all whitespace characters from a file. However, we often face the requirement of removing just the horizontal whitespace characters. In other words, we may want to remove all whitespaces from each line of a file, but still keep them as separate lines.
In this tutorial, we’ll explore both scenarios.
We should also note that the Unicode character set defines some additional whitespace characters, for example, the vertical tab (U+000B) character and the “figure space” (U+2007) character.
Let’s start with an example of horizontal and vertical whitespace:
$ cat -n raw_file.txt
1 We Have Leading Spaces.
2 Now We Have Two Tabs: And An Empty Line:
3
4 And We Have A Couple Of Trailing Blank Lines:
5
6
Here, we’ve used the cat command with the -n option to print the file content with line numbers. In this way, we can clearly see empty lines in the output.
As the output above shows, our raw_file.txt contains different whitespace characters, such as spaces, tabs, and line breaks. Our goal is to remove them all.
In this tutorial, we’ll look at a few commands:
These are very common and should be found in most Linux distros.
The tr command reads a byte stream from standard input (stdin), translates or deletes characters, then writes the result to standard output (stdout).
We can use the tr command’s -d option – for deleting specific characters – to remove whitespace characters. The syntax is: tr -d SET1
So, depending on the requirement, passing the right SET1 characters to tr becomes the key to either removing only horizontal whitespace characters or all whitespace.
First, let’s remove all horizontal whitespace from the input file. tr defines the “[:blank:]” character set for all horizontal whitespace.
Also, we should keep in mind that the tr command only reads data from stdin. Therefore, we need to redirect the content of raw_file.txt to stdin:
$ tr -d "[:blank:]" < raw_file.txt | cat -n
1 WeHaveLeadingSpaces.
2 NowWeHaveTwoTabs:AndAnEmptyLine:
3
4 AndWeHaveACoupleOfTrailingBlankLines:
5
6
In the example, we’ve also piped the result of tr to cat -n to verify empty lines.
So, as the output shows, we’ve removed all horizontal whitespace but kept the line breaks.
Next, let’s remove all whitespace characters from the file.
The “[:space:]” character set means all horizontal and vertical whitespace:
$ tr -d "[:space:]" < raw_file.txt
WeHaveLeadingSpaces.NowWeHaveTwoTabs:AndAnEmptyLine:AndWeHaveACoupleOfTrailingBlankLines:
Here, we don’t need to pipe the output to cat to see that there are no line breaks!
sed is a widely used, non-interactive stream editing utility.
First, let’s remove all horizontal whitespace characters. “[:blank:]” is also a POSIX standard character class that stands for horizontal whitespace.
sed works with regular expressions. To use this character class within the regular expression, it becomes “[[:blank:]]“:
$ sed 's/[[:blank:]]//g' raw_file.txt | cat -n
1 WeHaveLeadingSpaces.
2 NowWeHaveTwoTabs:AndAnEmptyLine:
3
4 AndWeHaveACoupleOfTrailingBlankLines:
5
6
Similarly, “[:space:]” is a POSIX standard character class for horizontal and vertical whitespace.
However, unlike the tr command, we cannot replace the [[:blank:]] with [[:space:]] in the sed command to remove all whitespace.
By default, the sed command reads, processes, and outputs line by line. When it writes to the output, it’ll automatically append a newline character to the current pattern space if the pattern space doesn’t end with a newline.
Therefore, even if we replace [:space:] with empty, the line break comes back when sed outputs the line.
If we want sed to remove vertical whitespace, such as line breaks, we need to tell sed to keep reading and removing whitespace until the end of the file and then output only once:
$ sed ':a; N; s/[[:space:]]//g; ta' raw_file.txt
WeHaveLeadingSpaces.NowWeHaveTwoTabs:AndAnEmptyLine:AndWeHaveACoupleOfTrailingBlankLines:
The sed command above is pretty compact. However, it might not be that straightforward to understand. Let’s break it down quickly and see how it works:
In the sed command, ‘:a …. ta‘ works like a loop. When we append a new line to the pattern space by the N; command, of course, we have at least one whitespace — the line break. Therefore, sed will keep appending the next line and removing whitespace characters until the last line in the file.
When it comes to the end of the input file, the N; command detects the EOF. Therefore, sed will output the current result in the pattern space and terminate processing.
In this way, sed has removed all whitespace characters, including line breaks, from the input file.
Many sed implementations support writing the result back to the input file. For example, the widely used GNU Sed provides the -i option to do “in-place” changes.
awk is another powerful text-processing utility. It has defined its own C-like script and plenty of built-in variables and functions to manipulate the processing flexibly.
awk supports regular expressions as well. Therefore, the awk command fully supports the POSIX standard character classes, such as [:blank:] and [:space:].
We can just call the gsub function to remove all horizontal whitespace:
$ awk '{gsub(/[[:blank:]]/,""); print}' raw_file.txt | cat -n
1 WeHaveLeadingSpaces.
2 NowWeHaveTwoTabs:AndAnEmptyLine:
3
4 AndWeHaveACoupleOfTrailingBlankLines:
5
6
As the output above shows, we’ve solved the problem.
Similar to sed, by default, awk also reads, processes, and outputs line by line.
When awk prints records, it separates them by the built-in ORS variable. The default value of the ORS variable is one single line break.
Therefore, we can make two modifications to the awk command above to ask it to remove all whitespace, including line breaks:
Next, let’s see it in action:
$ awk -v ORS="" '{gsub(/[[:space:]]/,""); print}' raw_file.txt | cat -n
1 WeHaveLeadingSpaces.NowWeHaveTwoTabs:AndAnEmptyLine:AndWeHaveACoupleOfTrailingBlankLines:
So far, we’ve learned several approaches to remove whitespace characters from input files. These solutions will work for all ASCII text files.
In our day-to-day work, most text files we need to work with are ASCII text files. However, whitespaces contain non-ASCII Unicode characters.
Now, let’s discuss the handling of Unicode characters. We assume our default locale is en_US.utf-8.
First of all, let’s see an input file that contains non-ASCII Unicode characters:
$ cat raw_unicode.txt
Some Non-whitespace Unicode Characters:
[Check Mark]: U+2714 (✔)
[Cross Mark]: U+2716 (✖)
Some Unicode Whitespace Characters:
[Figure Space]: U+2007 ( )
[Thin Space]: U+2009 ( )
[Paragraph Separator]: U+2029 (
)
[Ideographic Space]: U+3000 ( )
In this file, we have six Unicode characters in the format: [Name]: Code_In_Hex (The Character)
Now, let’s try to remove horizontal whitespaces using our tr solution from this raw_unicode.txt file:
$ tr -d "[:blank:]" < raw_unicode.txt
SomeNon-whitespaceUnicodeCharacters:
[CheckMark]:U+2714(✔)
[CrossMark]:U+2716(✖)
SomeUnicodeWhitespaceCharacters:
[FigureSpace]:U+2007( )
[ThinSpace]:U+2009( )
[ParagraphSeparator]:U+2029(
)
[IdeographicSpace]:U+3000( )
As the output shows, all ASCII whitespaces have been removed, such as spaces. However, the non-ASCII Unicode whitespaces in the parentheses are still there.
This illustrates that when files contain Unicode characters, things work a little differently. It can be common when working with Unicode files in Linux that our tested commands or scripts suddenly don’t work anymore.
Therefore, before we focus on removing Unicode whitespaces, it’s worthwhile to test whether our file contains Unicode characters.
First of all, we can use the file command to test if a text file contains ASCII or Unicode:
$ file raw_file.txt
raw_file.txt: ASCII text
$ file raw_unicode.txt
raw_unicode.txt: Unicode text, UTF-8 text
The output shows us which file contains Unicode characters.
So, in practice, if our scripts suddenly don’t work on a particular file, we may want first to check if the file contains Unicode characters.
Unfortunately, there is no standard character class to match all Unicode whitespaces. However, there are only about twenty Unicode characters with property white_space=yes in total.
Therefore, we can build our own “character class” to contain all these characters:
SPACES=$(printf "%b" "\U00A0\U1680\U2000\U2001\U2002\U2003\U2004\U2005\U2006\U2007\U2008\U2009\U200A\U2028\U2029\U202F\U205F\U3000")
As the statement above shows, we saved all Unicode whitespaces in a shell variable called $SPACES.
Then, if we want to remove all Unicode whitespaces, we can build a Regex character class “[$SPACES]” to do the substitution.
Next, let’s remove all horizontal whitespaces, including non-ASCII ones, from the raw_unicode.txt file using the sed command:
$ sed "s/[[:blank:]$SPACES]//g" raw_unicode.txt
SomeNon-whitespaceUnicodeCharacters:
[CheckMark]:U+2714(✔)
[CrossMark]:U+2716(✖)
SomeUnicodeWhitespaceCharacters:
[FigureSpace]:U+2007()
[ThinSpace]:U+2009()
[ParagraphSeparator]:U+2029()
[IdeographicSpace]:U+3000()
As we can see in the output above, the sed command has removed all horizontal whitespaces, including those non-ASCII ones in the parentheses. Also, the Unicode characters ‘✔’ and ‘✖’ are still there.
Finally, let’s see another example to remove all whitespaces from the file using the awk command:
$ awk -v ORS="" -v uspaces="$SPACES" '{gsub("[[:space:]"uspaces"]",""); print}' raw_unicode.txt
SomeNon-whitespaceUnicodeCharacters:[CheckMark]:U+2714(✔)[CrossMark]:U+2716(✖)SomeUnicodeWhitespaceCharacters:[FigureSpace]:U+2007()[ThinSpace]:U+2009()[ParagraphSeparator]:U+2029()[IdeographicSpace]:U+3000()
In this article, we learned how to remove whitespace from a text file using the Linux command line.
Though the requirement looks pretty simple, it can have a few variations. We looked at how to solve horizontal and vertical whitespace removal with a few common Linux tools.
We also saw that when a file contains non-ASCII Unicode whitespaces, we need to handle it differently.
Additionally, we’ve learned how to check if a text file contains Unicode characters using the file command.