 
Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: March 18, 2024
Often, tasks require working on a text file without causing problems with its formatting. For example, we might want to mostly remove certain groups of characters while preserving part of them, depending on their role.
In this tutorial, we explore character notations and ways to remove whitespace characters from a text file without modifying or deleting any newlines. First, we talk about whitespace and line endings. Next, we dive deep into character notation and grouping. Finally, by leveraging our knowledge, we solve the task at hand efficiently.
For clarity, we use the <> angle bracket notation to represent characters. Importantly, we must have a way to write out each character we need.
We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments unless otherwise specified.
Perhaps one of the biggest differences between text and binary files is that text files are usually formatted by rows.
For example, we might have a text file with three lines:
$ cat --show-all textfile
Text^ILine 1.$
Text^ILine 2.$
Text^ILine 3.$Indeed, text files can have only one line. Even then, they usually still end with the standard newline:
$ cat --show-all textfile
Text^ILine 1.$In both examples above, we use cat to –show-all whitespace characters except spaces in a special character representation:
Such whitespace characters are one of the main ways to format textual data.
Moreover, the default newline character in Linux is a <LF> line feed. However, since they can be hard to input directly or insert into expressions, many tools accept different notations for whitespace characters.
To remove certain characters from any text, we need to know what tool to use and how to designate them:
+---------------------------------------------------------------------------------------------------------+
|     POSIX     | PCRE |  ANSI-C | Name(s)      | Group or Range                   | Description           |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| -             | \0   | \0      | <NUL>        | [\x00]                           | null                  |
| -             | \t   | \t      | <TAB>        | [\x09]                           | horizontal tabulation |
| -             | \n   | \n *    | <LF>         | [\x0A]                           | line feed             |
| -             | -    | -       | <VT>         | [\x0B]                           | vertical tabulation   |
| -             | \f   | \f      | <FF>         | [\x0C]                           | form feed             |
| -             | \r   | \r      | <CR>         | [\x0D]                           | carriage return       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:blank:]     | \h   | \t, " " | <TAB>, <SP>  | [\x09\x20] or [\t ]              | horizontal whitespace |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:blank:]]  | \H   | not     | not          | [^\x09\x20] or [^\t ]            | not                   |
|               |      | \t, " " | <TAB>, <SP>  |                                  | horizontal whitespace |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| -             | \v   | -       | <LF>, <TAB>, | [\x0A-\x0D]                      | vertical whitespace   |
|               |      |         | <FF>, <CR>   |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| -             | \V   | -       | not          | [^\x0A-\x0D]                     | not                   |
|               |      |         | <LF>, <TAB>, |                                  | vertical whitespace   |
|               |      |         | <FF>, <CR>   |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:space:]     | \s   | \t, \n, | <TAB>, <LF>, | [\x09-\x0D\x20]                  | all whitespace        |
|               |      | *v, \f, | <VT>, <FF>,  |                                  |                       |
|               |      | \r, " " | <CR>, <SP>   |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:space:]]  | \S   | not     | not          | [\x09-\x0D\x20]                  | not whitespace        |
|               |      | \t, \n, | <TAB>, <LF>, |                                  |                       |
|               |      | *v, \f, | <VT>, <FF>,  |                                  |                       |
|               |      | \r, " " | <CR>, <SP>   |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:cntrl:]     | -    | -       | <NUL>-<US>,  | [\x00-\x1F\x7F]                  | control character     |
|               |      |         | <DEL>        |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:punct:]     | -    | -       | -            | [!"\#$%&'()*+,\-./:;<=>?@        | punctuation, symbol   |
| [:punct:]     | -    | -       | -            |  \[\\\]^_‘{|}~] or               |                       |
|               |      |         |              | [\x21-\x2f\x3a-\x40              |                       |
|               |      |         |              |  \x5b-\x60\x7b-\x7e]             |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:digit:]     | \d   | -       | -            | [0-9] or [\x30-\x39]             | decimal digit         |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:digit:]]  | \D   | -       | -            | [^0-9] or [^\x30-\x39]           | not decimal digit     |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:xdigit:]    | -    | -       | -            | [0-9a-fA-F] or                   | hexadecimal digit     |
|               |      |         |              | [\x30-\x39\x61-\x66\x41-\x46]    |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:xdigit:]] | -    | -       | -            | [0-9a-fA-F] or                   | not hexadecimal digit |
|               |      |         |              | [\x30-\x39\x61-\x66\x41-\x46]    |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:alpha:]     | -    | -       | -            | [a-zA-Z] or [\x61-\x7A\x41-\x5A] | letter                |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:lower:]     | -    | -       | -            | [a-z] or [\x61-\x7A]             | lowercase letter      |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:upper:]     | -    | -       | -            | [A-Z] or [\x41-\x5A]             | uppercase letter      |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:alnum:]     | -    | -       | -            | [0-9a-zA-Z] or                   | alphanumeric          |
|               |      |         |              | [\x30-\x39\x61-\x7A\x41-\x5A]    | character             |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:graph:]     | -    | -       | not <SP>,    | [^\x20\x00-\x1F\x7F]             | visible character     |
|               |      |         | <NUL>-<US>,  | [\x21-\x7E]                      | all except control    |
|               |      |         | <DEL>        | [!-~]                            | characters and space  |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:print:]     | -    | -       | not          | [^\x00-\x1F\x7F]                 | printable character   |
|               |      |         | <NUL>-<US>,  | [\x20-\x7E]                      | all except control    |
|               |      |         | <DEL>        | [ -~]                            | characters            |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:word:]      | \w   | -       | -            | [_0-9a-zA-Z] or                  | word character        |
|               |      |         |              | [\x5F\x30-\x39                   | alphanumeric or _     |
|               |      |         |              |  \x61-\x7A\x41-\x5A]             |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:word:]]   | \W   | -       | -            | [^_0-9a-zA-Z] or                 | not word character    |
|               |      |         |              | [^\x5F\x30-\x39                  | not alphanumeric or _ |
|               |      |         |              |  \x61-\x7A\x41-\x5A]             |                       |
+----------------------------------------------------------------------------------------------------------+Importantly, these may not be one-to-one equivalents, mainly due to restrictions in encoding, but they should match the same on most platforms. In fact, for robust results, it’s usually best to use Group or Range, as we see below.
Naturally, we can directly type out the characters that we can, which includes some whitespace:
Notably, due to the limited number of keys on a regular keyboard, this usually isn’t possible for all whitespace characters, let alone every known character.
In addition, the method isn’t recommended due to potential problems with functional expression formatting. In particular, directly inserting a newline within a replacement command can be confusing, error-prone, or even wrong, depending on the tool.
To enable the entry of any character despite input method limitations, the ANSI-C standard supports \ or \x combinations.
In short, we can follow a \ backslash with an octal or \x with a hexadecimal number from the encoding table to represent an actual character:
An extension of this notation is \u, which also supports Unicode encodings like UTF-8, UTF-16, and UTF-32.
The ANSI-C standard defines a number of character constants:
While these aren’t universally accepted, the ANSI-C standard is common enough for them to be recognized by many tools.
Further, languages like Python and Java guarantee that \n is \x0A and \r is \x0D. However, UNIX, C, Perl, and others only stipulate two rules for \n and \r:
Because of this, we may get confusing results, especially with <CR><LF> protocols and platforms like the Internet Protocol (IP) and Microsoft Windows.
At a higher level, we can have groupings of characters.
On the one hand, there are the POSIX Basic Regular Expression (BRE) character class bracket expressions:
In particular, all of these character groups and classes are highly dependent on the current locale, encompass only ASCII characters, and usually appear in double [] square brackets as [[:CHARCLASS:]].
On the other hand, we can use specific Perl Compatible Regular Expressions (PCRE) character types:
While POSIX BRE doesn’t recognize the above, PCRE can understand both PCRE native character types and POSIX BRE character classes.
Depending on the regular expression (regex) flavor, some tools and languages might be able to do so as well. Now, let’s see utilities for removing whitespace from a file except for newline.
Since we want to match everything but a newline, we use regular expressions in combination with horizontal and vertical whitespace characters:
Of course, \h and \v contain all of the above by definition. However, we can’t define these groups exactly as ranges with POSIX BRE, but we can get close with [[:space:]] as long as we don’t process the text as a whole and only get it line by line. Alternatively, we can use [[:blank:]], but that only works for <TAB> and <SP>.
In most examples below, we either use redirection or pipes. In all cases, we assume the file textfile contains the whitespace we want to remove without touching the newlines.
Using bash alone, we can achieve our goal:
$ cat textfile | while read line; do
  while [[ $line =~ (.*)[[:space:]]+(.*) ]]; do
    line=${BASH_REMATCH[1]}${BASH_REMATCH[2]};
  done;
  echo "$line";
done;First, we pipe the contents from cat to a while loop with read, which removes the newline separators. After that, we match each $line for any amount of white[[:space:]] within it with (.*)[[:space:]]+(.*). Moreover, the =~ operator fills the $BASH_REMATCH variable as an array, starting with the whole match, followed by each matching group in order. Thus, the assignment in the nested loop skips over any found whitespace.
After we augment the line, we echo it back with a default newline suffix.
With tr alone, we can leverage octal escape values for any characters:
$ tr --delete '\011\021-\023\040' < textfileIn particular, we use –delete or -d and the range for [:space:] excluding the UNIX newline: \x10 or \020.
The awk command can leverage [[:space:]] since it’s protecting the newline characters:
$ awk '{gsub(/[[:space:]]/,""); print}' textfileOn the other hand, we can also employ ranges:
$ awk '{gsub(/[\x09\x0B-\x0D\x20]/,""); print}' textfileAs expected, this only covers the ASCII table.
Similar to awk, sed understands [[:space:]]:
$ sed 's/[[:space:]]//g' textfileAlso, we can again use ranges:
$ sed 's/[\x09\x0B-\x0D\x20]//g' textfileLike before, only ASCII is allowed.
Of course, we can employ the perl interpreter itself:
$ perl -pe 's/[^\S\n]//mg;' textfileWe [-e]xecute this one-liner, [p]rinting each line by default. The code itself is a basic [g]lobal regular expression [s]ubstitution command over [m]ultiple lines.
In particular, the regular expression leverages double negation by using an ^ inverse group match with an inverse character type \S. This is a way to use a class but exclude some characters from it. In this case, we omit \n but keep all other [\h]orizontal and [\v]ertical whitespace.
In this article, we talked about removing whitespace while preserving newlines.
In conclusion, although a simple task in general, we can leverage our knowledge about regular expressions to optimize the solutions, regardless of the tool we use.