1. Introduction

Often, tasks require working on a text file without causing problems with its formatting. For example, we might want to mostly remove certain groups of characters while preserving part of them, depending on their role.

In this tutorial, we explore character notations and ways to remove whitespace characters from a text file without modifying or deleting any newlines. First, we talk about whitespace and line endings. Next, we dive deep into character notation and grouping. Finally, by leveraging our knowledge, we solve the task at hand efficiently.

For clarity, we use the <> angle bracket notation to represent characters. Importantly, we must have a way to write out each character we need.

We tested the code in this tutorial on Debian 11 (Bullseye) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments unless otherwise specified.

2. Whitespace and Line Endings

Perhaps one of the biggest differences between text and binary files is that text files are usually formatted by rows.

For example, we might have a text file with three lines:

$ cat --show-all textfile
Text^ILine 1.$
Text^ILine 2.$
Text^ILine 3.$

Indeed, text files can have only one line. Even then, they usually still end with the standard newline:

$ cat --show-all textfile
Text^ILine 1.$

In both examples above, we use cat to –show-all whitespace characters except spaces in a special character representation:

  • <TAB>, horizontal tabulation, ^I
  • <LF>, line feed (newline), $ and an actual newline
  • <VT>, vertical tab, ^K
  • <FF>, form feed (new page), ^L
  • <CR>, carriage return, ^M
  • <SP>, space

Such whitespace characters are one of the main ways to format textual data.

Moreover, the default newline character in Linux is a <LF> line feed. However, since they can be hard to input directly or insert into expressions, many tools accept different notations for whitespace characters.

3. Character Notation and Grouping

To remove certain characters from any text, we need to know what tool to use and how to designate them:

+---------------------------------------------------------------------------------------------------------+
|     POSIX     | PCRE |  ANSI-C | Name(s)      | Group or Range                   | Description           |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| -             | \0   | \0      | <NUL>        | [\x00]                           | null                  |
| -             | \t   | \t      | <TAB>        | [\x09]                           | horizontal tabulation |
| -             | \n   | \n *    | <LF>         | [\x0A]                           | line feed             |
| -             | -    | -       | <VT>         | [\x0B]                           | vertical tabulation   |
| -             | \f   | \f      | <FF>         | [\x0C]                           | form feed             |
| -             | \r   | \r      | <CR>         | [\x0D]                           | carriage return       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:blank:]     | \h   | \t, " " | <TAB>, <SP>  | [\x09\x20] or [\t ]              | horizontal whitespace |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:blank:]]  | \H   | not     | not          | [^\x09\x20] or [^\t ]            | not                   |
|               |      | \t, " " | <TAB>, <SP>  |                                  | horizontal whitespace |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| -             | \v   | -       | <LF>, <TAB>, | [\x0A-\x0D]                      | vertical whitespace   |
|               |      |         | <FF>, <CR>   |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| -             | \V   | -       | not          | [^\x0A-\x0D]                     | not                   |
|               |      |         | <LF>, <TAB>, |                                  | vertical whitespace   |
|               |      |         | <FF>, <CR>   |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:space:]     | \s   | \t, \n, | <TAB>, <LF>, | [\x09-\x0D\x20]                  | all whitespace        |
|               |      | *v, \f, | <VT>, <FF>,  |                                  |                       |
|               |      | \r, " " | <CR>, <SP>   |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:space:]]  | \S   | not     | not          | [\x09-\x0D\x20]                  | not whitespace        |
|               |      | \t, \n, | <TAB>, <LF>, |                                  |                       |
|               |      | *v, \f, | <VT>, <FF>,  |                                  |                       |
|               |      | \r, " " | <CR>, <SP>   |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:cntrl:]     | -    | -       | <NUL>-<US>,  | [\x00-\x1F\x7F]                  | control character     |
|               |      |         | <DEL>        |                                  |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:punct:]     | -    | -       | -            | [!"\#$%&'()*+,\-./:;<=>?@        | punctuation, symbol   |
| [:punct:]     | -    | -       | -            |  \[\\\]^_‘{|}~] or               |                       |
|               |      |         |              | [\x21-\x2f\x3a-\x40              |                       |
|               |      |         |              |  \x5b-\x60\x7b-\x7e]             |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:digit:]     | \d   | -       | -            | [0-9] or [\x30-\x39]             | decimal digit         |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:digit:]]  | \D   | -       | -            | [^0-9] or [^\x30-\x39]           | not decimal digit     |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:xdigit:]    | -    | -       | -            | [0-9a-fA-F] or                   | hexadecimal digit     |
|               |      |         |              | [\x30-\x39\x61-\x66\x41-\x46]    |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:xdigit:]] | -    | -       | -            | [0-9a-fA-F] or                   | not hexadecimal digit |
|               |      |         |              | [\x30-\x39\x61-\x66\x41-\x46]    |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:alpha:]     | -    | -       | -            | [a-zA-Z] or [\x61-\x7A\x41-\x5A] | letter                |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:lower:]     | -    | -       | -            | [a-z] or [\x61-\x7A]             | lowercase letter      |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:upper:]     | -    | -       | -            | [A-Z] or [\x41-\x5A]             | uppercase letter      |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:alnum:]     | -    | -       | -            | [0-9a-zA-Z] or                   | alphanumeric          |
|               |      |         |              | [\x30-\x39\x61-\x7A\x41-\x5A]    | character             |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:graph:]     | -    | -       | not <SP>,    | [^\x20\x00-\x1F\x7F]             | visible character     |
|               |      |         | <NUL>-<US>,  | [\x21-\x7E]                      | all except control    |
|               |      |         | <DEL>        | [!-~]                            | characters and space  |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:print:]     | -    | -       | not          | [^\x00-\x1F\x7F]                 | printable character   |
|               |      |         | <NUL>-<US>,  | [\x20-\x7E]                      | all except control    |
|               |      |         | <DEL>        | [ -~]                            | characters            |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [:word:]      | \w   | -       | -            | [_0-9a-zA-Z] or                  | word character        |
|               |      |         |              | [\x5F\x30-\x39                   | alphanumeric or _     |
|               |      |         |              |  \x61-\x7A\x41-\x5A]             |                       |
|---------------+------+---------+--------------+----------------------------------+-----------------------|
| [^[:word:]]   | \W   | -       | -            | [^_0-9a-zA-Z] or                 | not word character    |
|               |      |         |              | [^\x5F\x30-\x39                  | not alphanumeric or _ |
|               |      |         |              |  \x61-\x7A\x41-\x5A]             |                       |
+----------------------------------------------------------------------------------------------------------+

Importantly, these may not be one-to-one equivalents, mainly due to restrictions in encoding, but they should match the same on most platforms. In fact, for robust results, it’s usually best to use Group or Range, as we see below.

3.1. Direct Insertion

Naturally, we can directly type out the characters that we can, which includes some whitespace:

  • Tab key – <TAB>, horizontal tabulation
  • Return key – <LF>, line feed (newline)
  • Space key – <SP>, space

Notably, due to the limited number of keys on a regular keyboard, this usually isn’t possible for all whitespace characters, let alone every known character.

In addition, the method isn’t recommended due to potential problems with functional expression formatting. In particular, directly inserting a newline within a replacement command can be confusing, error-prone, or even wrong, depending on the tool.

3.2. Escape Sequences

To enable the entry of any character despite input method limitations, the ANSI-C standard supports \ or \x combinations.

In short, we can follow a \ backslash with an octal or \x with a hexadecimal number from the encoding table to represent an actual character:

  • \x09, \011  – <TAB>, horizontal tabulation
  • \x0A, \012<LF>, line feed (newline)
  • \x0B, \013<VT>, vertical tab
  • \x0C, \014<FF>, form feed (new page)
  • \x0D, \015 – <CR>, carriage return
  • \x20, \040 – <SP>, space

An extension of this notation is \u, which also supports Unicode encodings like UTF-8, UTF-16, and UTF-32.

3.3. ANSI-C Escape Codes (\n)

The ANSI-C standard defines a number of character constants:

  • \t<TAB>, horizontal tabulation
  • \n<LF>, line feed (newline)
  • \f<FF>, form feed (new page)
  • \r<CR>, carriage return
  • \b<BS>, backspace
  • \a<BEL>, alarm bell
  • \0<NUL>, null character
  • \\ – backward slash
  • \’ – prime

While these aren’t universally accepted, the ANSI-C standard is common enough for them to be recognized by many tools.

Further, languages like Python and Java guarantee that \n is \x0A and \r is \x0D. However, UNIX, C, Perl, and others only stipulate two rules for \n and \r:

  • internally, both are single-character values
  • text mode operations implicitly convert \n to the native newline of the platform (possibly more than one character), while binary mode uses the internal value

Because of this, we may get confusing results, especially with <CR><LF> protocols and platforms like the Internet Protocol (IP) and Microsoft Windows.

3.4. POSIX BRE Regular Expression Groups

At a higher level, we can have groupings of characters.

On the one hand, there are the POSIX Basic Regular Expression (BRE) character class bracket expressions:

  • [:blank:]<SP> and <TAB>
  • [:space:]whitespace character
  • [:cntrl:] – control character
  • [:punct:] – punctuation or symbol
  • [:digit:] – decimal digit
  • [:xdigit:] – hexadecimal digit
  • [:alpha:] – all letter
  • [:lower:] – lowercase letter
  • [:upper:] – uppercase letter
  • [:alnum:] – alphanumeric character ([:alpha:], [:digit:])
  • [:graph:] – visible character (all except [:space:] and [:cntrl:])
  • [:print:] – visible character and <SP> (all except [:cntrl:])
  • [:word:] – word character ([:alpha:], [:digit:], and _ underscore)

In particular, all of these character groups and classes are highly dependent on the current locale, encompass only ASCII characters, and usually appear in double [] square brackets as [[:CHARCLASS:]].

3.5. PCRE Regular Expression Types

On the other hand, we can use specific Perl Compatible Regular Expressions (PCRE) character types:

  • \h<SP> and <TAB>, along with some Unicode characters
  • \H – all except \h
  • \v<LF>, <VT>, <FF> and <CR>, along with some Unicode characters
  • \V – all except \v
  • \swhitespace character
  • \S – all except \s
  • \d – decimal digits
  • \D – all except \d
  • \w – word character ([:alpha:], [:digit:], and _ underscore)
  • \W – all except \w

While POSIX BRE doesn’t recognize the above, PCRE can understand both PCRE native character types and POSIX BRE character classes.

Depending on the regular expression (regex) flavor, some tools and languages might be able to do so as well. Now, let’s see utilities for removing whitespace from a file except for newline.

4. Replace All Whitespace Except Newline

Since we want to match everything but a newline, we use regular expressions in combination with horizontal and vertical whitespace characters:

  • horizontal whitespace \x20, \x09, \x0A for POSIX, adding \u1680, \u180E, \u2000-200A, \u202F, \u205F, \u3000 from PCRE
  • vertical whitespace \x0A-\x0D from POSIX, adding \u0085, \u2028, \u2029 from PCRE

Of course, \h and \v contain all of the above by definition. However, we can’t define these groups exactly as ranges with POSIX BRE, but we can get close with [[:space:]] as long as we don’t process the text as a whole and only get it line by line. Alternatively, we can use [[:blank:]], but that only works for <TAB> and <SP>.

In most examples below, we either use redirection or pipes. In all cases, we assume the file textfile contains the whitespace we want to remove without touching the newlines.

4.1. Using Bash

Using bash alone, we can achieve our goal:

$ cat textfile | while read line; do
  while [[ $line =~ (.*)[[:space:]]+(.*) ]]; do
    line=${BASH_REMATCH[1]}${BASH_REMATCH[2]};
  done;
  echo "$line";
done;

First, we pipe the contents from cat to a while loop with read, which removes the newline separators. After that, we match each $line for any amount of white[[:space:]] within it with (.*)[[:space:]]+(.*). Moreover, the =~ operator fills the $BASH_REMATCH variable as an array, starting with the whole match, followed by each matching group in order. Thus, the assignment in the nested loop skips over any found whitespace.

After we augment the line, we echo it back with a default newline suffix.

4.2. Using tr

With tr alone, we can leverage octal escape values for any characters:

$ tr --delete '\011\021-\023\040' < textfile

In particular, we use –delete or -d and the range for [:space:] excluding the UNIX newline: \x10 or \020.

4.3. Using awk

The awk command can leverage [[:space:]] since it’s protecting the newline characters:

$ awk '{gsub(/[[:space:]]/,""); print}' textfile

On the other hand, we can also employ ranges:

$ awk '{gsub(/[\x09\x0B-\x0D\x20]/,""); print}' textfile

As expected, this only covers the ASCII table.

4.4. Using sed

Similar to awk, sed understands [[:space:]]:

$ sed 's/[[:space:]]//g' textfile

Also, we can again use ranges:

$ sed 's/[\x09\x0B-\x0D\x20]//g' textfile

Like before, only ASCII is allowed.

4.5. Using Perl

Of course, we can employ the perl interpreter itself:

$ perl -pe 's/[^\S\n]//mg;' textfile

We [-e]xecute this one-liner, [p]rinting each line by default. The code itself is a basic [g]lobal regular expression [s]ubstitution command over [m]ultiple lines.

In particular, the regular expression leverages double negation by using an ^ inverse group match with an inverse character type \S. This is a way to use a class but exclude some characters from it. In this case, we omit \n but keep all other [\h]orizontal and [\v]ertical whitespace.

5. Summary

In this article, we talked about removing whitespace while preserving newlines.

In conclusion, although a simple task in general, we can leverage our knowledge about regular expressions to optimize the solutions, regardless of the tool we use.

Comments are closed on this article!