Converting UTF-8 to ASCII | Baeldung on Linux

1. Overview

In this tutorial, we’ll discuss how to convert one type of character encoding into another, specifically the conversion of UTF-8 to ASCII. Character encoding plays a crucial role in software, ensuring the correct global display of information.

ASCII is the foundation of character encoding and a subset of Unicode. It corresponds to the starting 128 characters shown in UTF-8. UTF-8 is considered a type of byte-order agnostic Unicode and the most commonly used type of character encoding as it covers over 140,000 characters.

UTF-8 has a vast array of characters compared to ASCII. Therefore, non-ASCII characters are either transliterated or removed. We’ll navigate and showcase the various ways of conversions, such as iconv, uni2ascii, and textual transformation methods.

2. iconv

iconv is the preferred command used in LINUX systems due to its versatility. It converts Unicode to ASCII by utilizing conversion tables and algorithms that pinpoint the sequence of bytes in the encoding. In addition, it compares the characters mentioned above.

2.1. Installation

If iconv is found missing in the LINUX system, it can be installed using the code below:

$ sudo apt install -y libc-bin

An example of basic conversion from source encoding to target coding as the output:

$ iconv -f [SOURCE_ENCODING] -t [TARGET_ENCODING] [INPUT_FILE] -o [OUTPUT_FILE]

By using the iconv command, we convert a Txt file written in UTF-8 to a text file written in ASCII by operating:

$ iconv -f UTF-8 -t ASCII input_utf8.txt -o output_ascii.txt

iconv provides us with two further options that adhere to the user’s purpose, which are //TRANSLIT and //IGNORE.

2.2. //TRANSLIT Option

The //TRANSLIT option provides us with an additional route to transliterate. Transliteration is the process of representing characters from one script in the closest approximate characters of another script. It replaces the character that is not found with a somewhat equivalent or similar-looking character. Moreover, it acts as a way to display the most accurate and stable form of the encoding.

$ iconv -f UTF-8 -t ASCII//TRANSLIT input_utf8.txt -o output_ascii.txt

The output shown will have the encoding in ASCII format, including the transliterated character, such as all ‘ç’ characters being altered to ‘c’.

2.3. //IGNORE Option

Appending //IGNORE addition to the encoding allows us to follow a different path by omitting the problematic characters (non-ASCII characters).

$ iconv -f UTF-8 -t ASCII//IGNORE input_utf8.txt -o output_ascii.txt

The output encoding will include all ASCII characters and the non-ASCII characters, which include é,ö,ç,ñ,å. Moreover, any other that is not included in the 128 characters of ASCII will have been removed from it.

3. uni2ascii Command

The uni2ascii command is mainly for working with data interchange formats. It converts non-ASCII characters to their hexadecimal alternatives to represent them in the ASCII format. The installation of uni2ascii to Linux might be necessary if it is not present:

$ sudo apt-get install uni2ascii

The general conversion from Unicode to the hexadecimal alternative ASCII format goes as follows:

$ uni2ascii -a H < input.txt > output.txt

Not only does it support conversion to hexadecimal, but it has several conversion modes, including the ones below.

To convert to Uppercase hexadecimal we use the code below:

$ uni2ascii -a U < input.txt > output.txt

The conversion to XML/HTML-style decimal requires us to use the code below:

$ uni2ascii -a D < input.txt > output.txt

To convert to XML-style hexadecimal we use the code below:

$ uni2ascii -a X< input.txt > output.txt

In the Conversion to TeX format the code below is used:

$ uni2ascii -a T < input.txt > output.txt

To convert to Lowercase hexadecimal we use the code below:

$ uni2ascii -a L < input.txt > output.txt

Lastly to convert to the Octal form we use the code below:

$ uni2ascii -a O < input.txt > output.txt

There is also an option to convert all ASCII characters in addition to the non-ASCII characters by altering the code used:

$ uni2ascii -a O -A < input.txt > output.txt

The output shown here will be all the UTF-8 characters converted to the Octal form of ASCII.

4. Text Transformation

A less commonly used method of modification is textual transformation, which consists of two types: sed and awk. They act in a way similar to iconv by ignoring the non-ASCII characters or removing them from the UTF-8 file entirely:

$ sed 's/[^[:print:]]//g' input_utf8.txt > output_ascii.txt

In this code snippet, the sed command converts the input UTF-8 file into the corresponding ASCII file while removing all non-ASCII characters. In contrast to iconv, during sed transformation, some ASCII characters might get removed, and there is more room for error when comparing it to iconv.

There is also the option to transliterate characters, but in this type of conversion, the characters and their equivalent must be specified in the code:

$ sed 's/ç/c/g' input_utf8.txt > output_ascii.txt

Here, the sed command replaces all the ‘ç’ characters with ‘c’ characters.

The awk command works similarly, but there is a need to specify the character to be replaced, its equivalent, and the column where it resides as follows:

$ awk -F, '{gsub(/ñ/, "n", $4); print}' input_utf8.csv > output_ascii.csv

Here, we replace the ‘ñ’ characters with ‘n’ characters in the fourth column only.

5. Conclusion

In conclusion, in this article, we discussed the various ways of conversion from a Unicode file (UTF-8) to an ASCII file.

Each method caters to however the person intends to use it and provides a solution to this problem by creating a readable ASCII file. While navigating the diverse pathways, we found iconv to be the most versatile, commonly used conversion method with the least errors seen.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security