Converting Unicode to UTF-8 Using Console Tools in Linux

1. Overview

Unicode escape sequences \uXXXX are used to encode Unicode characters. Therefore, to standardize our data, we might need to convert them into UTF-8.

In this tutorial, we’ll discuss converting different Unicode types to UTF-8 using various ways, including iconv, echo, and text editors in Linux. We’ll also learn how to save our results into a text file.

2. The iconv Command

iconv is a command that we utilize to convert from one character encoding to UTF-8. Therefore, it’s useful in converting text files of different encodings to UTF-8 and inserting the output into another text file.

Moreover, we can convert Unicode escape sequences to UTF-8 by combining it with echo.

2.1. Basic Conversion

Here’s the syntax for converting file encodings from Unicode to UTF-8:

$ iconv -f [source_encoding] -t UTF-8 [input_file] -o [output_file]

We have to declare the source encoding of our input file, and the file we need to save our result. Thus, if our current encoding is unknown to us, we can utilize the file command:

$ file [input_file]

As an example, we can convert a UTF-16 text file to UTF-8 encoding and save our output to formatted.txt, a new text file:

$ iconv -f UTF-16 -t UTF-8 input.txt -o formatted.txt

Additionally, we add //IGNORE and //TRANSLIT to either overlook the characters in our original encoding that aren’t found in UTF-8 or transliterate them into similar characters in UTF-8:

$ iconv -f UTF-16 -t UTF-8//IGNORE input.txt -o formatted.txt

Let’s see an example of using //TRANSLIT:

$ iconv -f UTF-16LE -t UTF-8//TRANSLIT input.txt -o formatted.txt

These two methods are essential for error handling during our character encoding conversion.

2.2. Converting \uXXXX to UTF-8

Using echo with iconv provides us with an easy way to convert any Unicode escape sequence to UTF-8:

$ echo -e '\u0114' | iconv -f [source_encoding] -t UTF-8

We place the default encoding of our *nix system in [source-encoding] and use iconv to verify our output.

3. Using the echo Command Independently

The echo command interprets and displays lines of text in Linux. Adding the -e flag to it allows us to display Unicode escape sequences and simultaneously convert them to UTF-8:

$ echo -e '\uXXXX'

The conversion is performed to our system’s default encoding, UTF-8, in Linux. As an example, let’s convert \u0114, which is the character ‘Ĕ‘, to UTF-8:

$ echo -e '\u0114'
Ĕ

Furthermore, we can convert multiple Unicode escape sequences to UTF-8 simultaneously and save our output to a file if needed:

$ echo -e '\u0114\u0123\u0134' > output.txt

Subsequently, we use the file command to verify that our output file is encoded in UTF-8 after the conversion:

$ file output.txt
output.txt: UTF-8 Unicode text

Finally, this provides us with a sure method to ensure our output is correct.

4. Conclusion

In this short article, we discussed a few ways of converting Unicode escape sequences and files encoded in Unicode to UTF-8.

We found iconv to be optimal in converting files encoded in Unicode to UTF-8.

The echo command is the most efficient and fastest way to convert Unicode escape sequences to UTF-8.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security