Authors Top

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

1. Overview

Firstly, let’s begin with how computers interpret data. Computers understand only binary system convention, which means that all the information will be converted and stored in binary format and reconverted to human-readable format while displaying them. This process of converting characters in human language into a binary sequence is called Encoding.

ASCII, UTF-16LE, UTF-8 are nothing but commonly used encoding schemes. Due to the variety of encoding schemes available, we’ll often feel the need to interchange the file encoding formats according to the target system.

In this tutorial, let’s learn how to convert encoding format in any Linux system using the iconv tool.

2. How to Check the Encoding Scheme of Any Given File

The iconv tool converts data from one encoding scheme to another. Before converting the encoding scheme of any file, the first step is identifying the current encoding scheme and verifying that both the target and the source encoding schemes are compatible with the iconv tool.

Firstly, let’s learn how to check the encoding format of any given file. The file utility will come in handy in determining the properties of files. To identify the encoding scheme of any file:

$ file -i test.csv
test.csv: text/plain; charset=utf-8

3. List of Supported Encoding Schemes

Before we learn to convert the encoding schemes, let’s learn how to check all the supported encoding schemes in the iconv tool. The iconv -l or the iconv –list will list us the bunch of encoding schemes supported:

$ iconv -l
437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865,
...
CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949

Conclusively, we should ensure that our target and source encoding schemes are compatible with the iconv tool before converting them.

4. Converting the Encoding Scheme of Any Given File

Finally, let’s learn how to convert the encoding schemes of any given file. After finding the current encoding schemes (using the file tool) and ensuring that both the target and source encoding schemes are compatible with the iconv utility, we shall proceed with this step of actual conversion. The steps to convert the UTF-16LE encoded file to UTF-8 are as follows.

We firstly find the input encoding scheme of the file:

$ file -i input.csv 
input.csv: text/plain; charset=utf-16le

Secondly, let’s verify that both the target (UTF-8) and the source encoding scheme (UTF-16LE) are compatible with the iconv tool:

$ iconv -l | grep -i utf-16le 
UTF-16LE// 
$ iconv -l | grep -i utf-8 
ISO-10646/UTF-8/ 
UTF-8//

Finally, let’s convert the input file to our target UTF-8 format and also verify the result file’s encoding scheme:

$ iconv -f utf-16le -t utf-8 input.csv -o result.csv
$ file -i result.csv result.csv: text/plain; charset=utf-8

5. Input and Output Encoding Format Inputs

The iconv tool needs the encoding format of the source (via -f option) and the target encoding format (via the -i option).

If both the -f and -i options are absent, then the output file is simply the input file without performing any conversion:

$ file -i test.csv 
test.csv: text/plain; charset=utf-8
$ iconv test.csv -o result.csv
$  file -i result.csv 
result.csv: text/plain; charset=utf-8
$ diff test.csv result.csv

If -i option is absent, then the iconv tool recognizes the input file’s encoding format and converts it to the target encoding scheme:

$ iconv -t utf-16le test.csv -o result.csv
$ file -i result.csv 
result.csv: text/plain; charset=utf-16le

In short, if the from-encoding parameter is absent, then the default is derived from the current locale’s character encoding. Similarly, if the to-encoding parameter is absent, then the default is derived from the current locale’s character encoding.

6. Redirecting Output

Standard output is the default output option:

$ iconv -t utf-16le test.csv
��,,,
,,,
,,,
EMPLOYEE CLAIM FORM,,,
,,,
LEAVE TRAVEL ALLOWANCE,,,
,,,
,,,
,,,
,,,
SL.NO.,PARTICULARS,,REMARKS

To direct the output to a file, we have the following ways. Redirection using the pipe command (or -o option):

$ iconv -t utf-16le test.csv > result.csv
$ cat result.csv 
��,,,
,,,
,,,
EMPLOYEE CLAIM FORM,,,
,,,
LEAVE TRAVEL ALLOWANCE,,,
,,,
,,,
,,,
,,,
SL.NO.,PARTICULARS,,REMARKS

7. Omit Invalid Characters

The input file can contain some invalid characters due to memory corruption or due to improper transfer. In such cases, we could direct the iconv tool to ignore the invalid character using the -c option:

$ cat input_invalid 
hi😀😀This is not a valid char
$ iconv -f us-ascii -t utf-8 input_invalid -c -o output
$ cat output 
hiThis is not a valid char

The above shows that the iconv tool completely converted all the valid characters, ignoring only the invalid ones.

8. Conclusion

In this article, we have learned how to check encoding schemes for a file and use the iconv tool to convert files to another encoding format.

Authors Bottom

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

Comments are closed on this article!