Firstly, let’s begin with how computers interpret data. Computers understand only binary system convention, which means that all the information will be converted and stored in binary format and reconverted to human-readable format while displaying them. This process of converting characters in human language into a binary sequence is called Encoding.
ASCII, UTF-16LE, UTF-8 are nothing but commonly used encoding schemes. Due to the variety of encoding schemes available, we’ll often feel the need to interchange the file encoding formats according to the target system.
In this tutorial, let’s learn how to convert encoding format in any Linux system using the iconv tool.
2. How to Check the Encoding Scheme of Any Given File
The iconv tool converts data from one encoding scheme to another. Before converting the encoding scheme of any file, the first step is identifying the current encoding scheme and verifying that both the target and the source encoding schemes are compatible with the iconv tool.
Firstly, let’s learn how to check the encoding format of any given file. The file utility will come in handy in determining the properties of files. To identify the encoding scheme of any file:
$ file -i test.csv test.csv: text/plain; charset=utf-8
3. List of Supported Encoding Schemes
Before we learn to convert the encoding schemes, let’s learn how to check all the supported encoding schemes in the iconv tool. The iconv -l or the iconv –list will list us the bunch of encoding schemes supported:
$ iconv -l 437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865, ... CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949
Conclusively, we should ensure that our target and source encoding schemes are compatible with the iconv tool before converting them.
4. Converting the Encoding Scheme of Any Given File
Finally, let’s learn how to convert the encoding schemes of any given file. After finding the current encoding schemes (using the file tool) and ensuring that both the target and source encoding schemes are compatible with the iconv utility, we shall proceed with this step of actual conversion. The steps to convert the UTF-16LE encoded file to UTF-8 are as follows.
We firstly find the input encoding scheme of the file:
$ file -i input.csv input.csv: text/plain; charset=utf-16le
Secondly, let’s verify that both the target (UTF-8) and the source encoding scheme (UTF-16LE) are compatible with the iconv tool:
$ iconv -l | grep -i utf-16le UTF-16LE// $ iconv -l | grep -i utf-8 ISO-10646/UTF-8/ UTF-8//
Finally, let’s convert the input file to our target UTF-8 format and also verify the result file’s encoding scheme:
$ iconv -f utf-16le -t utf-8 input.csv -o result.csv $ file -i result.csv result.csv: text/plain; charset=utf-8
5. Input and Output Encoding Format Inputs
The iconv tool needs the encoding format of the source (via -f option) and the target encoding format (via the -i option).
If both the -f and -i options are absent, then the output file is simply the input file without performing any conversion:
$ file -i test.csv test.csv: text/plain; charset=utf-8 $ iconv test.csv -o result.csv $ file -i result.csv result.csv: text/plain; charset=utf-8 $ diff test.csv result.csv
If -i option is absent, then the iconv tool recognizes the input file’s encoding format and converts it to the target encoding scheme:
$ iconv -t utf-16le test.csv -o result.csv $ file -i result.csv result.csv: text/plain; charset=utf-16le
In short, if the from-encoding parameter is absent, then the default is derived from the current locale’s character encoding. Similarly, if the to-encoding parameter is absent, then the default is derived from the current locale’s character encoding.
6. Redirecting Output
Standard output is the default output option:
$ iconv -t utf-16le test.csv ��,,, ,,, ,,, EMPLOYEE CLAIM FORM,,, ,,, LEAVE TRAVEL ALLOWANCE,,, ,,, ,,, ,,, ,,, SL.NO.,PARTICULARS,,REMARKS
To direct the output to a file, we have the following ways. Redirection using the pipe command (or -o option):
$ iconv -t utf-16le test.csv > result.csv $ cat result.csv ��,,, ,,, ,,, EMPLOYEE CLAIM FORM,,, ,,, LEAVE TRAVEL ALLOWANCE,,, ,,, ,,, ,,, ,,, SL.NO.,PARTICULARS,,REMARKS
7. Omit Invalid Characters
The input file can contain some invalid characters due to memory corruption or due to improper transfer. In such cases, we could direct the iconv tool to ignore the invalid character using the -c option:
$ cat input_invalid hi😀😀This is not a valid char $ iconv -f us-ascii -t utf-8 input_invalid -c -o output $ cat output hiThis is not a valid char
The above shows that the iconv tool completely converted all the valid characters, ignoring only the invalid ones.
In this article, we have learned how to check encoding schemes for a file and use the iconv tool to convert files to another encoding format.