Learn through the super-clean Baeldung Pro experience:
>> Membership and Baeldung Pro.
No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with.
Last updated: November 12, 2021
We can sometimes have a file that contains invalid characters or contains foreign language words that make our program crash with an “invalid characters error”.
In this tutorial, we’re going to take a deeper dive into this topic and find out what non-UTF-8 characters are and how we can automatically remove all invalid characters from our files.
UTF-8 is an encoding system for Unicode that can translate any Unicode character to a matching unique binary string. It can also convert binary strings to their respective Unicode character hence the “UTF (Unicode Transformational Unit)” prefix.
UTF-8 is unique because it represents characters in one-byte units that contain 8 bits each hence the “-8” suffix.
Non-UTF-8 characters are characters that are not supported by UTF-8 encoding and, they may include symbols or characters from foreign unsupported languages.
Let’s take a look at some strings containing non-UTF-8 characters:
İnanç Esasları
İnanç Esasları
��� ����
We’ll get an error if we attempt to store these characters to a variable or run a file that contains them.
Files that contain non-UTF-8 characters produce errors when processed by utilities or when opened by some text editors. Let’s take a look at the kind of errors to expect in different languages.
Here’s an error we can expect on python:
#### Truncated ####
UnicodeDecodeError: 'utf-8' codec cannot decode byte 0xf1 in position 933: invalid continuation byte
None
Let’s take a look at the error to expect in JavaScript:
#### Trunctated ####
Uncaught SyntaxError: Unexpected identifier
Eventually, let’s see the error in Perl:
Malformed UTF-8 character (fatal)
We can easily find all non-UTF-8 characters in a file using grep. Assuming we’ve set up our locale to UTF-8.
Let’s type in the following command in our terminal to print out all lines containing non-UTF-8 characters:
grep -axv '.*' FILE
Here’s what each part of this command represents:
Let’s create a file named test.txt and add some random text to it with invalid characters:
$ touch test.txt
Then let’s add the following text to it:
2.3.1 U-0000D7FF = ed 9f bf = "������"
This just some random text
More random text. Baeldung is awesome!
Let’s now use our grep command to find all invalid characters in our newly created test file:
$ grep -axv '.*' test.txt
2.3.1 U-0000D7FF = ed 9f bf = "������"
But this is only useful to us when we need to find invalid characters. In the next section, we’ll find out how we can find and delete invalid characters in our file.
To automatically find and delete non-UTF-8 characters, we’re going to use the iconv command. It is used in Linux systems to convert text from one character encoding to another.
Let’s look at how we can use this command and a combination of other flags to remove invalid characters:
$ iconv -f utf-8 -t utf-8 -c FILE
We can break down the command above to find out what each part is doing:
By default, the cleared data will be written to standard output on our terminal. To save the changes we’ve made, we need to specify a file where the changes will be saved. We can use either of the following commands to save our changes:
$ iconv -f utf-8 -t utf-8 -c FILE.txt -o NEW_FILE
or
$ iconv -f utf-8 -t utf-8 -c FILE.txt > NEW_FILE
Let’s use the test file we created above to remove all invalid characters and save the changes to a different file named “test_clean.txt”:
$ iconv -f utf-8 -t utf-8 -c test.txt > test_clean.txt
or
$ iconv -f utf-8 -t utf-8 -c test.txt -o test_clean.txt
We took a closer look at what UTF-8 characters are and how having non-UTF-8 characters can potentially cause compatibility issues. We also looked at how we can find invalid characters through grep and how we can automatically delete the invalid characters from our file while utilizing the iconv command.