Removing the BOM From a UTF-8 Encoded File

1. Overview

The Unicode Standard specifies the usage of some magic numbers at the beginning of text files. These numbers are byte order marks. BOM is the abbreviation for byte order mark.

The Unicode Standard permits the usage of a BOM in the UTF-8 encoding scheme but doesn’t require or recommend the usage of it.

In this tutorial, we’ll discuss how to remove the BOM from a text file with the UTF-8 encoding scheme.

2. What Is BOM?

The BOM is a signature for specifying the byte order and encoding to programs reading the text file.

For example, a text file beginning with the bytes 0xFE 0xFF means that the file is encoded in big-endian UTF-16. The BOM for UTF-8, on the other hand, consists of the bytes 0xEF 0xBB 0xBF (357 273 277 in octal).

Byte order is important for encoding schemes like UTF-16 as UTF-16 uses a minimum of two bytes for character encoding. However, little-endianness or big-endianness in UTF-8 doesn’t have a meaning since UTF-8 is one byte-oriented. Therefore, a BOM in a UTF-8 encoded file might be omitted.

Unicode codepoint of the BOM is U+FEFF (ZERO WIDTH NO-BREAK SPACE).

3. Example File

We’ll use the text file bom_example.txt in our examples:

$ cat bom_example.txt
GÖKMEN

We can check the encoding scheme of the file using the file command:

$ file bom_example.txt
bom_example.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

The output shows that the encoding scheme of bom_example.txt is UTF-8 Unicode with BOM.

We can check the existence of the BOM in a file using the hexdump command:

$ hexdump -c bom_example.txt 
0000000 357 273 277   G 303 226   K   M   E   N  \r  \n                
000000c

The -c option of hexdump displays the output as one-byte characters. The first column in the output contains the starting offsets in the file.

The first three octal numbers in the first line, 357 273 277, correspond to the BOM. The part corresponding to GÖKMEN is “G 303 226 K M E N”. The special character Ö is represented by two bytes, 303 226. The characters at the end, \r \n, are the carriage return and newline characters, respectively.

The reason why we use the character Ö in bom_example.txt is that it’s a valid UTF-8 character, but it isn’t an ASCII character. Therefore, when we remove the BOM from bom_example.txt, the encoding scheme of the updated file is still UTF-8 because of the presence of Ö in the file.

4. Using sed

We can use the sed command to remove the BOM:

$ sed -i 's/\xef\xbb\xbf//' bom_example.txt

The -i option of sed is for in-place edit, i.e., we edit the file and save the changes to the original file. The ‘s/\xef\xbb\xbf//’ part of the command replaces the hexadecimal byte sequence 0xef 0xbb 0xbf, which corresponds to the BOM for UTF-8, with an empty string.

Let’s display the contents of bom_example.txt using hexdump:

$ hexdump -c bom_example.txt 
0000000   G 303 226   K   M   E   N  \r  \n                            
0000009

As is apparent from the output of hexdump, the characters corresponding to the BOM at the beginning of the file don’t exist anymore. Therefore, we’re successful in removing the BOM.

Additionally, let’s check the encoding of bom_example.txt:

$ file bom_example.txt
bom_example.txt: UTF-8 Unicode text, with CRLF line terminators

The encoding of the file is UTF-8 Unicode without BOM as expected.

We can also remove the BOM using its special Unicode character U+FEFF:

$ sed -i $'1s/^\uFEFF//' bom_example.txt

The regular expression in the command, $’1s/^\uFEFF//’, replaces the Unicode character uFEFF in the first line with an empty string.

The leading dollar sign in $’1s/^\uFEFF//’ causes escape characters to be interpreted.

5. Using vi

The vi editor is another option for removing the BOM in a file. We can use the :set nobomb command in the vi editor for this purpose.

However, instead of opening the file with vi, and then calling :set nobomb in the vi editor, we can achieve the same thing from the command line:

$ vi -c ":set nobomb" -c ":wq" bom_example.txt

The -c option executes vi commands from the command line after opening the file. The -c “:set nobomb” part of the command removes the BOM while the -c “:wq” part of the command saves the modification and quits from vi.

Let’s check the modified file’s encoding scheme after removing the BOM:

$ file bom_example.txt
bom_example.txt: UTF-8 Unicode text, with CRLF line terminators

The encoding of the file is UTF-8 Unicode without BOM as expected.

6. Using tail

Another option for removing the BOM is the tail command. Normally, tail prints the last ten lines of the input file.

However, it’s possible to skip the beginning of a file and print the remaining part starting from an offset byte using the -c option. For example, -c +4 prints a file starting at byte 4:

$ tail -c +4 bom_example.txt > without_bom.txt

Since the BOM consists of three bytes, we skip it using -c +4. We redirect the output to the file without_bom.txt.

Let’s check the encoding of without_bom.txt:

$ file without_bom.txt
without_bom.txt: UTF-8 Unicode text, with CRLF line terminators

The encoding of the output file without_bom.txt is UTF-8 Unicode without BOM as expected.

7. Using dos2unix

The dos2unix command is useful for converting a text file in DOS/MAC format to UNIX format. It removes the BOM in the input file during this conversion:

$ dos2unix bom_example.txt
dos2unix: converting file bom_example.txt to Unix format...

The dos2unix command writes over the input file. Let’s check the encoding scheme of bom_example.txt:

$ file bom_example.txt
bom_example.txt: UTF-8 Unicode text

As the output of the file command shows, the encoding of the file is UTF-8 Unicode without BOM. Additionally, dos2unix removes the carriage return, so the message with CRLF line terminators in the output of previous examples isn’t displayed this time.

8. Using Perl

The Perl programming language is powerful in string parsing and regular expressions. We can use the command-line interpreter, perl, to remove the BOM in a file:

$ perl -pi -e "s/^\o{357}\o{273}\o{277}//g" bom_example.txt
$ file bom_example.txt
bom_example.txt: UTF-8 Unicode text, with CRLF line terminators

The -p option of perl places a printing loop around the Perl command. The Perl command in our case is “s/^\o{357}\o{273}\o{277}//g”. It replaces the BOM specified by its octal values with an empty string.

On the other hand, the -i option is for in-place editing. Therefore, the combination of these two options, -pi, prints the output of the input Perl command and saves the result to the input file bom_example.txt.

The -e option of perl is for running a single-line Perl command from the command line.

The output of the command file bom_example.txt shows that we’re successful in removing the BOM from the input file.

9. Conclusion

In this article, we discussed how to remove the BOM in a file with UTF-8 encoding scheme.

First, we discussed briefly what BOM is. Then, we learned how to remove the BOM using the sed, vi, tail, dos2unix, and perl commands.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security