In this tutorial, we’re going to briefly introduce what Unicode is and then look at different ways to encode it. This will include the most common encodings, some less common ones, and why you would select one over another.
2. What Is Unicode?
Unicode is a standard for representing characters in computer systems. As of version 15.0, it can represent 149,186 different characters, spanning 161 scripts, and numerous special symbols and control characters. However, the standard is ultimately capable of representing 1,114,111 other characters.
Unicode represents each individual character as a unique code point with a unique number. For example, the character “A” – Latin Capital Letter A – is represented as U+0041, whereas the character “?” – Braille Pattern Dots-247 – is represented as U+284A.
By contrast, ASCII – the next most popular character set – can represent 127 characters, all from the English language. The Extended ASCII character set – ISO-8859-1 – expands this to 255 characters but still covers only Western European languages. This is a tiny subset of what Unicode is capable of representing.
However, ASCII and Extended ASCII fit each character into an 8-bit byte. Unicode, conversely, has far too many characters to fit into a single byte, so it needs special consideration for how to encode these characters as bytes.
The most obvious way to encode our Unicode characters is to do so directly by encoding every character. This is the idea behind the UTF-32 encoding system.
The full range of Unicode characters can fit inside a 21-bit number. However, most programming languages have built-in support for both 16-bit numbers and 32-bit numbers, but nothing in between. As such, we can comfortably represent every single Unicode character in a 32-bit number and still be easy to use for most languages.
This has the significant benefit that we can represent every single character in exactly the same way, with no special encoding work or anything. We store the number as-is as a standard integer type. For example:
- U+0041 – This encodes as 0x41, or 00000000 00000000 00000000 01000001.
- U+284A – This encodes as 0x284A, or 00000000 00000000 00101000 01001010.
This all seems very straightforward, so why isn’t this the standard used everywhere? Simply because it’s very wasteful. Around 1/3 of the bits are entirely wasted space – 11 out of 32 bits. However, it’s even worse than this. Most human languages will fit within the first 16 bits, meaning that the other 16 bits are wasted. And the entirety of the English language – which is still the most commonly used in computing – will fit within the first 8 bits.
This means that if we use this encoding, we’re wasting a significant amount of space – potentially as much as 75% – in trade for ease of encoding/decoding.
UTF-8 is one of the most commonly used encoding systems. It’s a variable-length encoding, meaning that we represent characters by between 1 and 4 bytes. This makes it notably more complicated to work with, but it’s much more efficient regarding space usage.
The UTF-8 encoding works differently for different ranges of the Unicode character range. Within each of these ranges, the encoded bytes have a special prefix to identify which range we’re working with, and then the remaining bits come from the actual character itself:
Where the bits get replaced with the appropriate bits from the character code point.
This looks complicated, so let’s look at some real examples.
If we wanted to encode U+0041, we first need to determine which row of our table to use. In this case, it fits into the first one. Next, we need to work out the bit sequence that represents our Unicode character in this case, it’s 00000000 00000000 00000000 01000001.
From our table, we have a single byte that has 7 replaceable bits. This means that we take the 7 lowest bits from our Unicode character and substitute them in. Note that all of the other bits are 0, and so we can safely discard them. This gives us a UTF-8 encoding of 01000000 or 0x41.
If, instead, we wanted to encode U+284A – 00000000 00000000 00101000 01001010 – then we follow the exact same process. This falls into the third row of our table, meaning that we’ll be encoding it into 3 bytes with 16 replaceable bits.
As before, we now substitute our lowest 16 bits into our pattern and can safely discard the rest. This gives us an encoding of 11100010 10100001 10001010, or 0xE2 0xA1 0x8A.
3.1. Decoding UTF-8
Given that we’ve seen how UTF-8 is more complicated to encode, we also need to be able to decode it reliably. Perhaps surprisingly, this is comparatively easy. Our byte prefixes uniquely tell us the number of bytes that are involved in a given character, so we just need to strip this off and correctly combine what’s left.
As before, this is easiest to see from a worked example.
Let’s look at 0xE2 0xA1 0x8A, or 11100010 10100001 10001010. Our first byte starts with the prefix 1110, which tells us we’re looking at 3 bytes. Next, we can strip off the prefixes, and what we’re left with is 0010 100001 001010. These bits are then combined into a single value, giving us 0x284A.
3.2. Detecting Encoding Errors
Because UTF-8 is a variable length encoding, we need to know exactly where every character starts to be able to decode them correctly. If the characters are all correctly encoded, and there’s no corruption, then this is easy, we just start each one immediately after the previous one. However, if we have a corrupted data stream, we can still salvage data from it.
Every byte in a UTF-8 stream has a distinct prefix. This includes the leading bytes, which can tell us how many bytes are in the character. However, it also includes the subsequent bytes with a distinct prefix.
This means if we’re decoding a corrupted byte stream, we can use these byte prefixes to detect when a valid character starts and that the subsequent bytes are all correctly part of it. While not perfect, it gives some indication of what each byte means when we’re decoding our characters. For example, given the byte stream 0x92 0xE2 0xA1 0x8A, we can tell that 0x92 is not a valid starting byte because it fits the pattern 10xxxxxx instead of any of the others.
3.3. Comparison to ASCII
We saw earlier that the UTF-8 encoding for U+0041 happens to be 0x41. This isn’t a coincidence but rather a deliberate design choice.
The Unicode standard is designed so that the first 127 characters exactly correspond to the 127 characters that make up the ASCII character set. Furthermore, the UTF-8 encoding is designed so that the first 127 Unicode characters all correspond to a 1-byte encoding with the same value.
This pair of decisions means that all valid ASCII byte streams are also valid UTF-8 byte streams with the exact same meaning. This then allows any UTF-8-aware decoders to also understand ASCII streams. It also means that any generated UTF-8 byte streams that only consist of these characters can be understood as valid ASCII streams.
The other common encoding format for Unicode is UTF-16. This encodes every character into one or two 16-bit values – meaning that every character takes 2 or 4 bytes of space.
Many platforms – including Java – use UTF-16 since it’s always more space efficient than UTF-32, more space efficient on average than UTF-8 and yet less complicated to implement for most characters. However, it does always require UTF-16 aware tooling, even for the characters that overlap with ASCII, since they are encoded differently.
All characters between U+0000 and U+D7FF, and then again between U+E000 and U+FFFF encode directly as a 16-bit number. This is the entire Basic Multilingual Plane, which includes the majority of human languages. For example:
- U+0041 – This encodes as 0x0041, or 00000000 01000001.
- U+284A – This encodes as 0x284A, or 00101000 01001010.
UTF-8 encodes a significant portion of this range as 3 bytes as compared to only needing 2 bytes with UTF-16 – everything between U+0800 and U+FFFF, so 63,487 characters. Furthermore, everything that UTF-16 encodes with 4 bytes will also be encoded with 4 bytes in UTF-8.
4.1. Encoding Surrogate Pairs
The process for encoding a character that is greater than U+FFFF is as follows:
- First, we subtract 0x10000 from the code point. This gives us a value between 0x00000 and 0xFFFFF, which is 20 bits in size.
- We now take the high 10 bits of this as a number and add it to 0xD800. This gives a value between 0xD800 and 0xDBFF.
- Next, we take the low 10 bits and add it to 0xDC00. This gives a value between 0xDC00 and 0xDFFF.
These two values are our encoding for this character. Let’s see this in practice by encoding the Unicode character “?” – Playing Card Ace of Spades – which is U+1F0A1.
Firstly we subtract 0x10000 from 0x1F0A1. This leaves us with 0xF0A1, or 0000 11110000 10100001 when represented as a 20-bit binary number.
We now take the high 10 bits of this – 0000111100 – and add this to 0xD800. This comes out as 0xD83C.
We can also take the low 10 bits – 0010100001 – and add this to 0xDC00. This comes out as 0xDCA1.
As such, we encode U+1F0A1 as 0xD83C 0xDCA1 when using UTF-16.
4.2. Decoding Surrogate Pairs
Now we know how to encode high characters as surrogate pairs, we need to be able to go the other way.
Firstly we need to ensure that we have valid values. This means that we have two 16-bit values, where the first is between 0xD800 and 0xDBFF whilst the second is between 0xDC00 and 0xDFFF. If this isn’t the case, then we have corrupted bytes and can’t continue.
Next, we subtract the appropriate values from each – 0xD800 from the first and 0xDC00 from the second. This will leave us with two 10-bit numbers. We then combine these to produce a single 20-bit number and finally add 0x10000 to produce the correct Unicode codepoint.
Let’s see this in action. Our starting values are 0xD83C 0xDCA1. We can see that this is valid because both are within the appropriate ranges.
Subtracting 0xD800 from 0xD83C gives us 0x3C or 0000111100.
Subtracting 0xDC00 from 0xDCA1 gives us 0xA1 or 0010100001.
Combining these together gives us 0000 11110000 10100001, or 0xF0A1. Finally, we add 0x10000 to get our answer of U+1F0A1.
We’ve seen here how the most common Unicode encodings work, as well as some of the tradeoffs that are made between them. Next time we need to encode character data, think about exactly how this should work and, therefore, which encoding is best to use.