Unicode is a way to represent characters within our programs. It supports a huge number of characters. As of Unicode 15.0, it represents 149,186 different characters. However, representing this number of options means that we need to be able to store them all, which means that they need to be represented in memory.
In this tutorial, we’re going to explore different ways Unicode characters can be represented and how much space each takes.
2. ASCII Encoding
Before Unicode, there were several different character encodings to use. Of these, ASCII was one of the more common ones.
The official ASCII standard represents characters in 7 bits, allowing for 128 characters. However, this includes some special control characters – to represent things such as tabs, new lines, and some special constructs used by early terminal controls. This allows us to represent 96 characters. Obviously, this is nowhere near as many as Unicode supports. In fact, it’s only really enough to support English languages.
There are then various extensions to ASCII that add some extra characters. For example, ISO-8859-1, also known as Latin-1, adds 128 characters and supports most characters from western European languages.
So far, all of this fits into a single byte, which makes it very convenient for representing in memory, storing on disk or transmitting between systems. However, it’s a very limited number of characters – 224 is only 0.1% of what Unicode can represent! In particular, it’s only able to support western European languages, with no support for different scripts such as Arabic, Cyrillic, Tamil, etc.
3. What is Unicode?
Unicode is an alternative way to represent characters. In this, there is a clear separation between the characters and how they are represented. Every character in the Unicode specification is assigned a code point, ranging from U+000000 up to U+10FFFF. In this range, we have support for a significant number of characters, covering a large number of different scripts from current and historical languages, as well as many other things such as emojis, mathematical symbols, music, braille and so on.
Obviously, it’s not possible to represent this number of characters in a single byte. In fact, Unicode currently requires 21 bits to represent every possible character, which in turn means that we need 3 bytes. However, this will mean that all text content suddenly takes three times as much space to store, which isn’t ideal. As such, there are several different encodings we can use. These have different benefits for use in different situations.
Unicode also has special cases whereby multiple Unicode characters generate a single glyph on the screen. These are often referred to as “combining characters”. In some cases, these are just ways of adjusting other characters slightly – for example, combining characters to add accents to other characters exist. In other cases, these can produce entirely new glyphs on the screen – for example, some of the emoji characters can be modified in this way to produce entirely new symbols.
In all of the below discussions, we are only considering individual characters. Glyphs made up of multiple characters aren’t discussed, but it can be assumed that those individual characters are each treated in the correct way for the appropriate encoding schemes.
4. UTF-32 Encoding
The most obvious encoding to use is UTF-32. This is simply a 32-bit encoding scheme that will represent every single possible Unicode character in the same way.
Having a fixed encoding length for every character makes some operations easier. For example, we always know the length of a string by simply counting the number of bytes. We can also easily jump to any character in a string because we can calculate every character offset easily.
However, this also means that every character will take 4 bytes (32 bits) of space to represent. This means that anything that uses it to represent text takes four times as much memory, storage, bandwidth, etc., for the content as needed for the ASCII equivalent. For anything that ASCII can represent, this is obviously hugely inefficient.
We also have to be careful of the endianness of the bytes. Without knowing this, it’s impossible to know what the bytes actually mean. UTF-32 is represented as a single 32-bit integer. For example, the character “PLAYING CARD ACE OF SPADES” (🂡) is code point U+1F0A1. Represented in UTF-32, this would be 0x000001F0A1. However, this will be stored in memory as either “00 01 F0 A1” in a big-endian system or “A1 F0 01 00” in a little-endian system.
Note that the decision to use 4 bytes instead of 3 was made before Unicode was officially restricted to being a 21-bit scheme. However, there are some other benefits to using 4 bytes as well. Many computers are optimised for working with 32-bit numbers and can do so significantly more efficiently than they can with other structures. As such, a 24-bit number – 3 bytes – would actually be less efficient for computers to process, even though they would save on storage space.
5. UTF-8 Encoding
Possibly the most popular encoding system for Unicode characters is UTF-8. This is a variable length encoding system, where we represent every character with between 8 and 32 bits. Of note, the UTF-8 encoding for any character will never take more bytes than the UTF-32 encoding for the same character but may take fewer bytes.
The encoding system for UTF-8 works by using an encoded length of prefix bits on the first byte to indicate the number of bytes that the character uses:
All subsequent bytes in the same character have a different prefix – “10xx xxxx” – so we can tell these are in the middle of the character and not the start of a new one. This, in turn, means that if we randomly end up in the middle of a character, we know enough to find either the start or the start of the next one.
5.1. Worked Example
This sounds complicated, so let’s look at a real example. Our “PLAYING CARD ACE OF SPADES” character is code point U+1F0A1. This puts it in the “32-bit” category. This in turn, means that we know the byte pattern is going to be:
- First byte – 1111 0xxx
- Second byte – 10xx xxxx
- Third byte – 10xx xxxx
- Fourth byte – 10xx xxxx
So this gives us 21 bits into which we need to fit our Unicode code point. This works out as “0 0001 1111 0000 1010 0001”.
If we fit these into our bit pattern, then we end up with the UTF-8 encoding as being “1111 0000“, “1001 1111“, “1000 0010“, “1010 0001“. This, in turn, is “F0 9F 82 A1”.
We can also go in the opposite direction, converting a set of UTF-8 bytes into a character. For example, let’s take the bytes E2 88 9A. Our first byte is “1110 0010“, so we immediately know this is a 24-bit character. Our second byte is “1000 1000“, and our third is “1001 1010“, both of which have our expected prefix of “10”.
If we then apply our patterns to these bytes, we get left with the bit sequence “0010 0010 0001 1010”, which is U+221A, or “SQUARE ROOT” (√).
This all seems very complicated, so why bother? What are the benefits of this scheme?
Because of the variable length encoding, characters will take less storage space than with UTF-32. In fact, it’s guaranteed that characters will never take up more space, so it’s guaranteed that it will always be as or more efficient than UTF-32 in terms of storage.
However, it’s more than this. The first 128 Unicode code points deliberately map onto the 128 ASCII code points. This means that all ASCII characters are encoded exactly the same in UTF-8. This, in turn means that any text data stored in ASCII can be treated as if it was in UTF-8 with no problems. Note that this doesn’t apply to the various versions of extended ASCII, only the basic characters.
This means most English language texts are stored in the most efficient form possible with UTF-8. This applies whether we are talking about prose but also source code, HTML, JSON and XML documents, and so on.
However, many other languages suffer as a result of this. For example, the Tamil language uses code points U+0B80 – U+0BFF. These will always be 3 bytes in the UTF-8 encoding.
6. UTF-16 Encoding
Another common encoding system that we see is UTF-16. For example, this is how Java represents strings. This works by representing most Unicode characters with code points between U+0000 and U+FFFF as a single 16-bit number. Any characters with code points above this range will instead be stored as two 16-bit numbers, referred to as a “Surrogate Pair”.
This means that UTF-16 is never going to take more bytes for the same character as UTF-32 does. In fact, the majority of characters will take fewer bytes. It also means that the majority of characters will take no more bytes in UTF-16 than they do in UTF-8, and in fact, many will take fewer:
So only the first 128 characters are less efficient in UTF-16 than in UTF-8. However, these happen to be our ASCII characters again. This means in this encoding ASCII characters are represented differently – so we can’t treat ASCII text files as UTF-16 like we can with UTF-8 – but also that they will take twice the storage space.
6.1. Surrogate Pairs
The majority of Unicode characters are represented as exactly their code point in UTF-16, so no conversion is necessary. For example, our “SQUARE ROOT” character is U+221A, which gets represented in UTF-16 simply as “22 1A”.
However, when we reach the Unicode characters in the upper ranges, anything that is U+10000 and above, this no longer works. In this case, we have the concept of Surrogate Pairs. These characters get encoded as two different Unicode characters from a special range – U+D800 – U+DFFF. As such, these characters will take 4 bytes of storage, but as it happens, they also take 4 bytes in both UTF-8 and UTF-32, so it’s no less space efficient here.
So how does this work?
- Subtract 0x10000 from our code point since this is implied by the fact that we are encoding in this way.
- Generate the high surrogate pair by shifting this number right 10-bits and adding the resulting number to 0xD800.
- Generate the low surrogate pair by taking the lower 10 bits and adding this to 0xDC00.
For example, let’s look at our “PLAYING CARD ACE OF SPADES” character again. This was U+1F0A1, so we:
- Subtract 0x10000 from our code point, leaving us with 0xF0A1.
- Shift 0xF0A1 right 10 bits to leave us with 0011 1100 (0x3C), and add this to 0xD800 to give us 0xD83C.
- Take the lower 10 bits of 0xF0A1, 00 1010 0001 (A1), and add this to 0xDC00 to give us DCA1.
Thus, our character gets represented in UTF-16 as 0xD83C 0xDCA1.
As always, we can also go in the other direction. If we see a character in the range 0xD800 – 0xDBFF, then we know this is a high surrogate pair, and 0xDC00 – 0xDFFF is a low surrogate pair. As long as we have both of these, then we can determine the encoded character:
- Subtract 0xD800 from the high surrogate pair, and 0xDC00 from the low surrogate pair.
- Shift the remaining high surrogate pair value 10 bits to the left.
- Add the remaining low surrogate pair value to this.
- Add 0x10000 to this.
So, for our example above, we have:
- Subtract 0xD800 from 0xD83C, and 0xDC00 from 0xDCA1. This gives us 0x3C and 0xA1.
- Shift ox3C 10 bits to the left to give us 0xF000.
- Add 0xA1 to this to get 0xF0A1.
- Add 0x10000 to this to get 0x1F0A1.
And we can see that this is our original character again.
6.2. Benefits and Drawbacks
UTF-16 has some obvious benefits over both UTF-8 and UTF-32. It’s never any less space efficient than UTF-32 and rarely less space efficient than UTF-8. It’s also notably easier to work with than UTF-8 for most characters.
So why is UTF-8 the de-facto standard and not UTF-16? There are two notable drawbacks that UTF-16 has over UTF-8:
- We need to think about the endianness of our bytes again. Because we are representing values as 16-bit numbers, we need to know if we are storing them as big-endian or little-endian. UTF-8 avoids this by always using 8-bit numbers
ASCII characters don’t map directly onto UTF-16 bytes the same way they do for UTF-8. This means that we can’t treat anything stored in ASCII as UTF-16 in the same way we can with UTF-8
We’ve looked at some major encoding systems for the Unicode character set and seen how they represent the characters when stored – to disk, memory, transmitted over the network, etc.
We’ve also seen some of the drawbacks to these schemes. Specifically, we have gained runtime complexity as the payoff for more efficient storage space.
Next time you need to work with storing and transmitting text, consider which encoding system is best for your needs.