In this tutorial, we’ll learn how to encode a string in UTF-8 in Kotlin.
An encoding is just a sequence of bytes used to represent a character. UTF-8 is one of the most common, and often the default, encoding for text. It’s usually the recommended encoding for communication and storage. There are various other encodings supported in Kotlin, called Charsets, which can be found here.
Let’s first go through an example to see why we would use UTF-8 and how it differs from others, and then demonstrate a few ways to ensure UTF-8 encoding.
In these examples, we’ll assume the input is of an arbitrary encoding, even though, in many cases, it’ll be UTF-8.
2. The Difference Between Encodings
To better understand the use case, let’s look at the code below, which shows the difference between UTF-8 and ASCII encoding:
val originalString = "That will cost €10."
val stringAsByteArray = originalString.toByteArray()
val utf8String = String(stringAsByteArray, Charsets.UTF_8)
val asciiString = String(stringAsByteArray, Charsets.US_ASCII)
Assertions.assertEquals("That will cost ���10.", asciiString)
As we can see, certain characters cannot be represented accurately in basic ASCII. UTF-8 is a Unicode encoding while basic ASCII isn’t, and ignoring extended ASCII, representing Unicode characters like the Euro symbol requires something more advanced. Many other special characters are similar, especially in non-English languages. Different Charsets will have different encodings for the same character, so it’s important to make sure you’re using the right encoding for the use case, and in this tutorial we’re focusing on UTF-8 for the reasons described below.
UTF-8 uses at least 1 byte, or 8 bits, to represent a character, hence the name. It’s a variable-length encoding. UTF-16 uses at least 16 bits (2 bytes), and UTF-32 always uses 32 bits. In addition to UTF-8’s lower memory consumption, it’s also ASCII compatible, representing ASCII characters the same way ASCII does. Legacy programs can usually handle UTF-8 files even if they have some UTF-8 characters. These are some of the reasons UTF-8 is commonly used and recommended.
To simplify our tests, we’ll use the following setup, which is purely for demonstration purposes. In practice, this would be redundant and we’d likely have an input provided to us that we need to encode, instead of constructing it ourselves where we can create the String in UTF-8 to begin with. Here’s the setup code we’ll use for the rest of our test cases:
val byteArray = byteArrayOf(84, 104, 97, 116, 32, 119, 105, 108, 108, 32, 99, 111, 115, 116, 32, -30, -126, -84, 49, 48, 46)
val charArray = charArrayOf('T', 'h', 'a', 't', ' ', 'w', 'i', 'l', 'l', ' ', 'c', 'o', 's', 't', ' ', '€', '1', '0', '.')
val expectedString = "That will cost €10."
Feel free to replace these values with others you’d like to test.
3. From a ByteArray
Given a ByteArray, we can convert it to a String with UTF-8 encoding in a few ways.
3.1. Using the String Constructor
Firstly, simply using the default String constructor, since UTF-8 is the default encoding used:
val utf8String = String(byteArray)
Secondly, by explicitly specifying the Charset in the constructor:
val utf8String = String(byteArray, Charsets.UTF_8)
3.2. Using ByteArray.toString(Charset)
Additionally, we can convert a ByteArray to a UTF-8 String by using the extension function toString() with the Charset as a parameter:
val utf8StringDefault = byteArray.toString()
val utf8StringExplicit = byteArray.toString(Charsets.UTF_8)
Note that using the toString() function without the Charset parameter won’t work – it’ll output the array itself.
4. From a CharArray
4.1. Using Charset.encode(CharBuffer) and Charset.decode(ByteBuffer)
To convert a CharArray to a UTF-8 String, we can convert it to an encoded ByteBuffer first, then decode it to a CharBuffer, and finally turn it into a String. Let’s create a CharArray, and then convert as described. Let’s assume the CharArray is of an unknown encoding for illustration purposes:
val encodedByteBuffer = Charsets.UTF_8.encode(CharBuffer.wrap(charArray))
val utf8String = Charsets.UTF_8.decode(encodedByteBuffer).toString()
As we can see, we leverage Charset.encode() and wrap the CharArray into a CharBuffer to do the encoding, then leverage Charset.decode() in the same line to translate the ByteBuffer to a CharBuffer. We then convert the CharBuffer to a String. Converting directly to a String using String(CharArray) won’t necessarily work, as it can retain the existing encoding.
In this tutorial, we explored the differences between encoding options, their memory usage, and their use cases.
We discussed that UTF-8 is the most common text encoding format. We then explored ways to encode strings as UTF-8 from various representations. Usually, the encoding requires converting to a byte representation, then explicitly converting to a UTF-8 String. In some cases, we leverage additional library classes and APIs such as Charset, ByteBuffer, and CharBuffer to translate to intermediate representations that we can transform into the desired String.
As always, the example code is available over on GitHub.