UTF-8 Validation in Java

Azure Spring Apps is a fully managed service from Microsoft (built in collaboration with VMware), focused on building and deploying Spring Boot applications on Azure Cloud without worrying about Kubernetes.

And, the Enterprise plan comes with some interesting features, such as commercial Spring runtime support, a 99.95% SLA and some deep discounts (up to 47%) when you are ready for production.

>> Learn more and deploy your first Spring Boot app to Azure.

You can also ask questions and leave feedback on the Azure Spring Apps GitHub page.

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

Accelerate Your Jakarta EE Development with Payara Server!

With best-in-class guides and documentation, Payara essentially simplifies deployment to diverse infrastructures.

Beyond that, it provides intelligent insights and actions to optimize Jakarta EE applications.

The goal is to apply an opinionated approach to get to what's essential for mission-critical applications - really solid scalability, availability, security, and long-term support:

>> Download and Explore the Guide (to learn more)

The AI Assistant to boost Boost your productivity writing unit tests - Machinet AI.

AI is all the rage these days, but for very good reason. The highly practical coding companion, you'll get the power of AI-assisted coding and automated unit test generation.
Machinet's Unit Test AI Agent utilizes your own project context to create meaningful unit tests that intelligently aligns with the behavior of the code.
And, the AI Chat crafts code and fixes errors with ease, like a helpful sidekick.

Simplify Your Coding Journey with Machinet AI:

>> Install Machinet AI in your IntelliJ

Looking for the ideal Linux distro for running modern Spring apps in the cloud?

Meet Alpaquita Linux: lightweight, secure, and powerful enough to handle heavy workloads.

This distro is specifically designed for running Java apps. It builds upon Alpine and features significant enhancements to excel in high-density container environments while meeting enterprise-grade security standards.

Specifically, the container image size is ~30% smaller than standard options, and it consumes up to 30% less RAM:

>> Try Alpaquita Containers now.

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

>> Take a look at DBSchema

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

1. Overview

In data transmission, we often need to handle byte data. If the data is an encoded string instead of a binary, we often encode it in Unicode. Unicode Transformation Format-8 (UTF-8) is a variable-length encoding that can encode all possible Unicode characters.

In this tutorial, we’ll explore the conversion between UTF-8 encoded bytes and string. After that, we’ll dive into the crucial aspects of conducting UTF-8 validation on byte data in Java.

2. UTF-8 Conversion

Before we jump into the validation sections, let’s review how to convert a string into a UTF-8 encoded byte array and vice versa.

We can simply call the getBytes() method with the target encoding of a string to convert a string into a byte array:

String UTF8_STRING = "Hello 你好";
byte[] UTF8_BYTES = UTF8_STRING.getBytes(StandardCharsets.UTF_8);

For the reverse, the String class provides a constructor to create a String instance by a byte array and its source encoding:

String decodedStr = new String(array, StandardCharsets.UTF_8);

The constructor we used doesn’t have much control over the decoding process. Whenever the byte array contains unmappable character sequences, it replaces those characters with the default replacement character �:

@Test
void whenDecodeInvalidBytes_thenReturnReplacementChars() {
    byte[] invalidUtf8Bytes = {(byte) 0xF0, (byte) 0xC1, (byte) 0x8C, (byte) 0xBC, (byte) 0xD1};
    String decodedStr = invalidUtf8Bytes.getBytes(StandardCharsets.UTF_8);
    assertEquals("�����", decodedStr);
}

Therefore, we cannot use this method to validate whether a byte array is encoded in UTF-8.

3. Byte Array Validation

Java provides a simple way to validate whether a byte array is UTF-8 encoded using CharsetDecoder:

CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
CharBuffer decodedCharBuffer = charsetDecoder.decode(java.nio.ByteBuffer.wrap(UTF8_BYTES));

If the decoding process succeeds, we consider those bytes as valid UTF-8. Otherwise, the decode() method throws MalformedInputException:

@Test
void whenDecodeInvalidUTF8Bytes_thenThrowsMalformedInputException() {

    CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
    assertThrows(MalformedInputException.class,() -> {
        charsetDecoder.decode(java.nio.ByteBuffer.wrap(INVALID_UTF8_BYTES));
    });
}

4. Byte Stream Validation

When our source data is a byte stream rather than a byte array, we can read the InputStream and put its content into a byte array. Subsequently, we can apply the encoding validation on the byte array.

However, our preference is to directly validate the InputStream. This avoids creating an extra byte array and reduces the memory footprint in our application. It’s particularly important when we process a large stream.

In this section, we’ll define the following constant as our source UTF-8 encoded InputStream:

InputStream UTF8_INPUTSTREAM = new ByteArrayInputStream(UTF8_BYTES);

4.1. Validation Using Apache Tika

Apache Tika is an open-source content analysis library that provides a set of classes for detecting and extracting text content from different file formats.

We need to include the following Apache Tika core and standard parser dependencies in pom.xml:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.9.1</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers-standard-package</artifactId>
    <version>2.9.1</version>
</dependency>

When we conduct a UTF-8 validation in Apache Tika, we instantiate a UniversalEncodingDetector and use it to detect the encoding of the InputStream. The detector returns the encoding as a Charset instance. We simply verify whether the Charset instance is a UTF-8 one:

@Test
void whenDetectEncoding_thenReturnsUtf8() {
    EncodingDetector encodingDetector = new UniversalEncodingDetector();
    Charset detectedCharset = encodingDetector.detect(UTF8_INPUTSTREAM, new Metadata());
    assertEquals(StandardCharsets.UTF_8, detectedCharset);
}

It’s worth noting that when we detect a stream that contains only the first 128 characters in the ASCII code, the detect() method returns ISO-8859-1 instead of UTF-8.

ISO-8859-1 is a single-byte encoding to represent ASCII characters, which are the same as the first 128 Unicode characters. Due to this characteristic, we still consider the data to be UTF-8 encoded if the method returns ISO-8859-1.

4.2. Validation Using ICU4J

ICU4J stands for International Components for Unicode for Java and is a Java library published by IBM. It provides Unicode and globalization support for software applications. We need the following ICU4J dependency in our pom.xml:

<dependency>
    <groupId>com.ibm.icu</groupId>
    <artifactId>icu4j</artifactId>
    <version>74.1</version>
</dependency>

In ICU4J, we create a CharsetDetector instance to detect the charset of the InputStream. Similar to the validation using Apache Tika, we verify whether the charset is UTF-8 or not:

@Test
void whenDetectEncoding_thenReturnsUtf8() {
    CharsetDetector detector = new CharsetDetector();
    detector.setText(UTF8_INPUTSTREAM);
    CharsetMatch charsetMatch = detector.detect();
    assertEquals(StandardCharsets.UTF_8.name(), charsetMatch.getName());
}

ICU4J exhibits the same behavior when it detects the encoding of the stream where the detection returns ISO-8859-1 when the data contains only the first 128 ASCII characters.

5. Conclusion

In this article, we’ve explored UTF-8 encoded bytes and string conversion and different types of UTF-8 validation based on byte and stream. This journey equips us with practical code to foster a deeper understanding of UTF-8 in Java applications.

As always, the sample code is available over on GitHub.

UTF-8 Validation in Java

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Overview

2. UTF-8 Conversion

3. Byte Array Validation

4. Byte Stream Validation

4.1. Validation Using Apache Tika

4.2. Validation Using ICU4J

5. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course:

REST with Spring

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Persistence

REST

Security

Full Archive

Baeldung Ebooks

About Baeldung

Write for Baeldung

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Overview

2. UTF-8 Conversion

3. Byte Array Validation

4. Byte Stream Validation

4.1. Validation Using Apache Tika

4.2. Validation Using ICU4J

5. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course: