Compact Strings in Java

Refactor Java code safely — and automatically — with OpenRewrite.

Refactoring big codebases by hand is slow, risky, and easy to put off. That’s where OpenRewrite comes in. The open-source framework for large-scale, automated code transformations helps teams modernize safely and consistently.

Each month, the creators and maintainers of OpenRewrite at Moderne run live, hands-on training sessions — one for newcomers and one for experienced users. You’ll see how recipes work, how to apply them across projects, and how to modernize code with confidence.

Join the next session, bring your questions, and learn how to automate the kind of work that usually eats your sprint time.

Regression testing is an important step in the release process, to ensure that new code doesn't break the existing functionality. As the codebase evolves, we want to run these tests frequently to help catch any issues early on.

The best way to ensure these tests run frequently on an automated basis is, of course, to include them in the CI/CD pipeline. This way, the regression tests will execute automatically whenever we commit code to the repository.

In this tutorial, we'll see how to create regression tests using Selenium, and then include them in our pipeline using GitHub Actions:, to be run on the LambdaTest cloud grid:

>> How to Run Selenium Regression Tests With GitHub Actions

1. Overview

Strings in Java are internally represented by a char[] containing the characters of the String. And, every char is made up of 2 bytes because Java internally uses UTF-16.

For instance, if a String contains a word in the English language, the leading 8 bits will all be 0 for every char, as an ASCII character can be represented using a single byte.

Many characters require 16 bits to represent them but statistically most require only 8 bits — LATIN-1 character representation. So, there is a scope to improve the memory consumption and performance.

What’s also important is that Strings typically usually occupy a large proportion of the JVM heap space. And, because of the way they’re stored by the JVM, in most cases, a String instance can take up double space it actually needs.

In this article, we’ll discuss the Compressed String option, introduced in JDK6 and the new Compact String, recently introduced with JDK9. Both of these were designed to optimize memory consumption of Strings on the JMV.

**2. Compressed String – Java 6**

The JDK 6 update 21 Performance Release, introduced a new VM option:

-XX:+UseCompressedStrings

When this option is enabled, Strings are stored as byte[], instead of char[] – thus, saving a lot of memory. However, this option was eventually removed in JDK 7, mainly because it had some unintended performance consequences.

**3. Compact String – Java 9**

Java 9 has brought the concept of compact Strings back.

This means that whenever we create a String if all the characters of the String can be represented using a byte — LATIN-1 representation, a byte array will be used internally, such that one byte is given for one character.

In other cases, if any character requires more than 8-bits to represent it, all the characters are stored using two bytes for each — UTF-16 representation.

So basically, whenever possible, it’ll just use a single byte for each character.

Now, the question is – how will all the String operations work? How will it distinguish between the LATIN-1 and UTF-16 representations?

Well, to tackle this issue, another change is made to the internal implementation of the String. We have a final field coder, that preserves this information.

**3.1. String Implementation in Java 9**

Until now, the String was stored as a char[]:

private final char[] value;

From now on, it’ll be a byte[]:

private final byte[] value;

The variable coder:

private final byte coder;

Where the coder can be:

static final byte LATIN1 = 0;
static final byte UTF16 = 1;

Most of the String operations now check the coder and dispatch to the specific implementation:

public int indexOf(int ch, int fromIndex) {
    return isLatin1() 
      ? StringLatin1.indexOf(value, ch, fromIndex) 
      : StringUTF16.indexOf(value, ch, fromIndex);
}  

private boolean isLatin1() {
    return COMPACT_STRINGS && coder == LATIN1;
}

With all the info the JVM needs ready and available, the CompactString VM option is enabled by default. To disable it, we can use:

+XX:-CompactStrings

**3.2. How coder Works**

In Java 9 String class implementation, the length is calculated as:

public int length() {
    return value.length >> coder;
}

If the String contains only LATIN-1, the value of the coder will be 0 so the length of the String will be the same as the length of the byte array.

In other cases, if the String is in UTF-16 representation, the value of coder will be 1, and hence the length will be half the size of the actual byte array.

Note that all the changes made for Compact String, are in the internal implementation of the String class and are fully transparent for developers using String.

4. Compact Strings vs. Compressed Strings

In case of JDK 6 Compressed Strings, a major problem faced was that the String constructor accepted only char[] as an argument. In addition to this, many String operations depended on char[] representation and not a byte array. Due to this, a lot of unpacking had to be done, which affected the performance.

Whereas in case of Compact String, maintaining the extra field “coder” can also increase the overhead. To mitigate the cost of the coder and the unpacking of bytes to chars (in case of UTF-16 representation), some of the methods are intrinsified and the ASM code generated by the JIT compiler has also been improved.

This change resulted in some counter-intuitive results. The LATIN-1 indexOf(String) calls an intrinsic method, whereas the indexOf(char) does not. In case of UTF-16, both of these methods call an intrinsic method. This issue affects only the LATIN-1 String and will be fixed in future releases.

Thus, Compact Strings are better than the Compressed Strings in terms of performance.

To find out how much memory is saved using the Compact Strings, various Java application heap dumps were analyzed. And, while results were heavily dependent on the specific applications, the overall improvements were almost always considerable.

4.1. Difference in Performance

Let’s see a very simple example of the performance difference between enabling and disabling Compact Strings:

long startTime = System.currentTimeMillis();
 
List strings = IntStream.rangeClosed(1, 10_000_000)
  .mapToObj(Integer::toString) 
  .collect(toList());
 
long totalTime = System.currentTimeMillis() - startTime;
System.out.println(
  "Generated " + strings.size() + " strings in " + totalTime + " ms.");

startTime = System.currentTimeMillis();
 
String appended = (String) strings.stream()
  .limit(100_000)
  .reduce("", (l, r) -> l.toString() + r.toString());
 
totalTime = System.currentTimeMillis() - startTime;
System.out.println("Created string of length " + appended.length() 
  + " in " + totalTime + " ms.");

Here, we are creating 10 million Strings and then appending them in a naive manner. When we run this code (Compact Strings are enabled by default), we get the output:

Generated 10000000 strings in 854 ms.
Created string of length 488895 in 5130 ms.

Similarly, if we run it by disabling the Compact Strings using: -XX:-CompactStrings option, the output is:

Generated 10000000 strings in 936 ms.
Created string of length 488895 in 9727 ms.

Clearly, this is a surface level test, and it can’t be highly representative – it’s only a snapshot of what the new option may do to improve performance in this particular scenario.