LS Price Increase Launch

The Price of all “Learn Spring” course packages will increase by $40 on next Friday:

>> GET ACCESS NOW

1. Overview

Many alphabets contain accent and diacritical marks. To search or index data reliably, we might want to convert a string with diacritics to a string containing only ASCII characters. Unicode defines a text normalization procedure that helps do this.

In this tutorial, we’ll see what Unicode text normalization is, how we can use it to remove diacritical marks, and the pitfalls to watch out for. Then, we will see some examples using the Java Normalizer class and Apache Commons StringUtils.

2. The Problem at a Glance

Let's say that we are working with text containing the range of diacritical marks we want to remove:

āăąēîïĩíĝġńñšŝśûůŷ

After reading this article, we'll know how to get rid of diacritics and end up with:

aaaeiiiiggnnsssuuy

3. Unicode Fundamentals

Before jumping straight into code, let's learn some Unicode basics.

To represent a character with a diacritical or accent mark, Unicode can use different sequences of code points. The reason for that is historical compatibility with older characters sets.

Unicode normalization is the decomposition of characters using equivalence forms defined by the standard.

3.1. Unicode Equivalence Forms

To compare sequences of code points, Unicode defines two terms: canonical equivalence and compatibility.

Canonically equivalent code points have the same appearance and meaning when displayed. For example, the letter “ś” (Latin letter “s” with acute) can be represented with one code point +U015B or two code points +U0073 (Latin letter “s”) and +U0301 (acute symbol).

On the other hand, compatible sequences can have distinct appearances but the same meaning in some contexts. For instance, the code point +U013F (Latin ligature “Ŀ”) is compatible with the sequence +U004C (Latin letter “L”) and +U00B7 (symbol “·”). Moreover, some fonts can show the middle dot inside the L and some following it.

Canonically equivalent sequences are compatible, but the opposite is not always true.

3.2. Character Decomposition

Character decomposition replaces the composite character with code points of a base letter, followed by combining characters (according to the equivalence form). For example, this procedure will decompose the letter “ā” into characters “a” and “-“.

3.3. Matching Diacritical and Accent Marks

Once we have separated the base character from the diacritical mark, we must create an expression matching unwanted characters. We can use either a character block or a category.

The most popular Unicode code block is Combining Diacritical Marks. It is not very large and contains just 112 most common combining characters. On the other side, we can also use the Unicode category Mark. It consists of code points that are combining marks and divide further into three subcategories:

  • Nonspacing_Mark: this category includes 1,839 code points
  • Enclosing_Mark: contains 13 code points
  • Spacing_Combining_Mark: contains 443 points

The major difference between a Unicode character block and a category is that the character block contains a contiguous range of characters. On the other side, a category can have many character blocks. For example, it is precisely the case of Combining Diacritical Marks: all code points belonging to this block are also included in the Nonspacing_Mark category.

4. Algorithm

Now that we understand the base Unicode terms, we can plan the algorithm to remove diacritical marks from a String.

First, we will separate base characters from accent and diacritical marks using the Normalizer class. Moreover, we will perform the compatibility decomposition represented as the Java enum NFKD. Additionally, we use compatibility decomposition because it decomposes more ligatures than the canonical method (for example, ligature “fi”).

Second, we will remove all characters matching the Unicode Mark category using the \p{M} regex expression. We pick this category because it offers the broadest range of marks.

5. Using Core Java

Let's start with some examples using core Java.

5.1. Check if a String Is Normalized

Before we perform a normalization, we might want to check that the String isn't already normalized:

assertFalse(Normalizer.isNormalized("āăąēîïĩíĝġńñšŝśûůŷ", Normalizer.Form.NFKD));

5.2. String Decomposition

If our String is not normalized, we proceed to the next step. To separate ASCII characters from diacritical marks, we will perform Unicode text normalization using compatibility decomposition:

private static String normalize(String input) {
    return input == null ? null : Normalizer.normalize(input, Normalizer.Form.NFKD);
}

After this step, both letters “â” and “ä” will be reduced to “a” followed by respective diacritical marks.

5.3. Removal of Code Points Representing Diacritical and Accent Marks

Once we have decomposed our String, we want to remove unwanted code points. Therefore, we will use the Unicode regular expression \p{M}:

static String removeAccents(String input) {
    return normalize(input).replaceAll("\\p{M}", "");
}

5.4. Tests

Let's see how our decomposition works in practice. Firstly, let's pick characters having normalization form defined by Unicode and expect to remove all diacritical marks:

@Test
void givenStringWithDecomposableUnicodeCharacters_whenRemoveAccents_thenReturnASCIIString() {
    assertEquals("aaaeiiiiggnnsssuuy", StringNormalizer.removeAccents("āăąēîïĩíĝġńñšŝśûůŷ"));
}

Secondly, let's pick a few characters without decomposition mapping:

@Test
void givenStringWithNondecomposableUnicodeCharacters_whenRemoveAccents_thenReturnOriginalString() {
    assertEquals("łđħœ", StringNormalizer.removeAccents("łđħœ"));
}

As expected, our method was unable to decompose them.

Additionally, we can create a test to validate the hex codes of decomposed characters:

@Test
void givenStringWithDecomposableUnicodeCharacters_whenUnicodeValueOfNormalizedString_thenReturnUnicodeValue() {
    assertEquals("\\u0066 \\u0069", StringNormalizer.unicodeValueOfNormalizedString("fi"));
    assertEquals("\\u0061 \\u0304", StringNormalizer.unicodeValueOfNormalizedString("ā"));
    assertEquals("\\u0069 \\u0308", StringNormalizer.unicodeValueOfNormalizedString("ï"));
    assertEquals("\\u006e \\u0301", StringNormalizer.unicodeValueOfNormalizedString("ń"));
}

5.5. Compare Strings Including Accents Using Collator

Package java.text includes another interesting class – Collator. It enables us to perform locale-sensitive String comparisons. An important configuration property is the Collator's strength. This property defines the minimum level of difference considered significant during a comparison.

Java provides four strength values for a Collator:

  • PRIMARY: comparison omitting case and accents
  • SECONDARY: comparison omitting case but including accents and diacritics
  • TERTIARY: comparison including case and accents
  • IDENTICAL: all differences are significant

Let's check some examples, first with primary strength:

Collator collator = Collator.getInstance();
collator.setDecomposition(2);
collator.setStrength(0);
assertEquals(0, collator.compare("a", "a"));
assertEquals(0, collator.compare("ä", "a"));
assertEquals(0, collator.compare("A", "a"));
assertEquals(1, collator.compare("b", "a"));

Secondary strength turns on accent sensitivity:

collator.setStrength(1);
assertEquals(1, collator.compare("ä", "a"));
assertEquals(1, collator.compare("b", "a"));
assertEquals(0, collator.compare("A", "a"));
assertEquals(0, collator.compare("a", "a"));

Tertiary strength includes case:

collator.setStrength(2);
assertEquals(1, collator.compare("A", "a"));
assertEquals(1, collator.compare("ä", "a"));
assertEquals(1, collator.compare("b", "a"));
assertEquals(0, collator.compare("a", "a"));
assertEquals(0, collator.compare(valueOf(toChars(0x0001)), valueOf(toChars(0x0002))));

Identical strength makes all differences important. The penultimate example is interesting, as we can detect the difference between Unicode control code points +U001 (code for “Start of Heading”) and +U002 (“Start of Text”):

collator.setStrength(3);
assertEquals(1, collator.compare("A", "a"));
assertEquals(1, collator.compare("ä", "a"));
assertEquals(1, collator.compare("b", "a"));
assertEquals(-1, collator.compare(valueOf(toChars(0x0001)), valueOf(toChars(0x0002))));
assertEquals(0, collator.compare("a", "a")));

One last example worth mentioning shows that if the character doesn't have a defined decomposition rule, it won't be considered equal to another character with the same base letter. This is due to the fact that Collator won't be able to perform the Unicode decomposition:

collator.setStrength(0);
assertEquals(1, collator.compare("ł", "l"));
assertEquals(1, collator.compare("ø", "o"));

6. Using Apache Commons StringUtils

Now that we've seen how to use core Java to remove accents, we'll check what Apache Commons Text offers. As we'll soon learn, it's easier to use, but we have less control over the decomposition process. Under the hood it uses the Normalizer.normalize() method with NFD decomposition form and \p{InCombiningDiacriticalMarks} regular expression:

static String removeAccentsWithApacheCommons(String input) {
    return StringUtils.stripAccents(input);
}

6.1. Tests

Let's see this method in practice — first, only with decomposable Unicode characters:

@Test
void givenStringWithDecomposableUnicodeCharacters_whenRemoveAccentsWithApacheCommons_thenReturnASCIIString() {
    assertEquals("aaaeiiiiggnnsssuuy", StringNormalizer.removeAccentsWithApacheCommons("āăąēîïĩíĝġńñšŝśûůŷ"));
}

As expected, we got rid of all the accents.

Let's try a string containing ligature and letters with stroke:

@Test 
void givenStringWithNondecomposableUnicodeCharacters_whenRemoveAccentsWithApacheCommons_thenReturnModifiedString() {
    assertEquals("lđħœ", StringNormalizer.removeAccentsWithApacheCommons("łđħœ"));
}

As we can see, the StringUtils.stripAccents() method manually defines the translation rule for Latin ł and Ł characters. But, unfortunately, it doesn't normalize other ligatures.

7. Limitations of Character Decomposition in Java

To sum up, we saw that some characters do not have defined decomposition rules. More specifically, Unicode doesn't define decomposition rules for ligatures and characters with the stroke. Because of that, Java won't be able to normalize them, either. If we want to get rid of these characters, we have to define transcription mapping manually.

Finally, it's worth considering whether we need to get rid of accents and diacritics. For some languages, a letter stripped from diacritical marks won't make much sense. In such cases, a better idea is to use the Collator class and compare two Strings, including locale information.

8. Conclusion

In this article, we looked into removing accents and diacritical marks using core Java and the popular Java utility library, Apache Commons. We also saw a few examples and learned how to compare text containing accents, as well as a few things to watch out for when working with text containing accents.

As always, the full source code of the article is available over on GitHub.

LS Price Increase Launch

The Price of all “Learn Spring” course packages will increase by $40 on next Friday:

>> GET ACCESS NOW
Generic footer banner
Comments are closed on this article!