Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

It’s a common practice in text processing and analysis to eliminate punctuation from a string.

In this quick tutorial, let’s explore how to easily remove punctuation from a given string.

2. Introduction to the Problem

Let’s say we have a string:

static final String INPUT = "It's 1 W o r d (!@#$%^&*{}[];':\")<>,.";

As we can see, the string INPUT contains digits, letters, whitespace, and various punctuation marks.

Our goal is to remove punctuation marks from the string only and leave letters, digits, and whitespace in the result:

static final String EXPECTED = "Its 1 W o r d ";

In this tutorial, we’ll mainly use the String.replaceAll() method, which is shipped with the Java standard library, to solve the problem.

For simplicity, we’ll use unit test assertions to verify whether the result is as expected.

So next, let’s see how the punctuation marks get removed.

3. Using the Regex Pattern “[^\sa-zA-Z0-9]” and “\p{Punct}

We’ve mentioned using the String.replaceAll() method to remove punctuation from the input string. The replaceAll() method does regex-based string substitution. It checks through the input string and replaces all parts that match ourrRegex pattern with a replacement string.

Therefore, the regex pattern is the key to solving this problem.

As we want to leave letters, digits, and whitespace in the result, we can replace any character that’s not a digit, a letter, or a whitespace character with an empty string. We can match these letters with regex’s character range [^\sa-zA-Z0-9].

Next, let’s create a test to check if it works:

String result = INPUT.replaceAll("[^\\sa-zA-Z0-9]", "");
assertEquals(EXPECTED, result);

The test passes if we execute it. The regex pattern is pretty straightforward. For those not familiar with the syntax, it may be helpful to note a couple of points:

  • [^…] – Not one of the characters in […]. For example, [^0-9] matches any non-digit.
  • \s\s matches any whitespace characters, such as space and TAB.

Moreover, Java’s regex engine supports POSIX character classes. Therefore, we can directly use the \\p{Punct} character class to match any character in !”#$%&'()*+,-./:;<=>?@[\]^_`{|}~:

String result = INPUT.replaceAll("\\p{Punct}", "");
assertEquals(EXPECTED, result);

When we run the test above, it passes too.

4. When the Input Is a Unicode String

We’ve seen two approaches to removing punctuation from the input string successfully. If we take a closer look at the INPUT string, we realize that it consists of ASCII characters.

A question may come up – will the solutions still work if we receive a string like this:

static final String UNICODE_INPUT = "3 March März 三月 březen маршировать (!@#$%^&*{}[];':\")<>,.";

Apart from the digit ‘3‘, whitespace characters, and punctuation marks, this input includes the word “March” in English, German, Chinese, Czech, and Russian. So, unlike the previous INPUT string, the UNICODE_INPUT variable contains Unicode characters.

After removing punctuation, the expected result should look like this:

static final String UNICODE_EXPECTED = "3 March März 三月 březen маршировать ";

So next, let’s test if our two solutions still work with this input:

String result1 = UNICODE_INPUT.replaceAll("[^\\sa-zA-Z0-9]", "");
assertNotEquals(UNICODE_EXPECTED, result1);

The test above passes. But we should note that the assertion is assertNotEquals(). So the “removing [^\sa-zA-Z0-9]” approach doesn’t produce the expected result. Let’s see what result it actually produces:

String actualResult1 = "3 March Mrz  bezen  ";
assertEquals(actualResult1, result1);

So, all non-ASCII characters have been removed together with punctuation marks. Apparently, the “removing [^\sa-zA-Z0-9]” approach doesn’t work for Unicode strings.

But we can fix it by replacing the “a-zA-Z” range with “\p{L}:

String result3 = UNICODE_INPUT.replaceAll("[^\\s\\p{L}0-9]", "");
assertEquals(UNICODE_EXPECTED, result3);

It’s worth mentioning that \p{L} matches any letter, including Unicode characters.

On the other hand, the “removing \p{Punct}” approach still works with Unicode inputs:

String result2 = UNICODE_INPUT.replaceAll("\\p{Punct}", "");
assertEquals(UNICODE_EXPECTED, result2);

This is because \\p{Punct} matches punctuation characters only.

5. Conclusion

In this article, we’ve learned how to remove punctuation from a string using the standard String.replaceAll() method:

  • String.replaceAll(“[^\\sa-zA-Z0-9]”, “”) – works only for input strings with ASCII characters
  • String.replaceAll(“\\p{Punct}”, “”) – works for both ASCII and Unicode strings
  • String.replaceAll(“[^\\s\\p{L}0-9]”, “”) – works for both ASCII and Unicode strings

As usual, all code snippets presented here are available over on GitHub.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.