Split a String Into Digit and Non-Digit Substrings

Last updated: January 8, 2024

Written by: Kai Yuan

Reviewed by: Eric Martin

Java String

JMH
Regex

Azure Container Apps is a fully managed serverless container service that enables you to build and deploy modern, cloud-native Java applications and microservices at scale. It offers a simplified developer experience while providing the flexibility and portability of containers.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, visit the documentation page.

You can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, you can get started over on the documentation page.

And, you can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Traditional keyword-based search methods rely on exact word matches, often leading to irrelevant results depending on the user's phrasing.

By comparison, using a vector store allows us to represent the data as vector embeddings, based on meaningful relationships. We can then compare the meaning of the user’s query to the stored content, and retrieve more relevant, context-aware results.

Explore how to build an intelligent chatbot using MongoDB Atlas, Langchain4j and Spring Boot:

>> Building an AI Chatbot in Java With Langchain4j and MongoDB Atlas

Accessibility testing is a crucial aspect to ensure that your application is usable for everyone and meets accessibility standards that are required in many countries.

By automating these tests, teams can quickly detect issues related to screen reader compatibility, keyboard navigation, color contrast, and other aspects that could pose a barrier to using the software effectively for people with disabilities.

Learn how to automate accessibility testing with Selenium and the LambdaTest cloud-based testing platform that lets developers and testers perform accessibility automation on over 3000+ real environments:

Automated Accessibility Testing With Selenium

1. Overview

Working with strings is a fundamental task in Java programming, and at times, we need to split a string into multiple substrings for further processing. Whether it’s parsing user input or processing data files, knowing how to break strings effectively is essential.

In this tutorial, we’ll explore different approaches and techniques for breaking an input string into a string array or list containing digit and non-digit string elements in the original order.

2. Introduction to the Problem

As usual, let’s understand the problem through examples.

Let’s say we have two input strings:

String INPUT1 = "01Michael Jackson23Michael Jordan42Michael Bolton999Michael Johnson000";
String INPUT2 = "Michael Jackson01Michael Jordan23Michael Bolton42Michael Johnson999Great Michaels";

As the examples above show, both strings consist of consecutive digit and non-digit characters. For example, consecutive digit substrings in INPUT1 are “01“, “23“, “42“, “999“, and “000“. The non-digit substrings are “Michael Jackson“, “Michael Jordan“, “Michael Bolton“, and so on.

INPUT2 is similar. The difference is it starts with a non-digit string. Therefore, we can conclude a few input characteristics:

The length of digit or non-digit substrings is dynamic.
The input string can start with a digit or non-digit substring.

We aim to break the input string into an array or list of these string elements:

String[] EXPECTED1 = new String[] { "01", "Michael Jackson", "23", "Michael Jordan", "42", "Michael Bolton", "999", "Michael Johnson", "000" };
List<String> EXPECTED_LIST1 = Arrays.asList(EXPECTED1);

String[] EXPECTED2 = new String[] { "Michael Jackson", "01", "Michael Jordan", "23", "Michael Bolton", "42", "Michael Johnson", "999", "Great Michaels" };
List<String> EXPECTED_LIST2 = Arrays.asList(EXPECTED2);

In this tutorial, we’ll solve this problem using both regex-based and non-regex-based approaches. Further, we’ll discuss their performances at the end.

For simplicity, we’ll use unit test assertions to verify whether each approach works as expected.

3. Using the String.split() Method

First, let’s solve this problem using a regex-based approach. We know that the String.split() method is a handy tool for splitting a String into an array. For example: “a, b, c, d”.split(“, “) returns a string array: {“a”, “b”, “c”, “d”}.

So, using the split() method could be the first idea we came up with to solve our problem. Then, we need to find a regex pattern as the separator and guide split() to get the expected result. However, we may realize one difficulty when we think about it twice.

Let’s revisit the “a, b, c, d”.split() example. We used “, ” as the separator regex pattern and got the string elements in the array result: “a”, “b”, “c”, and “d”. If we look at the result string elements, we’ll see all matched separators (“, “) aren’t in the result string array.

However, if we look at the inputs and expected outputs of our problem, every character in the input appears in the result array or list. Therefore, if we want to use split() to solve the problem, we must use a pattern of zero-length assertions, for example, the lookaround (lookahead and lookbehind) assertions. Next, let’s analyze our input string:

01[!]Michael Jackson[!]23[!]Michael Jordan[!]42[!]Michael Bolton...

To make it clear, we marked desired separators using ‘[!]‘ in the input above. Each separator sits either between a \d (digit character) and a \D (non-digit character) or between a \D and a \d. If we translate this into a lookaround regex pattern, it’s (?<=\D)(?=\d)|(?<=\d)(?=\D).

Next, let’s write a test to verify if using split(), with this pattern, on the two inputs produces the desired results:

String splitRE = "(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)";
String[] result1 = INPUT1.split(splitRE);
assertArrayEquals(EXPECTED1, result1);

String[] result2 = INPUT2.split(splitRE);
assertArrayEquals(EXPECTED2, result2);

The test passes if we give it a run. So, we’ve solved the problem using the split() method.

Next, let’s solve the problem using a non-regex approach.

4. A Non-Regex-Based Approach

We’ve seen how to solve the problem using the regex-based split() approach. Alternatively, we can solve it without using pattern matching.

The idea to achieve that is to check through all characters from the beginning of the input string. Next, let’s first look at the implementation and understand how it works:

enum State {
    INIT, PARSING_DIGIT, PARSING_NON_DIGIT
}

List<String> parseString(String input) {
    List<String> result = new ArrayList<>();
    int start = 0;
    State state = INIT;
    for (int i = 0; i < input.length(); i++) {
        if (input.charAt(i) >= '0' && input.charAt(i) <= '9') {
            if (state == PARSING_NON_DIGIT) { // non-digit to digit, get the substring as an element
                result.add(input.substring(start, i));
                start = i;
            }
            state = PARSING_DIGIT;
        } else {
            if (state == PARSING_DIGIT) { // digit to non-digit, get the substring as an element
                result.add(input.substring(start, i));
                start = i;
            }
            state = PARSING_NON_DIGIT;
        }
    }
    result.add(input.substring(start)); // add the last part
    return result;
}

Now, let’s walk through the code above quickly and understand how it works:

First, we initialize an empty ArrayList called result to store the extracted elements.
int start = 0; – This variable start keeps track of the start index of each substring during the iteration later.
The state variable is an enum, which indicates the state while iterating through the string.
Then, we use a for loop to iterate through the input string characters and check each character’s type.
If the current character is a digit (0–9) and a non-digit to digit transition, it means an element has ended. So, we add the substring from start to i (exclusive) to the result list. Also, we update the start index to the current index i and set state to the PARSING_DIGIT state.
The else block follows a similar logic and handles the digit to non-digit transition scenario.
After the for loop ends, we shouldn’t forget to add the last part of the string to the result list by using input.substring(start).

Next, let’s test the parseString() method with our two inputs:

List<String> result1 = parseString(INPUT1);
assertEquals(EXPECTED_LIST1, result1);

List<String> result2 = parseString(INPUT2);
assertEquals(EXPECTED_LIST2, result2);

If we run the test, it passes. So, our parseString() method does the job.

5. Performance

So far, we’ve addressed two solutions to the problem, regex-based and non-regex-based. The regex-based split() solution is pretty straightforward, just one single method call. On the contrary, our dozen-line self-made parseString() method requires controlling every single character in the input on our own. Then, some of us may ask, why’d we introduce or even use the self-made method to solve the problem?

The answer is “performance.”

Although our parseString() solution looks lengthy and requires manual control of each character, it’s faster than the regex-based solution. Let’s understand the reasons for this:

The split() solution requires compiling the regex pattern and applying pattern matching. These operations are considered computationally expensive, especially for complex patterns. However, on the other hand, the parseString() method uses a simple enum-based state machine to track transitions between digit and non-digit characters. It allows for direct comparisons and avoids the complexity of regex pattern matching and lookarounds.
In the parseString() method, substrings are extracted directly using the substring() method. This approach avoids unnecessary object creation and memory allocations that may occur when using the split() method with regex. Further, by directly extracting substrings using known indices, the parseString() method optimizes memory usage and potentially improves performance.

However, the difference in performance may be negligible if the input string isn’t considerably long.

Next, let’s benchmark the performance of these two approaches. We’ll use JMH (the Java Microbenchmark Harness) to do that. This is because JMH allows us to easily handle benchmarking factors, such as JVM warm-up, dead code elimination, and so on:

@State(Scope.Benchmark)
@Threads(1)
@BenchmarkMode(Mode.Throughput)
@Fork(warmups = 1, value = 1)
@Warmup(iterations = 2, time = 10, timeUnit = TimeUnit.MILLISECONDS)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class BenchmarkLiveTest {
    private static final String INPUT = "01Michael Jackson23Michael Jordan42Michael Bolton999Michael Johnson000";

    @Param({ "10000" })
    public int iterations;

    @Benchmark
    public void regexBased(Blackhole blackhole) {
        blackhole.consume(INPUT.split("(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)"));
    }

    @Benchmark
    public void nonRegexBased(Blackhole blackhole) {
        blackhole.consume(parseString(INPUT));
    }

    @Test
    public void benchmark() throws Exception {
        String[] argv = {};
        org.openjdk.jmh.Main.main(argv);
    }
}

As the above class shows, we benchmark the two approaches in 10k iterations using the same input. Of course, we won’t dive into JMH and understand each JMH annotation’s meaning. But two annotations are important for us to understand the final report: @OutputTimeUnit(TimeUnit.MILLISECONDS) and @BenchmarkMode(Mode.Throughput). This combination means we measure how many times we can run each approach per millisecond.

Next, let’s take a look at the result JMH generates:

Benchmark                        (iterations)   Mode  Cnt     Score     Error   Units
BenchmarkLiveTest.nonRegexBased         10000  thrpt    5  3880.989 ± 134.021  ops/ms
BenchmarkLiveTest.regexBased            10000  thrpt    5   297.282 ±  24.818  ops/ms

As we can see, the non-regex-based solution’s throughput is over 13 (3880/297 = 13.06) times more than the regex-based solution. Therefore, when we need to handle long strings in a performance-critical application, we should choose parseString() over the split() solution.

6. Conclusion

In this article, we’ve explored regex-based (split()) and non-regex-based (parseString()) approaches to breaking an input string into a string array or list containing digit elements and non-digit string elements in the original order.

The split() solution is compact and straightforward. However, when dealing with long input strings, it can be significantly slower than the self-made parseString() solution.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, visit the documentation page.

You can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, visit the documentation page.

You can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.