Generic Top

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2:

>> CHECK OUT THE COURSE

1. Overview

According to Wikipedia, an anagram is a word or phrase formed by rearranging the letters of a different word or phrase.

We can generalize this in string processing by saying that an anagram of a string is another string with exactly the same quantity of each character in it, in any order.

In this tutorial, we're going to look at detecting whole string anagrams where the quantity of each character must be equal, including non-alpha characters such as spaces and digits. For example, “!low-salt!” and “owls-lat!!” would be considered anagrams as they contain exactly the same characters.

2. Solution

Let's compare a few solutions that can decide if two strings are anagrams. Each solution will check at the start whether the two strings have the same number of characters. This is a quick way to exit early since inputs with different lengths cannot be anagrams.

For each possible solution, let's look at the implementation complexity for us as developers. We'll also look at the time complexity for the CPU, using big O notation.

3. Check by Sorting

We can rearrange the characters of each string by sorting their characters, which will produce two normalized arrays of characters.

If two strings are anagrams, their normalized forms should be the same.

In Java, we can first convert the two strings into char[] arrays. Then we can sort these two arrays and check for equality:

boolean isAnagramSort(String string1, String string2) {
    if (string1.length() != string2.length()) {
        return false;
    }
    char[] a1 = string1.toCharArray();
    char[] a2 = string2.toCharArray();
    Arrays.sort(a1);
    Arrays.sort(a2);
    return Arrays.equals(a1, a2);
}

This solution is easy to understand and implement. However, the overall running time of this algorithm is O(n log n) because sorting an array of n characters takes O(n log n) time.

For the algorithm to function, it must make a copy of both input strings as character arrays, using a little extra memory.

4. Check by Counting

An alternative strategy is to count the number of occurrences of each character in our inputs. If these histograms are equal between the inputs, then the strings are anagrams.

To save a little memory, let's build only one histogram. We'll increment the counts for each character in the first string, and decrement the count for each character in the second. If the two strings are anagrams, then the result will be that everything balances to 0.

The histogram needs a fixed-size table of counts with a size defined by the character set size. For example, if we only use one byte to store each character, then we can use a counting array size of 256 to count the occurrence of each character:

private static int CHARACTER_RANGE= 256;

public boolean isAnagramCounting(String string1, String string2) {
    if (string1.length() != string2.length()) {
        return false;
    }
    int count[] = new int[CHARACTER_RANGE];
    for (int i = 0; i < string1.length(); i++) {
        count[string1.charAt(i)]++;
        count[string2.charAt(i)]--;
    }
    for (int i = 0; i < CHARACTER_RANGE; i++) {
        if (count[i] != 0) {
            return false;
        }
    }
    return true;
}

This solution is faster with the time complexity of O(n). However, it needs extra space for the counting array. At 256 integers, for ASCII that's not too bad.

However, if we need to increase CHARACTER_RANGE to support multiple-byte character sets such as UTF-8, this would become very memory hungry. Therefore, it's only really practical when the number of possible characters is in a small range.

From a development point of view, this solution contains more code to maintain and makes less use of Java library functions.

5. Check with MultiSet

We can simplify the counting and comparing process by using MultiSet. MultiSet is a collection that supports order-independent equality with duplicate elements. For example, the multisets {a, a, b} and {a, b, a} are equal.

To use Multiset, we first need to add the Guava dependency to our project pom.xml file:

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>28.1-jre</version>
</dependency>

We will convert each of our input strings into a MultiSet of characters. Then we'll check if they're equal:

boolean isAnagramMultiset(String string1, String string2) {
    if (string1.length() != string2.length()) {
        return false;
    }
    Multiset<Character> multiset1 = HashMultiset.create();
    Multiset<Character> multiset2 = HashMultiset.create();
    for (int i = 0; i < string1.length(); i++) {
        multiset1.add(string1.charAt(i));
        multiset2.add(string2.charAt(i));
    }
    return multiset1.equals(multiset2);
}

This algorithm solves the problem in O(n) time without having to declare a big counting array.

It's similar to the previous counting solution. However, rather than using a fixed-size table to count, we take advantage of the MutlitSet class to simulate a variable-sized table, with a count for each character.

The code for this solution makes more use of high-level library capabilities than our counting solution.

6. Letter-based Anagram

The examples so far do not strictly adhere to the linguistic definition of an anagram. This is because they consider punctuation characters part of the anagram, and they are case sensitive.

Let's adapt the algorithms to enable a letter-based anagram. Let's only consider the rearrangement of case-insensitive letters, irrespective of other characters such as white spaces and punctuations. For example, “A decimal point” and “I’m a dot in place.” would be anagrams of each other.

To solve this problem, we can first preprocess the two input strings to filter out unwanted characters and convert letters into lower case letters. Then we can use one of the above solutions (say, the MultiSet solution) to check anagrams on the processed strings:

String preprocess(String source) {
    return source.replaceAll("[^a-zA-Z]", "").toLowerCase();
}

boolean isLetterBasedAnagramMultiset(String string1, String string2) {
    return isAnagramMultiset(preprocess(string1), preprocess(string2));
}

This approach can be a general way to solve all variants of the anagram problems. For example, if we also want to include digits, we just need to adjust the preprocessing filter.

7. Conclusion

In this article, we looked at three algorithms for checking whether a given string is an anagram of another, character for character. For each solution, we discussed the trade-offs between the speed, readability, and size of memory required.

We also looked at how to adapt the algorithms to check for anagrams in the more traditional linguistic sense. We achieved this by preprocessing the inputs into lowercase letters.

As always, the source code for the article is available over on GitHub.

Generic bottom

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2:

>> CHECK OUT THE COURSE
Comments are closed on this article!