Extract Text From a HTML Tag with Regex

Last updated: May 17, 2024

Written by: Mohamed Helmy

Reviewed by: Michal Aibin

Java String

Refactor Java code safely — and automatically — with OpenRewrite.

Refactoring big codebases by hand is slow, risky, and easy to put off. That’s where OpenRewrite comes in. The open-source framework for large-scale, automated code transformations helps teams modernize safely and consistently.

Each month, the creators and maintainers of OpenRewrite at Moderne run live, hands-on training sessions — one for newcomers and one for experienced users. You’ll see how recipes work, how to apply them across projects, and how to modernize code with confidence.

Join the next session, bring your questions, and learn how to automate the kind of work that usually eats your sprint time.

1. Introduction

When working with HTML content in Java, extracting specific text from HTML tags is common. While using regular expressions (regex) for parsing HTML is generally discouraged due to its complex structure, it can sometimes be sufficient for simple tasks.

In this tutorial, we’ll see how to extract text from HTML tags using regex in Java.

2. Using Pattern and Matcher Classes

Java provides the Pattern and Matcher classes from java.util.regex, allowing us to define and apply regular expressions to extract text from strings. Below is an example of how to extract text from a specified HTML tag using regex:

@Test
void givenHtmlContentWithBoldTags_whenUsingPatternMatcherClasses_thenExtractText() {
    String htmlContent = "<div>This is a <b>Baeldung</b> article for <b>extracting text</b> from HTML tags.</div>";
    String tagName = "b";
    String patternString = "<" + tagName + ">(.*?)</" + tagName + ">";
    Pattern pattern = Pattern.compile(patternString);
    Matcher matcher = pattern.matcher(htmlContent);

    List<String> extractedTexts = new ArrayList<>();
    while (matcher.find()) {
        extractedTexts.add(matcher.group(1));
    }

    assertEquals("Baeldung", extractedTexts.get(0));
    assertEquals("extracting text", extractedTexts.get(1));
}

Here, we first define the HTML content, denoted as htmlContent, which contains HTML with  tags. Moreover, we specify the tag name tagName as “b” to extract text from tags.

Then, we compile the regex pattern using the compile() method, where patternString is “(.*?)” to match and extract text within  tags. Afterward, we use a while loop with the find() method to iterate over all matches and add them to the list named extractedTexts.

Finally, we assert that two texts (“Baeldung” and “extracting text“) are extracted from the  tags.

To handle cases where tag contents may contain newlines, we can modify the pattern string by adding (?s) as follows:

String patternString = "(?s)<" + tagName + ">(.*?)</" + tagName + ">";

Here, we use a regex pattern “(?s)(.*?)” with dotall mode enabled (?s) to match  tags across multiple lines.

3. Using JSoup for HTML Parsing and Extraction

For more complex HTML parsing tasks, especially those involving nested tags, using a dedicated library like JSoup is recommended. Let’s demonstrate how to use JSoup to extract text from  tags, including handling nested tags:

@Test
void givenHtmlContentWithNestedParagraphTags_thenExtractAllTextsFromHtmlTag() {
    String htmlContent = "<div>This is a <p>multiline\nparagraph <strong>with nested</strong> content</p> and <p>line breaks</p>.</div>";

    Document doc = Jsoup.parse(htmlContent);
    Elements paragraphElements = doc.select("p");

    List<String> extractedTexts = new ArrayList<>();
    for (Element paragraphElement : paragraphElements) {
        String extractedText = paragraphElement.text();
        extractedTexts.add(extractedText);
    }

    assertEquals(2, extractedTexts.size());
    assertEquals("multiline paragraph with nested content", extractedTexts.get(0));
    assertEquals("line breaks", extractedTexts.get(1));
}

Here, we use the parse() method to parse the htmlContent string, converting it into a Document object. Next, we employ the select() method on the doc object to select all  elements within the parsed document.

Subsequently, we iterate over the selected paragraphElements collection, extracting text content from each  element using the paragraphElement.text() method.

4. Conclusion

In conclusion, we have explored different approaches to extracting text from HTML tags in Java. Firstly, we discussed using the Pattern and Matcher classes for regex-based text extraction. Additionally, we examined leveraging JSoup for more complex HTML parsing tasks.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.