Generic Top

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2:

>> CHECK OUT THE COURSE

1. Overview

In this tutorial, we'll look briefly at the different ways of preserving line breaks when using Jsoup to parse HTML to plain text. We will cover how to preserve line breaks associated with newline (\n) characters, as well as those associated with <br> and <p> tags.

2. Preserving \n While Parsing HTML Text

Jsoup removes the newline character (\n) by default from the HTML text and replaces each newline with a space character.

However, to prevent Jsoup from removing the newline characters, we can change the OutputSetting of Jsoup and disable pretty-print. If pretty-print is disabled, the HTML output methods will not re-format the output, and the output will look like the input:

Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);

Furthermore, we can use Jsoup#clean to remove all the HTML tags from the string:

String strHTML = "<html><body>Hello\nworld</body></html>";
String strWithNewLines = Jsoup.clean(strHTML, "", Whitelist.none(), outputSettings);

Let's see what our output string strWithNewLines looks like:

assertEquals("Hello\nworld", strWithNewLines);

Therefore, we can see that by calling Jsoup#clean with Whitelist#none and disabling the pretty-print output setting of Jsoup, we are able to preserve the line breaks associated with the newline character.

Let's see what else we can do!

3. Preserving Line Breaks Associated with <br> and <p> Tags

While cleaning the HTML text using the Jsoup#clean method, it removes the line breaks created by HTML tags like <br> and <p>.

To preserve the line breaks associated with these tags, we first need to create a Jsoup Document from our HTML string:

String strHTML = "<html><body>Hello<br>World<p>Paragraph</p></body></html>";
Document jsoupDoc = Jsoup.parse(strHTML);

Next, we prepend a newline character before the <br> and <p> tags — once again, we're disabling the pretty-print output setting as well:

Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
jsoupDoc.outputSettings(outputSettings);
jsoupDoc.select("br").before("\\n");
jsoupDoc.select("p").before("\\n");

Here, we used the select method of Jsoup Document along with the before method to prepend the newline character.

After that, we get the HTML string from jsoupDoc retaining the original new lines:

String str = jsoupDoc.html().replaceAll("\\\\n", "\n");

Finally, we call Jsoup#clean with Whitelist#none and the pretty-print output setting disabled:

String strWithNewLines = Jsoup.clean(str, "", Whitelist.none(), outputSettings);

And our output string strWithNewLines looks like:

assertEquals("Hello\nWorld\nParagraph", strWithNewLines);

Thus, by prepending <br> and <p> HTML tags with the newline character, and disabling the pretty-print output setting of Jsoup, we can preserve the line breaks associated with them.

4. Conclusion

In this short article, we learned how to preserve line breaks associated with newline (\n) characters and the <br> and <p> tags when parsing HTML into plain text with Jsoup.

As always, all these code samples are available over on GitHub.

Generic bottom

I just announced the new Learn Spring course, focused on the fundamentals of Spring 5 and Spring Boot 2:

>> CHECK OUT THE COURSE
Comments are closed on this article!