Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Introduction

In this article, we will introduce HtmlUnit, a tool that allows us to, simply put, interact with and test an HTML site programmatically, using JAVA APIs.

2. About HtmlUnit

HtmlUnit is a GUI-less browser – a browser intended to be used programmatically and not directly by a user.

The browser supports JavaScript (via the Mozilla Rhino engine) and can be used even for websites with complex AJAX functionalities. All of this can be done simulating a typical GUI based browser like Chrome or Firefox.

The name HtmlUnit could lead you to think that it’s a testing framework, but while it can definitely be used for testing, it can do so much more than that.

It has also been integrated into Spring 4 and can be used seamlessly together with Spring MVC Test framework.

3. Download and Maven Dependency

HtmlUnit can be downloaded from SourceForge or from the official website. Also, you can include it in your building tool (like Maven or Gradle, among others) as you can see here. For instance, this is the Maven dependency you can currently include in your project:

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.23</version>
</dependency>

The newest version can be found here.

4. Web Testing

There are many ways in which you can test a web application – most of which we covered here on the site at one point or another.

With HtmlUnit you can directly parse the HTML of a site, interact with it just as a normal user would from the browser, check JavaScript and CSS syntax, submit forms and parse the responses to see the content of its HTML elements. All of it, using pure Java code.

Let’s start with a simple test: create a WebClient and get the first page of the navigation of www.baeldung.com:

private WebClient webClient;

@Before
public void init() throws Exception {
    webClient = new WebClient();
}

@After
public void close() throws Exception {
    webClient.close();
}

@Test
public void givenAClient_whenEnteringBaeldung_thenPageTitleIsOk()
  throws Exception {
    HtmlPage page = webClient.getPage("/");
    
    Assert.assertEquals(
      "Baeldung | Java, Spring and Web Development tutorials",
        page.getTitleText());
}

You can see some warnings or errors when running that test if our website has JavaScript or CSS problems. You should correct them.

Sometimes, if you know what you’re doing (for instance, if you see that the only errors you have are from third-party JavaScript libraries that you should not modify) you can prevent these errors from making your test fail, calling setThrowExceptionOnScriptError with false:

@Test
public void givenAClient_whenEnteringBaeldung_thenPageTitleIsCorrect()
  throws Exception {
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    HtmlPage page = webClient.getPage("/");
    
    Assert.assertEquals(
      "Baeldung | Java, Spring and Web Development tutorials",
        page.getTitleText());
}

5. Web Scraping

You don’t need to use HtmlUnit just for your own websites. It’s a browser, after all: you can use it to navigate through any web you like, send and retrieve data as needed.

Fetching, parsing, storing and analyzing data from websites is the process known as web scraping and HtmlUnit can help you with the fetching and parsing parts.

The previous example shows how we can enter any website and navigate through it, retrieving all the info we want.

For instance, let’s go to Baeldung’s full archive of articles, navigate to the latest article and retrieve its title (first <h1> tag). For our test, that will be enough; but, if we wanted to store more info, we could, for instance, retrieve the headings (all <h2> tags) as well, thus having a basic idea of what the article is about.

It’s easy to get elements by their ID, but generally, if you need to find an element it’s more convenient to use XPath syntax. HtmlUnit allows us to use it, so we will.

@Test
public void givenBaeldungArchive_whenRetrievingArticle_thenHasH1() 
  throws Exception {
    webClient.getOptions().setCssEnabled(false);
    webClient.getOptions().setJavaScriptEnabled(false);

    String url = "/full_archive";
    HtmlPage page = webClient.getPage(url);
    String xpath = "(//ul[@class='car-monthlisting']/li)[1]/a";
    HtmlAnchor latestPostLink 
      = (HtmlAnchor) page.getByXPath(xpath).get(0);
    HtmlPage postPage = latestPostLink.click();

    List<HtmlHeading1> h1  
      = (List<HtmlHeading1>) postPage.getByXPath("//h1");
 
    Assert.assertTrue(h1.size() > 0);
}

First note how – in this case, we are not interested in CSS nor JavaScript and just want to parse the HTML layout, so we turned CSS and JavaScript off.

In a real web scraping, you could take for example the h1 and h2 titles, and the outcome would be something like this:

Java Web Weekly, Issue 135
1. Spring and Java
2. Technical and Musings
3. Comics
4. Pick of the Week

You can check that the retrieved info corresponds to the latest article in Baeldung indeed:

latestBaeldung

6. What About AJAX?

AJAX functionalities can be a problem because HtmlUnit will usually retrieve the page before the AJAX calls have finished. Many times you need them to finish to properly test your website or to retrieve the data you want. There are some ways to deal with them:

  • You can use webClient.setAjaxController(new NicelyResynchronizingAjaxController()). This resynchronizes calls performed from the main thread and these calls are performed synchronously to ensure that there is a stable state to test.
  • When entering a page of a web application, you can wait for some seconds so there is enough time to let AJAX calls finish. To achieve this, you can use webClient.waitForBackgroundJavaScript(MILLIS) or webClient.waitForBackgroundJavaScriptStartingBefore(MILLIS). You should call them after retrieving the page, but before working with it.
  • You can wait until some expected condition related to the execution of the AJAX call is met. For instance:
for (int i = 0; i < 20; i++) {
    if (condition_to_happen_after_js_execution) {
        break;
    }
    synchronized (page) {
        page.wait(500);
    }
}
  • Instead of creating a new WebClient(), that defaults to the best-supported web browser, try other browsers since they might work better with your JavaScript or AJAX calls. For instance, this will create a webClient that uses a Chrome browser:
WebClient webClient = new WebClient(BrowserVersion.CHROME);

7. An Example With Spring

If we’re testing our own Spring application, then things get a little bit easier – we no longer need a running server.

Let’s implement a very simple example app: just a controller with a method that receives a text, and a single HTML page with a form. The user can input a text into the form, submit the form, and the text will be shown below that form.

In this case, we’ll use a Thymeleaf template for that HTML page (you can see a complete Thymeleaf example here):

@RunWith(SpringJUnit4ClassRunner.class)
@WebAppConfiguration
@ContextConfiguration(classes = { TestConfig.class })
public class HtmlUnitAndSpringTest {

    @Autowired
    private WebApplicationContext wac;

    private WebClient webClient;

    @Before
    public void setup() {
        webClient = MockMvcWebClientBuilder
          .webAppContextSetup(wac).build();
    }

    @Test
    public void givenAMessage_whenSent_thenItShows() throws Exception {
        String text = "Hello world!";
        HtmlPage page;

        String url = "http://localhost/message/showForm";
        page = webClient.getPage(url);
            
        HtmlTextInput messageText = page.getHtmlElementById("message");
        messageText.setValueAttribute(text);

        HtmlForm form = page.getForms().get(0);
        HtmlSubmitInput submit = form.getOneHtmlElementByAttribute(
          "input", "type", "submit");
        HtmlPage newPage = submit.click();

        String receivedText = newPage.getHtmlElementById("received")
            .getTextContent();

        Assert.assertEquals(receivedText, text);     
    }
}

The key here is building the WebClient object using MockMvcWebClientBuilder from the WebApplicationContext. With the WebClient, we can get the first page of the navigation (notice how it’s served by localhost), and start browsing from there.

As you can see, the test parses the form enters a message (in a field with ID “message”), submits the form and, on the new page, it asserts that the received text (field with ID “received”) is the same as the text we submitted.

8. Conclusion

HtmlUnit is a great tool that allows you to test your web applications easily, filling forms fields and submitting them just as if you were using the web on a browser.

It integrates seamlessly with Spring 4, and together with Spring MVC Test framework they give you a very powerful environment to make integration tests of all your pages even without a web server.

Also, using HtmlUnit you can automate any task related to web browsing, such as fetching, parsing, storing and analyzing data (web scraping).

You can get the code over on Github.

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.