Partner – Microsoft – NPI (cat=Java)
announcement - icon

Microsoft JDConf 2024 conference is getting closer, on March 27th and 28th. Simply put, it's a free virtual event to learn about the newest developments in Java, Cloud, and AI.

Josh Long and Mark Heckler are kicking things off in the keynote, so it's definitely going to be both highly useful and quite practical.

This year’s theme is focused on developer productivity and how these technologies transform how we work, build, integrate, and modernize applications.

For the full conference agenda and speaker lineup, you can explore JDConf.com:

>> RSVP Now

Course – LS (cat=HTTP Client-Side)

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

In some applications, we may need to download a webpage from the internet and extract its content as a string. One popular use case is web scraping or content parsing.

In this tutorial, we’ll use Jsoup and HttpURLConnection to download a sample webpage.

2. Download a Webpage Using HttpURLConnection

HttpURLConnection is a subclass of URLConnection. It helps to connect to a Uniform Resource Locator (URL), which uses HTTP as its protocol. The class contains different methods to manipulate HTTP requests.

Let’s download a sample webpage using HttpURLConnection:

@Test
void givenURLConnection_whenRetrieveWebpage_thenWebpageIsNotNullAndContainsHtmlTag() throws IOException {
    
    URL url = new URL("https://example.com");
    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.setRequestMethod("GET");
    
    try (BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()))) {
        StringBuilder responseBuilder = new StringBuilder();
        String line;
        while ((line = reader.readLine()) != null) {
            responseBuilder.append(line);
        }
    
        assertNotNull(responseBuilder);
        assertTrue(responseBuilder.toString()
          .contains("<html>"));
    }
}

Here, we create a URL object that represents the address of the webpage. Next, we create an instance of HttpURLConnection and invoke the openConnection() method on the URL object. This opens a connection to the webpage. Also, we set the request method to GET to retrieve the content of the webpage.

Then, we create a new instance of BufferedReader and InputStreamReader to help read the data from the webpage. The InputStreamReader class helps to convert raw bytes into characters that can be read by BufferedReader.

Finally, we convert the webpage to a String by reading from BufferedReader and concatenating the lines together. We used the StringBuilder object to efficiently concatenate the lines.

3. Download a Webpage Using Jsoup

Jsoup is a popular open-source Java library for working with HTML. It helps fetch URLs and extract their data. One of its major strengths is scraping HTML from a URL using HTML DOM methods and CSS selectors.

To begin with Jsoup, we need to add its dependency to our dependency manager. Let’s add the Jsoup dependency to the pom.xml:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
</dependency>

Here’s an example of downloading a webpage using Jsoup:

@Test
void givenJsoup_whenRetrievingWebpage_thenWebpageDocumentIsNotNullAndContainsHtmlTag() throws IOException {
        
    Document document = Jsoup.connect("https://www.example.com").get();
    String webpage = document.html();
        
    assertNotNull(webpage);
    assertTrue(webpage.contains("<html>"));
}

In this example, we create an instance of Document and establish a connection to the sample site using Jsoup.connect(). Jsoup.connect() helps to establish a connection to the URL and retrieve its content as a Document object.

Next, we invoke the get() method, which sends a GET request to the specified URL. It returns the response as Document.

Finally, we store the extracted contents into a variable webpage of String type. We do this by invoking the html() method on the Document object.

4. Conclusion

In this article, we learned two ways of downloading a webpage in Java. We used the HttpURLConnection class and Jsoup to download the content of a webpage. Both methods can be used, but Jsoup seems easier to work with.

As always, the complete example source code for the examples is available over on GitHub.

Course – LS (cat=Java)

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
Course – LS (cat=HTTP Client-Side)

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – HTTP Client (eBook) (cat=Http Client-Side)
Comments are closed on this article!