Parsing HTML Table in Java With Jsoup

Azure Spring Apps is a fully managed service from Microsoft (built in collaboration with VMware), focused on building and deploying Spring Boot applications on Azure Cloud without worrying about Kubernetes.

The Enterprise plan comes with some interesting features, such as commercial Spring runtime support, a 99.95% SLA and some deep discounts (up to 47%) when you are ready for production.

>> Learn more and deploy your first Spring Boot app to Azure.

And, you can participate in a very quick (1 minute) paid user research from the Java on Azure product team.

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

The Jet Profiler was built for MySQL only, so it can do things like real-time query performance, focus on most used tables or most frequent queries, quickly identify performance issues and basically help you optimize your queries.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

Accelerate Your Jakarta EE Development with Payara Server!

With best-in-class guides and documentation, Payara essentially simplifies deployment to diverse infrastructures.

Beyond that, it provides intelligent insights and actions to optimize Jakarta EE applications.

The goal is to apply an opinionated approach to get to what's essential for mission-critical applications - really solid scalability, availability, security, and long-term support:

>> Download and Explore the Guide (to learn more)

The AI Assistant to boost Boost your productivity writing unit tests - Machinet AI.

AI is all the rage these days, but for very good reason. The highly practical coding companion, you'll get the power of AI-assisted coding and automated unit test generation.
Machinet's Unit Test AI Agent utilizes your own project context to create meaningful unit tests that intelligently aligns with the behavior of the code.
And, the AI Chat crafts code and fixes errors with ease, like a helpful sidekick.

Simplify Your Coding Journey with Machinet AI:

>> Install Machinet AI in your IntelliJ

Looking for the ideal Linux distro for running modern Spring apps in the cloud?

Meet Alpaquita Linux: lightweight, secure, and powerful enough to handle heavy workloads.

This distro is specifically designed for running Java apps. It builds upon Alpine and features significant enhancements to excel in high-density container environments while meeting enterprise-grade security standards.

Specifically, the container image size is ~30% smaller than standard options, and it consumes up to 30% less RAM:

>> Try Alpaquita Containers now.

DbSchema is a super-flexible database designer, which can take you from designing the DB with your team all the way to safely deploying the schema.

The way it does all of that is by using a design model, a database-independent image of the schema, which can be shared in a team using GIT and compared or deployed on to any database.

And, of course, it can be heavily visual, allowing you to interact with the database using diagrams, visually compose queries, explore the data, generate random data, import data or build HTML5 database reports.

>> Take a look at DBSchema

Slow MySQL query performance is all too common. Of course it is. A good way to go is, naturally, a dedicated profiler that actually understands the ins and outs of MySQL.

Critically, it has very minimal impact on your server's performance, with most of the profiling work done separately - so it needs no server changes, agents or separate services.

Basically, you install the desktop application, connect to your MySQL server, hit the record button, and you'll have results within minutes:

>> Try out the Profiler

1. Overview

Jsoup is an open-source library used to scrape HTML pages. It provides an API for data parsing, extraction, and manipulation using DOM API methods.

In this article, we will see how to parse an HTML table using Jsoup. We will be retrieving and updating data from the HTML table and also, adding and deleting rows in the table using Jsoup.

2. Dependencies

To use the Jsoup library, add the following dependency to the project:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.17.2</version>
</dependency>

We can find the latest version of the Jsoup library in the Maven central repository.

3. Table Structure

To illustrate parsing HTML tables via jsoup, we will be using a sample HTML structure. The complete HTML structure is available in the code base provided in the GitHub repository mentioned at the end of the article. Here, we are showing a table with only two rows of data for representational purposes:

<table>
    <thead>
        <tr>
            <th>Name</th>
            <th>Maths</th>
            <th>English</th>
            <th>Science</th>
         </tr>
    </thead>
    <tbody>
        <tr>
            <td>Student 1</td>
            <td>90</td>
            <td>85</td>
            <td>92</td>
        </tr>
     </tbody>
</table>

As we can see, we are parsing the table with a header row with thead tag followed by data rows in the tbody tag. We are assuming that the table in the HTML document will be in the above format.

4. Parsing Table

Firstly, to select an HTML table from the parsed document, we can use the code snippet below:

Element table = doc.select("table");
Elements rows = table.select("tr"); 
Elements first = rows.get(0).select("th,td");

As we can see, the table element is selected from the document, and then, to get the row element, tr is selected from the table element. As there are multiple rows in the table, we have selected the th or td elements in the first row. By using these functions, we can write the below function to parse table data.

Here, we are assuming no colspan or rowspan elements are used in the table, and the first row is present with header th tags.

Following is the code for parsing the table:

public List<Map<String, String>> parseTable(Document doc, int tableOrder) {
    Element table = doc.select("table").get(tableOrder);
    Element tbody = table.select("tbody").get(0);
    Elements dataRows = tbody.select("tr");
    Elements headerRow = table.select("tr")
      .get(0)
      .select("th,td");

    List<String> headers = new ArrayList<String>();
    for (Element header : headerRow) {
        headers.add(header.text());
    }

    List<Map<String, String>> parsedDataRows = new ArrayList<Map<String, String>>();
    for (int row = 0; row < dataRows.size(); row++) {
        Elements colVals = dataRows.get(row).select("th,td");

        int colCount = 0;
        Map<String, String> dataRow = new HashMap<String, String>();
        for (Element colVal : colVals) {
            dataRow.put(headers.get(colCount++), colVal.text());
        }
        parsedDataRows.add(dataRow);
    }
    return parsedDataRows;
}

In this function, parameter doc is the HTML document loaded from the file, and tableOrder is the nth table element in the document. We are using List<Map<String, String>> to store a list of dataRows in the table under the tbody element. Each element of the list is a Map representing a dataRow. This Map stores the column name as a key and the row value for that column as a map value. Using a list of Maps makes it easy to access the retrieved data.

The list index represents row numbers, and we can get specific cell data by its map key.

We can verify if table data is retrieved correctly using the test case below:

@Test
public void whenDocumentTableParsed_thenTableDataReturned() {
    JsoupTableParser jsoParser = new JsoupTableParser();
    Document doc = jsoParser.loadFromFile("Students.html");
    List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
    assertEquals("90", tableData.get(0).get("Maths")); 
}

From the JUnit test case, we can confirm that since we have parsed the text of all table cells and stored it in an ArrayList of HashMap objects, each element of the list represents a data row in the table. The row is represented by a HashMap with the key as the column header and cell text as the value. Using this structure, we can easily access table data.

5. Update Elements of the Parsed Table

To insert or update elements while parsing, we can use the below code on the td element retrieved from the row:

colVals.get(colCount++).text(updateValue);

colVals.get(colCount++).html(updateValue);

The function to update values in the parsed table would look like below:

public void updateTableData(Document doc, int tableOrder, String updateValue) {
    Element table = doc.select("table").get(tableOrder);
    Element tbody = table.select("tbody").get(0);
    Elements dataRows = tbody.select("tr");

    for (int row = 0; row < dataRows.size(); row++) {
        Elements colVals = dataRows.get(row).select("th,td");

        for (int colCount = 0; colCount < colVals.size(); colCount++) {
            colVals.get(colCount).text(updateValue);
        }
    }
}

In the above function, we are getting data rows from the tbody element of the table. The function traverses each cell of the table and sets its value to the parameter value, updatedValue. It updates all cells to the same value to demonstrate that cell values can be updated using Jsoup. We can update the individual cell values by specifying the row and column index for the data row.

The test below verifies the update function:

@Test
public void whenTableUpdated_thenUpdatedDataReturned() {
    JsoupTableParser jsoParser = new JsoupTableParser();
    Document doc = jsoParser.loadFromFile("Students.html");
    jsoParser.updateTableData(doc, 0, "50");
    List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
    assertEquals("50", tableData.get(2).get("Maths"));
}

The JUnit test case confirms that the update operation updates all table cell values to 50. Here we are verifying data from the third data row of the Maths column.

Similarly, we can set desired values for specific cells of the table.

6. Adding Row to the Table

We can add a row to the table using the following function:

public void addRowToTable(Document doc, int tableOrder) {
    Element table = doc.select("table").get(tableOrder);
    Element tbody = table.select("tbody").get(0);

    Elements rows = table.select("tr");
    Elements headerCols = rows.get(0).select("th,td");
    int numCols = headerCols.size();

    Elements colVals = new Elements(numCols);
    for (int colCount = 0; colCount < numCols; colCount++) {
        Element colVal = new Element("td");
        colVal.text("11");
        colVals.add(colVal);
    }
    Elements dataRows = tbody.select("tr");
    Element newDataRow = new Element("tr");
    newDataRow.appendChildren(colVals);
    dataRows.add(newDataRow);
    tbody.html(dataRows.toString());
}

In the above function, we are getting the number of columns from the header row and the data rows from the tbody element of the table. After adding a new row to the dataRows list, we are updating the tbody HTML content with the dataRows.

We can verify row addition using the following test case:

@Test
public void whenTableRowAdded_thenRowCountIncreased() {
    JsoupTableParser jsoParser = new JsoupTableParser();
    Document doc = jsoParser.loadFromFile("Students.html");
    List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
    int countBeforeAdd = tableData.size();
    jsoParser.addRowToTable(doc, 0);
    tableData = jsoParser.parseTable(doc, 0);
    assertEquals(countBeforeAdd + 1, tableData.size());
}

We can confirm from the JUnit test case that the addRowToTable operation on the table increases the number of rows in the table by 1. This operation adds a new row at the end of the list.

Similarly, we can add a row at any position by specifying the index while adding it to the row elements collection.

7. Delete the Row From the Table

We can delete a row from the table using the following function:

public void deleteRowFromTable(Document doc, int tableOrder, int rowNumber) {
    Element table = doc.select("table").get(tableOrder);
    Element tbody = table.select("tbody").get(0);
    Elements dataRows = tbody.select("tr");
    if (rowNumber < dataRows.size()) {
        dataRows.remove(rowNumber);
    }
}

In the above function, we are getting the tbody element of the table. From tbody, we are getting a list of dataRows. From the list of dataRows, we are deleting the row at the rowNumber position in the table. We can verify row deletion using the following test case:

@Test
public void whenTableRowDeleted_thenRowCountDecreased() {
    JsoupTableParser jsoParser = new JsoupTableParser();
    Document doc = jsoParser.loadFromFile("Students.html");
    List<Map<String, String>> tableData = jsoParser.parseTable(doc, 0);
    int countBeforeDel = tableData.size();
    jsoParser.deleteRowFromTable(doc, 0, 2);
    tableData = jsoParser.parseTable(doc, 0);
    assertEquals(countBeforeDel - 1, tableData.size());
}

The JUnit test case confirms that the deleteRowFromTable operation on the table reduces the number of rows in the table by 1.

Similarly, we can delete a row at any position by specifying the index while removing it from the row elements collection.

8. Conclusion

In this article, we have seen how we can use jsoup to parse HTML tables from HTML documents. Also, we can update table structure as well as table cell data. As always, the source for these examples is available over on GitHub.

Parsing HTML Table in Java With Jsoup

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Overview

2. Dependencies

3. Table Structure

4. Parsing Table

5. Update Elements of the Parsed Table

6. Adding Row to the Table

7. Delete the Row From the Table

8. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course:

REST with Spring

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Persistence

REST

Security

Full Archive

Baeldung Ebooks

About Baeldung

Write for Baeldung

Get started with Spring and Spring Boot, through the Learn Spring course:

1. Overview

2. Dependencies

3. Table Structure

4. Parsing Table

5. Update Elements of the Parsed Table

6. Adding Row to the Table

7. Delete the Row From the Table

8. Conclusion

Get started with Spring and Spring Boot, through the Learn Spring course: