Partner – Expected Behavior – NPI (tag=PDF)
announcement - icon

Creating PDFs is actually surprisingly hard. When we first tried, none of the existing PDF libraries met our needs. So we made DocRaptor for ourselves and later launched it as one of the first HTML-to-PDF APIs.

We think DocRaptor is the fastest and most scalable way to make PDFs, especially high-quality or complex PDFs. And as developers ourselves, we love good documentation, no-account trial keys, and an easy setup process.

>> Try DocRaptor's HTML-to-PDF Java Client (No Signup Required)

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE

1. Overview

Portable Document Format (PDF) is a common file format for documents. It’s used to distribute electronic documents that need to preserve their original format.

In this tutorial, we’ll explore two of the most popular libraries for reading PDF files in Java: Apache PDFBox and iText.

2. Setup

We’ll use Maven to manage dependencies.

Furthermore, we’ll add a sample PDF file to the project root directory. The file contains a simple phrase “Hello World!”.

Next, we’ll read the sample PDF file and test the extracted text against an expected result.

3. Using Apache PDFBox

Apache PDFBox is a free and open-source Java library for processing and manipulating PDF documents. Its capabilities include extracting text, rendering PDFs to images, and merging and splitting PDFs.

Let’s add the Apache PDFBox dependency to the pom.xml:

<dependency> 
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>${pdfbox.version}</version>
</dependency>

Here’s a simple example of using Apache PDFBox to read text from a PDF file:

@Test
public void givenSamplePdf_whenUsingApachePdfBox_thenCompareOutput() throws IOException {
    
    String expectedText = "Hello World!\n";
    File file = new File("sample.pdf");
    PDDocument document = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    document.close();
    
    assertEquals(expectedText, text);
}

In this example, we created a new instance of PDDocument to load the PDF file into the program. Then, we created a new instance of PDFTextStripper and invoked getText() to extract the text from the PDF file.

4. Using iText

iText is an open-source library for generating and using PDF files in Java. It provides a simple API for reading text from PDF files.

First, let’s include the iText dependency in the pom.xml:

<dependency> 
    <groupId>com.itextpdf</groupId> 
    <artifactId>itextpdf</artifactId> 
    <version>${itextpdf.version}</version>
</dependency>

Let’s see a simple example of using the iText PDF library to extract text from a PDF file:

@Test
public void givenSamplePdf_whenUsingiTextPdf_thenCompareOutput() throws IOException {
    
    String expectedText = "Hello World!";
    PdfReader reader = new PdfReader("sample.pdf");
    int pages = reader.getNumberOfPages();
    StringBuilder text = new StringBuilder();
    for (int i = 1; i <= pages; i++) {
        text.append(PdfTextExtractor.getTextFromPage(reader, i));
    }
    reader.close();
    
    assertEquals(expectedText, text.toString());
}

In this example, we created a new instance of PdfReader to open the PDF file. Then, we invoked the getNumberOfPages() method to get the number of pages of the PDF file. Finally, we looped through the pages and invoked getTextFromPage() on PdfTextExtractor to extract the content of the pages.

5. Conclusion

In this article, we learned two different ways of reading PDF files in Java. We used iText and Apache PDFBox libraries to extract text from a sample PDF file. Both libraries offer simple and effective APIs for extracting text from PDF documents.

As usual, the complete source code for the examples is available over on GitHub.

Partner – Expected Behavior – NPI (tag=PDF)
announcement - icon

Creating PDFs is actually surprisingly hard. When we first tried, none of the existing PDF libraries met our needs. So we made DocRaptor for ourselves and later launched it as one of the first HTML-to-PDF APIs.

We think DocRaptor is the fastest and most scalable way to make PDFs, especially high-quality or complex PDFs. And as developers ourselves, we love good documentation, no-account trial keys, and an easy setup process.

>> Try DocRaptor's HTML-to-PDF Java Client (No Signup Required)

Course – LS – All

Get started with Spring and Spring Boot, through the Learn Spring course:

>> CHECK OUT THE COURSE
res – REST with Spring (eBook) (everywhere)
Comments are open for 30 days after publishing a post. For any issues past this date, use the Contact form on the site.