Reading PDF File Using Java

Refactor Java code safely — and automatically — with OpenRewrite.

Refactoring big codebases by hand is slow, risky, and easy to put off. That’s where OpenRewrite comes in. The open-source framework for large-scale, automated code transformations helps teams modernize safely and consistently.

Each month, the creators and maintainers of OpenRewrite at Moderne run live, hands-on training sessions — one for newcomers and one for experienced users. You’ll see how recipes work, how to apply them across projects, and how to modernize code with confidence.

Join the next session, bring your questions, and learn how to automate the kind of work that usually eats your sprint time.

1. Overview

Portable Document Format (PDF) is a common file format for documents. It’s used to distribute electronic documents that need to preserve their original format.

In this tutorial, we’ll explore two of the most popular libraries for reading PDF files in Java: Apache PDFBox and iText.

2. Setup

We’ll use Maven to manage dependencies.

Furthermore, we’ll add a sample PDF file to the project root directory. The file contains a simple phrase “Hello World!”.

Next, we’ll read the sample PDF file and test the extracted text against an expected result.

3. Using Apache PDFBox

Apache PDFBox is a free and open-source Java library for processing and manipulating PDF documents. Its capabilities include extracting text, rendering PDFs to images, and merging and splitting PDFs.

Let’s add the Apache PDFBox dependency to the pom.xml:

<dependency> 
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>${pdfbox.version}</version>
</dependency>

Here’s a simple example of using Apache PDFBox to read text from a PDF file:

@Test
public void givenSamplePdf_whenUsingApachePdfBox_thenCompareOutput() throws IOException {
    
    String expectedText = "Hello World!\n";
    File file = new File("sample.pdf");
    PDDocument document = PDDocument.load(file);
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    document.close();
    
    assertEquals(expectedText, text);
}

In this example, we created a new instance of PDDocument to load the PDF file into the program. Then, we created a new instance of PDFTextStripper and invoked getText() to extract the text from the PDF file.

4. Using iText

iText is an open-source library for generating and using PDF files in Java. It provides a simple API for reading text from PDF files.

First, let’s include the iText dependency in the pom.xml:

<dependency> 
    <groupId>com.itextpdf</groupId> 
    <artifactId>itextpdf</artifactId> 
    <version>${itextpdf.version}</version>
</dependency>

Let’s see a simple example of using the iText PDF library to extract text from a PDF file:

@Test
public void givenSamplePdf_whenUsingiTextPdf_thenCompareOutput() throws IOException {
    
    String expectedText = "Hello World!";
    PdfReader reader = new PdfReader("sample.pdf");
    int pages = reader.getNumberOfPages();
    StringBuilder text = new StringBuilder();
    for (int i = 1; i <= pages; i++) {
        text.append(PdfTextExtractor.getTextFromPage(reader, i));
    }
    reader.close();
    
    assertEquals(expectedText, text.toString());
}

In this example, we created a new instance of PdfReader to open the PDF file. Then, we invoked the getNumberOfPages() method to get the number of pages of the PDF file. Finally, we looped through the pages and invoked getTextFromPage() on PdfTextExtractor to extract the content of the pages.

5. Conclusion

In this article, we learned two different ways of reading PDF files in Java. We used iText and Apache PDFBox libraries to extract text from a sample PDF file. Both libraries offer simple and effective APIs for extracting text from PDF documents.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.