Get Information About a PDF in Java

Last updated: January 8, 2024

Written by: Hamid Reza Sharifi

Reviewed by: David Martinez

Java IO

Azure Container Apps is a fully managed serverless container service that enables you to build and deploy modern, cloud-native Java applications and microservices at scale. It offers a simplified developer experience while providing the flexibility and portability of containers.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, visit the documentation page.

You can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, you can get started over on the documentation page.

And, you can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Traditional keyword-based search methods rely on exact word matches, often leading to irrelevant results depending on the user's phrasing.

By comparison, using a vector store allows us to represent the data as vector embeddings, based on meaningful relationships. We can then compare the meaning of the user’s query to the stored content, and retrieve more relevant, context-aware results.

Explore how to build an intelligent chatbot using MongoDB Atlas, Langchain4j and Spring Boot:

>> Building an AI Chatbot in Java With Langchain4j and MongoDB Atlas

Accessibility testing is a crucial aspect to ensure that your application is usable for everyone and meets accessibility standards that are required in many countries.

By automating these tests, teams can quickly detect issues related to screen reader compatibility, keyboard navigation, color contrast, and other aspects that could pose a barrier to using the software effectively for people with disabilities.

Learn how to automate accessibility testing with Selenium and the LambdaTest cloud-based testing platform that lets developers and testers perform accessibility automation on over 3000+ real environments:

Automated Accessibility Testing With Selenium

1. Overview

In this tutorial, we’ll get to know different ways of getting information about a PDF file using the iText and PDFBox libraries in Java.

2. Using the iText Library

iText is a library for creating and manipulating PDF documents. Also, it provides an easy way to get information about the document.

2.1. Maven Dependency

Let’s start by declaring the itextpdf dependency in our pom.xml:

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itextpdf</artifactId>
    <version>5.5.13.3</version>
</dependency>

2.2. Getting the Number of Pages

Let’s create a PdfInfoIText class with a getNumberOfPages() method that returns the number of pages in a PDF document:

public class PdfInfoIText {

    public static int getNumberOfPages(final String pdfFile) throws IOException {
        PdfReader reader = new PdfReader(pdfFile);
        int pages = reader.getNumberOfPages();
        reader.close();
        return pages;
    }
}

In our example, first, we use the PdfReader class to load a PDF from a File object. After that, we use the getNumberOfPages() method. And finally, we close the PdfReader object. Let’s declare a test case for it:

@Test
public void givenPdf_whenGetNumberOfPages_thenOK() throws IOException {
    Assert.assertEquals(4, PdfInfoIText.getNumberOfPages(PDF_FILE));
}

In our test case, we validate the number of pages in a given PDF file stored in the test resources folder.

2.3. Getting the PDF Metadata

Let’s now have a look at how we can get metadata of the document. We’ll use the getInfo() method. This method can get the information of the file, like title, author, creation date, creator, producer, and so on. Let’s add the getInfo() method to our PdfInfoIText class:

public static Map<String, String> getInfo(final String pdfFile) throws IOException {
    PdfReader reader = new PdfReader(pdfFile);
    Map<String, String> info = reader.getInfo();
    reader.close();
    return info;
}

Now, let’s write a test case for fetching the creator and producer of the document:

@Test
public void givenPdf_whenGetInfo_thenOK() throws IOException {
    Map<String, String> info = PdfInfoIText.getInfo(PDF_FILE);
    Assert.assertEquals("LibreOffice 4.2", info.get("Producer"));
    Assert.assertEquals("Writer", info.get("Creator"));
}

2.4. Knowing the PDF Password Protection

We’ll now want to know if there is password protection on the document. For this, let’s add the isEncrypted() method to the PdfInfoIText class:

public static boolean isPasswordRequired(final String pdfFile) throws IOException {
    PdfReader reader = new PdfReader(pdfFile);
    boolean isEncrypted = reader.isEncrypted();
    reader.close();
    return isEncrypted;
}

Now, let’s create a test case to see how this method behaves:

@Test
public void givenPdf_whenIsPasswordRequired_thenOK() throws IOException {
    Assert.assertFalse(PdfInfoIText.isPasswordRequired(PDF_FILE));
}

In the next section, we’ll do the same work using the PDFBox library.

3. Using the PDFBox Library

Another way of getting information about a PDF file is by using the Apache PDFBox library.

3.1. Maven Dependency

We need to include the pdfbox Maven dependency in our project:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.0</version>
</dependency>

3.2. Getting the Number of Pages

The PDFBox library provides the ability to work with PDF documents. For getting the number of pages, we simply use the Loader class and its loadPDF() method to load the document from the File object. After that, we use the getNumberOfPages() method of the PDDocument class:

public class PdfInfoPdfBox {

    public static int getNumberOfPages(final String pdfFile) throws IOException {
        File file = new File(pdfFile);
        PDDocument document = Loader.loadPDF(file);
        int pages = document.getNumberOfPages();
        document.close();
        return pages;
    }
}

Let’s create a test case for it:

@Test
public void givenPdf_whenGetNumberOfPages_thenOK() throws IOException {
    Assert.assertEquals(4, PdfInfoPdfBox.getNumberOfPages(PDF_FILE));
}

3.3. Getting the PDF Metadata

Getting the PDF metadata is also straightforward. We need to use the getDocumentInformation() method. This method returns document metadata (such as the author of the document or its creation date) as a PDDocumentInformation object:

public static PDDocumentInformation getInfo(final String pdfFile) throws IOException {
    File file = new File(pdfFile);
    PDDocument document = Loader.loadPDF(file);
    PDDocumentInformation info = document.getDocumentInformation();
    document.close();
    return info;
}

Let’s write a test case for it:

@Test
public void givenPdf_whenGetInfo_thenOK() throws IOException {
    PDDocumentInformation info = PdfInfoPdfBox.getInfo(PDF_FILE);
    Assert.assertEquals("LibreOffice 4.2", info.getProducer());
    Assert.assertEquals("Writer", info.getCreator());
}

In this test case, we just validate the producer and creator of the document.

3.4. Knowing the PDF Password Protection

We can check if the PDF is password protected using the isEncrypted() method of the PDDocument class:

public static boolean isPasswordRequired(final String pdfFile) throws IOException {
    File file = new File(pdfFile);
    PDDocument document = Loader.loadPDF(file);
    boolean isEncrypted = document.isEncrypted();
    document.close();
    return isEncrypted;
}

Let’s create a test case for the validation of password protection:

@Test
public void givenPdf_whenIsPasswordRequired_thenOK() throws IOException {
    Assert.assertFalse(PdfInfoPdfBox.isPasswordRequired(PDF_FILE));
}

4. Conclusion

In this article, we learned how to get information about a PDF file using two popular Java libraries.

The code backing this article is available on GitHub. Once you're logged in as a Baeldung Pro Member, start learning and coding on the project.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, visit the documentation page.

You can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Of course, Azure Container Apps has really solid support for our ecosystem, from a number of build options, managed Java components, native metrics, dynamic logger, and quite a bit more.

To learn more about Java features on Azure Container Apps, visit the documentation page.

You can also ask questions and leave feedback on the Azure Container Apps GitHub page.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.