The EPUB Format and Converting EPUB Files to PDF

1. Overview

EPUB, short for Electronic Publication, is a popular ebook format for many kinds of ebook readers. On the other hand, there’s the old and reliable ebook format in the form of PDF. Both EPUB and PDF have their own advantages and disadvantages, and certain documents may be better suited to one format over the other.

In this tutorial, we’ll explore EPUB, compare it with PDF, and learn how to convert EPUB files to PDF using Pandoc and Calibre.

All commands and tools in this guide have been tested on Debian 12 (Bookworm) with Pandoc 2.17.1.1 and Calibre 6.13.

2. What Is EPUB?

EPUB is an ebook format initially developed by the International Digital Publishing Forum (IDPF) and released in 2007. It has now been a part of the World Wide Web Consortium (W3C) since 2016.

EPUB is an open standard, which means no single entity owns it so anyone can freely use it without having to pay any license fee. In addition, we can use the format across different devices and platforms since almost all hardware readers support EPUB.

2.1. EPUB File Format

EPUB files are essentially ZIP archives containing:

XHTML files
CSS scripts
media files such as images (GIF, JPEG, PNG, SVG) and audio or video clips

As a result, this enables us to create a flexible layout and styling. Further, we can support accessibility features, such as text-to-speech, adjustable fonts, and navigation aids, for people with disabilities.

In other words, EPUB is a website container to a certain degree, making it possible to create rich and interactive ebooks. This explains why we can adjust the font properties (type, size, spacing) when reading the file on an e-reader like an iPad.

Additionally, the text and images can adapt to different screen sizes and orientations (reflowability), providing a flexible reading experience. Moreover, we can also choose to read the ebook in either a single-long page format or a multi-page format.

2.2. EPUB File Structure

Let’s explore the structure of an EPUB file. There’s a free EPUB file titled Anatomy of the State (a great book, by the way) available on Mises Institute’s website that we can experiment with:

$ wget https://cdn.mises.org/anatomy_of_the_state_0.epub

The wget command above downloads the EPUB file.

Since EPUBs are essentially ZIP archives, let’s unzip the file:

$ unzip anatomy_of_the_state_0.epub
$ tree
.
├── anatomy_of_the_state_0.epub
├── META-INF
│   └── container.xml
├── mimetype
└── OEBPS
    ├── content.opf
    ├── Images
    │   ├── cover.jpg
    │   └── LVMIHeader.png
    ├── Styles
    │   ├── MOBITweaks.css
    │   └── stylesheet.css
    ├── Text
    │   ├── Chapter01.xhtml
    │   ├── Chapter02.xhtml
...

As we can see from the tree command output above, the file consists of two main directories: META-INF (metadata) and OEBPS (content).

3. EPUB Compared With PDF

In the previous section, we learned about EPUB’s characteristics such as its layout, reflowability, interactivity, compatibility, and accessibility. Everything that EPUB offers may seem very impressive, which could lead us to wonder if we ever need to convert EPUB to PDF.

However, there are reasons why the PDF format might be more suitable for certain documents than EPUB. For instance, PDFs maintain a fixed-layout design which can be advantageous when we want a document to appear or be printed exactly as intended. In other words, this gives us control over the quality of the resources (e.g., images) that we insert into the document.

Moreover, PDFs support different protection and authentication mechanisms:

This combination of fixed-layout and security features makes PDFs well-suited for legal documents.

Since EPUBs are basically a website container, we can retrieve some files, such as images, from the Internet and display them in the document. Consequently, we need to be connected to the Internet to be able to view these images. Meanwhile, PDFs are typically static documents that contain all the text and resources they need for offline viewing. This can be useful when access to online content may be limited.

Lastly, PDF has ISO-standardized variants available for various purposes. For example, there’s PDF Archive (PDF/A), specifically designed for the long-term preservation of electronic documents. It ensures that documents remain accessible and unchanged over time, making it suitable for archiving important records and documents. There are also other variants:

These strict variations of PDF make it a reliable standard for information storage.

4. Converting EPUB to PDF

Now that we’ve learned the differences between EPUB and PDF and their characteristics, let’s explore some of the common tools that a Linux environment may provide for converting EPUB to PDF.

4.1. Using Pandoc

Pandoc is, as its website states, a universal document converter. It’s available on many Linux official repositories.

Let’s install Pandoc:

$ sudo apt install pandoc texlive-xetex texlive-latex-base

Pandoc uses the pdflatex engine to convert documents to PDF by default. Therefore, we may need to install pdflatex, which is part of the texlive-xetex package.

Moreover, we also need to install the pdflatex supporting package, which is texlive-latex-base. For further information, we can refer to the README file at /usr/share/doc/pandoc/README.Debian about additional packages that Pandoc might require for optional features.

Additionally, we can tell Pandoc to use a different engine by passing –pdf-engine=<engine_name> option with a value such as xelatex, lualatex, or wkhtmltopdf. We may need to install the engine and its supporting package separately.

Let’s convert our EPUB document to PDF using Pandoc:

$ pandoc anatomy_of_the_state_0.epub -o anatomy_of_the_state_0.pdf

The resulting PDF should have the same font properties, resources (images), and interactivity (hyperlinks). Additionally, we can generate a table of contents by using the –toc option if we want to.

However, the PDF may not match the EPUB file in terms of page size, length, table of contents (TOC), and page order. To achieve this, one workaround is to extract the EPUB file, convert each xHTML page to PDF individually, and convert the EPUB cover image (located at OEBPS/Images/cover.jpg) to PDF with ImageMagick’s convert tool. Afterward, we can merge them all in order via pdftk:

$ unzip anatomy_of_the_state_0.epub -d anatomy
$ cd anatomy/OEBPS/Text/
$ for file in ./*.xhtml; do pandoc "$file" -o "${file%.xhtml}.pdf"; done
$ convert ../Images/cover.jpg cover.pdf
$ pdftk cover.pdf titlepage.pdf copyright.pdf dedication.pdf TOC.pdf \
    Chapter01.pdf Chapter02.pdf Chapter03.pdf Chapter04.pdf Chapter05.pdf \
    Chapter06.pdf Chapter07.pdf Index.pdf LvMI.pdf \
    cat output anatomy.pdf

We can find the output anatomy.pdf in the anatomy/OEBPS/Text directory.

While this workaround offers greater flexibility, it might be too troublesome for some. For instance, it requires the use of various tools or scripts. Additionally, it requires updating the LaTeX template, TOC, and hyperlinks.

4.2. Using Calibre

Calibre is an ebook manager commonly included in many Linux distributions. It provides a command-line interface tool called ebook-convert for performing conversions:

$ ebook-convert anatomy_of_the_state_0.epub anatomy_of_the_state.pdf

The default page size of the output may vary depending on our system or settings. In this case, the command above generated the PDF file anatomy_of_the_state.pdf with a default page size of letter.

If we want a different page size for the PDF, we can specify it by using the –paper-size=<size> option of the tool. A number of standard formats are supported:

a0, a1, a2, a3, a4, a5, a6
b0, b1, b2, b3, b4, b5, b6
legal
letter

In addition, ebook-convert automatically follows the font properties (size, color, type) from the original EPUB file. Furthermore, it also adjusts the page order, TOC, and hyperlinks in the PDF file.

Notably, our EPUB file includes an index page listing word occurrences with their page numbers. Based on that, the most suitable page size option for the PDF version of the EPUB file is B5. Additionally, there’s a PDF version of the file on the website, and we can see that the generated PDF file is quite similar to the original PDF file.

5. Conclusion

In this article, we explored the EPUB format, its properties, and structures, and compared it with the PDF format. While EPUB is very popular, the PDF format may be more suitable for specific document types.

Afterward, we learned how to convert EPUB files to PDF using Pandoc and Calibre. Both tools support many format conversions, with Calibre usually being more effective at converting EPUBs to PDFs.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security