How to Automate HTML-to-PDF Conversions

1. Overview

In this article, we’ll first introduce the characteristics of HTML documents and PDF documents as the basis. We’ll then propose the necessity and the feasibility of conversions from the HTML document format to the PDF document format. Lastly, we’ll study various command-line tools to realize HTML-to-PDF conversions.

2. HTML vs. PDF Document Format

HTML (HyperText Markup Language) is the code that is used to structure a web page and its content.

PDF, as we know, stands for “portable document format”. A file in PDF format is useful when we need to save files that cannot be modified but still need to be easily shared and printed. Therefore, PDF format allows pages – that is, a fixed layout of text and graphics – to be shared with total fidelity to the author’s intent. The need for a shareable electronic document drove the fundamental design of PDF.

3. Tools Comprehending HTML-to-PDF Conversions

So, how we do go from HTML to PDF? Unless we have Adobe Acrobat or another PDF creation program, it can be hard to convert HTML to PDF. Let’s discuss tools that give us a way to realize HTML-to-PDF conversions.

3.1. wkhtmltopdf

wkhtmltopdf is a simple and effective open-source command-line shell utility that enables users to convert any given HTML (web page) to a PDF document.

Let’s look at the syntax for running wkhtmltopdf with some of its more widely used options:

$ wkhtmltopdf --margin-bottom 20mm --margin-top 20mm --minimum-font-size 16mm ...

The default page size of the rendered document is A4, but by using the –page-size option, this can be changed to almost anything else, such as A3:

$ echo "https://doc.qt.io/archives/qt-4.8/qapplication.html qapplication.pdf" >> cmds
$ wkhtmltopdf --page-size --book < cmds

A table of contents can be added to the document by adding a toc object to the command line:

$ wkhtmltopdf toc https://doc.qt.io/archives/qt-4.8/qstring.html qstring.pdf

On Linux, wkhtmltopdf uses the WebKit rendering engine and Qt, which means it can benefit from updates.

3.2. weasyprint

WeasyPrint produces PDFs with selectable text and hyperlinks. The command syntax to obtain a PDF from the HTML file is:

$ weasyprint [options] <input> <output>

The input is a filename or URL to an HTML document, or “-” to read HTML from stdin. The output is a filename, or “-” to write to stdout.

Options can be mixed anywhere before, between, or after the input and output. We can force the input character encoding using -e utf-8 or –encoding utf-8:

$ weasyprint -e utf-8 docs.html docs.pdf

We can also add the filename or URL of a user cascading stylesheet (see Stylesheet Origins) to the document as -s print.css or –stylesheet print.css:

$ weasyprint -s print.css docs.html docs.pdf

Whereas, the command to set tiny margins is:

$ weasyprint docs.html docs.pdf -s <(echo '@page { margin: 0.5cm; }')

We can install weasyprint using a package manager such as apt-get:

$ sudo apt-get -y install weasyprint

3.3. ebook-convert

The ebook-convert command-line utility converts many HTML documents into a single PDF.

Regular usage of this utility would be:

$ ebook-convert index.html book.pdf

We can also force the input character encoding by using the –input-encoding option to specify the character encoding of the input document.

Another useful option is –max-levels. It permits maximum levels of recursion when following links in HTML files. The value must be non-negative with 5 as default, where 0 implies that no links in the root HTML file are followed.

3.4. unoconv

We can use unoconv in standalone mode, which means that in absence of a LibreOffice listener, it will start its own:

$ unoconv -f pdf some-document.html

Also, we can start unoconv as a listener (by default on localhost:2002) to let other unoconv instances connect to it:

$ unoconv --listener & 
$ unoconv -f pdf some-document.html 
$ kill -15 %-

This also works on a remote host:

$ unoconv --listener --server 1.2.3.4 --port 4567

And then, we can connect another system to convert documents:

$ unoconv --server 1.2.3.4 --port 4567
$ unoconv -f pdf mypage.html

We can install it on most Linux flavors via the package manager:

$ apt-get install unoconv

3.5. act Converter

act is a tool that provides a simplified interface for performing common actions. Using this, we can convert an HTML file to a PDF format:

$ act convert index.html -o index.pdf -w 2000px -h 3000px

This will create a new PDF file from the HTML file.

4. Conclusion

In this article, we discussed the underlying characteristics of HTML and PDF document formats. We also discussed the feasibility of HTML-to-PDF conversions. Later, we saw how the use of tools that convert files from HTML to PDF format ease this conversion process, which is otherwise difficult in the absence of a PDF creation program.

Administration

Scripting

Networking

Files

Processes

Full Archive

About Baeldung