Split and Recombine PDF Files and Pages in Linux

1. Introduction

Portable Document Format (PDF) files are ubiquitous due to their inherent portability and universality. Because of this, methods to edit a PDF file can be a valuable asset in the toolkits of users and administrators alike. One way to modify a PDF is to split it or rearrange its pages.

In this tutorial, we go over different ways to split or redistribute, and combine PDF files and pages. First, we briefly explore paging. After that, we check an online method for working with PDF files. Next, we turn to printing as a way to split PDF files. Finally, we enumerate multiple packages for handling PDF operations such as splitting, merging, and more, providing examples for each.

We tested the code in this tutorial on Debian 12 (Bookworm) with GNU Bash 5.1.4. It should work in most POSIX-compliant environments unless otherwise specified.

2. Pages

Pages are visually distinct canvases that hold and bind file content within a given document. When it comes to PDF files, PDF pages are top-level objects.

Pages have numbers, which provide for navigation, subsetting, and searching. For example, we can extract a number of pages from a document by specifying them:

as a range, e.g., 13-666, -666 (start to 666), -10–1 (last 10 pages)
as a list, e.g., 13,17,100-666

This way, we get a subset from a given document. By combining subsets together, we can produce new and rearranged documents. However, although PDF code can be made fairly readable, manipulating it by hand, even for rearranging, can easily result in corruption.

So, let’s explore safe ways to perform page subsetting and recombining.

3. Working on PDF Files Online

When we don’t want to or can’t install any packages, using an online service is often a convenient choice for many tasks, including the one at hand.

Since the interface of most online tools is intuitive and self-explanatory, we select ILovePDF as an entirely free one with many features:

PDF conversion to and from Word, PowerPoint, Excel, JPEG, HTML, PDF/A
merge and split PDF
freely rearrange PDF pages
compress PDF
free edit PDF
sign and watermark PDF
protect or unlock PDF
rotate PDF
add and customize PDF page numbers
scan PDF or perform PDF OCR
repair PDF

In our case, we can use both the split and rearrange options. They produce either single PDF files or ZIP archives with the resulting data.

Still, if we prefer a local offline solution, there are plenty to choose from.

4. Generate Subset to PDF With PDF Printer

One of the most common ways to extract a number of pages from a PDF file involves just printing it to a virtual printer. Most contemporary operating systems, Web browsers, and even environments provide a so-called PDF printer. Even if we don’t have one, we can use the standard Common UNIX Printing System (CUPS) to provide a PDF printer via its cups-pdf package.

In essence, a PDF printer is software that uses code similar to that of a printer control language (PCL) to discern how a document looks and capture that in a separate PDF file.

For example, we might want to save a given page of a website for later viewing. To do so, we can print it to a file and then use an offline reader. Further, we can do the same with a PDF document, thus producing a PDF from a PDF.

In fact, as with most printing dialogs, we can usually adjust different settings:

destination
layout (landscape or portrait)
(virtual) paper size
pages per sheet
margins
scale
headers and footers
quality
page range

Of course, the last part is of interest to us in this case, as it enables the extraction of a custom subset from a document. Some printers even provide settings for extracting based on odd numbers and even numbers, easing collation.

One fairly big downside of this method is the fact that the resulting file doesn’t contain a selectable text layer, as everything is converted to images simply arranged within a PDF.

5. PDF Toolkit (PDFtk) Split and Recombine

As its name implies, the PDF Toolkit (PDFtk) has a number of features, one of which is the cat subcommand:

$ pdftk full.pdf cat 13-666 output subset_p13-666.pdf

Here, we con[cat]enate pages 13-666 from full.pdf into the output file subset_p13-666.pdf.

Alternatively, we can specify pages and ranges multiple times to recombine a PDF:

$ pdftk full.pdf cat 1 13 13 1 5-77 output subset_custom.pdf

Now, we have the subset_custom.pdf file:

page 1 is page 1 from full.pdf
page 2 is page 13 from full.pdf
page 3 is page 13 from full.pdf
page 4 is page 1 from full.pdf
pages 5-77 are pages 5-77 from full.pdf

In addition, there’s the burst subcommand:

$ pdftk full.pdf burst

The burst subcommand creates a number of new PDF files equal to the number of pages in the original document, each containing a single page from it. By default, each filename starts with pg_ and continues with a zero-padded four-digit page number according to the printf integer format specifier %04d.

To modify the output naming scheme, we can append another format to the end of the command via output:

$ pdftk full.pdf burst output prefix_%02d.pdf

Now, we get files with the prefix_ prefix and a two-digit zero padding.

Finally, we even get a report about the procedure in doc_data.txt:

$ cat doc_data.txt
InfoBegin
InfoKey: ModDate
InfoValue: D:20230810080808-04'00'
InfoBegin
InfoKey: Creator
InfoValue: pdftk-java 3.3.2
InfoBegin
InfoKey: CreationDate
InfoValue: D:20230810080808-04'00'
InfoBegin
InfoKey: Producer
InfoValue: x ([email protected])
PdfID0: 91135e76263a891ba6b946390eb03b1b
PdfID1: a632bbacee2b642321c3119bee2e9ac7
NumberOfPages: 667
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 666.001 1666.667
PageMediaDimensions: 666.001 1000.667
[...]

After bursting, we can use the cat subcommand again to recombine some of the output files:

$ pdftk pg_0666.pdf pg_0667.pdf cat subset_p666-667.pdf

Being a full-fledged PDF parser, PDFtk aims to preserve all aspects of the original document.

6. MuPDF mutool Split and Recombine

While the MuPDF mutool has many features, direct page splitting by range isn’t a separate subcommand.

Still, we can (paradoxically) extract pages from a PDF file using the merge subcommand:

$ mutool merge -o subset_p13-666.pdf full.pdf 13-666

In this case, we get pages 13-666 from full.pdf into the [-o]utput file subset_p13-666.pdf.

Let’s recombine a number of pages within a new document:

$ mutool merge -o subset_custom.pdf full.pdf 1,13,13,1,5-77

This way, we get the same subset_custom.pdf page mapping we had previously.

Still, mutool does some hidden transformations, which might end up changing the internal code of the PDF to conform to the MuPDF standards.

7. QPDF Split and Recombine

Naturally, the versatile qpdf tool has the –pages option for page manipulation as one of its features:

$ qpdf full.pdf --pages . 13-666 -- subset_p13-666.pdf

In this case, we extract –pages in the range 13-666 from the current document as denoted by the . period. After that, we place the extracted pages in the subset_p13-666.pdf file as specified in the output section after —.

On the other hand, the –split-pages option is similar to the burst subcommand. However, instead of the constant 1 of pdftk burst, –split-pages has a variable per-file page count:

$ qpdf --split-pages=100 full.pdf subset_100p.pdf

This command splits full.pdf at every 100 pages, placing the result in .pdf files with a subset_100p- prefix followed by the page ranges in order, e.g., subset_100p-001-100.pdf.

To get other formats, we can again use printf format specifiers in the filename we append to the end of the command:

$ qpdf --split-pages=100 full.pdf subset_100p-%d.pdf

In this case, the output naming scheme doesn’t change relative to our last example. Also, since we deal with ranges, we can’t use explicit zero padding.

Naturally, the –pages option can also combine pages and files:

$ qpdf --empty --pages subset_100p-001-100.pdf 51-100 s
ubset_100p-201-300.pdf 51-100 -- subsets_extract.pdf

For this operation, we start with a new –empty PDF file. In it, we place several ranges in order:

pages 51-100 (last 50 pages) of subset_100p-001-100.pdf
pages 51-100 (last 50 pages) of subset_100p-201-300.pdf

QPDF also offers page range modifiers like even and odd.

Importantly, unless we start with an –empty file, the metadata and original structure of any PDF document usually remain intact.

8. Ghostscript Split and Recombine

The universal Ghostscript toolset contains gs, a PDF interpreter with a vast feature set.

Let’s see how we can leverage gs for our purposes:

$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile='subset_p13-666.pdf' -dFirstPage=13 -dLastPage=666 'full.pdf'

Just like before, we produce subset_p13-666.pdf as the -sOutputFile containing all pages from the -dFirstPage=13 to the -dLastPage=666 of the input file full.pdf. Critically, -sDEVICE=pdfwrite makes the interpreter use pdfwrite as the output device, thereby enabling the features above.

Further, we add -dNOPAUSE and -dBATCH to ensure an uninterrupted execution that doesn’t expect user input to exit.

To change the output file name according to the page range, we can use format specifiers:

$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile='subset_custom-%06d.pdf' -dFirstPage=13 -dLastPage=666 'full.pdf'

With this code, we get the page number as a six-digit zero-padded number between the – dash and the .pdf file extension.

Alternatively, we can use -dPageList for a more versatile way to extract pages:

$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile='subset_custom.pdf' -sPageList=1,13,13,1,5-77 'full.pdf'

Similar to our earlier examples, we get the pages in the list in order within our new document, subset_custom.pdf. The -sPageList option supports more complex specifiers like negative indices (-1 is the last page) and empty range indices for from beginning or until end.

The gs command also supports merging files via basic enumeration:

$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile='full-p13-15.pdf' 'subset_custom-000013.pdf' 'subset_custom-000014.pdf' 'subset_custom-000015.pdf'

This way, we get the files one after the other in a new PDF called full-p13-15.pdf. Naturally, the order is extracted from that of the arguments list. Still, when many files are involved, this scheme is more rigid than using ranges, which Ghostscript doesn’t support for merging.

9. Poppler Tools Split and Recombine

The famous Poppler toolkit also provides utilities for PDF manipulation that support splitting and rearranging pages and files. However, the operations are atomic in this case, meaning we must get each page in a separate file first and then recombine them as desired.

First, pdfseparate extracts pages between the [-f]irst and [-l]ast supplied:

$ pdfseparate -f 13 -l 666 full.pdf full-p%d.pdf

The result is similar to that of pdftk burst and qpdf –split-pages – one page per file. However, the last argument is mandatory in this case to set the naming scheme. Just like before, the printf-compatible format specifiers can add the page number with or without zero padding as necessary.

After having the pages we want as separate files, we can use pdfunite to join them in a single file:

$ pdfunite full-*.pdf subset_p13-666.pdf

This command creates subset_p13-666.pdf from all full-*.pdf files in ascending (default) order by using globbing. Since the order in which we supply the files decides the final page numbers, we can rearrange freely by specifying pages as desired. Again, this scheme can quickly become tedious.

Finally, unlike other tools, the merge operation can’t include encrypted files.

10. PDFsam (PDF Split and Merge)

The aptly named PDF Split And Merge (PDFsam) package and tool has an open-source Basic version.

Let’s install it via apt:

$ apt install pdfsam

Now, we should have access to the pdfsam command that starts the actual graphical user interface (GUI) tool:

The controls and steps are fairly intuitive for splitting:

Press Split
Press Select PDF or drag and drop to select a single PDF file
Choose Split settings (Every page, every “n” pages, Odd pages, Even pages, given pages)
Set name prefix or leave default (PDFsam_)
Press Run

After the above procedure, we end up with the respective page range files.

To combine files, we select the respective option and continue from there:

Press Merge
Press Add or Drag and drop to select PDF files
Select desired Merge settings
Select Destination file via Browse
Press Run

PDFsam also has other features, such as bookmark splitting, collation, rotation, and extraction.

The tool supports compression, which might change the internals of resulting PDF files.

11. Summary

In this article, we talked about splitting and recombining PDF files and pages with different tools.

In conclusion, the Linux environment provides multiple ways to work with PDF files, many of which are free and open source.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security