Portable Document Format (PDF) files are ubiquitous due to their inherent portability and universality. Because of this, methods to edit a PDF file can be a valuable asset in the toolkits of users and administrators alike. One way to modify a PDF is to split it or rearrange its pages.
In this tutorial, we go over different ways to split or redistribute, and combine PDF files and pages. First, we briefly explore paging. After that, we check an online method for working with PDF files. Next, we turn to printing as a way to split PDF files. Finally, we enumerate multiple packages for handling PDF operations such as splitting, merging, and more, providing examples for each.
Pages are visually distinct canvases that hold and bind file content within a given document. When it comes to PDF files, PDF pages are top-level objects.
Pages have numbers, which provide for navigation, subsetting, and searching. For example, we can extract a number of pages from a document by specifying them:
- as a range, e.g., 13-666, -666 (start to 666), -10–1 (last 10 pages)
- as a list, e.g., 13,17,100-666
This way, we get a subset from a given document. By combining subsets together, we can produce new and rearranged documents. However, although PDF code can be made fairly readable, manipulating it by hand, even for rearranging, can easily result in corruption.
So, let’s explore safe ways to perform page subsetting and recombining.
3. Working on PDF Files Online
When we don’t want to or can’t install any packages, using an online service is often a convenient choice for many tasks, including the one at hand.
Since the interface of most online tools is intuitive and self-explanatory, we select ILovePDF as an entirely free one with many features:
- PDF conversion to and from Word, PowerPoint, Excel, JPEG, HTML, PDF/A
- merge and split PDF
- freely rearrange PDF pages
- compress PDF
- free edit PDF
- sign and watermark PDF
- protect or unlock PDF
- rotate PDF
- add and customize PDF page numbers
- scan PDF or perform PDF OCR
- repair PDF
In our case, we can use both the split and rearrange options. They produce either single PDF files or ZIP archives with the resulting data.
Still, if we prefer a local offline solution, there are plenty to choose from.
4. Generate Subset to PDF With PDF Printer
One of the most common ways to extract a number of pages from a PDF file involves just printing it to a virtual printer. Most contemporary operating systems, Web browsers, and even environments provide a so-called PDF printer. Even if we don’t have one, we can use the standard Common UNIX Printing System (CUPS) to provide a PDF printer via its cups-pdf package.
In essence, a PDF printer is software that uses code similar to that of a printer control language (PCL) to discern how a document looks and capture that in a separate PDF file.
For example, we might want to save a given page of a website for later viewing. To do so, we can print it to a file and then use an offline reader. Further, we can do the same with a PDF document, thus producing a PDF from a PDF.
In fact, as with most printing dialogs, we can usually adjust different settings:
- layout (landscape or portrait)
- (virtual) paper size
- pages per sheet
- headers and footers
- page range
Of course, the last part is of interest to us in this case, as it enables the extraction of a custom subset from a document. Some printers even provide settings for extracting based on odd numbers and even numbers, easing collation.
One fairly big downside of this method is the fact that the resulting file doesn’t contain a selectable text layer, as everything is converted to images simply arranged within a PDF.
5. PDF Toolkit (PDFtk) Split and Recombine
$ pdftk full.pdf cat 13-666 output subset_p13-666.pdf
Here, we con[cat]enate pages 13-666 from full.pdf into the output file subset_p13-666.pdf.
Alternatively, we can specify pages and ranges multiple times to recombine a PDF:
$ pdftk full.pdf cat 1 13 13 1 5-77 output subset_custom.pdf
Now, we have the subset_custom.pdf file:
- page 1 is page 1 from full.pdf
- page 2 is page 13 from full.pdf
- page 3 is page 13 from full.pdf
- page 4 is page 1 from full.pdf
- pages 5-77 are pages 5-77 from full.pdf
In addition, there’s the burst subcommand:
$ pdftk full.pdf burst
The burst subcommand creates a number of new PDF files equal to the number of pages in the original document, each containing a single page from it. By default, each filename starts with pg_ and continues with a zero-padded four-digit page number according to the printf integer format specifier %04d.
To modify the output naming scheme, we can append another format to the end of the command via output:
$ pdftk full.pdf burst output prefix_%02d.pdf
Now, we get files with the prefix_ prefix and a two-digit zero padding.
Finally, we even get a report about the procedure in doc_data.txt:
$ cat doc_data.txt InfoBegin InfoKey: ModDate InfoValue: D:20230810080808-04'00' InfoBegin InfoKey: Creator InfoValue: pdftk-java 3.3.2 InfoBegin InfoKey: CreationDate InfoValue: D:20230810080808-04'00' InfoBegin InfoKey: Producer InfoValue: x ([email protected]) PdfID0: 91135e76263a891ba6b946390eb03b1b PdfID1: a632bbacee2b642321c3119bee2e9ac7 NumberOfPages: 667 PageMediaBegin PageMediaNumber: 1 PageMediaRotation: 0 PageMediaRect: 0 0 666.001 1666.667 PageMediaDimensions: 666.001 1000.667 [...]
After bursting, we can use the cat subcommand again to recombine some of the output files:
$ pdftk pg_0666.pdf pg_0667.pdf cat subset_p666-667.pdf
Being a full-fledged PDF parser, PDFtk aims to preserve all aspects of the original document.
6. MuPDF mutool Split and Recombine
Still, we can (paradoxically) extract pages from a PDF file using the merge subcommand:
$ mutool merge -o subset_p13-666.pdf full.pdf 13-666
In this case, we get pages 13-666 from full.pdf into the [-o]utput file subset_p13-666.pdf.
Let’s recombine a number of pages within a new document:
$ mutool merge -o subset_custom.pdf full.pdf 1,13,13,1,5-77
This way, we get the same subset_custom.pdf page mapping we had previously.
Still, mutool does some hidden transformations, which might end up changing the internal code of the PDF to conform to the MuPDF standards.
7. QPDF Split and Recombine
$ qpdf full.pdf --pages . 13-666 -- subset_p13-666.pdf
In this case, we extract –pages in the range 13-666 from the current document as denoted by the . period. After that, we place the extracted pages in the subset_p13-666.pdf file as specified in the output section after —.
On the other hand, the –split-pages option is similar to the burst subcommand. However, instead of the constant 1 of pdftk burst, –split-pages has a variable per-file page count:
$ qpdf --split-pages=100 full.pdf subset_100p.pdf
This command splits full.pdf at every 100 pages, placing the result in .pdf files with a subset_100p- prefix followed by the page ranges in order, e.g., subset_100p-001-100.pdf.
To get other formats, we can again use printf format specifiers in the filename we append to the end of the command:
$ qpdf --split-pages=100 full.pdf subset_100p-%d.pdf
In this case, the output naming scheme doesn’t change relative to our last example. Also, since we deal with ranges, we can’t use explicit zero padding.
Naturally, the –pages option can also combine pages and files:
$ qpdf --empty --pages subset_100p-001-100.pdf 51-100 s ubset_100p-201-300.pdf 51-100 -- subsets_extract.pdf
For this operation, we start with a new –empty PDF file. In it, we place several ranges in order:
- pages 51-100 (last 50 pages) of subset_100p-001-100.pdf
- pages 51-100 (last 50 pages) of subset_100p-201-300.pdf
QPDF also offers page range modifiers like even and odd.
Importantly, unless we start with an –empty file, the metadata and original structure of any PDF document usually remain intact.
8. Ghostscript Split and Recombine
Let’s see how we can leverage gs for our purposes:
$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile='subset_p13-666.pdf' -dFirstPage=13 -dLastPage=666 'full.pdf'
Just like before, we produce subset_p13-666.pdf as the -sOutputFile containing all pages from the -dFirstPage=13 to the -dLastPage=666 of the input file full.pdf. Critically, -sDEVICE=pdfwrite makes the interpreter use pdfwrite as the output device, thereby enabling the features above.
Further, we add -dNOPAUSE and -dBATCH to ensure an uninterrupted execution that doesn’t expect user input to exit.
To change the output file name according to the page range, we can use format specifiers:
$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile='subset_custom-%06d.pdf' -dFirstPage=13 -dLastPage=666 'full.pdf'
With this code, we get the page number as a six-digit zero-padded number between the – dash and the .pdf file extension.
Alternatively, we can use -dPageList for a more versatile way to extract pages:
$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile='subset_custom.pdf' -sPageList=1,13,13,1,5-77 'full.pdf'
Similar to our earlier examples, we get the pages in the list in order within our new document, subset_custom.pdf. The -sPageList option supports more complex specifiers like negative indices (-1 is the last page) and empty range indices for from beginning or until end.
The gs command also supports merging files via basic enumeration:
$ gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile='full-p13-15.pdf' 'subset_custom-000013.pdf' 'subset_custom-000014.pdf' 'subset_custom-000015.pdf'
This way, we get the files one after the other in a new PDF called full-p13-15.pdf. Naturally, the order is extracted from that of the arguments list. Still, when many files are involved, this scheme is more rigid than using ranges, which Ghostscript doesn’t support for merging.
9. Poppler Tools Split and Recombine
The famous Poppler toolkit also provides utilities for PDF manipulation that support splitting and rearranging pages and files. However, the operations are atomic in this case, meaning we must get each page in a separate file first and then recombine them as desired.
$ pdfseparate -f 13 -l 666 full.pdf full-p%d.pdf
The result is similar to that of pdftk burst and qpdf –split-pages – one page per file. However, the last argument is mandatory in this case to set the naming scheme. Just like before, the printf-compatible format specifiers can add the page number with or without zero padding as necessary.
After having the pages we want as separate files, we can use pdfunite to join them in a single file:
$ pdfunite full-*.pdf subset_p13-666.pdf
This command creates subset_p13-666.pdf from all full-*.pdf files in ascending (default) order by using globbing. Since the order in which we supply the files decides the final page numbers, we can rearrange freely by specifying pages as desired. Again, this scheme can quickly become tedious.
Finally, unlike other tools, the merge operation can’t include encrypted files.
10. PDFsam (PDF Split and Merge)
Let’s install it via apt:
$ apt install pdfsam
Now, we should have access to the pdfsam command that starts the actual graphical user interface (GUI) tool:
The controls and steps are fairly intuitive for splitting:
- Press Split
- Press Select PDF or drag and drop to select a single PDF file
- Choose Split settings (Every page, every “n” pages, Odd pages, Even pages, given pages)
- Set name prefix or leave default (PDFsam_)
- Press Run
After the above procedure, we end up with the respective page range files.
To combine files, we select the respective option and continue from there:
- Press Merge
- Press Add or Drag and drop to select PDF files
- Select desired Merge settings
- Select Destination file via Browse
- Press Run
PDFsam also has other features, such as bookmark splitting, collation, rotation, and extraction.
The tool supports compression, which might change the internals of resulting PDF files.
In this article, we talked about splitting and recombining PDF files and pages with different tools.
In conclusion, the Linux environment provides multiple ways to work with PDF files, many of which are free and open source.