In this tutorial, we consider the PDF format and explore ways to view and edit its original source code. First, we take a thorough look at the portable document and another similar format. Next, we show samples of two different ways for export and storage. After that, we discuss the general idea and pitfalls behind PDF creation and editing. Finally, we delve deep into many tools that enable us to handle and repair PDF files.
2. PostScript (PS)
Before going into PDF, we start with its much older relative. PostScript (PS) is a format, page description language (PDL), and printer control language (PCL). In other words, it can specify the design elements of pages and how they appear with human-readable text:
%!PS /Courier 10 selectfont 100 666 moveto (Baeldung) show showpage
This sample PS file begins with a header similar to the shebang in Linux. The next two lines choose the Courier font and select it in size 10 with selectfont. Actually, font handling is one of the most complex activities of PostScript and, by extension, PDF.
After that, moveto specifies a location, where show writes Baeldung. To display the page, we finish with showpage.
Currently, PostScript 3 (PS3) is the latest iteration from 1997, as described in the PostScript Language Reference Third Edition (ISBN 0-201-37922-8).
Since PostScript is the basis of many other formats such as Encapsulated PostScript (EPS) and PDF files, we can convert them to and from PS. In fact, the main difference between PDF and PS is the former’s lack of a general-purpose programming language backbone. One can very roughly compare PDF’s static structure to that of the HyperText Markup Language (HTML), unlike PS, which can compute graphics dynamically.
3. Portable Document Format (PDF)
The Portable Document Format (PDF) is a universal and portable way to view and transfer structured data:
- forms and form fields
- audio and video
Of course, the main aim is to have a standard that defines how all of these are to be embedded in a single file so that software on any operating system (OS) and hardware can handle them. In a way, PDF is also a PDL and PCL.
4. Sample Pure-Text PDF Structure
Essentially, PDF files consist of pages, described by objects. This collection of objects is the format’s backbone and plays a key role in the presentation, so its structure is well-defined.
Let’s create a sample pure-text PDF:
%PDF-1.1 %¥±ë 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Type /Pages /Kids [3 0 R] /Count 1 /MediaBox [0 0 300 144] >> endobj 3 0 obj << /Type /Page /Parent 2 0 R /Resources << /Font << /F1 << /Type /Font /Subtype /Type1 /BaseFont /Times-Roman >> >> >> /Contents 4 0 R >> endobj 4 0 obj << /Length 55 >> stream BT /F1 18 Tf 0 0 Td (Hello World) Tj ET endstream endobj xref 0 5 0000000000 65535 f 0000000015 00000 n 0000000077 00000 n 0000000179 00000 n 0000000433 00000 n trailer << /Root 1 0 R /Size 5 >> startxref 541 %%EOF
For example, this simple blank PDF has four objects:
- Object 1 0
- Object 2 0
- Object 3 0
- Object 4 0
Each of these objects has a dictionary (<<>>) of key-value pairs like /Type /Page and /MediaBox [0 0 300 144]. Also, the latter assigns an array to /MediaBox, which is similar to a page size. Moreover, we can use object references such as 2 0 R and 3 0 R.
Further, the xref reference table is an index for objects. The first line of the table sets the first object number (0) and has the total object count for this file (4). Each following line defines a successive object (by number, starting from the first) and has the same structure:
- Offset in bytes from the start of the file to the beginning of the object content
- Generation number, matching the one after the object number (most often 0)
- Either n if the object is in use or f otherwise
Offsets are strict, so any changes to a PDF file may corrupt it. On the other hand, the xref table is no longer necessary for PDF versions from 1.5. Further, copying the code above to a new text file creates a valid PDF for version 1.1, as defined by the mandatory first line.
5. Sample Binary PDF
Often, a pure-text representation is unusual. For example, we can have the sample PDF file from earlier, but in binary form:
%PDF-1.1 %âãÏÓ 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj 2 0 obj << /Kids [3 0 R] /Type /Pages /Count 1 /MediaBox [0 0 300 144] >> endobj 3 0 obj << /Contents 4 0 R /Type /Page /Resources << /Font << /F1 << /Subtype /Type1 /Type /Font /BaseFont /Times-Roman >> >> >> /Parent 2 0 R /MediaBox [0 0 300 144] >> endobj 4 0 obj << /Length 52 /Filter /FlateDecode >> stream xœSPp áR }7CC…40Ï CRÀL Ôœœ|…ðü¢œM…, k + endstream endobj xref 0 5 0000000000 65535 f 0000000015 00000 n 0000000066 00000 n 0000000149 00000 n 0000000331 00000 n trailer << /Root 1 0 R /Size 5 >> startxref 456 %%EOF
In this case, we see object 4 0 is an encoded stream in a binary representation (here, flate), which makes the file smaller.
However, we’re no longer able to directly see or edit the PDF’s previous contents:
4 0 obj << /Length 55 >> stream BT /F1 18 Tf 0 0 Td (Hello World) Tj ET endstream endobj
Here, we begin a text segment at BT, change the current font via Tf and write Hello World with Tj, before ending the text segment at ET. Without seeing the commands, it’s much harder to modify them.
Of course, other objects like media might not have a purely textual representation. Still, how can we convert the operators of a PDF file and as much of its content as possible to editable ASCII text?
6. PDF Creation and Editing
Due to the format’s ubiquity, many tools can generate PDF files. Thus, it’s up to the creator of the original content as well as their tool of choice to save or export the file in a given way.
For example, some Adobe products offer options to save a PDF decompressed or uncompressed, i.e., without compression. The same goes for many open-source tools, which use any of the libraries we discuss here. Compression is a method to reduce the size of a PDF file via specific encodings that can convert a pure-text object to a binary stream, thereby sometimes rendering the source PDF operators obfuscated. In addition, PDF-compressed objects can have their own additional encodings and compressions.
By decompressing, we end up with a file much like the pure-text sample PDF from earlier. Thus, after decompression, we can simply use an editor like vi as long as it can handle large files and preserve binary data:
$ vi /file.pdf %PDF-1.1 %¥±ë 1 0 obj << /Type /Catalog /Pages 2 0 R >> endobj [...]
After some types of edits, we might have to repair the resulting PDF to avoid errors from strict PDF viewers due to object offsets and other references.
Of course, a lot of libraries also have stand-alone tools for this and other purposes. Most are built with a command-line interface (CLI), but some have a graphical user interface (GUI) as well.
7. PDF Toolkit (PDFtk)
To begin with, we compressed our sample pure-text PDF to a part-binary PDF via PDFtk.
We can usually install the latest version of PDFtk from the pdftk package:
$ apt-get install pdftk [...] $ apt-cache policy pdftk pdftk: Installed: 2.02-5+b1 [...] $ pdftk --version pdftk port to java 3.2.2 a Handy Tool for Manipulating PDF Documents [...]
Now, let’s continue with some of its options.
7.1. PDFtk Features
This PDF toolkit includes the pdftk utility enabling us to perform many operations as subcommands:
- cat – merge, split, or rotate pages
- shuffle – collate pages
- burst – split into pages
- rotate – rotate given pages
- generate_fdf – generate FDF file for automatic form filling
- fill_form – fill forms with FDF
- background, multiackground – place watermark or watermarks under page contents
- stamp, multistamp – place watermark or watermarks over pages
- dump_data, dump_data_utf8 – report metadata, bookmarks, page metrics, and others (optionally in UTF-8)
- dump_data_fields, dump_data_fields_utf8 – get form field statistics (optionally in UTF-8)
- dump_data_annots – get link annotations
- update_info, update_info_utf8 – sets metadata and bookmarks (optionally in UTF-8)
- attach_files – pack files in PDF
- unpack_files – unpack files from a PDF
Besides these, pdftk offers several security options for passwords and encryption.
7.2. Compress and Uncompress With pdftk
We can also use pdftk to compress and uncompress with the respective subcommand:
$ pdftk in.pdf output out.pdf compress
Actually, we compressed our sample PDF and object 4 0 in it with this exact command line. Naturally, performing the reverse operation yields a text-only PDF:
$ pdftk in.pdf output out.pdf uncompress
Moreover, pdftk offers compress and decompress options for many of its subcommands, which affects their output.
7.3. Repair PDF With pdftk
Repairing a PDF with pdftk is simple, but not always effective:
$ pdftk in.pdf output out.pdf
In essence, we just pass our file through the tool.
8. MuPDF Ecosystem
On most platforms, the viewer is in the mupdf package, while the command-line tools are in the mupdf-tools package:
$ apt-get install mupdf-tools [...] $ apt-cache policy mupdf-tools mupdf-tools: Installed: 1.17.0+ds1-2 [...]
Now, let’s explore MuPDF further.
8.1. mutool Features
Among the MuPDF tools is the mutool utility with its subcommands:
- draw – convert documents to images (among others) with lots of options
- convert – convert documents into other formats simply
- trace – debugging tool for tracing
- show – show internal PDF objects
- extract – extract resources like images and embedded fonts
- clean – fix PDF by rewriting it in a potentially human-readable form
- merge – merge pages
- poster – subdivide pages into pieces
- create – use a text file with commands to create a PDF
- sign – digital signature operations
- info – get page object details
- pages – get page media box, artbox, and others
Unlike pdftk, mutool requires a subcommand on each run. By default, mutool preserves the original way a PDF is structured as long as we don’t explicitly request changes that alter that.
8.2. Compress and Uncompress With mutool
To compress or decompress, we can use clean:
$ mutool clean -d -z -gggg -i -a in.pdf out.pdf
As long as we don’t opt to preserve some, the clean command of mutool with its -d flag decompresses all streams, while potentially performing other optimizations:
- -i – compress or leave compressed image streams
- -a – use ASCII hex to encode any binary streams
- -g – remove unused objects
- -gg – -g and compact xref table
- -ggg – -gg and merge duplicate objects
- -gggg – -ggg and deduplicate streams
- -s – clean and streamline content streams
In fact, we only skip two options:
- -l – reorder contents and objects as they are referenced by page (quick loading)
- -f – compress or leave compressed font streams
- -p – password, if needed
After any edits, we can also repair our file.
8.3. Repair PDF with mutool
The repair mechanism of clean is very effective in many circumstances, even after custom edits:
$ mutool clean in.pdf out.pdf
Considering the versatility of the MuPDF ecosystem, its licensing might be its main drawback.
9. QPDF Tool
On most platforms, we can install QPDF from the qpdf package:
$ apt-get install qpdf [...] $ apt-cache policy qpdf qpdf: Installed: 10.1.0-1 [...] $ qpdf --version qpdf version 10.1.0 Run qpdf --copyright to see copyright and license information.
Next, let’s see what QPDF can offer.
9.1. QPDF Features
As the main tool in the package, qpdf performs many tasks:
- –linearize – reorder contents and objects as they are referenced by page (quick loading)
- –compress-streams=[n|y] – toggle compression of streams
- –decode-level=parameter – decompress and decode given streams
- –stream-data=parameter – preset combinations of –compress-streams and –decode-level
- –qpdf – rewrite the file for viewing and editing
- –collate[=n] – collate pages, optionally by groups of n
- –split-pages – split into pages
- –overlay – overlay the pages of one file over another
- –underlay – underlay the pages of one file under another
- –rotate – rotate pages
- embedding and attaching files
- extraction of data such as media, metadata, object information, and more even as a JSON
9.2. QDF Mode and Decompression With qpdf
$ qpdf --qdf in.pdf out.pdf
In fact, the mode is a way of processing unique for this toolkit:
- incompatible with –linearize
- all uncompressible streams are decompressed
- content streams are normalized
- encryption is decrypted
- restructure objects to be more readable, albeit less efficient
- add hinting comments
On the other hand, we can skip –qdf in favor of just combining other options:
$ qpdf --decode-level=all --compress-streams=n in.pdf out.pdf
Here, –compress-streams=n decompresses streams or just leaves them uncompressed, while –decode-level=all ensures this is done for all streams. We can achieve a similar effect, but only for generalized streams via the older –stream-data=uncompress option.
9.3. Repair PDF with fix-qdf
After any changes, we can use the included fix-qdf command to repair hand-edited QDF files partially based on the changes –qdf introduces:
$ fix-qdf in.pdf > out.pdf
Still, this should work for many non-QDF PDF files as well.
- Ghostscript PDF and PS intepreter
- GhostPDF, a PDF interpretation component, currently available as an old PostScript-based and new C-based version
- GhostPDL, the umbrella term for all Ghostscript products
- GhostPCL PCL and PXL interpreter
- GhostXPS XPS interpreter
- font information
First, let’s install Ghostscript via the ghostscript package:
$ apt-get install ghostscript [...] $ apt-cache policy ghostscript ghostscript: Installed: 9.53.3~dfsg-7+deb11u2 [...] $ ghostscript --version 9.53.3
Now, let’s focus on the main Ghostscript tool.
10.1. Ghostscript PDF Interpreter Features
The gs (gswin32 or gswin64 on Microsoft Windows) Ghostscript interpreter works at a low level, which means it doesn’t necessarily provide single subcommands for its many abilities. Instead, gs has output devices with options, which we can combine to perform tasks. Actually, this is one of its main strengths.
For example, the pdfwrite, ps2write, and eps2write PDF and PostScript output devices are very versatile as they can do most of what we already discussed about other toolsets:
- merge files
- split files with OutputFile, %d, -dFirstPage, and -dLastPage
- rotate pages
- change PDF version
- change PDF type
- embed fonts
- compress and decompress fonts
- compress and decompress stream compression
- compress and decompress pages
- convert colors
- change resolution
- change page position
- many other options
Critically, due to its low-level operation, Ghostscript doesn’t preserve the original input files but instead creates a new one through the requested virtual device. While file appearance might be the same, gs changes the way it’s achieved. On the other hand, commands like mutool do preserve the contents as long as we don’t explicitly request such structure modifications.
10.2. Compress and Decompress PDF With gs and Ghostscript
Just like other tools, gs can decompress many elements of a PDF:
$ gs -dNOSAFER -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.7 -dEmbedAllFonts=true -dCompressEntireFile=false -dCompressStreams=false -dCompressPages=false -dCompressFonts=false -sOutputFile="out.pdf" -f "in.pdf"
Let’s break down this command:
- -dNOSAFER – enable changes to the filesystem when allowed
- -dNOPAUSE – disable pausing on each page
- -dBATCH – exit after processing all files
- -sDEVICE=pdfwrite – select pdfwrite as the initial output device
- -dBATCH – exit after processing all files
- dCompatibilityLevel – set format version of output
- -dEmbedAllFonts=true – embed all fonts
- -dCompressEntireFile=false – apply no additional compression
- -dCompressStreams=false – decompress non-font and non-page streams
- -dCompressPages=false – decompress page content streams
- -dCompressFonts=false – decompress embedded fonts
- -sOutputFile – select output file for the output device
- -f – makes supplying input file name(s) safer
Since Ghostscript generates a new PDF file, it’s possible that the uncompressed streams don’t contain their exact original data.
Further, the decompression mechanisms of gs are generally not as advanced or universal as those of other solutions.
10.3. Repair PDF With gs and Ghostscript
Importantly, Ghostscript has no repair facilities and, unlike most readers and other tools, is highly intolerant to syntax and specification problems. Even so, gs itself can sometimes cause issues:
- forms not working
- incomplete fonts
- missing characters or glyphs
- missing ligatures
Of course, there are other tools that can perform a repair but can’t toggle compression.
11. Poppler Tools
Thus, to install Poppler, we can use the poppler-utils package on most and the xpdf package on some distributions:
$ apt-get install poppler-utils [...] $ apt-cache policy poppler-utils poppler-utils: Installed: 20.09.0-3.1+deb11u1
11.1. Poppler Features
While Poppler doesn’t provide a way to control PDF file compression, it does provide stable stand-alone utilities when it comes to many other PDF operations:
- pdfattach – embed attachments
- pdfdetach – extract attachments
- pdffonts – get font information
- pdfimages – extract all images
- pdfinfo – get metadata like page sizes, numbers, encryption, and others
- pdfseparate – extract pages
- pdftocairo – convert to PostScript, vector, and bitmap via Cairo, handling aspects of the conversion
- pdftohtml – convert to HTML
- pdftoppm – convert to bitmap
- pdftops – convert to PostScript
- pdftotext – extract text
- pdfunite – merge files
11.2. Repair PDF with pdftocairo and Poppler
In addition to enabling conversions, pdftocairo can be very helpful when it comes to standardizing PDF files:
$ pdftocairo -pdf in.pdf out.pdf
Similar to the output of qpdf –qdf, the -pdf option of pdftocairo reliably produces PDF files with common characteristics such as structure and object specifics.
Because of this and the PDF specification tolerance of the tools, we can employ pdftocairo -pdf as a stable and comprehensive way to fix and repair a problematic PDF.
12. Xpdf Tools
The xpdf PDF reader is usually in the xpdf package, while the xpdf-utils package contains the CLI utilities. Yet, depending on the Linux distribution, this xpdf-utils package can often just be an alias for poppler-utils.
$ wget https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
After that, we can simply unpack with tar, copy, and use as necessary. Since Xpdf provides its own versions of pdftops, pdftotext, pdftohtml, pdfinfo, pdffonts, pdfdetach, pdftoppm, pdftopng, pdfimages, this approach also avoids conflicts with Poppler.
Further, the Xpdf tools don’t include the coveted pdftocairo and the features they offer are a subset of their Poppler relatives. Hence the replacement of xpdf-utils with poppler-utils in most Linux versions.
In this article, we discussed the PDF file format, how to view its contents, as well as tools that can handle and manipulate it under Linux.
In conclusion, while we can open PDF files with a text editor, pre- and postprocessing can be critical for reading all contents and properly performing any edits to leave a valid file.