Linux Optical Character Recognition (OCR) Tools

1. Introduction

In a digital world where information is key, Optical Character Recognition (OCR) technology plays a pivotal role in transforming printed or handwritten text into editable, searchable, and shareable digital content.

OCR is the process that converts an image or Portable Document Format (PDF) of text into machine-readable text format.

For Linux users, there’s a wealth of OCR tools available to choose from, each with its unique features and capabilities.

In this tutorial, we’ll delve into the world of OCR tools tailored for Linux, shedding light on some of the best options available to help us harness the transformative capabilities of text recognition.

2. Tesseract OCR

Tesseract OCR is an open-source optical character recognition engine available for various operating systems.

Originally developed by Hewlett-Packard, Tesseract is now maintained by Google, making it a robust and reliable choice for Linux users. In addition, it supports over 100 languages, which ensures its versatility around the world.

Tesseract’s strength lies in several main points:

accuracy, especially with clear and well-formatted text
handling of a variety of images and document formats
command-line and GUI methods for both beginners and advanced users

Now, let’s delve deeper into the use of this OCR engine.

2.1. Installation

We start by installing the tesseract command on our system via the local package manager.

On Ubuntu, we can use the APT package manager:

$ sudo apt-get install tesseract-ocr

Alternatively, on Arch Linux, we can use Pacman:

$ sudo pacman -S tesseract

Finally, on Fedora Linux, we can employ DNF:

$ sudo dnf install tesseract

After the installation, let’s use Tesseract OCR to extract text from an image. In addition, we’ll explore some of the available options.

2.2. Basic Usage

Tesseract has a basic usage syntax:

$ tesseract input_image output_text_file

The input_image option represents the image or document with text that we want to extract and the output_text_file represents the output text file after extraction.

Assuming we have a sample image called harvard_first_page.png, let’s use Tesseract to extract text from the image:

$ tesseract harvard_first_page.png havard_first_page_text
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 110

As a result, the command above extracts a text file named havard_first_page_text.txt that we can modify further with any text editor.

2.3. Language

By default, Tesseract uses English for text recognition. However, in case we have text in a different language, we can specify the language using the -l option:

$ tesseract harvard_first_page.png havard_first_page_text_french -l fra

The option above extracts all French text in the input image.

Moreover, we can control the output file type with the -c option:

$ tesseract harvard_first_page.png havard_first_page_text -c PDF

In this case, the output file is in the PDF format.

3. OCRmyPDF

If we’re looking to perform OCR specifically on PDF files, OCRmyPDF is a specialized tool worth considering. It combines the power of OCR with PDF processing capabilities. In fact, OCRmyPDF employs Tesseract under the hood and automates the OCR process for image-only PDF files.

OCRmyPDF preserves the original PDF formatting while adding searchable text layers and is especially valuable for digitizing scanned documents or making existing image-only PDFs more accessible.

3.1. Installation

We can install the ocrmypdf command through pip, Python‘s package manager:

$ pip install ocrmypdf

Of course, we need Python installed before using pip.

3.2. Basic Usage

We can use OCRmyPDF to add searchable text layers to the input PDF:

$ ocrmypdf input.pdf output.pdf

Consequently, the output.pdf file should now have searchable text layers that we can copy and paste.

3.3. Language

OCRmyPDF also works on image-only PDF files in different languages. We just download the appropriate language model from Tesseract’s GitHub repository or any other reliable sources:

$ wget https://github.com/tesseract-ocr/tessdata/raw/main/spa.traineddata -P /usr/share/tesseract-ocr/4.00/tessdata/

In this case, we’re using wget to download the Spanish language model from Tesseract’s GitHub repo.

Now, we can use OCRmyPDF to add searchable OCR text to our Spanish PDF file:

$ ocrmypdf --language spa input.pdf output.pdf

The –language spa flag specifies that we want to use the Spanish language model for OCR.

3.4. Logs

We can also generate a log file during processing that contains information about each page’s OCR results:

$ ocrmypdf --logfile ocrmypdf.log input.pdf output.pdf

The information includes any errors or warnings that we can review to identify any potential issues.

4. gImageReader

gImageReader is a user-friendly, graphical OCR tool for Linux that offers a simple and intuitive interface, making it accessible and beginner-friendly.

It relies on the Tesseract OCR engine but packages it in a user-friendly interface.

gImageReader offers several main features:

open scanned images or PDF files and extract text
provides options for manual corrections
supports multiple languages

Now, let’s go over the basic installation and usage.

4.1. Installation

The tool is available on most Linux distros, and we can install it from the local package manager.

On Debian-based Distros, we can use the APT package manager:

$ sudo apt-get install gimagereader

Alternatively, to install gImageReader on Arch Linux, we can use the Arch User Repository (AUR) with an AUR helper like yay:

$ sudo pacman -S yay

Once yay is installed, we can use it to search for and install gImageReader:

$ yay -S gimagereader

Finally, on Fedora Linux, we can use DNF:

$ sudo dnf install gimagereader

Let’s explore how we can use gImageReader to extract text from an image and then explore some of the features available.

4.2. Usage

After installing gImageReader, we can search and open it from the applications menu by typing gImageReader:

Once open, we’ll see the main window of gImageReader:

We can import one or multiple images from the screen above by clicking on the relevant option from the left sidebar:

Next, we can use the Recognize option on the top navbar to scan and extract text from the image:

Using Recognize option to scan and extract text from the input image

Moreover, we can perform manual corrections by using the find and replace option on the right sidebar:

Finally, to get the final text file, we use the export option on the right sidebar:

Exporting extracted text with gImageReader

gImageReader supports exporting the output in plain text, ODT, or PDF format.

5. CuneiForm

CuneiForm is an open-source OCR tool available to Linux users. While it may not be as well-known as Tesseract, it offers accuracy and multilingual support.

CuneiForm is designed to convert scanned documents and images containing text into machine-readable and editable text. It handles complex layouts well and supports a wide range of fonts and output file formats.

5.1. Installation

Let’s start by installing CuneiForm.

On Debian-based distros, we can use the APT package manager:

$ sudo apt-get install -y cuneiform

On Arch Linux, we can use an Arch User Repository (AUR) package and an AUR helper like yay:

$ yay -S cuneiform-linux

Finally, on Fedora Linux, we can employ the DNF package manager:

$ dnf install cuneiform-qt

After the installation, let’s go over the basic usage.

5.2. Usage

We can use CuneiForm to extract text from an image in the terminal:

$ cuneiform -l <language_code> -o output_file.txt sample_image.png

Of course, we replace <language_code> above with the corresponding language on the image. Currently, CuneiForm supports over 20 different languages.

What’s more, we can change the name of the output file output_file.txt to match any desired name.

We can also employ the -f option to specify the output file:

$ cuneiform -l <language_code> -f pdf -o output.pdf input_image.png

As a result, the output file is in the PDF format.

6. Paperwork

Paperwork is an open-source document manager and note-taking application that helps users organize, search, and manage their documents in digital format.

It simplifies the process of digital organization by offering a comprehensive set of features that enable us to scan, categorize, search, and manage documents.

Paperwork was designed as a scan-and-forget tool, meaning we can scan a new document and forget about it until the day we’ll need it again because it keeps a record.

6.1. Installation

We can install Paperwork on Debian from the APT package manager:

$ sudo apt install paperwork-gtk

On Arch Linux, we can use Pacman to install Paperwork:

$ pacman -S paperwork

Finally, on Fedora Linux, we can use the DNF package manager:

$ sudo dnf -y install paperwork

After installation, let’s find out how we can use Paperwork to extract text from documents.

6.2. Usage

At this point, we can open Paperwork by searching for it in the applications menu:

Alternatively, we can open it by using the paperwork command in the terminal:

$ paperwork

The main Paperwork window shows a left sidebar that contains all the previously scanned documents. Further, the main section shows a preview of the selected document and the top navigation contains options to scan, import, export, and print documents:

Paperwork scans and converts all the text from an image or a scanned document into text that we can select, copy, and paste if necessary. It adds searchable text layers on the document uploaded in the same way as OCRmyPDF:

Paperwork searchable text layers preview

Moreover, we can use the search on the left sidebar to find a specific word or set of words from all previously scanned and imported documents:

Paperwork keyword search across all documents

In fact, Paperwork can highlight all matches of a word or phrase in all documents and pages.

7. Conclusion

In this article, we’ve explored different OCR tools that we can use to convert printed documents into searchable and editable text. Linux users have no shortage of options when it comes to command-line or GUI-based OCR tools.

All the tools we discussed offer distinct features and capabilities making them some of the best OCR tools available for Linux systems. Consequently, the tool to use ultimately depends on personal preferences, level of comfort with different interfaces, and the task at hand.

With these tools at our disposal, we can unlock the power of text recognition on our system. This makes digitization and data extraction more accessible and efficient than ever before.

Full Archive

About Baeldung

Administration

Filesystems

Processes

Files

Scripting

Installation

Networking

Security