Thomas Fischer's Weblog

Life, Linux, LaTeX

Archive for November 2008

Searchable PDFs with Linux

with 15 comments

Recently, I came across a news posting that there is an open source document management software called ArchivistaBox 2008/IX that can create searchable PDFs from scanned documents. Core components of this software package are Cuneiform (an OCR system) and hocr2pdf (a special PDF generator from ExactCODE).

Using these two programs (both are GPL-2), everyone can generated searchable PDFs which I will demonstrate in the following example.

Lacking a scanned document, I created a LaTeX document using a sample text from Project Gutenberg and generated a TIFF file using GhostScript:

pdflatex mammalia.tex
gs -r320 -dBATCH -sOutputFile=mammalia.tiff -sDEVICE=tiffgray mammalia.pdf

Tip: When scanning or generating TIFF images, try different image resolutions where the recognization rate is sufficient and the image size is still acceptable small.

Generating a searchable PDFs is a two-step process. First, cuneiform is used to generate a special HTML document which contains information where letters and words are located on the TIFF image.
This HTML document uses the suffix .hocr:

cuneiform -f hocr -o mammalia.hocr mammalia.tiff

Tip: You can use cuneiform to write its output in different other formats such as normal HTML or plain text. Use cuneiform -f to get a list of formats.
Tip: Linked against ImageMagick, cuneiform can read a large number of image formats, not only TIFF.

Once mammalia.hocr has been generated, the searchable PDF document is generated using hocr2pdf:

hocr2pdf -i mammalia.tiff -o mammalia-ocr.pdf <mammalia.hocr

Here, the TIFF image is used for the PDF’s visual content, but when you search for text, the meta information from the .hocr file is used to find and highlight the search hits in the document.

Above example is rather artificial, as the used TIFF image has a much better quality compared to a scanned document. If scan results degenerate (not all letters are recognized and some word boundaries are detected wrong), you may want to try the optional switch -s for hocr2pdf to use a more sloppy approach on detecting words.

Now you can use above tools to run your own document management system at home e.g. to scan incomming letters. Happy OCRing… 🙂

Note: Gentoo Linux users can use ebuilds from bug reports for cuneiform and exactimage.

Written by Thomas Fischer

November 26, 2008 at 22:23

Posted in Linux

Tagged with ,