Thomas Fischer's Weblog

Life, Linux, LaTeX

Searchable PDFs with Linux

with 15 comments

Recently, I came across a news posting that there is an open source document management software called ArchivistaBox 2008/IX that can create searchable PDFs from scanned documents. Core components of this software package are Cuneiform (an OCR system) and hocr2pdf (a special PDF generator from ExactCODE).

Using these two programs (both are GPL-2), everyone can generated searchable PDFs which I will demonstrate in the following example.

Lacking a scanned document, I created a LaTeX document using a sample text from Project Gutenberg and generated a TIFF file using GhostScript:

pdflatex mammalia.tex
gs -r320 -dBATCH -sOutputFile=mammalia.tiff -sDEVICE=tiffgray mammalia.pdf

Tip: When scanning or generating TIFF images, try different image resolutions where the recognization rate is sufficient and the image size is still acceptable small.

Generating a searchable PDFs is a two-step process. First, cuneiform is used to generate a special HTML document which contains information where letters and words are located on the TIFF image.
This HTML document uses the suffix .hocr:

cuneiform -f hocr -o mammalia.hocr mammalia.tiff

Tip: You can use cuneiform to write its output in different other formats such as normal HTML or plain text. Use cuneiform -f to get a list of formats.
Tip: Linked against ImageMagick, cuneiform can read a large number of image formats, not only TIFF.

Once mammalia.hocr has been generated, the searchable PDF document is generated using hocr2pdf:

hocr2pdf -i mammalia.tiff -o mammalia-ocr.pdf <mammalia.hocr

Here, the TIFF image is used for the PDF’s visual content, but when you search for text, the meta information from the .hocr file is used to find and highlight the search hits in the document.

Above example is rather artificial, as the used TIFF image has a much better quality compared to a scanned document. If scan results degenerate (not all letters are recognized and some word boundaries are detected wrong), you may want to try the optional switch -s for hocr2pdf to use a more sloppy approach on detecting words.

Now you can use above tools to run your own document management system at home e.g. to scan incomming letters. Happy OCRing… 🙂

Note: Gentoo Linux users can use ebuilds from bug reports for cuneiform and exactimage.

Advertisements

Written by Thomas Fischer

November 26, 2008 at 22:23

Posted in Linux

Tagged with ,

15 Responses

Subscribe to comments with RSS.

  1. This seems to work as you describe but only does the first page of my document. I can’t see any way to specify to cuneiform which pages to process. Am I missing something?

    Jonathan

    July 22, 2009 at 2:57

  2. For more than one page you’ll need batch processing (shell scripts).

    I wrote an article about that, you’ll find it with a search engine with the keywords ‘linux ocr and pdf problem solved’ (it seems I’m not allowed to post links here).

    Konrad Voelkel

    March 6, 2010 at 12:30

  3. Rodrigo Torres

    September 1, 2010 at 18:56

  4. Ubuntu – from Konsole –

    $ cuneiform -f hocr -o scan-0001.hocr scan-0001.tiff
    Cuneiform for Linux 0.7.0
    PUMA_XFinalrecognition failed.

    Any idea what the interesting error is saying?

    Thank you

    Barry Smith

    September 7, 2011 at 20:45

  5. Answered my own question… in a trial-and-error way.
    scan-0001.tiff was made at 600x600DPI.
    Created scan-0002.tiff at 150x150DPI, and it worked.

    QED.

    Barry Smith

    September 7, 2011 at 22:37

  6. New Question:
    Mr. Torres responded about extracting text from a PDF on a multi-page document. That text file can’t be used in your cuneiform & hocr2pdf process, can it?

    Yet _differently_, I want to work on a multi-page document to create a single searchable PDF.

    An example — My resume is 23 pages long in .doc format for all of the silly recruiters out there.
    If I create a 150dpi image file for each page, and run each file through your cuneiform & hocr2pdf process, I’m left with 23 PDFs that cannot be merged… am I not? The .hocr file would refer to a single page document.

    Another wrinkle – I’m still working on KDE, but I have installed GNOME tools after switching to kdm-KDE,and they are working. xsane is working for scanning… I was able to scan a single-sheet to tiff, and use your process above. yet, how do I convert the multi-page .doc to searchable PDF, and then chain the single-sheet PDF – xsane-cuneiform-hocr2pdf – followed by the 23 individual PDFs of my resume?
    While waiting for this complex answer, I’ll continue to ponder. as I did above… but if you have the answer, please share. 🙂

    Thank you again,
    Barry

    Barry Smith

    September 8, 2011 at 11:50

  7. If you have the .doc file, you can easily create PDF files using LibreOffice or OpenOffice. If you have a multipage PDF file which basically consists of scanned images and is not searchable, you can use the following Bash script:

    TEMPDIR=$(mktemp -d)
    INPUTPDF="$1"
    OUTPUTPDF="${INPUTPDF/.pdf/-index.pdf}"
    
    gs -r320 -dBATCH -dNOPAUSE -sOutputFile=${TEMPDIR}/page%05d.tiff -sDEVICE=tiffgray "${INPUTPDF}" || exit 1
    for tiff in ${TEMPDIR}/page*.tiff ; do
      hocr=${tiff/.tiff/.hocr}
      pdf=${tiff/.tiff/.pdf}
      cuneiform -f hocr -o ${hocr} ${tiff} && \
      hocr2pdf -i ${tiff} -o ${pdf} <${hocr} || \
      exit 2
    done
    pdftk ${TEMPDIR}/page*.pdf output "${OUTPUTPDF}"
    
    rm -rf ${TEMPDIR}
    

    Thomas Fischer

    November 27, 2011 at 18:00

  8. I’m impressed, I must say. Seldom do I encounter a blog that’s both equally educative and entertaining, and let me tell
    you, you’ve hit the nail on the head. The problem is something that too few people are speaking intelligently about. Now i’m
    very happy that I found this during my hunt for something concerning this.

    Burgundy.Cmmt.Ubc.Ca

    January 21, 2013 at 1:45

  9. I am in fact grateful to the owner of this site who has shared this fantastic piece
    of writing at at this time.

    http://www.bundespressecamp.De

    January 21, 2013 at 5:25

  10. Elle participe tant pleinement à la destine de développement de Bouygues
    Conservatoire, qu’elle contribue à exagérer par la recherche de solutions innovantes en conséquence d’un point de
    vue gracieuse l’袨elle qu’environnemental.

    Récemment, plusieurs articles sont parus sur les coûts et contraintes des bâtiments passifs.
    Extraits.

    « Préalable harassant aux bâtiments à familiarité positive,
    la construction passive coûte krach cher en France qu’en Belgique.

    l'袨elle

    April 4, 2013 at 5:48

  11. , potarł związek
    zapalniczki, Zaciągnął się, wydmuchując Katrin dym w fizjonomia niziołka.
    Palił camele, natomiast
    jak. Milczał w poprzek chwilę, Dwóch fałszywych specnazpwców zaciekawiło pozycja po jego
    bokach, dwa pozostali zniknęli, Frodo nie mógł skierować łba, mógł właśnie
    ćwiczyć rzucić okiem w bok, załatwiając do tego straszliwego zeza.

    – Naszli? – rzucił Kirpiczew w bok, nie spuszczają.

    Katrin

    May 15, 2013 at 21:23

  12. Thanks for sharing your thoughts about Buy Youtube Subscribers.
    Regards

    Maple

    May 30, 2013 at 16:52

  13. […] sono alcune guide e script in merito, nonché un live CD nato per fare solo questo. Tuttavia sono tutte tecniche […]

  14. Thank you for your post. I had a similar need and came accross a linux command line tool that seems to be superior to most other in term of reliability and accuracy (espacially placement of the text below the image): https://github.com/fritz-hh/OCRmyPDF

    It requires quite many dependencies, but the tool warns the user if the dependencies are not yet installed.

    According to the developper:

    Main features
    ——–

    – Generates a searchable PDF/A file from a PDF file containing only images
    – Places OCRed text accurately below the image to ease copy / paste
    – Keeps the exact resolution of the original embedded images
    – or if requested oversample the images before OCRing so as to get better results
    – If requested deskews and / or clean the image before performing OCR
    – Validates the generated file against the PDF/A specification using jhove
    – Provides debug mode to enable easy verification of the OCR results
    – Processes several pages in parallel if more than one CPU core is available

    Motivation
    ———-

    I searched the web for a free command line tool to OCR PDF files on linux/unix:
    I found many, but none of them were really satisfying.
    – Either they produced PDF files with misplaced text under the image (making copy/paste impossible)
    – Or they did not display correctly some escaped html characters located in the hocr file produced by the OCR engine
    – Or they changed the resolution of the embedded images
    – Or they generated PDF file having a ridiculous big size
    – Or they crashed when trying to OCR some of my PDF files
    – Or they did not produce valid PDF files (even though they were readable with my current PDF reader)
    – On top of that none of them produced PDF/A files (format dedicated for long time storage / archiving)

    … so I decided to develop my own tool (using various existing scripts as an inspiration)

    itsme

    January 11, 2014 at 22:14

  15. fuck yeah

    Thalia Lomedico

    March 6, 2016 at 12:08


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: