Ocr Scanning

This post describes how to scan pages from a printed mass too convert the icon to text using Optical Character Recognition (OCR) technology.

The tools that I purpose are:

  1. SimpleScan
  2. tesseract

Preparation

SimpleScan is a GUI scan application that comes pre-installed inward many Linux distributions (including Debian Wheezy).

To manually install it on Debian:

 $ sudo apt-get install simple-scan 

tesseract is a command-line OCR program.

To install:

 $ sudo apt-get install tesseract-ocr 

If English linguistic communication is the linguistic communication used, that is all you lot quest to install. If you lot require roughly other language, you lot must install additional tesseract linguistic communication packs. Examples are tesseract-ocr-rus for Russian, tesseract-ocr-deu for German, too tesseract-ocr-fra for French.

OCR Procedure

  1. Scan the pages using SimpleScan.
  2. Save the image.
  3. Run the tesseract command:
     $ tesseract OnWritingWell.jpg out Tesseract Open Source OCR Engine v3.02 amongst Leptonica 

    The get-go parameter is the input icon filename. The minute parameter is the desired basename of the output text file. The default txt extension is added to the basename, e.g., out.txt.

    If the linguistic communication is non English, you lot quest to specify the linguistic communication on the ascendency business using a 3-character linguistic communication code (refer to the tesseract man page). The next ascendency specifies the purpose of three languages: Russian, High German too French.

     $ tesseract OnWritingWell.jpg myout  -l rus+deu+fra  

Accuracy

In the higher upwards example, at that spot were a full of 734 words. Within the output text file, 119 words (16% of total) require roughly class of manual correction. This roughly translates to 84% OCR accuracy. The sample size is besides modest to last scientific, or statistically valid. What is the surgical operation that you lot are getting from OCR?

Berlangganan update artikel terbaru via email:

0 Response to "Ocr Scanning"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel