Ocr Scanning
This post describes how to scan pages from a printed mass too convert the icon to text using Optical Character Recognition (OCR) technology.
The tools that I purpose are:
- SimpleScan
- tesseract
Preparation
SimpleScan is a GUI scan application that comes pre-installed inward many Linux distributions (including Debian Wheezy).
To manually install it on Debian:
$ sudo apt-get install simple-scan
tesseract is a command-line OCR program.
To install:
$ sudo apt-get install tesseract-ocr
If English linguistic communication is the linguistic communication used, that is all you lot quest to install. If you lot require roughly other language, you lot must install additional tesseract linguistic communication packs. Examples are tesseract-ocr-rus for Russian, tesseract-ocr-deu for German, too tesseract-ocr-fra for French.
OCR Procedure
- Scan the pages using SimpleScan.
- Save the image.
- Run the tesseract command:
$ tesseract OnWritingWell.jpg out Tesseract Open Source OCR Engine v3.02 amongst Leptonica
The get-go parameter is the input icon filename. The minute parameter is the desired basename of the output text file. The default txt extension is added to the basename, e.g., out.txt.
If the linguistic communication is non English, you lot quest to specify the linguistic communication on the ascendency business using a 3-character linguistic communication code (refer to the tesseract man page). The next ascendency specifies the purpose of three languages: Russian, High German too French.
$ tesseract OnWritingWell.jpg myout -l rus+deu+fra
Accuracy
In the higher upwards example, at that spot were a full of 734 words. Within the output text file, 119 words (16% of total) require roughly class of manual correction. This roughly translates to 84% OCR accuracy. The sample size is besides modest to last scientific, or statistically valid. What is the surgical operation that you lot are getting from OCR?
0 Response to "Ocr Scanning"
Post a Comment