Optical Character Recognizer.
From an image of a paper document, this software gives an electronic version (often only text).
The result has a better presentation when it is combined with Document Analysis techniques.
Many commercial OCR have an error rate of less than 1/100.Than means nearly one error per text line.
OCR errors can be:
- confusion: a character (rather, a glyph) instead of another one
- insertion: a glyph is added where it should not be
- deletion: a glyph is not recognized.
A typical error is the replacement of "m" by "rn", the confusion of the lowercase L and the digit 1, etc.
The most common OCR are: FineReader, TextBridge, OmniPage. Also, they all do a segmentation of the scanned images given to them into blocks of different media (at least texte, table, image).