OCR - Optical Character Reader

Brief:

OCR (Optical Character Recognition) is the recognition of printed or written text characters by a computer. This involves photo scanning of the text character-by-character, analysis of the scanned-in image, and then translation of the character image into character codes, such as ASCII, commonly used in data processing.
Or, we can say... Optical character recognition (OCR) translates images of text, such as scanned documents, into actual text characters. Also known as text recognition, OCR makes it possible to edit and reuse the text that is normally locked inside scanned images. OCR works using a form of artificial intelligence known as pattern recognition, to identify individual text characters on a page, including punctuation marks, spaces, and ends of lines.

Need:

What should you do if you want to convert scanned paper, books and documents into electronic files like Word document, PDF, or text? You need Optical character recognition, usually abbreviated to OCR, which can translate scanned images of handwritten, typewritten or printed text into machine-encoded text.

OCR technology makes it possible to search for a word or phrase in an image, scanned PDF or other un-editable file. Once you extract text from image, you can store it more compactly, send it to Google to translate text, put text into text-to-speech techniques, or publish the content on your blog or website.

How It Works:

Optical Character Recognition (OCR) extracts text and layout information from document images. With the help of Microsoft Office Document Imaging Library (MODI), which is contained in the Office 2007 package, you can easily integrate OCR functionality into your own applications. In combination with the MODI Document Viewer control, you will have complete OCR support with only a few lines of code.

Office 2007 and Vista, both support MODI. It's not installed by default, but you can easily add the package via installing options of your Office 2007. You just need to rerun the setup.exe (of your Office installation) again and choose the package.
OCR is only one step in document processing. To get a more qualified access to your paper based document information, usually a couple steps and techniques are required:
Scanning

Before documents are available as images, they have to be digitized. This process is called 'scanning.' There are two important standards used for interacting with the scanning hardware: TWAIN and WIA.

Although the scanning devices are getting better, a couple of methods can be used to increase the image quality. These pre-processing functions include noise reduction and angle correction, for instance.
OCR Itself

As a next step, OCR itself interprets pixel-based images to layout and text elements. OCR can be called the 'highest' bottom up technology, where the system has no or only little knowledge about the business context. Recognizing hand written documents is often called ICR (intelligent Character Recognition).
Document Classification

In most business cases, you have certain target structures you want to fill with the document information. That is called 'Document Classification and Detail Extraction.' For instance, you might want to process invoices, or you have certain table structures to fill. In Document Processing Part II, you can see how this kind of content knowledge can be used.

First of all, you need to add the library's reference to your project:
Microsoft Office Document Imaging 11.0 Type Library (located in MDIVWCTL.DLL). Supported image formats are TIFF, multi-page TIFF, and BMP.

Create a Document Instance and Assign an Image File
       _MODIDocument = new MODI.Document();
       _MODIDocument.Create(filename);


The OCR process is started by the MODIDocument.OCR method.
With theDocument.OCR call, all the contained pages of the document are processed. You can also call the OCR method for each page separately, by calling the MODIImage.OCR method in the very same way. As you can see, the OCR method has three parameters:
  • Language
  • AutoRotation
  • StraightenImages
Working on the result structure is pretty straightforward. If you just want to use the full text, you simply need the image's Layout.Text property.

Podcast

Michael Patterson sat down with the CEO of Boston Byte, Mustapha Shaikh to discuss the significance and rapid digitization of the healthcar...