Why Optical Character Recognition Holds Back Progress

By Adam Torab

Optical Character Recognition (OCR) has been around a long time. So long, in fact – it’s become mired in confusion and mismatched expectations.

Most organizations still use a lot of paper and PDF documents. While some industries are particularly drawn to paper, humanity is a long way off from eradicating the exchange of information through documents and forms.

Enterprise OCR provided by “OCR Engines” from vendors such as Tesseract, ABBYY, OmniPage, AnyDoc, Transym, Azure, Google, and others are only a small part of what organizations really need.

While OCR is still a relevant technology, the secret to getting accurate text from documents isn’t crafting a better recognition engine.

I talk to people every day who are looking for better OCR. What they’re really looking for is a better way to get accurate information from data trapped in documents. It’s only possible with two things: Dealing with OCR killers and machine reading.

Why Imperfect Documents Kill OCR

Take a skewed document image, for example. Independent research and our own observations prove you’ll get a maximum of 40% recognition accuracy. And the reality is that you’ll probably get less, which is pretty much worthless. Errors in dates, numbers, amounts, names, etc. make trusting the data difficult.

Document skewing isn’t the only OCR killer. Poor scan quality, hole punches, pictures, data in tables, mismatched font types, and text that spans pages all wreak havoc on OCR. And these are just some of the complications. If the solution is manual human review – that’s just too time-consuming and expensive for projects with hundreds of thousands or millions of pages.

Machine Reading

Machine reading is what OCR was when it was first mainstream. It’s innovative because it combines powerful data sciences and image processing tools that strike back at OCR killers.

Machine reading solutions combine better than twice-as-accurate OCR, and intelligent integration of information contained within documents. It’s the heart of modern data integration from physical and electronic documents.

If your organization is performing OCR using off-the-shelf OCR products, or older document capture solutions, you’re missing out on maximizing the full potential of information contained within your documents.

And if human review is a central component of your workflows – good news!

You’re perfectly positioned to make a technology change that will return an immense amount of valuable resources.

This article originally appeared on BIS.