I remember when OCR technology was on a full-length computer board – not just a chip, but a full-length board – it wasn’t until a Pentium 90 that OCR got fast enough to run on a “modern” PC.
Today, it’s all about speed – taking advantage of modern architecture. CPUs still get fully utilized during OCR technology and we can utilize all resources at our disposal (think idle computer labs or servers).
But years ago, you OCR’d the whole page and did fuzzy searches to find your data, or you grabbed a subset of data for archival purposes. But you never trusted the OCR data unless you could validate it against a database – there just weren’t very many sources available to validate against. You also only OCR’d what you needed because most systems charged extra for full-text OCR.
What – First, What is OCR?
Let’s back up a bit though and explore what is OCR software. It’s is a machine learning algorithm. It tries to match letters and words from pixels. Initially, only from black and white pixels – not a lot of data to choose from!
We’ve been doing this so long, we’ve figured out some other things about OCR as well: machine learning algorithms do better with less data. We’ve patented a process that takes advantage of that premise – OCR the document, remove what was confident, and OCR it again. We actually get better results with this process. Segment the document into smaller chunks and OCR works better. The same OCR engine will yield different results with these methods. This is what 35 years of experience gets you.
How Accurate is OCR?
Interestingly enough, the OCR engines themselves haven’t changed a whole lot. The same documents still yield the same results. You hear “95% accurate” and higher all the time. So the results are the results. The problem with that 5% error rate is the cost of fixing errors. There’s a data entry theory around the cost of errors:
So when a 5% error rate translates to 50% of the data entry labor – that’s significant. Any system that’s operating on manual data entry standard error rates, somewhere between 1 and 5% – are spending up to half of their labor fixing errors. The alternative? Double-blind keying which DOUBLES the data entry effort.
What’s changed is in the preparation of the document before it goes to the OCR technology.
What’s New in OCR?
Most companies are using the same few algorithms (some even use open source) for image cleanup. Open source is great for keeping costs down, but not necessarily the best in every domain, this is especially true in the imaging/OCR domain. So we took a cue from one of the last real innovations in the industry.
We scan in color so we can run better algorithms to get cleaner images that result in much better OCR. We wrote our own library of over 70 image cleanup algorithms. Our most recent ones are for preparing microfilm. We took the work and research around computer vision and applied it to our industry domain.
But we’ve gone farther still. When you get the results from OCR technology, and the results are bad, what do you do now?
In every other system I’ve worked with, when you get an OCR error, say a 5 instead of an S, or a high confidence with a bad character, you’re stuck with the OCR error. Not with us. We’ve built a system that understands the common OCR errors and allows you to tune them for your project as needed. The results are I can get good data from really poor OCR’d data.
I’m oversimplifying, but you get the point. We’ve figured out how to compensate for common OCR errors using machine learning and natural language processing (NLP). It’s a layered AI approach, and we’ve figured out how to use this in the document domain, rather than bolting on someone else’s library.
The result being, OCR’d data is just text. And guess what? So are full-text PDFs, so are emails, so are a lot of documents that companies get. We’ve spent a lot of engineering time on being able to normalize text data. Just because I get a full-text PDF doesn’t mean that data is easily extracted. It doesn’t mean that data will fit into my target system. We fix that. It’s extract transform load (ETL) for documents.
The last frontier in OCR technology is handprint recognition.
Until 9 months ago, I would strongly caution someone from trying to do anything except the very basic structured extraction for handprint. Between the preparation we can do, and layering recognition engines, we are now able to get incredible results from handprinted data.
Still, I’m very skeptical and cautious, but we’ve sold systems in the last six months that work on unstructured handprint data. I wouldn’t believe it, if I hadn’t been part of it myself.