How Automated Document Processing Software Works

Automated document processing software is the next generation of capture that combines new technology like computer vision, machine learning, and natural language processing with traditional OCR tools.

BIG TIP: The biggest change that has happened in document capture is that the fundamental building blocks of the software have been completely re-worked.

Instead of data science methods like natural language processing being an add-on (or an afterthought), they are worked into every aspect of how the software works.

This — and changes to the way software is architected to take advantage of modern compute and data storage — makes today’s automated document processing software worth taking a deeper look at…

The 6 Fundamental Technologies in Automated Document Processing Software

 

1. Computer Vision

Computer vision helps computers perform actions that are extremely easy for humans, but not so easy for a machine.

In the realm of automated document processing, computer vision (CV) helps process document images for very accurate optical character recognition (OCR).

CV is applied in three broad phases:

  • Phase 1: Enhance image pages to create the best physical representation of the original page. This ensures the proper version for output.
  • Phase 2: Create intermediate images to ensure the best possible OCR and data extraction results. These intermediate images are a vital resource to guide software architects decisions. This is an incredibly helpful method when deciding which techniques to use to get high-quality OCR and data extraction results.
  • Phase 3: Apply fully automated CV and image processing techniques as needed to fully automate the collection of various data elements such as information stored in tables, boxes, and bound regions.

 

2. Optical Character Recognition

Even a technology as old as OCR has been advanced in modern automated document processing software platforms.

An OCR engine is the part of the software that performs the actual character recognition by analyzing the pixels of an image to figure out the correct character.

Now, users run tens or even hundreds of concurrent “threads” of OCR. This means, gone are the days of using a single OCR engine to “look” at a page from left to right, top to bottom. That old method is slow and error-prone.

New features in OCR make highly accurate data extraction and integration a reality on even the most complex documents.

Here’s what’s changed in OCR technology:

  1. Use multiple OCR engines at the same time
  2. Re-run OCR until desired accuracy is attained
  3. Automatic data correction for well-known OCR errors
  4. Spell correction and word splitting
  5. Globalized multi-language support
  6. Trainable OCR for custom / difficult font types
  7. Lexicon-based corrections for proprietary information

 

3. Natural Language Processing

Natural language processing (NLP) recognizes paragraphs, sentences, and other language elements in documents. Creating this understanding is vital to help a machine understand the meaning conveyed in blocks of text.

Essentially, NLP is using computers to process human language. In the early days of NLP, document processing solutions used a standard library called the Stanford Library to recognize text.

Today, NLP is performed on the fly using built-in advanced machine learning functionality.

Here are some examples of NLP in automated document processing:

  1. Paragraph marking – allows users to break up a document’s text flow into segments of paragraphs instead of segments of lines
  2. Flow collation – language on documents follows a standard flow (in English, it’s left to right), so creating an understanding of the order of words and phrases makes it easier to find important information in unstructured text
  3. Field class extractors – intelligently creates individual sections of targeted text in paragraphs
  4. Porter stemming – An algorithm that reduces words down to their core meaning. For example; likes, liked, likely, liking are all reduced to “like”

 

4. Machine Learning

Machine learning algorithms have been in development for years and are perfect for automated document processing software.

The most important algorithm for document processing is TF-IDF. This stands for Term Frequency-Inverse Document Frequency.

It is simply a numerical statistic intended to show a user how important a word is to the document within which it is contained.

Document Classification and Data Extraction

When people talk about training a machine learning system, they are talking about TF-IDF. It is important for automated tasks like document classification and data extraction. TF-IDF is popular because it is both high effective and relatively easy to understand.

Transparency is also an important topic in any automated system. For an automated document processing system to be “transparent,” one of the key ways is exposing the underlying data that machine learning algorithms create.

By looking at the results of machine learning training, users easily see whether or not the training is creating the intended result.

BIG TIP: Automation systems that do not expose training data should be avoided because they are nearly impossible to troubleshoot and trust.

 

5. Data Extraction

Data extraction is the whole point of modern automated document processing software. While full-page text searching is a byproduct, the goal is training the machine to identify, locate, and extract data important to workflows and business decision-making.

There are numerous approaches to automated data extraction. The simplest methods are based on regular expressions (RegEx). RegEx is a cross-industry standard syntax designed to parse text strings. It is simply a way of finding information in a bunch of text using pattern matching.

A shortcoming to RegEx is that it will either match a string of text or it won’t. This means that if you’re trying to match a word and the RegEx pattern is even a single character off from the text data, you won’t get a result.

A new method, called Fuzzy RegEx uses a Levenshtein distance equation to solve this problem. Users get to set a confidence score to find text that is i.e., 95% similar to the RegEx pattern. Here’s more on Fuzzy RegEx.

More Ways to Extract Data

Other common methods of data extraction include:

  1. Table extraction – automatic identification of tables and the data contained within them
  2. Vision-assisted capture – captures information like checkboxes (also known as optical mark recognition, or OMR)
  3. Key-value pair – this approach uses the layout relationship between a “key” i.e., “First Name,” and a nearby “value” i.e., “Farris”
  4. Content models – used in document classification and data extraction. These are building blocks used to capture virtually any data from a document using data models and data elements
  5. Lexicons – these are pre-defined internal or external lists of words, phrases, or other information used during extraction or Fuzzy RegEx matching
  6. Zonal extraction – this is one of the earliest data extraction methods and is primarily used on documents with layouts that don’t change, like a check

New data extraction techniques and technology are constantly in development and are largely driven by business requirements from increasingly complex document types.

 

6. Data Integration

Integrating the data provided from automated document processing software is just as important as extracting the data to begin with.

Because both physical and digital documents, and purely text-based files are all considered “documents,” the level of data integration provided by document processing software is quite impressive.

After data has been classified and extracted, integration becomes a powerful step in the process. Data is “normalized,” which means that dates and numbers are formatted properly to match existing database requirements. Other data elements may be added by parsing a database or other application to add additional structure to the data.

Data is mainly integrated using the following industry standard techniques:

  1. CMIS – stands for Content Management Interoperability Services, and is used to connect with electronic content management (ECM) systems
  2. APIs – stands for Application Programming Interface, and is used to connect to both cloud and local software storage and line-of-business applications
  3. File shares – using FTP or SFTP (File / Secure File Transport Protocol) to integrate digital documents and metadata with standard computer folders
  4. Database exports – these move data very efficiently to databases, like SQL
  5. Custom file exports – using XML and JSON to integrate data and metadata to virtually any desired layout

 

An Overview of Automated Document Processing Software

Automated document processing software is more than just document capture or full-page OCR. It is an entire platform that ingests virtually any type of data to intelligently analyze it and deliver the data in a way that is meaningful to an organization.

There are many more elements that play a crucial role in the automations, like establishing business rules, using subject matter expertise, and creating human-in-the-loop verification workflows.

The number of possible business outcomes using automated document processing software are nearly endless, and only limited by your imagination and choosing the most robust platform for your needs.