*
AMS Adviser *
Volume 4 Issue 4 - July/August 1999
Welcome to a new issue of the AMS Adviser.
Kodak's New Colour Scanner
Article - Merging Document Capture and
Data Capture.
Plus we have all the usual bits (AMS
Services).
AMS
Kodak's
3590C Colour Scanner
 
Click here for the information
on the 3590C Colour Scanner
Go to Top.
Merging Document Capture and Data Capture
This article has been updated and reprinted in full as it now
includes the just released Version 3.0 of Ascent Capture.



Kofax Image Products
April 1999
Document Capture and Data
Capture
What is Document Capture?
What is Data Capture?
Production Capture:
Merging Document and Data Capture
The Production Capture Process
The Cost of Capture
Ascent Capture
DDI Architecture
Scalability
Distributed Capture
Using Ascent Capture
Administration
Batch Management
Scanning and Importing
Recognition
Types of Data Extraction
Image Cleanup
Scripting
Indexing and Validation
QC and Rescan
Release
Documentation, Training, and Support
Table 1. Costs Associated
With Production Capture
Document Capture
and Data Capture
Most production scanning sites are designed either for the purpose of
document capture (i.e., indexing and archiving images) or for the purpose of data capture
(i.e., extracting data from forms). The differences between the two are both subtle and
far reaching.
What Is Document Capture?
Document capture is the process of converting paper documents into digital
images and index data. Images are stored as data files (typically PDF or compressed group
4 TIFF files) on a storage system and the indexes are stored in a document management or
workflow system.
In a document capture environment, the primary deliverable to the back end workflow or
document management system is the image. The index information is provided so users can
retrieve the original image. Without the image, the index information has no value.
What Is Data Capture?
Data capture (historically referred to as forms processing) is the process of
automatically extracting information from forms. In a data capture environment, the
information contained on the form is the primary deliverable to a back end line of
business application or a legacy database. Oftentimes the images themselves are discarded
once the desired information has been extracted.
Most forms oriented products available on the market today require the user to redesign
their forms in order to achieve acceptable recognition rates. However, new technologies
being released into the market today can work effectively with existing real-world forms
while maintaining adequate recognition accuracy.
Production
Capture: Merging Document and Data Capture
Merging document capture and data capture into a single product is a
natural evolution of production capture. The result is a product offering that is more
intuitive to use and easier to deploy, especially for users who find themselves having to
deal with form and document requirements within the same department or enterprise. By
merging the two product segments into a single product offering, it is now possible to
deploy a single application to meet a wide range of capture applications that in the past
would have required the deployment of two or more applications.
The Production Capture Process
A production capture application is composed of several separate modules.
These modules can be run on a single workstation or, in high-volume environments, can be
run on multiple stations on a LAN. The diagram on the next page shows a typical batch
processing system.

Figure 1. The Production
Capture Process
The Cost of
Capture
Although capture is fairly inexpensive to implement (usually about 20% of the initial
system cost), it is by far the biggest ongoing labour expense of
most production imaging and document management systems. The cost of capture comes in
three different areas:
- Ongoing Labor: This is by far the biggest cost. Capture consists of six primary steps,
four of which are heavily labour intensive. Estimates are that capture accounts for up to
80% of the ongoing cost of a production workflow or document management system.
- Capital Equipment: This is primarily the cost of the scanners themselves and the scan
stations they are connected to. High-speed scanners range in price from about $5,000 for a
30 ppm simplex scanner up to $50,000 or more for a 200 ppm duplex scanner.
- Integration Costs: This is the cost of integrating the capture software with the rest of
the system. This cost varies widely with system requirements, but can be quite high if the
capture software is restricted to proprietary database and design tools.
A well-implemented capture system can reduce the operating costs of an imaging system
by 20-40% or more. Table 1 on the next page shows some of the prime areas for cost
reduction.
Table 1. Costs Associated With Production Capture
Operation |
Definition
|
Cost Considerations
|
| Document Preparation |
Sort documents, remove staples, prepare batches, etc. |
Primary cost: The labour cost of clerks to manually prepare
the documents. Cost saving methods: Automatic form ID reduces the work of sorting
different document types into batches. Grayscale VRS technology eliminates the need to
batch documents based on paper colour and thickness. |
| Scanning |
Converts paper documents into electronic files, typically PDF or Group 4
TIFF images. |
Primary cost: The capital cost of the scanners and the
ongoing labour cost of operators to run the scanners. Cost saving methods: Page-by-page
scanning is very slow and prevents scanners from running at rated speed. Implement batch
scanning to speed throughput and cut the number of scanners and scanner operators
required. |
| Recognition |
Automatically extracts data (from a form) or index information (from a
document). |
Primary cost: The capital cost of the recognition servers. Cost
saving methods: Automated recognition can reduce or in some cases eliminate labour
intensive manual keying costs By automatically extracting data from a form or document,
the validation process is simplified and validation operators check the results of the
recognition process as opposed to keying information from scratch. |
| Indexing and Validation |
Document Capture: Assigns index keywords to all documents so that they
can be retrieved later. Data Capture: Validates the results of automated recognition
performed on a form. |
Primary cost: The ongoing labour cost of operators to
manually key or validate data fields. A typical production capture operation employs 2-4
index or validation operators for every scan operator. Cost saving methods: Use bar
codes, optical character recognition, intelligent character recognition, optical mark
recognition and advanced scripting techniques to automate the extraction of form data. For
manual keying or data validation, make sure input screens are designed for efficient
"heads up" operation and can keep up with professional keyboard operators. To
ensure accuracy, use validation rules on each index field. |
| QC and Rescanning |
Examines document to make sure they are scanned correctly and sends badly
scanned documents back to be rescanned. |
Primary cost: Ongoing labour cost of operators to examine
images. Slows down scanner operators who must rescan entire batches of documents. Cost
saving methods: Use built-in batch integrity checks to insure against misfeeds. Image
processing software can automatically correct some errors, such as skew, orientation, and
so forth. Capture software should keep track of rejected pages and allow rescan of single
pages within a batch. Good design will allow pages to be automatically inserted back into
batches in the proper order. Grayscale technology can automate many aspects of this
process. |
| Release |
Export images to long term storage and data to a database or back end
workflow or document management system. |
Primary cost: Little ongoing cost. Cost saving methods:
Capture software should support release of documents to standard optical systems and
common SQL databases. Integration with popular workflow and document management
applications should be quick and easy. |
Ascent Capture
 Ascent Capture is a batch-oriented production
capture application designed to process 1,000 to 100,000+ pages per day at high throughput
and low cost. Ascent Capture integrates the following features into a single application:
- Key Document Capture Features: Robust batch management, integrated quality control and
rescan, full-text OCR, heads-up key from image capabilities, and tightly integrated
support for major workflow and document management packages.
- Key Data Capture Features: Automatic form recognition, page registration, OCR, ICR, OMR,
barcode recognition, zonal image cleanup, and validation and lookup scripting
capabilities.
- Enterprise Features: Distributed capture, centralised
administration, quick and simple installation, configuration and management of remote
clients, and secure data transfer between sites.
- Scalability: Small standalone systems can be quickly and easily expanded to meet the
most demanding throughput requirements.
- Integration. Ascent Capture integrates quickly and easily into most popular workflow and
document management systems, including applications from IBM, Documentum, PC Docs, Optika,
Keyfile, Open Text, IMR, Excalibur, Eastman Software, and others.
DDI Architecture
Ascent
Capture 3 is the first production capture solution that delivers a revolutionary
combination of three key capabilitiesDocument capture, Data capture, and
Internet-based distributed capturein single low-cost product. Youll have all
the powerful document capture features youve relied on in previous versions of
Ascent Capture, plus data capture and distributed capture capabilities previously
unavailable in a single application. With the largest number of licenses sold worldwide,
Ascent Capture has proven itself as the most reliable and widely-used production-level
capture package in the world.
Scalability
Ascent Capture is designed to scale gracefully from a single workstation to 40 or
more workstations. Smaller capture operations (up to 3,000 pages per day) can run
scanning, OCR, indexing, validation, and release all on a single workstation, while larger
operations can run each module on a separate workstation. Capacity can be increased
further by adding additional stations to the network. Dynamic load balancing keeps every
station busy at all times. An easy to configure, low cost, low volume configuration can be
quickly expanded to process more than 100,000 forms or documents per day.
Distributed Capture
Ascent Captures Internet-based distributed capture capability
eliminates the common and costly practice of shipping documents from remote offices to a
central site for processing. The optional Ascent Capture Internet Server (ACI Server)
enables remote scanning and indexing or validation of documents via connections ranging
from dedicated lines to inexpensive dial-up service. Now, IT and IS managers can develop
and implement an enterprise-wide capture solution that allows documents to be scanned
inexpensively at remote sites and then automatically uploaded to a central site securely
and reliably.

Figure 2. Internet-Based
Distributed Capture
Using Ascent
Capture
The balance of this white paper describes the operation of Ascent
Capture, from administration and batch management to scanning, data validation, and remote
operation.
Administration
Before using Ascent Capture, it is necessary to define the types of
documents that will be scanned or imported, and the specific data fields on each form to
be stored in the target database. The Ascent Capture Administration module is used for
three primary tasks:
- Defining document classes: A document class is a particular type of
document, such as a tax form or a transportation waybill. Document classes can contain up
to 100 different form types, each varying in appearance and layout, each with different
data fields.
- Defining data fields: A data field is either an index keyword that is
used to retrieve a document after it has been captured or a piece of information extracted
from a form that will eventually end up in a legacy database or line of business
application. Typical data fields are names, ID numbers, account numbers, and so forth.
Ascent Capture stores all data fields in a central pool (data dictionary), and each
document class uses a subset of the entire pool.
- Module settings: These options include such things as scanner setup,
indexing parameters, release options, etc. Options can be set up differently for each
document class if desired.

Figure 3. Ascent Capture
Administration Module
Batch Management
Ascent Captures Batch Manager module is used to check the status
or control the flow of batches in the Ascent Capture system. A batch is a stack of
documents that are scanned at a single time, and can consist of up to 100 distinct form
types and definitions, each of which can consist of one or more pages. Each batch is kept
together as it is routed through the Ascent Capture processing queues.
The system administrator can use Batch Manager to create, delete, or open batches. In
addition, the administrator can route a batch to a processing module or change the current
status of a batch. A user can be given rights to Batch Manager to perform batch creation
or other operations as permitted by the system administrator.
Batch Manager can be used to:
- Display a summary table showing the current status of all active batches in the Ascent
Capture system.
- Create new batches.
- Delete existing batches.
- Edit batch properties such as the priority, status, and processing queue.
- Display a status history of each active batch in the system.
When you invoke the Batch Manager module, the Batch Manager main window is displayed as
shown below. The batch summary table shows the current status of all batches in the Ascent
Capture system. The batches in the summary table can be sorted by selecting the button
above the associated column in the batch summary table.

Figure 4. Ascent Capture Batch
Manager Module
Scanning and Importing
The Scan module is used to create batches, scan and import
documents, process bar codes and patch codes, and perform page based image cleanup and
image enhancement. Users can also edit the contents of batches before releasing it to the
next process.
Ascent Capture drives both simplex and duplex scanners at their full rated speed and
comes standard with support for high-speed SCSI and video scanners via Kofax accelerator
boards as well as mid-range and low-end SCSI scanners via software drivers.
Bell+Howell ACE, Fujitsu IPC3 and Kodak ATP image processing options are fully
supported to provide greater control over scanner settings.
Ascent
Capture has also been designed to take full advantage of scanners based on Kofaxs
revolutionary VirtualReScan (VRS) technology. Whether you are using a VRS ready scanner at
a scan station or at a rescan station, Ascent Capture 3 exploits all the grayscale image
processing capabilities of VRS, resulting in better recognition accuracy and reduced
labour costs due to the elimination of rescanning and document preparation.
The Scan module also supports an auto-import facility. A command line option allows the
Scan module to be placed in "auto-import" mode by another application. For
example, this feature might be used to allow a simple Visual Basic program to poll a
directory for incoming faxes and then start up Ascent Capture to automatically import
these faxes.

Figure 5. Ascent Capture Scan/Import
Module
If an error occurs during the scan or import process, the Scan module displays an
appropriate error message. For example, if the number of pages or documents scanned does
not match the value entered by the Scan operator, an error is reported.
Ascent Captures optional Internet Server enables remote scanning of documents via
connections ranging from dedicated lines to inexpensive dial-up service. Centralised
administration settings are synchronised periodically with remote sites, and scanned
batches from remote sites are uploaded to a central server whenever a batch is finished
orif desiredon a preprogrammed schedule each day.
Recognition
If a document or form has well defined fields, it is possible to
reduce manual keying by using automated recognition techniques such as OCR, ICR, OMR or
bar codes to read zones on the document and automatically convert them into data. If
automated recognition is specified for a document class, Ascent Capture allows you to
specify zones in the document or form and associate each zone with an data field. After
the zones have been automatically recognised, the documents can be sent to an
indexing/validation station for verification or can be sent straight to the release
server.

Figure 6. Ascent Capture Recognition
Server
Types of Data Extraction
- Form ID is used to identify a particular form, resulting in specific fields being
automatically recognised and specific image cleanup being applied. This allows the
index/validation operator to simply check the accuracy of the automated recognition
results rather than manually typing the required data on the data entry form.
- Page registration is used to detect if the image has been shifted during scanning
relative to the master image used to create the form template. This shifting, which is
caused by mechanical tolerances in the scanners document feeder, causes zones to be
misaligned and results in inaccurate recognition. Ascent Captures built-in page
registration automatically corrects mis-registered images and improves the accuracy of
automated recognition.
- OCR is used to automatically fill data fields, thus allowing index operators to simply
check the accuracy of OCR fields rather than manually typing the required data on the data
entry form.
- ICR is similar to OCR but recognises handprinted characters. It is generally less
accurate than OCR but can produce good results if the characters are constrained within
boxes or if they are limited to numeric characters.
- OMR (Optical Mark Recognition) is used to automatically recognise checkboxes, bubbles,
and other filled in marks on a form.
- Bar code recognition is a highly reliable method for extracting data from documents. Bar
codes can be recognised either in a predefined zone or on the entire page. If page level
bar code recognition is used, bar codes dont have to be present at specific places.
Instead, every bar code on the page is recognised and then associated with data fields in
the order in which they are read.
Ascent Capture also supports full text OCR. This process performs OCR on the entire
document and produces an ASCII file of the output. The output can also be stored in a
variety of word processing formats, including Microsoft Word and WordPerfect. Ascent
Capture provides full multi-language OCR and ICR support allowing the administrator to
associate a country and a language with each document class.

Figure 7. Anatomy of a Form
Image Cleanup
As a rule of thumb, automated recognition techniques such as OCR, ICR, OMR and bar code
recognition are useful only on clean, sharp images where the recognition accuracy is
90-95% or higher. If the recognition accuracy is less than 90%, the cost of checking and
correcting errors is frequently higher than the cost of the manually keying the data in
the first place.
There are several techniques that can make images more readable and increase OCR
accuracy. The most effective ones include:
- Deskew: This technique straightens pages that have been scanned slightly crooked due to
mechanical tolerances in the scanners document feeder. Deskewing can increase
recognition accuracy by 15-20% or more, which can make the difference between using
expensive manual keying and automated recognition technology.
- Deshade: Recognition engines are unable to process words against the gray shaded
backgrounds that are common on forms. Removing shading allows you to recognise zones that
are otherwise unreadable.
- Despeckle and streak removal: These techniques remove small speckles and streaks caused
by dirt in the scanner feeder or noise in the scanner CCDs.
- Line removal: On typewritten forms, words are frequently typed so that they cross over
the lines on the form, which makes them unreadable to automated recognition processes.
Line removal erases the lines on the image and then reconstructs the characters so they
can be recognized.
- Edge enhancement: This is a multiple set of filters that sharpens the edges of
characters. The results are usually invisible to the eye, but they can increase
recognition accuracy by as much as 5-10%.
Scripting
The Recognition module has scripting capabilities that allow you to perform custom
operations based on individual zones within a form. For example, a custom script could
retrieve each zone snippet as it passes through the Recognition queue and send it off to a
proprietary OCR or ICR engine. This allows easy integration of highly customized
recognition engines where appropriate.
Since the Recognition module is an unattended operation, recognition scripts are also
well suited to performing lengthy data validation operations such as performing a database
lookup over a slow connection. This type of validation is better suited to an unattended
queue than it is to the Indexing and Validation queue.
Indexing and Validation
The Indexing and Validation module is used to enter data and associate it with a
scanned document. It can also be used to verify the results of automated recognition
techniques used to extract data from a form. Upon release, the field data is stored in the
users target database.
Data validation is the most critical and labour intensive step in
the document capture process, with typical capture operations sometimes requiring as many
as four validation stations for each scanner. Ascent Capture provides several methods to
reduce operator errors and speed the indexing and validation process:
- Custom validation scripts can be configured to fill fields on the data entry form with
default values.
- Validation scripts can also be used to detect both manual and OCR, ICR, OMR or barcode
data errors. For example, if a data field is a telephone number, a validation script can
require that all entries must be numbers, which prevents OCR or ICR from mistaking the
number 1 with the letter l. Validation scripts can also be used to verify the value of a
data field against an external database.
- For data fields in which 100% accuracy is essential, secondary verification can be
specified. After a batch of documents has been validated, the batch is then routed to a
second operator, who reenters the specified data fields a second time. Any data fields
that dont match are flagged as errors and must be rekeyed. This method of double key
entry is the most reliable way known to ensure the accuracy of document and form data.

Figure 8. Ascent Capture Indexing and
Validation Module
QC and Rescan
No scanner is perfect, and rescanning is an integral part of Ascent Capture.
Index operators can manually tag documents or individual pages for rescan, attaching
electronic notes that tell the scanner operator exactly what the problem is. In addition,
the forms processing module can reject pages when document separation or form ID fail. The
batch is then queued to a rescan workstation where the operator is prompted for the
specific pages or documents to be rescanned. Ascent Capture automatically inserts
rescanned pages in the appropriate position within the batch.
The following are typical reasons why documents are rejected and the batch is sent back
for rescanning:
- Poorly scanned page (too light, too dark)
- Missing page
- Missing document
- Unrecognised
form type
- Skewed image
- Missing corner (dog ear)
- Illegible bar code or patch code
- Too much noise
Electronic notes explaining the problem may be attached to the rejected document by the
operator who detected the problem in the validation or verification module. Before
rescanning, the rescan operator can open the note viewer and read the attached notes.

Figure 9. Ascent Capture
QC/Rescan Module
Release
The final stage in the production capture process is to release each
document in the batch to a document management or workflow system. In the release process,
the image files are written to permanent storage (for example, Ascent Storage, an optical
storage management system for Windows NT) and the data is written to the target database
or a document manager. When dealing with forms, extracted form data can be released to a
back end database or line of business application.
In addition, Ascent Capture allows users to write their own custom release modules,
either to modify the standard release procedure or to release documents into a proprietary
back-end or non-ODBC database. Custom release modules are available from Kofax for a wide
variety of back end systems, including:
- IBM EDMSuite
- Documentum EDMS 98
- OpenText LiveLink
- Adobe Acrobat Capture
- Eastman Software Imaging for Windows NT
- Keyfile
- Optika eMedia
- NovaSoft Novation
- IMR Alchemy
- PC DOCS DOCS Open and DOCS Fusion
- Excalibur RetrievalWare
New release scripts are available all the time, so call Kofax for a current list or
visit our home page at www.kofax.com and go to the Ascent Capture section for the latest
information.
Documentation,
Training and Support
Ascent Capture ships with a Getting Started Guide that contains a four
lesson tutorial that walks you through the basics of setting up Ascent Capture queues,
data fields, document classes, and batch classes. The Getting Started Guide is also
available in its entirety on the Kofax Web site. In addition, detailed online help is
available for all topics.
Both technical and sales training is offered to resellers and end-users at the
companys facilities in Irvine, California.
The Alliance program for Ascent offers multiple levels of technical support to meet all
the support requirements of systems integrators and resellers.

Contact AMS for more information or
click here to look at Ascent Capture Version 3.0.
Go to Top.
AMS Services.
For the complete run down on what AMS can do for you, click on
the following link.
AMS Services
Go to Top.
Next Month
In the next few issues we will have some new articles which will include the following:
- Alchemy Version 6.
- Articles and other topics of interest.
Plus all the usual bits & pieces.
Should you want a topic covered or need an article in full, please
feel free to contact AMS.
Go to Top.

Go to AMS HomePage.
Go to Top |