8 . 5 Document Imaging
People are turning to document imaging in an effort to manage the mountains of paper generated every day in this age of the so-called paperless office. In part, document imaging can be viewed as paper management in electronic form. Properly used, however, document imaging can become much more than electronic paper. Electronic documents created from document images can be indexed, searched, and accessed in a variety of ways to make their information more useful and valuable.
One particularly interesting document imaging software system is Adobe's Acrobat Capture. Its idea is to scan documents and have them converted into PDF (Adobe Acrobat format) files. Of course, document imaging systems can rarely recognize all the text and convert it to computer text, so Acrobat takes an interesting approach. Basically it punts and if the system can't figure out what a character is, it replaces it with the image of the character. Using this approach, the document will keep the correct look, although, of course, there may be problems with searches. It's a clever trade-off
Before you embark on an imaging conversion project it would be useful to consider the following questions:
- Did you determine what you intended to do with the images? Your use of the documents will affect the storage and processing of the images.
- Do the images meet all your legal requirements? Some industries require documents that are valid from a legal standpoint. Documents may sometimes be required as evidence, and electronic documents may not be valid.(11)
- Does your organization have the infrastructure to support the new types of skilled staff needed to work with the new system? Your existing staff may need more technical training. A training budget should be part of the project, and retraining should be part of the project plan and organizational goals.
To become involved in document imaging, you presumably already have a large collection of paper documents. These documents must be converted into electronic images in a process known as backfile conversion. According to one particularly practical article, this process involves the following five steps:(12)
Document PreparationOrganize and discard old documents.
ScanningPurchase an appropriate scanner for your documents. Autofeed scanners are a myth; hands-on feeding is a reality.
IndexingCreate all potential identifiers, using current technology such as autoindexing, OCR, and bar codes. Where possible, use existing databases to retrieve additional information, such as employee information from an ID.
Quality AssuranceManage the process carefully; conversion is tedious and difficult. Get it right the first time, because reprocessing is expensive.
Image IntegrationYour goal is a system to handle daily throughput. Service bureaus may be appropriate for the backfile work while you focus on the future system.
As part of the document conversion process, paper documents are scanned, and images are converted into electronic files and kept in large electronic filing cabinets.(13) The document images may be recognized by optical character recognition (OCR) software and converted into computer interpretable text, which can be indexed and searched.
According to the news magazine Imaging World, "office workers spend 60% of the day dealing with paper documents and U.S. businesses continue to create over one billion pages of paper each day."(14) That's a lot of trees. The same article categorizes the imaging market into five components:
- image input
- image storage
- image management and processing
- image communications
- image output and display
Let's examine three aspects of these components. First, the OCR part of the input component; second, text retrieval, part of the management and processing component; finally, the media issues that are part of the storage component.
8 . 5 . 1 OCR
Paper documents consisting mainly of text represent a document conversion opportunity. It would be nice to get that text into your computer in a manipulable form such as ASCII text.(15) The two ways to do this are to retype the information or to use OCR (Optical Character Recognition). Retyping is not as absurd as it may seem at first glance. Scores of "offshore" workers provide inexpensive labor and a number of companies provide this service. (see section Appendix A Resources in the appendix Resources for more information.) You can imagine that proof reading the material is excruciating. Nevertheless, rekeying is a costeffective option.
The more civilized approach, OCR, has come a long way. Software and hardware packages are now available for all classes of computer equipment, from PCs to workstations. Recognition systems can interpret a wide variety of fonts. The accuracy of some systems can be improved by training them to recognize particular fonts and the specific characteristics of the documents.
8 . 5 . 2 Text Retrieval
A basic reason to convert paper documents into electronic files is to improve access to the information. After a set of documents has been scanned and the text recognized and converted into ASCII, what's next? Indexing and the creation of a fulltext retrieval database is one possibility.
What is a fulltext database, and how is one used? As the name implies, a fulltext database allows you to search the entire text of a document. Every word is indexed for rapid retrieval. Often, the index takes up as much space as the text. Storage media like CD-ROMs with over 600Mb of available space are perfect for these types of databases. However this involves the classic trade-off between speed and space.
To build a fulltext database, you go through the following steps:
- 1. Assemble all text into a common area, such as a single directory.
- 2. Identify and possibly mark up, in any required format, the headings, sections, and subsections that provide a hierarchical structure to the document. (Typically, this is used for the user interface of the text retrieval engine.)
- 3. Identify the "kill list" words you do not wish to search, such as "the," "and," and so on.
- 4. Run the database builder software to create the indexes and generate the user interface for the particular set of data.
In a typical case, the textual information that originated with a set of documents is processed by the database builder software to produce a searchable database. A user interface or run-time systemas opposed to the builder systemis used to search through the text in a variety of ways. The searching flexibility is an important characteristic of fulltext retrieval systems. (Please see section Text Retrieval in the appendix Resources for some references to these products.)
Textual searching can take many forms. The complexity of searching can range from a simple word search to boolean queries with proximity distance and regular expressions. OK, I'll explain that obnoxious jargon.
A boolean query (or search) is something like the following:
Find all occurrences of the words "your mother wears" AND ("army boots" OR "high heel shoes")The query above would find the phrases:
"your mother wears army boots" "your mother wears high heel shoes, what a fashion statement"The query would not find:
"your mother wears some funky army boots"A proximity search is a way of specifying words that you want to locate that are not necessarily next to each other. They would have to be within some specified distance of each other. Distance is described as a stated number of words.
Regular expressions are a formal way of using a pattern to represent many letters. You are probably already familiar with the concept of wild carding for file names. For example, in DOS, when you ask to list all files names that start with the letter F, you type the command:
dir F*The * is a simple "regular expression" that means match 0 or more of any character. More complex regular expressions are commonly used in the UNIX operating system and as a way to specify text for retrieval and editing.
These types of searches were previously used only by techno-geeks. Now however with the advent of the Web doing them is becoming a more important skill. Many of the Internet Starting point services, such as Yahoo and Open Text, can take advantage of more complex queries, helping you to find what you're after faster.
8 . 5 . 3 Storage Media
In any discussion of document imaging, we must also talk about mass storage. The images captured by scanners take a lot of space. Currently, the preferred media is the optical disk. The advent of economical, highcapacity optical disks was one of the critical technological advances that enabled the imaging industry to become a reality. The main technology used in the imaging domain is called WORM for Write Once Read Many. WORM disks can store from 1.2 to 10 gigabytes of data on a single cartridge. The writeonce limitation may actually be an advantage. The data are physically impossible to erase, an advantage for most imaging applications with an archival function.
The other optical technologies, MO or Magneto Optical and CD-ROM, are not appropriate for document imaging applications, but the prices are always going down, so check the costs. MO disks allow reading and writing many times, but they are probably too expensive for the high data volumes needed. This may change in the future. CD-ROMs require a factory to master and replicate the information. But CD-ROMs are the media of choice for the distribution of many copies of large volumes of information. Actually, desktop CD-ROM production is now a reality, and lowvolume production is practical.
Significant imaging applications often require terabytes of on-line storage. The costeffective solution for keeping all this information accessible is to use optical jukeboxes. Just like the old Wurlitzer in the corner diner, the optical jukebox contains a set of cartridges. They are swapped, one at a time, into the drive. A wide variety of jukeboxes are available. They range in size from a toaster to a refrigerator. Some jukeboxes hold WORM cartridges, some hold MO (rewritable), and some contain both types. Jukebox capacities are largely dependent on the number of cartridges they can holdfrom 10 to 2000. The storage capacities can go as high as 12 terabytes for a single jukebox. A jukebox this large is larger than desktop size, however.
Let's also remember the wonderful world of micrographics. COM (Computer Output Microform) systems are still alive and kicking. Even today, images can be stored in a costeffective way using microfilm, microfiche, and the everpopular aperture card. Microfilm is accepted throughout the world as a legal archival copy.
Access to images stored in these analog media is labor intensive and basically awkward. You can't even do simple text searches. Digital information will be able to take advantage of new storage technologies. You can't shrink microfilm images, but as digital storage technology improves, more and more information can be packed into less and less space. Witness the new development in next generation CD-ROMs. (See Section 7 . 4 . 1 CD-ROM in Chapter 7 Applying Standards for more information on DVDs.) Some far-out technologies mentioned in the press are digital paper and holographic memory, with storage capacity several orders of magnitude greater than existing optical media.
The capabilities of document imaging systems are constantly expanding. Coupled with the improvements in text retrieval and networks, it seem likely that imaging systems present us with an important opportunityto add value to the information.
Skip to chapter[1][2][3][4][5][6][7][8][9]
| © Prentice-Hall, Inc. A Simon & Schuster Company Upper Saddle River, New Jersey 07458 |