Chapter 5: Document Standards

"A committee is a group that keeps the minutes and loses hours." -Milton Berle

Standards for electronic publishing can profoundly affect the publishing process. All aspects of the process, from design to authoring through production, can be influenced by the use or misuse of standards.

First, let's define the key terms. A standard is a set of agreedupon procedures or data formats that you use to accomplish a task. Standards become part of the software tools you use to get your work done. This chapter will examine both defacto (informal) and formal publishing standards. It will also explore document exchange, the motivation behind a great deal of document standards work. Document interchange seeks to answer the question: How can I give you my electronic document and know that you can use it?

Publishing is evolving into one of many forms of information dissemination. On-line reading and browsing, also known as Web surfing, hypertext, hypermedia, and CD-ROM based delivery mechanisms are realities when the proper standards are implemented. Thus, the standards themselves should be viewed as enabling technologies. They lower the risk of trying a new technology. If everyone uses a particular standard, you take less risk publishing a document that uses that standard.

One longterm goal of publishers is to create customized publications from repositories of textual information. The content, stored in a database, is the raw material that is refined into information products.

The term information refineries, used to describe this process, is an apt analogy. Raw "crude" text is poured in the top of a processing chain and out flow numerous products. CD-ROMs, on-line information services, Web sites, customized textbooks, personalized newspapers, and more are all potential products of these repositories of information.

The widespread use of formal publishing standards may permit the establishment of such information refineries. However, it is also clear that not all types of text are suited to be sliced, diced, mixed, and tossed to produce an arbitrary salad bowl of products. While it is tempting to use the refinery paradigm, we must not be seduced into inappropriate applications. Sometimes, the author intended a book to be read as a whole. Similarly, sometimes the content must be read in context.

It is useful to consider two types of standards: document standards and the graphics standards commonly used to represent the graphics included as part of a document. Often, standards refer to existing standards rather than to the reinvention of an area already covered by a standard. Combinations of appropriately chosen document and graphics standards can provide powerful solutions to the many complex problems of electronic publishing.

5 . 1 DeFacto Standards

"There is no monument dedicated to the memory of a committee." - Lester J. Pourciau

Sometimes, when a product becomes very popular and widespread, the data formats for text or graphics used by that product become a defacto standard. When appropriate for your particular application, a defacto standard format is an easy and convenient way to exchange information. For example, in the Computer Aided Design (CAD) world, AutoCAD's format DXF is the defacto standard for CAD data on PC's. Adobe's PostScript is another defacto standard.

These specifications should not be confused with formal, official standards. True standardsformal standardsare generally developed over periods of 2 to 10 years by committees of technical experts. The committees work under the sponsorship of national or international standardsmaking bodies such as American National Standards Institute (ANSI), International Telegraph and Telephone Consultative Committee or Comité Consultatif International Télégraphique et Téléphonique (CCITT), European Computer Manufacturers Association (ECMA), or International Organization for Standardization (ISO). The formal standardsmaking process is excruciatingly painstaking and slow, but it's the best way to address all concerns. (For more discussion about formal standards, see Section 5 . 2 Formal Standards later in this chapter.)

One significant difference between defacto and formal standards is that defacto standards are often proprietary. The exact structure of a data format a defacto standardmay well be a trade secret. PostScript Type 1 fonts were a tightly held secret until 1990. In that year, Apple and Microsoft announced the TrueType font format, a shot across the bow at Adobe's PostScript monopoly.(1) Subsequently, the Type 1 font specification was made public. (Seems as though a little competition is useful.) Yet, keep in mind that PostScript would never have come into existence through the formal standardsmaking process with the political and technical compromises that so often are a reality.

5 . 1 . 1 Document Processors

The classic document processing systems, some of which are still quite popular, are batchlanguage oriented. The intuitive appeal of WYSIWYG systems must sometimes give way to the sheer volume of processing necessary for documents consisting of thousands of pages. In fact, sometimes documents are so mundane and routine that you don't want to look at them (for example, documentation for hundreds of similarly structured subroutines). Let's briefly examine three systems: Scribe, troff, and TeX.

Scribe was a groundbreaking document processor, the creation of Brian Reid formerly of CarnegieMellon University. He singlehandedly revolutionized the field of document processing with his doctoral dissertation, Scribe.(2) Along with the overall ability to format text according to markup instructions, Scribe introduced the notion of styles. A Scribe document does not contain detailed formatting instructions. Documents can be created and printed according to a particular format such as "Thesis," "Report," and "Letter."

From an early Scribe manual:

To use Scribe, you prepare a manuscript file using a text editor. You process this manuscript file through Scribe to generate a document file, which you then print on some convenient printing machine to get paper copy.

Scribe controls the words, lines, pages, spacing, headings, footings, footnotes, numbering, tables of contents, indexes and more. It has a data base full of document format definitions, which tell it the rules for formatting a document in a particular style. Under normal circumstances, writers need not concern themselves with the details of formatting, because Scribe does it for them.

The manuscript document an author creates has markup statements throughout. These statements describe the various components of the document to the Scribe processor. The descriptive markup the author places in the document is interpreted and formatted by the Scribe document processor. Scribe has generally been superseded by TeX and troff. Nevertheless, it remains an important document processing system.

In the UNIX world of document processing, troff is king. Actually, document processing applications were one of the first serious UNIX applications and one of the motivations behind its creation.(3) Created by Joseph Ossana, troff is first and foremost a typesetting system. Troff processes the markup that an author must embed into a document as formatting instructions. The modular nature of UNIX, coupled with the power of troff, has led to a number of troff preprocessors: Eqn for typesetting equations, tbl for tables, and pic for line drawings. Grap, a little language to specify graphs, is actually a pic preprocessor. Each of these preprocessors is a little language in and of itself. (See Section 4 . 1 . 3 Specialized Languages in Chapter 4 Form and Function of Document Processors for illustrations and a discussion of these preprocessors.)

It is common to see a command line such as

cat doc.txt | pic | tbl | eqn | troff -mm

to produce the printed copy of a paper. (In UNIX, the | symbol is a "pipe," which directs the output of the commands to its left to the input of the commands on its right.) cat doc.txt sends the file (doc.txt) as input to pic, which interprets drawing commands; to tbl, which interprets table making commands; and to eqn, which interprets equations. The output of all three of these preprocessors is input to troff, which does the actual typesetting according to the mm macro package.

TeX is one of the premier document processing systems in existence. It is arguably the most popular batchlanguage oriented document processing systems. It is available on virtually any computing platform and can be legally obtained for free. An extensive series of books by Donald Knuth (the author of TeX) documents the source code and functionality of TeX.(4) Commercially supported implementations can also be purchased for platforms such as the IBM PC.

LaTeX, a macro preprocessing system used with TeX, is the primary way documents are authored(5). LaTeX uses the concept of style files to encapsulate commands and for processing instructions to format particular document elements.

Troff and TeX are used as the basis for internal publishing standards by a number of large organizations. AT&T's UNIX and OSF's (Open Software Foundation) software documentation originates as troff documents. TeX is used by the American Mathematical Society for a number of publications, and the electronic publishing magazine EP-ODD uses TeX and troff as the principal means for electronic submissions.

5 . 1 . 2 PostScript

PostScript is THE defacto standard page description language because of its extremely wide market penetration. It has evolved into more than simply a way of describing marks on a page. The thorough way in which it handles graphics and fonts, along with the consistency and quality of its implementations, has led PostScript into many areas. Document exchange and on-line document displays are two of the more prominent ones. PostScript, in combination with Apple's LaserWriter, effectively started the desktop publishing phenomenon.

For several years, PostScript was available only as a language that ran inside a printer. The printer's manufacturer had to license PostScript from Adobe. Close conformance to the PostScript specifications was guaranteed, because Adobe made sure that a particular implementation of PostScript worked correctly for a particular printer. This proprietary conformance testing is one way to ensure consistent implementations of software. However, it depends on the honor of a particular vendor (not that I'm implying that any vendor would lead us astray, of course).

As more and more PostScript printers became available, PostScript became a reasonable medium for document exchange. For example, if I send you a PostScript document, I have a high degree of confidence that the document you print will be correct. However, a PostScript document is not generally considered a revisable form of the document and is difficult to edit. (For a more through discussion of the issues involved with document exchange, see Section 7 . 2 Document Exchange in Chapter 7 Applying Standards.)

Like any commercial product, PostScript is evolving to meet new requirements and fix old problems. PostScript Level 2 addresses many past complaints, such as poor memory management and limited color support. PostScript Level 2 also offers several other interesting features. One of its significant improvements over its predecessor is in the area of color manipulation.(6) Full support for the CMYK (4-color printing) color model should make life easier for color printing.

The fundamental change of PostScript Level 2 is the incorporation of the CIE color mode (see Section 6 . 3 . 1 Pure Color Models in Chapter 6 Media and Document Integration). The CIE color space specifies a mathematical relationship of color to human perception and is, therefore, independent of any output device. PostScript Level 2 provides a mechanism (called CIE based ABC) that enables developers to map the CIE color space to a particular output device.

The extensions needed for Display PostScript have been included in PostScript Level 2, allowing the same PostScript interpreter to be used for either printing or display applications. True WYSIWYG displays are all the more likely if the same software is used both to display a document on the screen and to print it on paper.

In the area of data compression, PostScript Level 2 also offers significant improvements. Level 2 includes a new operation that accepts a compression algorithm. This includes the JPEG (Joint Photographic Experts Group) and LZW (Lempel-Ziv-Welch) compressions algorithms.

A second generation PostScript is called PDF (Portable Document Format) and is the core of the Acrobat product line from Adobe. Its principal difference is optimization for display, as its primary function is the on-line display of documents. One key to this technology is the use of a new font technology called "Multiple Master." The new fonts work with Adobe Type Manager (ATM) to "mimic the style, weight, and spacing of the document's original faces automatically." PDF also stores a document as a series of randomly accessible pages, facilitating hypertext links. The overall product line that uses this technology is called Acrobat. (See section 7 . 4 . 2 Electronic Page Delivery in the chapter Applying Standards for more information on Acrobat.)

5 . 1 . 3 Lots `O Formats

Software vendors have defined many document and graphics formats. They have made many, but not all, of the specifications public. Vendors of open specifications correctly reason that publicizing their formats will encourage the creation of new software products that use their formats and more of their products. A few of the more popular formats are discussed next.

DCA/RFT

DCA/RFT, the Document Content Architecture/Revisable Form Text, commonly referred to simply as DCA, is the format used by IBM's DisplayWrite. It is capable of representing a document with one or two master formats.

Graphics are possible inside a DCA document via a special Inserted Escaped Graphic identifier. This identifier lets a document treat a graphic as a block located in the text.

DCA has an automatic numbering scheme that can be used to specify the numbering style of footnotes. It is possible, using this feature, to allow the user to define custom numbering sequences.

RTF

Microsoft has defined the RTF, Rich Text Format, for use by its principal publishing product MS Word. On the Macintosh, it is the most commonly used document exchange format. Many products include import filters to allow input of text in this format.

MS Word has grown up to be the king of the hill of word processors. As such, RTF is a widely used interchange format. However, many word processors and document publishing systems seem to have trouble reading and writing proper RTF files. MS Word, not surprisingly, is clearly the most reliable program to read and write RTF files. There are also a number of RTF to HTML converters around. For example check out rtftohtml at http://www.sunpack.com/RTF/rtftohtml_overview.html.

WORDPERFECT

Oh, how the mightly have fallen. A mere 3 or 4 years ago, WordPerfect was the undisputed leader of word processing packages. Now after a buyout by Novell, who then sold it to Corel, WordPerfect is struggling to keep market share and is playing catchup with the formidible marketing clout of Microsoft and its MS Word.

One particularly effective aspect of WordPerfect's design is the use of multiple views. A user can view the document in three ways. The normal view shows mainly the textual content with minor highlighting, color changes for font variations, and a few other visual cues. The show-codes view lets you see and edit all the hidden control codes used by the system. In this view, users can get into the nittygritty when required. The third view is the print-preview mode, which is most useful for displaying the relationship of inset graphics to the text. Of course, a fourth view is implicit: the printed document itself.

5 . 1 . 4 Dealing with Formats

Can I get document X into system Y? Will my WordPerfect system accept this vital MSWord document?

The answers to these questions depend on a variety of factors. The specific document processor may or may not support the import or export of a number of formats. Even if a format is reportedly supported, the import/export function often does not do a complete translation. The result of a successful translation will almost always produce a document that must undergo extensive editing. Style and paragraph tags are usually lost, even if the overall formatting was translated successfully. Unfortunately, it is necessary to understand more than you might care to know when transferring a document from one format into another.

ASCII is the most interchangeable format for documents. Unfortunately, the one lowest common denominator of document interchange, the textonly option, has some problems. The difficulties are not with ASCII but with the different ways in which computing platforms treat lines of text. There is no standard for the end-of-line (EOL) character. For example, UNIX computers use a line feed as the EOL. PCs use a carriage return (CR) and line feed (LF) in that order. Macintoshes use a CR while VMS systems use a character count rather than a particular character.

These different EOL characters are usually not that much trouble; however, in this age of networked distributed computing, with disks on server machines shared across many computing platforms, things can get ugly. Text on one platform will often not display correctly when the text file came from another platform. The networking software often takes care of these disparities, but not always. This issue becomes more significant when dealing with "write once" media or CD-ROMs, which are intended to provide data to many platforms.

When dealing with formats, it is crucial to be conscious of certain basic categories of information. Graphics (vector and bitmapped), font usage, style usage, global information, and properties are some of the basic functional categories of information that must be converted by a translator. Also, keep in mind that document fidelity is a difficult goal to attain.





[SECTION 5.2] [TABLE OF CONTENTS]

Skip to chapter[1][2][3][4][5][6][7][8][9]



© Prentice-Hall, Inc.
A Simon & Schuster Company
Upper Saddle River, New Jersey 07458

Legal Statement