4 . 3 Markup

Markup is information that is embedded in the text of a document that is not intended for printing or display. It may consist of instructions to a printing device, commands for a word processor, or even comments to a coauthor. All languageoriented document processing systems require some sort of markup. WYSIWYG systems often have markup that is hidden from the user. Otherwise, all you have is text with no information for the document processor.

4 . 3 . 1 Types of Markup

The three main classifications of markup discussed in the following sections has inspired the creation of a number of standards. It is also possible to create the markup itself in a number of ways, which are discussed at the end of this section.

SPECIFIC MARKUP

Specific markup, sometimes called procedural markup, is often found in word processors and older (yet still used) typesetting systems. The function of specific markup is to tell the system how the text should look when printed. Typically, these are instructions to format a section of text bold or centered and of a particular size.

Specific markup can also be used to tell the system to perform some processing function on the text or on other items (for example, to count the number of figures). Sometimes the markup is hidden from the user; this is the case in a WYSIWYG system. TeX and troff commands embedded in a document are a form of specific markup. In effect, the markup consists of procedural commands that direct the document processing system to perform certain functions.

GENERALIZED MARKUP

Specific markup tells the system what to do with a document. In contrast, generalized markup describes the document to the system. Also called descriptive markup, generalized markup tells the system about the document elements. It does not tell the system what to do with that information. SGML is a standard for placing descriptive generalized markup in a document.

The act of placing markup into a document is time consuming and unwieldy. However, several good tools make this process reasonable. (See Appendix A Resources for some tools.) Ideally, you would like to automate the markup process to the point of invisibility.

What is Markup? One of the editors of the Text Encoding Initiative (TEI) (see Section 9 . 2 Text Encoding Initiative in Chapter 9 Case Studies), a project developing markup specifications for humanities scholars, says this about markup:

Why does the TEI encoding scheme matter?... It is a tool for scholars, but it has many applications, some of them commercial as when it helps to reduce the documentation for a fighter plane from three tons of printed information to a disc of easily retrievable, cross-referenced electronic data. Markup, if one needs a fancy word, is a branch of hermeneutics, a system of explication. Markup makes explicit what was not so clearly arranged before. It allows huge amounts of data to become parsed character data, that is meaningfully arranged data with tags that can help collect or arrange the data according to the needs of the retrieving user.(13)

CONTENT MARKUP

Content markup is the use of generalized markup to describe the semantic elements of a document. Strictly speaking, this is an application of generalized markup and, indeed, of SGML.

For example, you might have a recipe marked up with tags such as <INGREDIENTS>, <TEMPERATURE>, and <SERVINGS>. These tags describe the content, not the structure. You can imagine having hundreds of recipes in this form and integrating the information with a database. You could ask questions of this data base to produce, for example, a shopping list for a particular set of recipes.

This type of markup is the subject of a great deal of research. It gets very complicated very quickly. Often, it is difficult to clearly and unambiguously identify actions and objects in the real world.

Take the issue of naming an item. In a description of a new porch you're about to build, you could refer to a joist as (1) the 10th support from the left end, (2) the joist 150 inches from the left end, (3) the 3rd loadbearing support, (4) the corner support, or (5) the pink joist. Descriptions of objects often mix naming conventions; as a result, the markup of content is very difficult unless the text is highly structured and almost legalistic in nature. To deal with this issue, you must try to anticipate the way the document and content markup will be used. Alternatively, you must be willing to highly restrict the markup to do meaningful content markup. These issues are also part of the work of the Text Encoding Initiative. (See Section 9 . 2 Text Encoding Initiative in Chapter 9 Case Studies.)

4 . 3 . 2 Markup Creation

As you can easily imagine, the act of marking up text can be arduous. There are a number of ways to attack the problem. Tools to aid the markup process range from no automation to fully automated.

The first markup method is brute force. You simply use a text editor to embed the markup at appropriate places in the text. In the next method, markup is entered by hand, but by an editing tool that knows about allowable markup. In the case of SGML generalized markup, there are a number of structure editors. (See section Document Processors in the appendix Resources for a list of these editors.) These editors "know" what kind of markup is allowed at any particular place in the text. The user is allowed only to enter legal types of markup entities. This approach has several benefits. The mental overhead is greatly reduced, and you're assured of producing legally markedup documents. Sometimes, however, having an electronic checker looking over your shoulder can be overly intrusive. Inevitably, you need to turn off the checking. From a user interface point of view, the better systems balance markup validation with ease of use.

A semiautomated approach is another way to enter markup. A document already in one publishing system's format is used as the input to an automatic markup process. For example, a FrameMaker document could be translated into HTML by using a converter to translate FrameMaker (MIF) markup into HTML tags.(14)

Although useful, this conversion approach has limits. You must start with a highly structured document. More problematic is the reality that document structure is often implicit in specific markup systems. The fact that a sentence is bold and all capitalized may imply that it is the start of a section, but it does not state so explicitly. The structure must be inferred based on the particular style used to format the document, rather than on an explicit command that says: <THIS IS THE START OF A SECTION>. If a figure or caption also contains a sentence that is bold and all capitalized, the markup system would misinterpret it as the start of a section.

Generalized markup using SGML defines the structure of a document. Troff documents use a form of specific markup. Often you must use implicit assumptions about troff documents to complete the translation to SGML. The same is true for documents in TeX or MS Word.

Finally, you can use automated markup systems that use document images from scanners as input. The software is told what to expect and creates the markup based on those expectations. As in the previous FrameMaker example, highly structured documents can be successfully translated, but poorly structured documents are much more difficult.(15) A newsletter cannot be fed into the scanner when the software is expecting a particular kind of technical report, at least not if you expect meaningful results.





[CHAPTER 5.0] [TABLE OF CONTENTS]

Skip to chapter[1][2][3][4][5][6][7][8][9]



© Prentice-Hall, Inc.
A Simon & Schuster Company
Upper Saddle River, New Jersey 07458

Legal Statement