5.5 Concepts of Program Debugging
Errors, Faults and Failures
Debugging is difficult because its primary focus is correcting the thinking of the developer and is only secondarily concerned with correcting the program. Setting aside problems that are due to simple mechanical transcription mistakes (e.g., the developer mistakenly typed "+" instead of "-"), a problem revealed by testing a system began in the mind of the developer. The developer may have formed an inaccurate mental model of the systems's goals and constraints, conceived of an algorithm that is incomplete or incorrect, misunderstood a programming language feature (e.g., inheritance), or misused a library component. The term error is often defined to mean these invalid models, incorrect concepts, and misunderstandings that stem from the mind of the developer. During system development the developer's errors become manifest in the system's code. Such code is said to contain one or more faults. When the code containing a fault is executed, the system enters an unintended state and eventually the system experiences a failure, which is an observeable departure of the system from its intended behavior. The failure is the outward manifestation of the problem created by the fault(s). Harmless failures are those that affect the appearance of the system, as when the output may not be formatted correctly or a button has the wrong label. A more severe failure produces incorrect results but allows the continued execution of the system, and the most severe failures cause the system to terminate immeditely and abnormally with possible loss of data or corruption of other resources.
The steps in debugging a system are outlined in the figure below. First, the failure of the system is observed during a test. The term test in this sense is very broad, including an informal execution of the system by the developer during development, a formal test conducted under controlled conditions by an independent team of testers, and production use by the end user. Second, the fault(s) in the code are located. During this step other tests of the system are usually conducted to recreate the failure so as to isolate and identify the offending code. Third, the errors on the part of the developer are discovered. This step involves questioning the models and concepts employed by the developer in creating the faulty code. Fourth, once the misunderstandings are identified, a correct model or concept can be formed. The developer may reexamine the system specification, other documentation, or consult with other team members to construct a correct model or concept. Fifth, the fault(s) in the code are repaired by adding to, removing from, and/or modifying the system's code. The repairs made in this step may be very localized if the error was a mistake about details, or they may require widespread changes if the error was a fundamental mistake. Sixth, the repaired system is retested to ensure that the system does not fail as previously observed. This step also helps to check that the repaired code did not itself introduce any new faults into the system.
The code containing a fault and the code that is executing at the time of the failure may not be the same and may have different spatial and temporal relationships. Various spatial and temporal relationships between these two segments of code are shown in the figure below. The term spatial as used here refers to the relationship between the classes containing the two segments of code. The spatial relationships shown in the figure are based on whether the two segments are in the same class, in different but related classes, or in unrelated classes. Classes are related through inheritance or because objects of these classes form associations or aggregations. The term temporal as used here refers to the points in time when the two segments of code are executed. Significant temporal relationships that are shown in the figure are whether the two segements are both executed within a single method invocation, not within the same method invocation but within the same sequence of method invocations, or in different method invocation sequences. A method invocation sequence is an ordered list of method names such that at a given point in time during the execution each method in the sequence, except the last, has invoked the next method in the sequence and the invocation has not yet returned; the last method in the sequence is the method being executed at the given point in time.
Commonly occuring debugging problems can be identified by their spatial and temporal characteristics. Thugh by no means an exhaustive list, the figure above shows five distinct and commonly occuring situations. First, the most limited and most easily fixed cases are those in which the two segments of code are in the same class and in either the same invocation or the same invocation sequence. Often the problem is a small mistake in the detailed coding of the class's methods. Second, a slightly more difficult case is one where the two segments in the same class are executed in different execution sequences. Because different execution sequences are involved, a typical problem is the way in which the two segments are using the state information of the object of which they are both a part. This state information is the most immediate thing that ties together the two segments of code. Third, the two segments of code may lie in different classes that are related through inheritance and are executed in the same invocation sequence. These problems are often due to misunderstandings or misuse of inherited methods, unintentially overriding a base-class virtual method so that other base class methods no longer work correctly, or conflicts over the use of protected data that is accessed by both the base class and the derived class. Fourth, the two segements of code may lie in different classes related because their objects are parts of aggregations or associations. These problems are harder to diagnose because they involve multiple classes and, possibly, different method invocation sequences; they are frequently due to misunderstanding the behavior or responsibilities of a class. Fifth, the most difficult problems to locate and correct are those that theoretically "can not happen." This category of problems are those that occur between unrelated classes and at unrelated points in time. Furthermore, these problems often are not deterministic, that is, the failure happens under different conditions, or the same conditions do not always produce the failure. In sequential programs, problems of this kind are most often attributed to improper use of pointers or related memory allocation effects (e.g., using an object after it has been deleted). Similar problems can occur in programs that use multiple independent threads of execution that are not correctly synchronized.
The Role of Debugging Tools
Debugging tools are useful for efficiently completing the first two steps in debugging - observing failure and locating faults. These two steps essentially deal with reporting the occurrence of interesting or unsual events, providing snapshots of the system as it executes, and allowing the developer to interrogate and control the system's execution. Effective debugging tools provide support for all of these activities. The subsequent steps in debugging - identifying the errors, correcting them, and repairing the faults - cannot be automated.
The developer can employ one of two strategies in observing failures. The developer can simply run the system and wait for it to fail, which is often the first strategy employed because the developer may have no reason to anticipate if or when a failure will occur. The second strategy is to set watch conditions, conditions that are tested to determine if the system is in a valid state. When an invalid state is detected, the system is halted with an informative message. This second strategy is often used when the developer is attempting to recreate a previously observed failure or if the developer is wary of a particular part of the system being tested, perhaps because it is new, complex, or not well understood. The watch conditions may stop the system short of the actual failure, but at a point closer to the true fault. The watch conditions may be dynamically inserted in the system using a debugging tool or they may be preprogrammed by the developer. Inserting watch conditions dynamically depends on the details of the debugging tool. A common technique for preprogramming watch conditions in C++ is via the assert macro, an example of which is shown in the figure below. In this figure, a method receives three input arguments: an array (a[ ]) , the length of the array (size), and an index into the array (i). In this example, the assert is used to watch for violations of the condition: "i is a valid subscript and the value of a[i] is greater than zero." The assert macro takes a single, though arbitrarily complex, condition, which is tested each time the assert is executed. If the condition evaluates to true (non-zero), no action is taken. If the condition evaluates to false (zero) then the system is halted and a message is displayed that states the condition along with the name of the source-code file and the number of the line within this file that contains the assert.
To accomplish the second step in debugging, locating the faults in the code, it is necessary to accumulate three different kinds of information about the system's execution at the point of failure: the state of objects in the system, the current execution sequence, and previous execution sequences. The state of each object in the system indicates what information that object currently contains and what other objects it knows about. Collectively, the entire system state is represented by the union of the states of all of the objects in the system. The current execution sequence shows the sequence of events that immediately preceeded the failure. Knowing this sequence gives an indication of the processing actions that were being attempted at the point of failure. In simple cases, knowing the state of key objects and the current execution sequence is sufficient to locate the fault. In more complicated cases, where the execution of the code with the fault and the execution of the code that causes the failure are more distant in time from each other, an execution history that extends further back in time is needed.
The Debugging Environment
Understanding key components of the system's execution environment - the heap and the run-time stack - is helpful in becoming a proficient user of debugging tools and more adept at locating faults in the code. The space for dynamically created objects is allocated from a memory area referred to as the system's heap. The heap memory allocated for an object contains the object's data and possibly some other information generated by the compiler for run-time purposes. The object's this pointer points to the beginning of the object's memory area in the heap. For three objects identified as A, B, and C, the figure below shows the memory allocated for the objects and their respective this pointers.
The run-time stack provides the memory space for all automatically allocated objects, including local variables (named objects declared within a method and anonymous objects) and parameters. The high-level view shown in the figure below illustrates the contents of the run-time stack for an invocation sequence in which a method in object A, invokes a method in object B, that, in turn, invokes a method in object C. A stack frame (also known as an activation record) is pushed onto the stack each time an invocation is made. When the current invocation returns, its stack frame is popped from the run-time stack and control is returned to the method invocation corresponding to the new top of the run-time stack. Conceptually, each stack frame has two parts containing the parameters passed to the invoked method and the local variables.
The organization of the execution environment helps to explain some of the difficulty of debugging and the limitations of debugging tools:
The evident strength of automated debugging tools is their capacity to examine the execution sequence and exisiting objects at the time of the failure. In some cases, this information suffices to identify the fault. In other cases more extensive means must be used to gather information about past events in the history of the computation.
Breakpoints and logs are two strategies for gaining information about execution sequences that occur before the failure. While both strategies provide information about prior execution sequences, they do so in very different ways. Breakpoints are a means of controlling the forward execution of the program starting at a point in time before the failure occurs. Logs are a means of examing a record to unravel the sequence of events that led to the failure. Breakpoints do not require preprogramming, but logs do.
Breakpoints are used in the following manner: using the commands provided by a debugging tool, the developer dynamically inserts breakpoints into the system prior to the system's execution . A breakpoint can be inserted immediately before any executable line of code. During execution, the program will halt whenever a breakpoint is reached and allow the developer to examine the state of objects and the current execution sequence, set new breakpoints or remove existing breakpoints, and control the forward execution of the system.
The developer controls the forward execution of the system using commands provided by the debugging tool. These commands allow the developer to execute a single line of code, a fixed number of lines, or all code until the next breakpoint or failure before halting. When the line of code being executed is an invocation the user has additional options to trace the invocation on a line-by-line basis, execute the invocation as a single instruction, and, after tracing an invocation, run until the invocation returns. These commands allow the user to control the forward execution of the system and to examine the state of the system at any point in time. The difficulty of using the breakpoint approach is that the user must have a means by which to set the breakpoint so that the execution will be halted before, but still near, the point of failure. In some cases it is difficult or impossible to set a convenient breakpoint: the developer may know that the failure occurs in a given method, but if that method is executed thousands of times before the failure occurs, setting a breakpoint at the beginning of the method's code is of little use.
A log is a preprogrammed means to gain information about the past history of the execution. In this approach, the developer builds into the code output statements that write selected information to a log file. Recorded in the log file is information about the sequence of events occurring during the system's execution. The preprogrammed output statements typically write to the log file entries to record when a method has begun and/or completed execution, the values of parameters or other objects at the time an invocation occurs and/or completes, and relevant information at other points in the execution. The advantage of the log approach is that it makes available a complete history, at least as much as the developer is willing to preprogram, in a stable form. The developer may use this log to reconstruct a sequence of events that led to the failure or compare the logs of different tests to determine at what point the execution resulting in a failure diverges from a normal execution. The difficulty of the log approach is that it requires the expense of preprogramming and the developer to have a reasonable idea in advance of what information to capture, and, even then, the log file for an execution may be so large that critical information needed by the developer is lost in a mass of extraneous log entries.
Developers must have a strategy when employing debugging tools; the debugging tools must be used in a deliberate, calculated way that enhances the probability of most rapidly identifying the fault(s) in the system. As with all tools, possessing a powerful debugging tool is not a guarantee of efficiency or effectiveness in debugging. The developer must know and put into practice strategies that guide the debugging tool's use.
It may be useful to think of debugging strategies in terms of a game metaphor: the player's (developer's) objective is to navigate a robot (the system state) through a terrain that the player has created. Adding to the terrain creates hidden opponents (faults). The opponents can convincingly lie to the robot, such as by making it think that the way ahead is clear when in fact it is the edge of a cliff. A large unexplored terrain will usually contain many opponents and many hiding places. Two opponents may sometimes collaborate to conceal each other's presence or to create together a harmful effect that neither one is capable of producing by itself (perhaps one opponent distracts the robot while the other pushes it into a hole). The opponents are devious and may set a time-delay bomb and then run away, causing the robot to crash when the opponent is far gone. The player is allowed to search the terrain (use the debugger) and replay the game under different conditions so as to reveal the presence of an opponent. While this metaphor is inaccurate in some respects, it does suggest some obvious strategies that the player might use: create a terrain that is easier to examine for possible opponents; do not give the opponents many places to hide; isolate a part of the terrain and search it for opponents; follow the trail backward from the site of a crashed robot looking for clues about the opponents and their hiding places; set traps to catch opponents. Such strategies have counterparts in debugging real systems.
Debugging strategies are either proactive or reactive. Proactive strategies are practices that the developer can use in designing and building the system so as to limit the introduction of faults. Reactive strategies are practices that the developer can use in locating faults that, despite the best use of proactive stratgies, have been introduced into the system. The debugging strategies considered here are shown in the table below.
Incremental development is the first proactive debugging strategy. While it may seem strange, this development practice is one of the most effective strategies for creating a debugged system, because it follows two of the strategies suggested by the game metaphor: it creates a "terrain" that is easier to examine for possible "opponents," and does not give the "opponents" many places to hide. The first step in incremental development creates the smallest system possible, one that incorporates the least amount of code, to achieve its goal. This gives faults in the program the fewest places to hide and makes the program easier to examine, simply because there are fewer places to look. Each step in development adds but a small increment of code. Thus, each debugging step is simpler than it would be if a large body of undebugged code were added in a single step.
The second proactive debugging strategy is to build systems that use scope restriction through encapsulation, information hiding, aggregation, and other design techniques to limit the visibility and accessibility of system components. This design approach creates systems that are more easily debugged because it follows the game strategy of "creating a terrain" that is easier to examine for possible "opponents." Encapsulation and other scope-restricting techniques are the equivalent of building protective walls in the terrain to trap opponents either inside or outside its bounds. Thus trapped, the opponents are easier to find.
The first reactive debugging strategy is fault isolation: in which the system is made smaller by removing (commenting out) or disabling (turning off options) the suspected part of the system. The smaller system is then tested again using the same conditions that produced the failure. If the failure still occurs, then the removed or disabled part of the system is known not to contain the fault, and the strategy is repeated by removing or disabling additional parts of the system until the fault becomes isolated. At some point the failure will not occur. When the failure does not occur the removed or disabled code contains the fault. The most likely site for the fault is in the last code that was removed or disabled. More extensive and detailed debugging of this suspected code can now be done with the debugger. This strategy follows the game metaphor notions of "not giving the opponents many places to hide" and "isolating a part of the terrain to search it for opponents." Notice that the debugging tools did not come into play until after the strategy had yielded a small area to examine.
The second reactive debugging strategy is deductive reasoning. Debugging is like detective work: starting at the point of failure, or the scene of the crime, the system state contains clues from which deductions can be made about how the failure occured. Possible causes for the observed system state can be tested, which either reveals the fault or produces additional clues for further deduction and tests. Unlike real detectives, the developer has the advantage of using the debugging tools to replay the system's execution and examine in detail possible causes. Notice that the use of the debugging tool is, though important, secondary - the use of the tool is guided by the strategy that determines what to look for and where to look for it. Sharp deductive reasoning is better than a powerful debugging tool.
Trap setting is the third reactive debugging strategy. Cases dififcult to debug are those in which the devloper knows in what way the system state has been damaged but can not gather much evidence about how or when the damage occurred. For example, the system may fail because a state variable has an incorrect value, but such a value can be the result of many actions in the system. Frequently, the execution of the code that causes the failure is found in a different execution sequence from the one that executed the faulty code. A starting point for debugging can be found by setting a trap to identify when and where the system state was damaged. Two ways of setting a trap are to use the debugging tool to set a watch condition or to use assert macros. Both of these tehniques allow the developer to be notified when the damage to the system state occurs. With this starting point, other strategies can be used to locate the fault.
Model testing, the fourth reactive debugging strategy, comes into play in thos situations where the system being debugged is too complex to apply the other strategies. If the developer has an idea of what components are involved in the failure, a scaled-down model of the system can be built and tested. The model must be able to exhibit the same failure as the real system. If such a model can be constructed, locating the fault in the simpler model can also reveal the fault in the more complex, real system. Although this may sound difficult, it is often easy to construct the model for testing. For example, in a graphical editor system, if there is a problem with the resizing of polygons, then a special test program that creates a single hard-coded polygon can be built and tested.
The following list provides some other hints, information, and suggestions about debugging:
Even with these strategies and hints, the developer must realize that there is no shortcut in the difficult work of debugging - patient determination, consistent application of effective strategies, and knowledgeable use of debugging tools provide the only real path to success.
In the next sections, two debugging tools will be considered. The toolkit approach is represented by gdb, the GNU debugger, and its variants such as xxgdb. (Other Unix toolkit debuggers are also available in the public domain.) The IDE apprach is illustrated by the debugger embedded in the Visual C++ environment. Other integrated development environments similar to Visual C++ include a debugger as part of the standard set of integrated capabilities.