Fault-Tolerant Computer System Design, 1/e
Dhiraj Pradhan, College Station, TX
Published February, 1996 by Prentice Hall PTR (ECS Professional)
Copyright 1996, 560 pp.
Sign up for future on this subject. mailings
See other books about:
Fault Tolerant Computing-Computer Science
Fault Tolerant Computing-Electrical Engineering
In the ten years since the publication of the
first edition of this book, the field of fault-tolerant design has
broadened in appeal, particularly with its emerging application in
distributed computing. This new edition specifically deals with this
dynamically changing computing environment, incorporating new topics
such as fault-tolerance in multiprocessor and distributed systems.
details the latest developments in fault-tolerance in multiprocessor
and distributed systems.
describes techniques for dependablity prediction and measurement.
includes a set of exercise problems in each chapter.
(NOTE: Each chapter begins with an introduction and concludes with Problems
1. An Introduction to the Design and Analysis of Fault-Tolerant Systems.
Fundamental Terminology. Objectives of Fault Tolerance. Applications of Fault
Hardware Redundancy. Information Redundancy. Time Redundancy. Software
Redundancy. Redundancy Example.
Dependency Evaluation Techniques.
Basic Definitions. Reliability Modeling. Safety Modeling. Availability
Modeling. Maintainability Modeling.
The Design Process. Fault Avoidance in the Design Process.
2. Architecture of Fault-Tolerant Computers.
Taxonomy of Applications.
General-Purpose Computing. High Availability System. Long-Life Systems.
Generic Computer. VAX 8600. IBM 3090.
High Availability Systems.
AT&T. Tandem. STRATUS. VAXft 310.
Spacecraft Systems. Voyager. Galileo.
SIFT. Space Shuttle Computer.
3. Fault Tolerant Multiprocessor and Distributed
Review of Multiprocessors and Fault Tolerance.
SIMD versus MIMD. Moderate Parallel versus Massively Parallel. Fine Grain
versus Coarse Grain. Shared Memory versus Distributed Memory. Topology of Interconnect
Implications on Fault Tolerance. Fault Tolerance Through Static Redundancy.
Redundancy for Safety. Redundancy for Arbitrary Faults.
Fault Tolerance Through Dynamic or Stand-by Redundancy. Fault Detection in
Fault Detection through Duplication and Comparison. Fault Detection Using
Diagnostics and Coding Techniques.
Recovery Strategies for Multiprocessor Systems. Rollback Recovery Using
Processor-Cache-Based Checkpoints. Virtual Checkpoints.
Rollback Recovery Issues in Communicating Multiprocessors.
Shared-Memory Multiprocessors. Distributed Memory Multiprocessors. Recovery
in Distributed Shared Memory Systems. Recovery in Database Systems.
Forward Recovery Schemes.
Static Redundancy Approaches. Dynamic Redundancy Approaches. Software
Redundancy-Based Approach for Forward Recovery.
Reconfiguration in Multiprocessors.
Bus-based Systems. Crossbar-based Systems. Multistage Interconnection
Networks. Hypercube Networks. de Bruijn Networks. Mesh Networks. Tree Networks.
Appendix: Other Approaches to Fault Detection.
Algorithm-based Fault Detection.
4. Case Studies in Fault Tolerant Multiprocessor
and Distributes Systems.
Case Study 1: Tandem Multicomputer Systems.
NonStop Cyclone. Himalaya K10000.
Case Study 2: Tandem Integrity S2.
Architecture. Fault Tolerance Strategies.
Case Study 3: Stratus XA/R Series 300 Systems
System Architecture. The Pair-and-Spare Appraoch. System Software.
Case Study 4: Sequoia Series 400 System.
System Architecture. Hardware Fault Tolerance. Software Fault Tolerance.
Case Study 5: The Error-Resistant Interactively Consistent Architecture
The (4,2)-Concept. The Architecture.
Case Study 6: Fault-Tolerant Parallel Processor.
Byzantine Resilience. Architecture. Prototypes.
Case Study 7: The MAFT Design for Ultra-Reliable Systems.
The MAFT Philosophy. System Model.
5. Experimental Analysis of Computer System Dependability
Parameter Estimation. Distribution Characterization. Multivariate Analysis
Simulated Fault Injection at the Electrical Level. Simulated Fault Injection
at the Logic Level. Simulated Fault Injection at the Function Level.
Hardware-Implemented Fault Injection. Software-Implemented Fault Injection.
Radiation-Induced Fault Injection.
Measurements. Data Processing. Preliminary Analysis. Dependency Analysis.
Markov Reward Modeling. Software Dependability. Failure Prediction.
6. Reliability Estimation.
Element Reliability. System Reliability.
The Reliability Model. Coverage Models.
7. Fault Tolerance in Software.
Motivation for Fault Tolerance in Software.
Failure Experience of Current Software. Consequences of Software Failure.
Difficulties in Test and Verification. A Framework for Further Discussion.
Dealing with Faulty Programs.
Robustness. Temporal Redundancy. Software Diversity.
Design of Fault Tolerant Software Using Diversity.
N-Version of Fault Tolerant Software Using Diversity. Recovery Block.
Composite Designs. The Distributed Recovery Block. The Extended Distributed Recovery
Reliability Models for Fault-Tolerant Software.
Terminology and Software Failure Models. Reliability Models.
Construction of Acceptance Tests.
Program Characteristics Useful for Acceptance Tests.
Fault Trees as a Design Aid. Placement of Acceptance Tests within the Program.
Design Issues. Implementation Issues.
Case Study: Implementation to the Extended Distributed Recovery Block.
Notes on the System Architecture. EDRB Networks. Response Time, Recovery
Time, and Throughput. Key Executive Layer Tasks. Implementation Details.
8. System Diagnosis.
System Diagnosis under Bounded Models.
Diagnosis under the PMC Model. Comparison-Based Diagnosis.
System Diagnosis under Probabilistic Models.
Motivation. Probabilistic Models. Results in Directed Models. Results in