[Book Cover]

Fault-Tolerant Computer System Design, 1/e

Dhiraj Pradhan, College Station, TX

Published February, 1996 by Prentice Hall PTR (ECS Professional)

Copyright 1996, 560 pp.
ISBN 0-13-057887-8

Sign up for future
on this subject.

See other books about:
    Fault Tolerant Computing-Computer Science

    Fault Tolerant Computing-Electrical Engineering


In the ten years since the publication of the first edition of this book, the field of fault-tolerant design has broadened in appeal, particularly with its emerging application in distributed computing. This new edition specifically deals with this dynamically changing computing environment, incorporating new topics such as fault-tolerance in multiprocessor and distributed systems.


details the latest developments in fault-tolerance in multiprocessor and distributed systems.
describes techniques for dependablity prediction and measurement.
includes a set of exercise problems in each chapter.

Table of Contents
(NOTE: Each chapter begins with an introduction and concludes with Problems and References).

    1. An Introduction to the Design and Analysis of Fault-Tolerant Systems.


        Fundamental Terminology. Objectives of Fault Tolerance. Applications of Fault Tolerance.

          Redundancy Techniques.

            Hardware Redundancy. Information Redundancy. Time Redundancy. Software Redundancy. Redundancy Example.

              Dependency Evaluation Techniques.

                Basic Definitions. Reliability Modeling. Safety Modeling. Availability Modeling. Maintainability Modeling.

                  Design Methodology.

                    The Design Process. Fault Avoidance in the Design Process.

                  2. Architecture of Fault-Tolerant Computers.

                    Taxonomy of Applications.

                      General-Purpose Computing. High Availability System. Long-Life Systems. Critical Computations.

                        General-Purpose Computing.

                          Generic Computer. VAX 8600. IBM 3090.

                            High Availability Systems.

                              AT&T. Tandem. STRATUS. VAXft 310.

                                Long-Life Systems.

                                  Spacecraft Systems. Voyager. Galileo.

                                    Critical Computations.

                                      SIFT. Space Shuttle Computer.

                                    3. Fault Tolerant Multiprocessor and Distributed Systems: Principles.

                                      Review of Multiprocessors and Fault Tolerance.

                                        SIMD versus MIMD. Moderate Parallel versus Massively Parallel. Fine Grain versus Coarse Grain. Shared Memory versus Distributed Memory. Topology of Interconnect Programming Model.

                                          Implications on Fault Tolerance. Fault Tolerance Through Static Redundancy.

                                            Redundancy for Safety. Redundancy for Arbitrary Faults.

                                              Fault Tolerance Through Dynamic or Stand-by Redundancy. Fault Detection in Multiprocessors.

                                                Fault Detection through Duplication and Comparison. Fault Detection Using Diagnostics and Coding Techniques.

                                                  Recovery Strategies for Multiprocessor Systems. Rollback Recovery Using Checkpoints.

                                                    Processor-Cache-Based Checkpoints. Virtual Checkpoints.

                                                      Rollback Recovery Issues in Communicating Multiprocessors.

                                                        Shared-Memory Multiprocessors. Distributed Memory Multiprocessors. Recovery in Distributed Shared Memory Systems. Recovery in Database Systems.

                                                          Forward Recovery Schemes.

                                                            Static Redundancy Approaches. Dynamic Redundancy Approaches. Software Redundancy-Based Approach for Forward Recovery.

                                                              Reconfiguration in Multiprocessors.

                                                                Bus-based Systems. Crossbar-based Systems. Multistage Interconnection Networks. Hypercube Networks. de Bruijn Networks. Mesh Networks. Tree Networks. Theoretical Issues.

                                                                  Appendix: Other Approaches to Fault Detection.

                                                                    Algorithm-based Fault Detection.


                                                                    4. Case Studies in Fault Tolerant Multiprocessor and Distributes Systems.

                                                                      Case Study 1: Tandem Multicomputer Systems.

                                                                        NonStop Cyclone. Himalaya K10000.

                                                                          Case Study 2: Tandem Integrity S2.

                                                                            Architecture. Fault Tolerance Strategies.

                                                                              Case Study 3: Stratus XA/R Series 300 Systems

                                                                                System Architecture. The Pair-and-Spare Appraoch. System Software.

                                                                                  Case Study 4: Sequoia Series 400 System.

                                                                                    System Architecture. Hardware Fault Tolerance. Software Fault Tolerance.

                                                                                      Case Study 5: The Error-Resistant Interactively Consistent Architecture (ERICA).

                                                                                        The (4,2)-Concept. The Architecture.

                                                                                          Case Study 6: Fault-Tolerant Parallel Processor.

                                                                                            Byzantine Resilience. Architecture. Prototypes.

                                                                                              Case Study 7: The MAFT Design for Ultra-Reliable Systems.

                                                                                                The MAFT Philosophy. System Model.

                                                                                              5. Experimental Analysis of Computer System Dependability

                                                                                                Statistical Techniques.

                                                                                                  Parameter Estimation. Distribution Characterization. Multivariate Analysis Importance Sampling.

                                                                                                    Design Phase.

                                                                                                      Simulated Fault Injection at the Electrical Level. Simulated Fault Injection at the Logic Level. Simulated Fault Injection at the Function Level.

                                                                                                        Prototype Phase.

                                                                                                          Hardware-Implemented Fault Injection. Software-Implemented Fault Injection. Radiation-Induced Fault Injection.

                                                                                                            Operational; Phase.

                                                                                                              Measurements. Data Processing. Preliminary Analysis. Dependency Analysis. Markov Reward Modeling. Software Dependability. Failure Prediction.

                                                                                                            6. Reliability Estimation.


                                                                                                                Element Reliability. System Reliability.

                                                                                                                  Behavioral Decomposition.

                                                                                                                    The Reliability Model. Coverage Models.

                                                                                                                      An Example.

                                                                                                                    7. Fault Tolerance in Software.

                                                                                                                      Motivation for Fault Tolerance in Software.

                                                                                                                        Failure Experience of Current Software. Consequences of Software Failure. Difficulties in Test and Verification. A Framework for Further Discussion.

                                                                                                                          Dealing with Faulty Programs.

                                                                                                                            Robustness. Temporal Redundancy. Software Diversity.

                                                                                                                              Design of Fault Tolerant Software Using Diversity.

                                                                                                                                N-Version of Fault Tolerant Software Using Diversity. Recovery Block. Composite Designs. The Distributed Recovery Block. The Extended Distributed Recovery Block.

                                                                                                                                  Reliability Models for Fault-Tolerant Software.

                                                                                                                                    Terminology and Software Failure Models. Reliability Models.

                                                                                                                                      Construction of Acceptance Tests.

                                                                                                                                        Program Characteristics Useful for Acceptance Tests. Fault Trees as a Design Aid. Placement of Acceptance Tests within the Program.

                                                                                                                                          Exception Handling.

                                                                                                                                            Design Issues. Implementation Issues.

                                                                                                                                              Case Study: Implementation to the Extended Distributed Recovery Block.

                                                                                                                                                Notes on the System Architecture. EDRB Networks. Response Time, Recovery Time, and Throughput. Key Executive Layer Tasks. Implementation Details.

                                                                                                                                              8. System Diagnosis.

                                                                                                                                                System Diagnosis under Bounded Models.

                                                                                                                                                  Diagnosis under the PMC Model. Comparison-Based Diagnosis.

                                                                                                                                                    System Diagnosis under Probabilistic Models.

                                                                                                                                                      Motivation. Probabilistic Models. Results in Directed Models. Results in Undirected Models.


                                                                                                                                                    © Prentice-Hall, Inc. A Simon & Schuster Company
                                                                                                                                                    Comments To webmaster@prenhall.com