Course Details
Subject {L-T-P / C} : CS4440 : Fault Tolerant Systems { 3-0-0 / 3}
Subject Nature : Theory
Coordinator : Pabitra Mohan Khilar
Syllabus
Module 1: Introduction to fault tolerance, Requirement of Fault Tolerance, Goals and Characteristics of fault tolerance, Challenges for fault tolerance, Types of faults: Hard, Soft, Transient, Intermittent and Byzantine Faults, Causes of Faults: Environment, Out of range, Physical damage
Module 2: Fault Model: PMC Model, BGM, MM, MM* and comparison models, Composite Fault Models
Module 3: Algorithms for Fault Detection and Diagnosis: System level diagnosis, Centralized Vs. Distributed Diagnosis, Static Vs. Dynamic Diagnosis, Diagnosis Algorithms, Asymptotic Complexity, Diagnosable systems, t diagnosability, k-connectivity, diagnosis parameters, Replica Management, K+1 Redundancy, Mechanisms for fault detection.
Module 3: Fault Isolation and Fault Recovery: Fault tree, Isolation and Recovery Algorithms, Fault Evaluation: Generic Evaluation Parameters, Diagnosis Latency, Diagnosis Start-up Time, False Alarm Rate, Time, Space & Message Complexity of Fault diagnosis algorithms
Module 4: Introduction to Fault Diagnosis in distributed systems such as Clusters, Grids, Internet, Cloud, Edge and Fog Computing Systems, Iot Systems, Multi-UAV systems, Automated Fault Diagnosis, WSN, MANET, VANET, FANET, AANET, Role of Fault diagnosis to achieve fault tolerance. Fault diagnosis in distributed embedded systems
Course Objectives
- To identify the types of faults and fault behavior in distributed systems
- To develop fault detection, diagnosis and recovery algorithms
- To evaluate the fault tolerant systems using standard diagnosis parameters
- To apply the fault diagnosis algorithms to different distributed systems
Course Outcomes
Performance evaluation of fault tolerant systems
Identify the policy and mechanisms for achieving fault tolerance in distributed networks
Essential Reading
- P. Jalote, Fault Tolerance in Distributed Systems, PHI , 1999
- Elena Dubrova,, Fault Tolerant Design, Springer , 2013
Supplementary Reading
- Thomas H & Y. Robert,, Fault Tolerance Techniques for High Performance Computing, Springer , 2015
- D.Janakiram, Grid Computing, TMH , 2005
Journal and Conferences
- P.M.Khilar and S.Mahapatra, “Time-Constrained Fault Tolerant X-by-wire Systems” International Journal of Computer and Applications, Vol. 31, No.4, Oct-Dec, 2009, pp. 231-238
- Sanjaya Kumar Panda and Pabitra Mohan Khilar, “A Two-Step QoS Priority for Scheduling in Grid”, Proceedings of The Second IEEE International Conference on Parallel, Distributed and Grid Computing (PDGC), IEEE, Waknaghat, 6th - 8th Dec 2012, pp. 502 – 507.