Course Details
Subject {L-T-P / C} : CS4440 : Fault Tolerant Systems { 3-0-0 / 3}
Subject Nature : Theory
Coordinator : Prof. Pabitra Mohan Khilar
Syllabus
Module 1: Introduction to fault tolerance, Requirement of Fault Tolerance, Goals and Characteristics of fault tolerance, Challenges for fault tolerance, Types of faults: Hard, Soft, Transient, Intermittent and Byzantine Faults, Causes of Faults: Environment, Out of range, Physical damage
Module 2: Fault Model: PMC Model, BGM, MM, MM* and comparison models, Composite Fault Models
Module 3: Algorithms for Fault Detection and Diagnosis: System level diagnosis, Centralized Vs. Distributed Diagnosis, Static Vs. Dynamic Diagnosis, Diagnosis Algorithms, Asymptotic Complexity, Diagnosable systems, t diagnosability, k-connectivity, diagnosis parameters, Replica Management, K+1 Redundancy, Mechanisms for fault detection.
Module 3: Fault Isolation and Fault Recovery: Fault tree, Isolation and Recovery Algorithms, Fault Evaluation: Generic Evaluation Parameters, Diagnosis Latency, Diagnosis Start-up Time, False Alarm Rate, Time, Space & Message Complexity of Fault diagnosis algorithms
Module 4: Introduction to Fault Diagnosis in distributed systems such as Clusters, Grids, Internet, Cloud, Edge and Fog Computing Systems, Iot Systems, Multi-UAV systems, Automated Fault Diagnosis, WSN, MANET, VANET, FANET, AANET, Role of Fault diagnosis to achieve fault tolerance. Fault diagnosis in distributed embedded systems
Course Objectives
- To identify the types of faults and fault behavior in distributed systems
- To develop fault detection, diagnosis and recovery algorithms
- To evaluate the fault tolerant systems using standard diagnosis parameters
- To apply the fault diagnosis algorithms to different distributed systems
Course Outcomes
Performance evaluation of fault tolerant systems <br /> <br />Identify the policy and mechanisms for achieving fault tolerance in distributed networks
Essential Reading
- P. Jalote, Fault Tolerance in Distributed Systems, PHI , 1999
- Elena Dubrova,, Fault Tolerant Design, Springer , 2013
Supplementary Reading
- Thomas H & Y. Robert,, Fault Tolerance Techniques for High Performance Computing, Springer , 2015
- D.Janakiram, Grid Computing, TMH , 2005
Journal and Conferences
- P.M.Khilar and S.Mahapatra, “Time-Constrained Fault Tolerant X-by-wire Systems” International Journal of Computer and Applications, Vol. 31, No.4, Oct-Dec, 2009, pp. 231-238
- Sanjaya Kumar Panda and Pabitra Mohan Khilar, “A Two-Step QoS Priority for Scheduling in Grid”, Proceedings of The Second IEEE International Conference on Parallel, Distributed and Grid Computing (PDGC), IEEE, Waknaghat, 6th - 8th Dec 2012, pp. 502 – 507.