National Institute of Technology Rourkela

राष्ट्रीय प्रौद्योगिकी संस्थान राउरकेला

ଜାତୀୟ ପ୍ରଯୁକ୍ତି ପ୍ରତିଷ୍ଠାନ ରାଉରକେଲା

An Institute of National Importance

Syllabus

Course Details

Subject {L-T-P / C} : CS4440 : Fault Tolerant Systems { 3-0-0 / 3}

Subject Nature : Theory

Coordinator : Pabitra Mohan Khilar

Syllabus

Module 1: Introduction to fault tolerance, Requirement of Fault Tolerance, Goals and Characteristics of fault tolerance, Challenges for fault tolerance, Types of faults: Hard, Soft, Transient, Intermittent and Byzantine Faults, Causes of Faults: Environment, Out of range, Physical damage

Module 2: Fault Model: PMC Model, BGM, MM, MM* and comparison models, Composite Fault Models

Module 3: Algorithms for Fault Detection and Diagnosis: System level diagnosis, Centralized Vs. Distributed Diagnosis, Static Vs. Dynamic Diagnosis, Diagnosis Algorithms, Asymptotic Complexity, Diagnosable systems, t diagnosability, k-connectivity, diagnosis parameters, Replica Management, K+1 Redundancy, Mechanisms for fault detection.

Module 3: Fault Isolation and Fault Recovery: Fault tree, Isolation and Recovery Algorithms, Fault Evaluation: Generic Evaluation Parameters, Diagnosis Latency, Diagnosis Start-up Time, False Alarm Rate, Time, Space & Message Complexity of Fault diagnosis algorithms

Module 4: Introduction to Fault Diagnosis in distributed systems such as Clusters, Grids, Internet, Cloud, Edge and Fog Computing Systems, Iot Systems, Multi-UAV systems, Automated Fault Diagnosis, WSN, MANET, VANET, FANET, AANET, Role of Fault diagnosis to achieve fault tolerance. Fault diagnosis in distributed embedded systems

Course Objectives

  • To identify the types of faults and fault behavior in distributed systems
  • To develop fault detection, diagnosis and recovery algorithms
  • To evaluate the fault tolerant systems using standard diagnosis parameters
  • To apply the fault diagnosis algorithms to different distributed systems

Course Outcomes

Performance evaluation of fault tolerant systems

Identify the policy and mechanisms for achieving fault tolerance in distributed networks

Essential Reading

  • P. Jalote, Fault Tolerance in Distributed Systems, PHI , 1999
  • Elena Dubrova,, Fault Tolerant Design, Springer , 2013

Supplementary Reading

  • Thomas H & Y. Robert,, Fault Tolerance Techniques for High Performance Computing, Springer , 2015
  • D.Janakiram, Grid Computing, TMH , 2005

Journal and Conferences

  • P.M.Khilar and S.Mahapatra, “Time-Constrained Fault Tolerant X-by-wire Systems” International Journal of Computer and Applications, Vol. 31, No.4, Oct-Dec, 2009, pp. 231-238
  • Sanjaya Kumar Panda and Pabitra Mohan Khilar, “A Two-Step QoS Priority for Scheduling in Grid”, Proceedings of The Second IEEE International Conference on Parallel, Distributed and Grid Computing (PDGC), IEEE, Waknaghat, 6th - 8th Dec 2012, pp. 502 – 507.