National Institute of Technology Rourkela

राष्ट्रीय प्रौद्योगिकी संस्थान राउरकेला

ଜାତୀୟ ପ୍ରଯୁକ୍ତି ପ୍ରତିଷ୍ଠାନ ରାଉରକେଲା

An Institute of National Importance

Syllabus

Course Details

Subject {L-T-P / C} : CS4440 : Fault Tolerant Systems { 3-0-0 / 3}

Subject Nature : Theory

Coordinator : Prof. Pabitra Mohan Khilar

Syllabus

Module 1: Introduction to fault tolerance, Requirement of Fault Tolerance, Goals and Characteristics of fault tolerance, Challenges for fault tolerance, Types of faults: Hard, Soft, Transient, Intermittent and Byzantine Faults, Causes of Faults: Environment, Out of range, Physical damage

Module 2: Fault Model: PMC Model, BGM, MM, MM* and comparison models, Composite Fault Models

Module 3: Algorithms for Fault Detection and Diagnosis: System level diagnosis, Centralized Vs. Distributed Diagnosis, Static Vs. Dynamic Diagnosis, Diagnosis Algorithms, Asymptotic Complexity, Diagnosable systems, t diagnosability, k-connectivity, diagnosis parameters, Replica Management, K+1 Redundancy, Mechanisms for fault detection.

Module 3: Fault Isolation and Fault Recovery: Fault tree, Isolation and Recovery Algorithms, Fault Evaluation: Generic Evaluation Parameters, Diagnosis Latency, Diagnosis Start-up Time, False Alarm Rate, Time, Space & Message Complexity of Fault diagnosis algorithms

Module 4: Introduction to Fault Diagnosis in distributed systems such as Clusters, Grids, Internet, Cloud, Edge and Fog Computing Systems, Iot Systems, Multi-UAV systems, Automated Fault Diagnosis, WSN, MANET, VANET, FANET, AANET, Role of Fault diagnosis to achieve fault tolerance. Fault diagnosis in distributed embedded systems

Course Objectives

  • To identify the types of faults and fault behavior in distributed systems
  • To develop fault detection, diagnosis and recovery algorithms
  • To evaluate the fault tolerant systems using standard diagnosis parameters
  • To apply the fault diagnosis algorithms to different distributed systems

Course Outcomes

Performance evaluation of fault tolerant systems <br /> <br />Identify the policy and mechanisms for achieving fault tolerance in distributed networks

Essential Reading

  • P. Jalote, Fault Tolerance in Distributed Systems, PHI , 1999
  • Elena Dubrova,, Fault Tolerant Design, Springer , 2013

Supplementary Reading

  • Thomas H & Y. Robert,, Fault Tolerance Techniques for High Performance Computing, Springer , 2015
  • D.Janakiram, Grid Computing, TMH , 2005

Journal and Conferences

  • P.M.Khilar and S.Mahapatra, “Time-Constrained Fault Tolerant X-by-wire Systems” International Journal of Computer and Applications, Vol. 31, No.4, Oct-Dec, 2009, pp. 231-238
  • Sanjaya Kumar Panda and Pabitra Mohan Khilar, “A Two-Step QoS Priority for Scheduling in Grid”, Proceedings of The Second IEEE International Conference on Parallel, Distributed and Grid Computing (PDGC), IEEE, Waknaghat, 6th - 8th Dec 2012, pp. 502 – 507.