Course Details
Subject {L-T-P / C} : CS6121 : Fault Tolerant Distributed System { 3-0-0 / 3}
Subject Nature : Theory
Coordinator : Pabitra Mohan Khilar
Syllabus
Module 1: Introduction: High Performance Computing (HPC), Grand Challenge Problems Computational and communication intensive, Parallel Architectures Classifications SMP,MPP,NUMA,Clusters and Components of a Parallel Machine, Conventional Supercomputers and it’s limitations, Multi-processor and Multi Computer based Distributed Systems, Introduction to Clusters and Grids.
Module 2: Fault Tolerance: Classification of faults , Fault detection, fault diagnosis, fault model, hardware and software redundancy Masking/Non masking –Group and Hierarchical masking, Reliability and availability, Code protection/data protection (RAID LEVEL 0 5), Dependable Clusters high availability and high performance clusters. Dependability Concepts, Quorums, Consensus and Broadcast, View synchronous Group Communication, Distributed Cryptography, Byzantine Agreement, Service Replication, Data Storage.
Module 3: System Level diagnosis: Diagnosis and Diagnosability Theory, Testing Assignment,Syndrome Collection, Centralized vs. Distributed Diagnosis, Static Vs. Dynamic Fault Environment, System and Fault Model, Classification of Diagnosis Algorithms, Evaluation Metric such as Time and Space Complexity, Bounded Correctness, Applications to Distributed Embedded System, Internet, DSNs, MANETs, PVN. Fault Tolerant Networks: Meausers of Resilence, Graph Theoretic Measures, Computer Network Measures, Regular Networks, Adhoc Point to point Networks
Module 4: Role of fault detection and Diagnosis to achieve fault tolerance in Distributed Embedded Systems: FBW, SBW and BBW systems, Graph representiaon of fault tolerance, k-fault tolerant design principle, Automated fault diagnosis for UAV enabled Distributed Embedded Systems
Module 5: Application of soft computing principles (ANN, Clonal Selection Principles, PSO etc) and ML/AI to fault tolerance in distributed computing systems, Performance Evaluation of diagnosis algorithms using ML/AI, Improving diagnosis accuracy
Course Objectives
- To understand the requirements for fault tolerant distributed computing systems
- To design and develop efficient fault tolerance algorithms for disitributed computing systems
- To identify the fault tolerance measures for evaluating the performance of fault tolerant algorithms
- To understand the behavior of fault tolrant systems using standard fault tolerance metrics
Course Outcomes
Designing and implementing distributed fault tolerant systems.
Identify the fault tolerance requirement for designing robust distributed system
Essential Reading
- P. Jalote, Fault Tolerance in Distributed Systems, Prentice Hall , 1994
- J. Joseph & C. Fellenstein,, Grid Computing, Pearson Education , 2004
Supplementary Reading
- H. Attiya and J. Welch, Distributed Computing: Fundamentals, Wiley , 2004
- G. Coulororis, J. Dollimore, and T. Kindberg., Distributed Systems: Concepts and Design., Addison Wesley , 2001
Journal and Conferences
- P.M.Khilar and S.Mahapatra, “Time-Constrained Fault Tolerant X-by-wire Systems” International Journal of Computer and Applications, Vol. 31, No.4, Oct-Dec, 2009, pp. 231-238
- A.Mahapatra and P.M.Khilar, Fault Diagnosis in Wireless Sensor Networks: A Survey, IEEE Communications Surveys and Tutorials, Issue 99, pp. 1-27, April 2013