National Institute of Technology Rourkela

राष्ट्रीय प्रौद्योगिकी संस्थान राउरकेला

ଜାତୀୟ ପ୍ରଯୁକ୍ତି ପ୍ରତିଷ୍ଠାନ ରାଉରକେଲା

An Institute of National Importance

Syllabus

Course Details

Subject {L-T-P / C} : CS6121 : Fault Tolerant Distributed System { 3-0-0 / 3}

Subject Nature : Theory

Coordinator : Prof. Pabitra Mohan Khilar

Syllabus

Module 1: Introduction: High Performance Computing (HPC), Grand Challenge Problems Computational and communication intensive, Parallel Architectures Classifications SMP,MPP,NUMA,Clusters and Components of a Parallel Machine, Conventional Supercomputers and it’s limitations, Multi-processor and Multi Computer based Distributed Systems, Introduction to Clusters and Grids.

Module 2: Fault Tolerance: Classification of faults , Fault detection, fault diagnosis, fault model, hardware and software redundancy Masking/Non masking –Group and Hierarchical masking, Reliability and availability, Code protection/data protection (RAID LEVEL 0 5), Dependable Clusters high availability and high performance clusters. Dependability Concepts, Quorums, Consensus and Broadcast, View synchronous Group Communication, Distributed Cryptography, Byzantine Agreement, Service Replication, Data Storage.

Module 3: System Level diagnosis: Diagnosis and Diagnosability Theory, Testing Assignment,Syndrome Collection, Centralized vs. Distributed Diagnosis, Static Vs. Dynamic Fault Environment, System and Fault Model, Classification of Diagnosis Algorithms, Evaluation Metric such as Time and Space Complexity, Bounded Correctness, Applications to Distributed Embedded System, Internet, DSNs, MANETs, PVN. Fault Tolerant Networks: Meausers of Resilence, Graph Theoretic Measures, Computer Network Measures, Regular Networks, Adhoc Point to point Networks

Module 4: Role of fault detection and Diagnosis to achieve fault tolerance in Distributed Embedded Systems: FBW, SBW and BBW systems, Graph representiaon of fault tolerance, k-fault tolerant design principle, Automated fault diagnosis for UAV enabled Distributed Embedded Systems

Module 5: Application of soft computing principles (ANN, Clonal Selection Principles, PSO etc) and ML/AI to fault tolerance in distributed computing systems, Performance Evaluation of diagnosis algorithms using ML/AI, Improving diagnosis accuracy

Course Objectives

  • To understand the requirements for fault tolerant distributed computing systems
  • To design and develop efficient fault tolerance algorithms for disitributed computing systems
  • To identify the fault tolerance measures for evaluating the performance of fault tolerant algorithms
  • To understand the behavior of fault tolrant systems using standard fault tolerance metrics

Course Outcomes

Designing and implementing distributed fault tolerant systems. <br /> <br />Identify the fault tolerance requirement for designing robust distributed system

Essential Reading

  • P. Jalote, Fault Tolerance in Distributed Systems, Prentice Hall , 1994
  • J. Joseph & C. Fellenstein,, Grid Computing, Pearson Education , 2004

Supplementary Reading

  • H. Attiya and J. Welch, Distributed Computing: Fundamentals, Wiley , 2004
  • G. Coulororis, J. Dollimore, and T. Kindberg., Distributed Systems: Concepts and Design., Addison Wesley , 2001

Journal and Conferences

  • P.M.Khilar and S.Mahapatra, “Time-Constrained Fault Tolerant X-by-wire Systems” International Journal of Computer and Applications, Vol. 31, No.4, Oct-Dec, 2009, pp. 231-238
  • A.Mahapatra and P.M.Khilar, Fault Diagnosis in Wireless Sensor Networks: A Survey, IEEE Communications Surveys and Tutorials, Issue 99, pp. 1-27, April 2013