Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems

Varma, Jyothish S

Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems

Files

etd.pdf (425.74 KB)

Date

2006-04-23

Authors

Varma, Jyothish S

Advisors

Dr. Tao Xie, Committee Member

Dr. Vincent Freeh, Committee Member

Dr. Frank Mueller, Committee Chair

Abstract

Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-timeto-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This thesis presents a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response time in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems.

Keywords

BlueGene/L, Parallel Computing, High Performance, Scalable, Fault Tolerant, Group Communication

URI

http://www.lib.ncsu.edu/resolver/1840.16/2089

Degree

MS

Discipline

Computer Science

Collections

Theses

Full item page

Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems

Files

Date

Authors

Advisors

Journal Title

Series/Report No.

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Degree

Discipline

Collections