Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems

dc.contributor.advisorDr. Tao Xie, Committee Memberen_US
dc.contributor.advisorDr. Vincent Freeh, Committee Memberen_US
dc.contributor.advisorDr. Frank Mueller, Committee Chairen_US
dc.contributor.authorVarma, Jyothish Sen_US
dc.date.accessioned2010-04-02T18:10:17Z
dc.date.available2010-04-02T18:10:17Z
dc.date.issued2006-04-23en_US
dc.degree.disciplineComputer Scienceen_US
dc.degree.levelthesisen_US
dc.degree.nameMSen_US
dc.description.abstractReliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-timeto-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This thesis presents a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response time in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems.en_US
dc.identifier.otheretd-04102006-132409en_US
dc.identifier.urihttp://www.lib.ncsu.edu/resolver/1840.16/2089
dc.rightsI hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to NC State University or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.en_US
dc.subjectBlueGene/Len_US
dc.subjectParallel Computingen_US
dc.subjectHigh Performanceen_US
dc.subjectScalableen_US
dc.subjectFault Toleranten_US
dc.subjectGroup Communicationen_US
dc.titleScalable, Fault-Tolerant Membership for Group Communication on HPC Systemsen_US

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
etd.pdf
Size:
425.74 KB
Format:
Adobe Portable Document Format

Collections