Transparent Fault Tolerance for Job Healing in HPC Environments

dc.contributor.advisorDr. Frank Mueller, Committee Chairen_US
dc.contributor.advisorDr. Xiaosong Ma, Committee Memberen_US
dc.contributor.advisorDr. Yan Solihin, Committee Memberen_US
dc.contributor.advisorDr. Nagiza Samatova, Committee Memberen_US
dc.contributor.authorWang, Chaoen_US
dc.date.accessioned2010-04-02T18:53:58Z
dc.date.available2010-04-02T18:53:58Z
dc.date.issued2009-07-07en_US
dc.degree.disciplineComputer Scienceen_US
dc.degree.leveldissertationen_US
dc.degree.namePhDen_US
dc.description.abstractAs the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions. This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas. First, at the job level, novel, scalable mechanisms are built in support of proactive FT and to significantly enhance reactive FT. The contributions of this dissertation in this area are (1) a transparent job pause mechanism, which allows a job to pause when a process fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant approach that combines process-level live migration with health monitoring to complement reactive with proactive FT and to reduce the number of checkpoints when a majority of the faults can be handled proactively; (3) a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks; and (4) an incremental checkpointing mechanism, which is combined with full checkpoints to explore the potential of reducing the overhead of checkpointing by performing fewer full checkpoints interspersed with multiple smaller incremental checkpoints. Second, for the job input data, transparent techniques are provided to improve the reliability, availability and performance of HPC I/O systems. In this area, the dissertation contributes (1) a mechanism for offline job input data reconstruction to ensure availability of job input data and to improve center-wide performance at no cost to job owners; (2) an approach to automatic recover job input data at run-time during failures by recovering staged data from an original source; and (3) “just in time†replication of job input data so as to maximize the use of supercomputer cycles. Experimental results demonstrate the value of these advanced fault tolerance techniques to increase fault resilience in HPC environments.en_US
dc.identifier.otheretd-06302009-003240en_US
dc.identifier.urihttp://www.lib.ncsu.edu/resolver/1840.16/4437
dc.rightsI hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dis sertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to NC State University or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.en_US
dc.subjectjob input dataen_US
dc.subjectfault toleranceen_US
dc.subjecthigh-performance computingen_US
dc.subjectfault resilienceen_US
dc.subjectcheckpoint/restarten_US
dc.titleTransparent Fault Tolerance for Job Healing in HPC Environmentsen_US

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
etd.pdf
Size:
1.77 MB
Format:
Adobe Portable Document Format

Collections