Transparent Fault Tolerance for Job Healing in HPC Environments
No Thumbnail Available
Files
Date
2009-07-07
Authors
Journal Title
Series/Report No.
Journal ISSN
Volume Title
Publisher
Abstract
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions.
This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas.
First, at the job level, novel, scalable mechanisms are built in support of proactive
FT and to significantly enhance reactive FT. The contributions of this dissertation in this
area are (1) a transparent job pause mechanism, which allows a job to pause when a process
fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant
approach that combines process-level live migration with health monitoring to complement
reactive with proactive FT and to reduce the number of checkpoints when a majority of the
faults can be handled proactively; (3) a novel back migration approach to eliminate load
imbalance or bottlenecks caused by migrated tasks; and (4) an incremental checkpointing
mechanism, which is combined with full checkpoints to explore the potential of reducing the
overhead of checkpointing by performing fewer full checkpoints interspersed with multiple
smaller incremental checkpoints.
Second, for the job input data, transparent techniques are provided to improve the
reliability, availability and performance of HPC I/O systems. In this area, the dissertation
contributes (1) a mechanism for offline job input data reconstruction to ensure availability
of job input data and to improve center-wide performance at no cost to job owners; (2)
an approach to automatic recover job input data at run-time during failures by recovering
staged data from an original source; and (3) “just in time†replication of job input data so
as to maximize the use of supercomputer cycles.
Experimental results demonstrate the value of these advanced fault tolerance techniques
to increase fault resilience in HPC environments.
Description
Keywords
job input data, fault tolerance, high-performance computing, fault resilience, checkpoint/restart
Citation
Degree
PhD
Discipline
Computer Science