System Virtualization for Proactive Fault-Tolerant Computing

dc.contributor.advisorDr. Xiaosong MA, Committee Memberen_US
dc.contributor.advisorDr. Xiaohui (Helen) Gu, Committee Memberen_US
dc.contributor.advisorDr. Frank Mueller, Committee Chairen_US
dc.contributor.authorNagarajan, Arun Babuen_US
dc.date.accessioned2010-04-02T18:07:06Z
dc.date.available2010-04-02T18:07:06Z
dc.date.issued2008-05-02en_US
dc.degree.disciplineComputer Scienceen_US
dc.degree.levelthesisen_US
dc.degree.nameMSen_US
dc.descriptionNorth Carolina State University Theses Computer Science.
dc.description.abstractLarge-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This thesis contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint⁄restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.en_US
dc.formatThesis (M.S.)--North Carolina State University.
dc.identifier.otheretd-04212008-235520en_US
dc.identifier.urihttp://www.lib.ncsu.edu/resolver/1840.16/1750
dc.rightsI hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dis sertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to NC State University or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.en_US
dc.subjectVirtualizationen_US
dc.subjectHigh-Performance Computingen_US
dc.subjectProactive Fault Toleranceen_US
dc.titleSystem Virtualization for Proactive Fault-Tolerant Computingen_US
dcterms.abstractKeywords: Virtualization, High-Performance Computing, Proactive Fault Tolerance.
dcterms.extentviii, 40 pages : illustrations (some color)

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
etd.pdf
Size:
363.81 KB
Format:
Adobe Portable Document Format

Collections