Browsing by Author "Dr. Nagiza Samatova, Committee Member"
Now showing 1 - 2 of 2
- Results Per Page
- Sort Options
- Remote Data Collection and Analysis using Mobile Agents and Service-Oriented Architectures.(2008-08-19) Girish Chandra, Harsha; Dr. Helen Gu, Committee Member; Dr. Frank Mueller, Committee Chair; Dr. Nagiza Samatova, Committee MemberThe ubiquity of wireless systems have ushered us into a new era of mobile computing. With the emergence of superior input/output, communication hardware and cheap data services, mobile phones have become a bed for offering new and exotic services. Superior GUI and remote connectivity make mobile phones and PDAs good candidates for data collection, but lacking battery life and computational prowess, they are poor computational devices. We introduce a novel architecture that builds on agents on mobile phones as the front end and a service-oriented architecture composed of high-performance devices as the back end. Agent-based computing, which has proved to be advantageous for desktops⁄servers, can also encompass hand-held devices to provide us with new service management capabilities. In this thesis, we discuss a new service deployment strategy on mobile phones based on mobile Agents. Mobile agent is an agent that can migrate from one node to the other node in the network while preserving its state. This solves the problem of introducing new services manually and provides the advantage of on-the-fly code updates for existing services. We also discuss the challenges of mobile agent development in Java, mainly introducing code migration in Java (J2ME), which is the critical component of a mobile agent, and interoperability among different J2ME profiles and with Java standard edition (J2SE). As a computational backbone for the architecture, we utilize inexpensive but powerful nodes based on the IBM Cell Broadband architecture namely through PlayStation (PS3) devices running on Linux. Powered by a RISC based main processing unit (PPU) and eight synergistic processing units (SPU), a PS3 can analyze large data sets with great speed. In this work, we also analyze the programming paradigm used in the PS3 machines. We discuss the design and implementation of several high-performance kernels in the PS3 and measure the speedup obtained corresponding to an x86 machine. Lastly, we discuss how high-performance computing can be introduced as a service by using platform-neutral protocols such as XML-RPC, to integrate the heterogeneous platform of mobile agents and service-oriented architectures (SOA).
- Transparent Fault Tolerance for Job Healing in HPC Environments(2009-07-07) Wang, Chao; Dr. Frank Mueller, Committee Chair; Dr. Xiaosong Ma, Committee Member; Dr. Yan Solihin, Committee Member; Dr. Nagiza Samatova, Committee MemberAs the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions. This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas. First, at the job level, novel, scalable mechanisms are built in support of proactive FT and to significantly enhance reactive FT. The contributions of this dissertation in this area are (1) a transparent job pause mechanism, which allows a job to pause when a process fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant approach that combines process-level live migration with health monitoring to complement reactive with proactive FT and to reduce the number of checkpoints when a majority of the faults can be handled proactively; (3) a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks; and (4) an incremental checkpointing mechanism, which is combined with full checkpoints to explore the potential of reducing the overhead of checkpointing by performing fewer full checkpoints interspersed with multiple smaller incremental checkpoints. Second, for the job input data, transparent techniques are provided to improve the reliability, availability and performance of HPC I/O systems. In this area, the dissertation contributes (1) a mechanism for offline job input data reconstruction to ensure availability of job input data and to improve center-wide performance at no cost to job owners; (2) an approach to automatic recover job input data at run-time during failures by recovering staged data from an original source; and (3) “just in time†replication of job input data so as to maximize the use of supercomputer cycles. Experimental results demonstrate the value of these advanced fault tolerance techniques to increase fault resilience in HPC environments.
