Adding Coordination to the Management of High-End Storage Systems

Show full item record

Title: Adding Coordination to the Management of High-End Storage Systems
Author: Zhang, Zhe
Advisors: Xiaosong Ma, Committee Chair
William Stewart , Committee Co-Chair
Frank Mueller, Committee Member
Robert Handfield, Committee Member
Sudharshan Vazhkudai, Committee Member
Abstract: Today’s scientific and commercial applications rely heavily on high-end computing(HEC) facilities, including large scale datacenters, supercomputers, and so forth. In these facilities, the storage subsystems are playing an increasingly important role in the overall computing experience perceived by users. Meanwhile, it is a challenging task to provide high performance and reliability to those high-end storage systems due to their high I/O demands, large scales, and complex architectures. We observe that in addition to the well-recognized lack of I/O resources relative to computing demands in an aggregate perspective, one main challenge faced by high-end storage systems lies in the growing scale and complexity of the entire environment. Individually developed system components or algorithms often behave with isolated local optimizations, and handle concurrent user workloads without considering inter-workload relationships. The author’s Ph.D. research focuses on three novel instances of bringing adaptive coordination to the management of commercial and scientific high-end storage systems, at different levels of the HEC storage hierarchy. Firstly, on a single storage server, we present a memory cache allocation mechanism which coordinates multiple concurrent sequential access streams with different request rates. Our work is based on the interesting observation that this problem bears a strong resemblance to situations long studied in the field of supply chain management (SCM), used by used by large vendors and retailers. Furthermore, in a multi-level storage architecture, we address the problem of information distortion in uncoordinated prefetching operations on different storage caches. We develop a simple information sharing mechanism, as well as a transparent hierarchy-aware optimization component named PreFetching-Coordinator (PFC), which monitors both upper- and lower-level caches, and adjusts the aggressiveness of lower-level prefetching. Finally, we improve the data availability in an entire distributed storage system by coordinating it with the HPC job scheduler and remote data sources. We implemented the proposed techniques in real software environments, including a state-of-the-art operating system kernel, a widely used job scheduler and a popular parallel file system, as well as verified simulators. Our experimental results collected from real system experiments and simulations show that our proposed techniques can significantly improve system performance and reliability by coordinating among system components and requests.
Date: 2009-11-20
Degree: PhD
Discipline: Computer Science

Files in this item

Files Size Format View
etd.pdf 1.546Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record