Adding Coordination to the Management of High-End Storage Systems

Abstract

Today’s scientific and commercial applications rely heavily on high-end computing(HEC) facilities, including large scale datacenters, supercomputers, and so forth. In these facilities, the storage subsystems are playing an increasingly important role in the overall computing experience perceived by users. Meanwhile, it is a challenging task to provide high performance and reliability to those high-end storage systems due to their high I/O demands, large scales, and complex architectures. We observe that in addition to the well-recognized lack of I/O resources relative to computing demands in an aggregate perspective, one main challenge faced by high-end storage systems lies in the growing scale and complexity of the entire environment. Individually developed system components or algorithms often behave with isolated local optimizations, and handle concurrent user workloads without considering inter-workload relationships. The author’s Ph.D. research focuses on three novel instances of bringing adaptive coordination to the management of commercial and scientific high-end storage systems, at different levels of the HEC storage hierarchy. Firstly, on a single storage server, we present a memory cache allocation mechanism which coordinates multiple concurrent sequential access streams with different request rates. Our work is based on the interesting observation that this problem bears a strong resemblance to situations long studied in the field of supply chain management (SCM), used by used by large vendors and retailers. Furthermore, in a multi-level storage architecture, we address the problem of information distortion in uncoordinated prefetching operations on different storage caches. We develop a simple information sharing mechanism, as well as a transparent hierarchy-aware optimization component named PreFetching-Coordinator (PFC), which monitors both upper- and lower-level caches, and adjusts the aggressiveness of lower-level prefetching. Finally, we improve the data availability in an entire distributed storage system by coordinating it with the HPC job scheduler and remote data sources. We implemented the proposed techniques in real software environments, including a state-of-the-art operating system kernel, a widely used job scheduler and a popular parallel file system, as well as verified simulators. Our experimental results collected from real system experiments and simulations show that our proposed techniques can significantly improve system performance and reliability by coordinating among system components and requests.

Description

Keywords

Data Storage, Supply Chain Management, Operating Systems

Citation

Degree

PhD

Discipline

Computer Science

Collections