NCSU Institutional Repository >
NC State Theses and Dissertations >
Please use this identifier to cite or link to this item:
|Title: ||Analyzing and Characterizing Space and Time Sharing of the Cache Memory|
|Authors: ||Kim, Seong Beom|
|Advisors: ||Edward Gehringer, Committee Member|
Suleyman Sair, Committee Member
Yan Solihin, Committee Chair
Vincent Freeh, Committee Member
|Keywords: ||OS performance characterization|
|Issue Date: ||23-Jul-2007|
|Discipline: ||Computer Engineering|
|Abstract: ||The first part of this dissertation presents a detailed study of concurrent space sharing of the cache memory, focusing on the fairness in cache sharing between threads in a chip-multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness, and its relation to throughput, has not been studied. Fairness is a critical issue because the Operating System (OS) thread scheduler's effectiveness depends on the hardware to provide fair caching to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective.
This work makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them strongly correlate with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, this work proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. Finally, this work studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4x, while increasing the throughput by 15%, compared to a non-partitioned shared cache.
The second part of the dissertation presents a novel simulation methodology that accelerates full-system simulation where OS and application programs time-share the cache. The ongoing trend of increasing computer hardware and software complexity has resulted in the increase in complexity and overheads of cycle-accurate processor system simulation, especially in full-system simulation which not only simulates user applications, but also OS and system libraries. This work seeks to address how to accelerate full-system simulation through studying, characterizing, and predicting the performance behavior of OS services.
Through studying the performance behavior of OS services, we found that each OS service exhibits multiple but limited behavior points that are repeated frequently. We exploit the observation to speed up full system simulation. A simulation run is divided into two non-overlapping periods: a learning period in which performance behavior of instances of an OS service are characterized and recorded, and a prediction period in which detailed simulation is replaced with a much faster emulation mode. During a prediction period, the behavior signature of an instance of an OS service is obtained through emulation while performance of the instance is predicted based on its signature and records of the OS service's past performance behavior. Statistically-rigorous algorithms are used to determine when to switch between learning and prediction periods.
We test the proposed scheme with a set of OS-intensive applications and a recent version of Linux OS running on top of a detailed processor and memory hierarchy model implemented on Simics, a popular full-system simulator. On average, the method needs the learning periods to cover only 11% of OS service invocations in order to produce highly accurate performance estimates. This leads to an estimated simulation speedup of 4.9x, with an average performance prediction error of only 3.2%, and a worst case error of 4.2%.|
|Appears in Collections:||Dissertations|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.