Browsing by Author "Vincent Freeh, Committee Member"
Now showing 1 - 13 of 13
- Results Per Page
- Sort Options
- Analytical Bounding Data Cache Behavior for Real-Time Systems(2008-07-21) Ramaprasad, Harini; Frank Mueller, Committee Chair; Eric Rotenberg, Committee Member; Vincent Freeh, Committee Member; Tao Xie, Committee MemberThis dissertation presents data cache analysis techniques that make it feasible to predict data cache behavior and to bound the worst-case execution time for a large class of real-time programs. Data Caches are an increasingly important architectural feature in most modern computer systems. They help bridge the gap between processor speeds and memory access times. One inherent difficulty of using data caches in a real-time system is the unpredictability of memory accesses, which makes it difficult to calculate worst-case execution times of real-time tasks. This dissertation presents an analytical framework that characterizes data cache behavior in the context of independent, periodic tasks with deadlines less than or equal to their periods, executing on a single, in-order processor. The framework presented has three major components. 1) The first component analytically derives data cache reference patterns for all scalar and non-scalar references in a task. Using these, it produces a safe and tight upper bound on the worst-case execution time of the task without considering interference from other tasks. 2) The second component calculates the worst-case execution time and response time of a task in the context of a multi-task, prioritized, preemptive environment. This component calculates Data-Cache Related Preemption Delay for tasks assuming that all tasks in the system are completely preemptive. 3) In the third component, tasks are allowed to have critical sections in which they access shared resources. In this context, two analysis techniques are presented. In the first one, a task executing in a critical section is not allowed to be preempted by any other task. In the second one, the framework incorporates Resource Sharing Policies to arbitrate accesses to shared resources, thereby improving responsiveness of high-priority tasks that do not use a particular resource. In all the components presented in this dissertation, a direct-mapped data cache is assumed. Experimental results demonstrate the value of all the analysis techniques described above in the context of data cache usage in a hard real-time system.
- Analyzing and Characterizing Space and Time Sharing of the Cache Memory(2007-07-23) Kim, Seong Beom; Edward Gehringer, Committee Member; Suleyman Sair, Committee Member; Yan Solihin, Committee Chair; Vincent Freeh, Committee MemberThe first part of this dissertation presents a detailed study of concurrent space sharing of the cache memory, focusing on the fairness in cache sharing between threads in a chip-multiprocessor (CMP) architecture. Prior work in CMP architectures has only studied throughput optimization techniques for a shared cache. The issue of fairness, and its relation to throughput, has not been studied. Fairness is a critical issue because the Operating System (OS) thread scheduler's effectiveness depends on the hardware to provide fair caching to co-scheduled threads. Without such hardware, serious problems, such as thread starvation and priority inversion, can arise and render the OS scheduler ineffective. This work makes several contributions. First, it proposes and evaluates five cache fairness metrics that measure the degree of fairness in cache sharing, and shows that two of them strongly correlate with the execution-time fairness. Execution-time fairness is defined as how uniform the execution times of co-scheduled threads are changed, where each change is relative to the execution time of the same thread running alone. Secondly, this work proposes static and dynamic L2 cache partitioning algorithms that optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling, has low overhead, and does not restrict the cache replacement algorithm to LRU. Finally, this work studies the relationship between fairness and throughput in detail. We found that optimizing fairness usually increases throughput, while maximizing throughput does not necessarily improve fairness. Using a set of co-scheduled pairs of benchmarks, on average our algorithms improve fairness by a factor of 4x, while increasing the throughput by 15%, compared to a non-partitioned shared cache. The second part of the dissertation presents a novel simulation methodology that accelerates full-system simulation where OS and application programs time-share the cache. The ongoing trend of increasing computer hardware and software complexity has resulted in the increase in complexity and overheads of cycle-accurate processor system simulation, especially in full-system simulation which not only simulates user applications, but also OS and system libraries. This work seeks to address how to accelerate full-system simulation through studying, characterizing, and predicting the performance behavior of OS services. Through studying the performance behavior of OS services, we found that each OS service exhibits multiple but limited behavior points that are repeated frequently. We exploit the observation to speed up full system simulation. A simulation run is divided into two non-overlapping periods: a learning period in which performance behavior of instances of an OS service are characterized and recorded, and a prediction period in which detailed simulation is replaced with a much faster emulation mode. During a prediction period, the behavior signature of an instance of an OS service is obtained through emulation while performance of the instance is predicted based on its signature and records of the OS service's past performance behavior. Statistically-rigorous algorithms are used to determine when to switch between learning and prediction periods. We test the proposed scheme with a set of OS-intensive applications and a recent version of Linux OS running on top of a detailed processor and memory hierarchy model implemented on Simics, a popular full-system simulator. On average, the method needs the learning periods to cover only 11% of OS service invocations in order to produce highly accurate performance estimates. This leads to an estimated simulation speedup of 4.9x, with an average performance prediction error of only 3.2%, and a worst case error of 4.2%.
- Analyzing and Improving Linux Kernel Memory Protection: A Model Checking Approach.(2010-04-27) Liakh, Siarhei; Xuxian Jiang, Committee Chair; Rainer Mueller, Committee Member; Vincent Freeh, Committee Member
- Controller in Core: An Adaptive Microarchitectural Model for System-level Optimization(2007-07-19) Gao, Fei; Suleyman Sair, Committee Chair; Thomas M Conte, Committee Member; Yan Solihin, Committee Member; Vincent Freeh, Committee MemberModern processors are utilizing more complex microarchitectures to extract more Instruction Level Parallelism (ILP) ⁄ Thread Level Parallelism (TLP) to improve performance. In addition to performance concerns, power and thermal issues have also become important for microprocessor designers because of the increasing power/heat density and resulting cooling costs. Meanwhile, many runtime architectural optimization approaches, called adaptive microarchitectures, are proposed to optimize system resources dynamically based on the characteristics of applications. However, most of them focus on improving a specific microarchitecture component or metric. In this dissertation, we argue that system-wide optimization is the future of adaptive microarchitectures that can balance the tradeoffs between different optimization choices to achieve maximum overall performance. We propose a runtime optimization architectural model - Controller in Core (CiC), which uses a dedicated element to synthesize and analyze the system-wide runtime information and make judicious optimization decisions. To demonstrate the CiC model, we present a performance-oriented adaptive microarchitecture - an adaptive value predictor that tailors its value prediction functionality based on runtime system performance bottleneck analysis. We propose an event counter based performance model that can accurately estimate the performance cost for critical system events. Based on this model, we propose the bottleneck vector as the basis of long-term performance bottleneck analysis and a runtime bottleneck phase tracking scheme. In addition, three bottleneck phase prediction schemes are studied. Based on performance bottleneck analysis, we develop adaptation algorithms to control the adaptation of the adaptive value predictor. Our results show that the adaptive value predictor achieves 30% and 10% average performance gains when compared to the baseline and the traditional value predictor designs respectively.
- Core-Selectable Chip Multiprocessor Design.(2010-11-10) Hashemi, Hashem; Eric Rotenberg, Committee Chair; Gregory Byrd, Committee Member; Vincent Freeh, Committee Member; James Tuck, Committee Member
- Design of the Management Component in a Scavenged Storage Environment(2005-07-31) Tammineedi, Nandan; Xiaosong Ma, Committee Chair; Sudharshan Vazhkudai, Committee Member; Khaled Harfoush, Committee Member; Vincent Freeh, Committee MemberHigh-end mass storage systems are increasingly becoming popular in supercomputing facilities for their large storage capacities and superior data delivery rates. End-users, on the other hand, face problems in processing this data on their local machines due to the limited disk bandwidth and memory. The Freeloader project is based on the premise that in a LAN environment, a large number of such workstations collectively represent significant storage space and aggregate I/O bandwidth, if harnessed when idle. Aggregation of these precious resources is made viable by the high speed interconnect that exists between nodes in a LAN. The FreeLoader project is an effort to aggregate free storage space, and I/O bandwidth contributions from commodity desktops to provide a shared cache/scratch space for large, immutable data sets. Striping is initially used to distribute data among multiple workstations, and this enables subsequent retrieval of data in the form of parallel streams from multiple workstations. In this thesis, we present the management component of the Freeloader project. We discuss the functionality of the management component in terms of data placement and maintenance of information about workstations which donate storage. We show how the striping of data maximizes retrieval rates and helps in load balancing. We present the choices we face in the design of the management component and how to minimize its overheads. We also model the entire Freeloader cloud as a cache space with an eviction policy, due to the dynamic nature of space contributions and the limited amount of donated space. We discuss how the management component handles data set eviction in a manner that exploits temporal locality based on a history of accesses. We also discuss experimental results which show the impact of different striping parameters on the data access rates, and the viability of Freeloader in comparison to traditional data retrieval from high-end storage systems.
- EMFS: An Email Based Distributed File System.(2010-08-17) Srinivasan, Jagannath; Xiaosong Ma, Committee Chair; Ting Yu, Committee Member; Vincent Freeh, Committee Member
- METRIC: Tracking Memory Bottlenecks via Binary Rewriting(2003-07-15) Marathe, Jaydeep Prakash; Frank Mueller, Committee Chair; Vincent Freeh, Committee Member; Gregory Byrd, Committee MemberOver recent decades, computing speeds have grown much faster than memory access speeds. This differential rate of improvement between processor speeds and memory speeds has led to an ever-increasing processor-memory gap. Overall computing speeds for most applications are now dominated by the cost of their memory references. Furthermore, memory access costs will grow increasingly dominant as the processor-memory gap widens. In this scenario, characterizing and quantifying application program memory usage to isolate, identify and eliminate memory access bottlenecks will have significant impact on overall application computing performance. This thesis presents METRIC, an environment for determining memory access inefficiencies by examining access traces. This thesis makes three primary contributions. First, we present methods to extract partial access traces from running applications by observing their memory behavior via dynamic binary rewriting. Second, we present a methodology to represent partial access traces in constant space for regular references through a novel technique for online compression of reference streams. Third, we employ offline cache simulation to derive indications about memory performance bottlenecks from partial access traces. By examining summarized and by-reference metrics as well as cache evictor information, we can pinpoint the sources of performance problems. We perform validation experiments of the framework with respect to accuracy, compression and execution overheads for several benchmarks. Finally, we demonstrate the ability to derive opportunities for optimizations and assess their benefits in several case studies, resulting in up to 40% lower miss ratios.
- MST: A Multi-level Storage Testbed(2008-06-02) Balik, Alexander James; Xiaosong Ma, Committee Chair; Edward Davis, Committee Member; Vincent Freeh, Committee Member
- Providing predictability for high end embedded systems(2010-01-27) Raghavendra, Raghuveer; Frank Mueller, Committee Chair; Vincent Freeh, Committee Member; Xuxian Jiang, Committee MemberReal-Time systems require logical and temporal correctness. Temporal correctness implies that each task running on the system has a deadline that needs to be met. To ensure that the deadlines are met, the scheduler of a real-time system needs information about the worst-case execution time (WCET) of each task. The task of determining the WCET of a task on a particular architecture is called timing analysis. Analysis techniques are broadly classified as static and dynamic. Dynamic timing analysis does not provide safe WCET bounds. Static analysis cannot be used on modern processors with features like out-of-order execution, dynamic branch prediction and speculative execution. Such features, while improving the average-case performance, induce counter-intuitive timing behavior known as timing anomalies. Hence, designers of hard real-time systems are forced to use architectures with simple in-order pipelines. This thesis develops and demonstrates the benefits of a hybrid timing analysis technique (combining static and dynamic analysis) on a processor simulator and on FPGA hardware to provide tight and safe WCET bounds. The technique makes the following contributions: * It enhances the realm of design for hard real-time systems by allowing the designers to use complex out-of-order architectures that exhibit timing anomalies. * It eliminates the need for complex prototyping of hardware for static timing analysis since the analysis can be done directly on the actual hardware. This has the added advantage of eliminating timing inaccuracies arising out of variations in manufacturing technology. * The method helps manufacturers to protect their Intellectual Property by eliminating the need to disclose architectural details for the purpose of static timing analysis.
- Scalable Distributed Concurrency Protocol with Priority Support(2003-08-01) Desai, Nirmit Vikram; Frank Mueller, Committee Chair; Vincent Freeh, Committee Member; Gregory Byrd, Committee MemberMiddleware components are becoming increasingly important as applications share computational resources in large distributed environments, such as web services, high-end clusters with ever larger number of processors, computational grids and an increasingly large server farms. One of the main challenges in such environments is to achieve scalability of synchronization. Another challenge is posed by requirement for shared resources with a need for QoS and real-time support. In general, concurrency services arbitrate resource requests in distributed systems. But concurrency protocols currently lack scalability and support for service differentiation based on QoS requirements. Adding such guarantees enables resource sharing and computing with distributed objects in systems with a large number of nodes and supporting a wide range of QoS metrics. The objective of this thesis is to enhance middleware services to provide scalability of synchronization and to support service differentiation based on priorities. We have designed and implemented middleware protocols in support of these objectives. Its essence is a novel, peer-to-peer, fully decentralized protocol for multi-mode hierarchical locking, which is applicable to transaction-style processing and distributed agreement. We discuss the design and implementation details of the protocols and demonstrate high scalability combined with low response times in high-performance cluster environments as well as TCP/IP networks when compared to a prior protocol for distributed synchronization. The prioritized version of the protocol is shown to offer differentiated response times to real-time applications with support for protocols to bound priority inversion such as PCEP and PIP. Our approach was originally motivated by CORBA concurrency services. Beyond CORBA, its principles are shown to provide benefits to general distributed concurrency services and transaction models. Besides its technical strengths, our approach is intriguing due to its simplicity and its wide applicability, ranging from large-scale clusters to server-style computing and real-time applications. In general, the results of this thesis impact applications sharing resources across large distributed environments ranging from hierarchical locking in real-time databases and database transactions to distributed object environments in large-scale embedded systems including real-time applications.
- The SILO Architecture: Exploring Future Internet Design.(2010-08-09) Wang, Anjing; Rudra Dutta, Committee Chair; Georgios Rouskas, Committee Chair; Vincent Freeh, Committee Member; Edward Gehringer, Committee Member; Ilia Baldine, Committee Member
- Trace Based Performance Characterization and Optimization(2007-06-20) Marathe, Jaydeep Prakash; Vincent Freeh, Committee Member; Yan Solihin, Committee Member; Tao Xie, Committee Member; Frank Mueller, Committee ChairProcessor speeds have increased dramatically in the recent past, but improvement in memory access latencies has not kept pace. As a result, programs that do not make efficient use of the processor caches tend to become increasing memory-bound and do not experience speedups with increasing processor frequency. In this thesis, we present tools to characterize and optimize the memory access patterns of software programs. Our tools use the program's memory access trace as a primary input for analysis. Our efforts encompass two broad areas --- performance analysis and performance optimization. With performance analysis, our focus is on automating the analysis process as far as possible and on presenting the user with a rich set of metrics, both for single-threaded and multi-threaded programs. With performance optimization, we go one step further and perform automatic transformations based on observed program behavior. We make the following contributions in this thesis. First, we explore different tracing strategies --- software tracing with dynamic binary instrumentation, hardware-based tracing exploiting support found in contemporary microprocessors and a hybrid scheme that leverages hardware support with certain software modifications. Second, we present a range of performance analysis and optimization tools based on these trace inputs and additional auxiliary instrumentation. Our first tool, METRIC, characterizes the memory performance of single-threaded programs. Our second tool, ccSIM extends METRIC to characterize the coherence behavior of multithreaded OpenMP benchmarks. Our third tool extends ccSIM to work with hardware-generated and hybrid trace inputs. These three tools represent our performance analysis efforts. We also explore automated performance optimization with our remaining tools. Our fourth tool uses hardware-generated traces for automatic page placement in cache coherent non-uniform memory architectures (ccNUMA). Finally, our fifth tool explores a novel trace-driven instruction-level software data prefetching strategy. Overall, we demonstrate that memory traces represent a rich source of information about a program's behavior and can be effectively used for a wide range of performance analysis and optimization strategies.
