Browsing by Author "Eric Rotenberg, Committee Member"
Now showing 1 - 20 of 21
- Results Per Page
- Sort Options
- Analytical Bounding Data Cache Behavior for Real-Time Systems(2008-07-21) Ramaprasad, Harini; Frank Mueller, Committee Chair; Eric Rotenberg, Committee Member; Vincent Freeh, Committee Member; Tao Xie, Committee MemberThis dissertation presents data cache analysis techniques that make it feasible to predict data cache behavior and to bound the worst-case execution time for a large class of real-time programs. Data Caches are an increasingly important architectural feature in most modern computer systems. They help bridge the gap between processor speeds and memory access times. One inherent difficulty of using data caches in a real-time system is the unpredictability of memory accesses, which makes it difficult to calculate worst-case execution times of real-time tasks. This dissertation presents an analytical framework that characterizes data cache behavior in the context of independent, periodic tasks with deadlines less than or equal to their periods, executing on a single, in-order processor. The framework presented has three major components. 1) The first component analytically derives data cache reference patterns for all scalar and non-scalar references in a task. Using these, it produces a safe and tight upper bound on the worst-case execution time of the task without considering interference from other tasks. 2) The second component calculates the worst-case execution time and response time of a task in the context of a multi-task, prioritized, preemptive environment. This component calculates Data-Cache Related Preemption Delay for tasks assuming that all tasks in the system are completely preemptive. 3) In the third component, tasks are allowed to have critical sections in which they access shared resources. In this context, two analysis techniques are presented. In the first one, a task executing in a critical section is not allowed to be preempted by any other task. In the second one, the framework incorporates Resource Sharing Policies to arbitrate accesses to shared resources, thereby improving responsiveness of high-priority tasks that do not use a particular resource. In all the components presented in this dissertation, a direct-mapped data cache is assumed. Experimental results demonstrate the value of all the analysis techniques described above in the context of data cache usage in a hard real-time system.
- Analyzing and Managing Shared Cache in Chip Multi-Processors(2008-08-14) Guo, Fei; Yan Solihin, Committee Chair; Eric Rotenberg, Committee Member; Edward Gehringer, Committee Member; Gregory Byrd, Committee MemberRecently, Chip Multi-Processor (CMP) or multicore design has become the mainstream architecture choice for major microprocessor makers. In a CMP architecture, some important on-chip platform resources, such as the lowest level on-chip cache and the off-chip bandwidth, are shared by all the processor cores. As will be shown in this dissertation, resource sharing may lead to sub-optimal throughput, cache thrashing, thread starvation and priority inversion for the applications that fail to acquire sufficient resources to make good progress. In addition, resource sharing may also lead to a large performance variation for an individual application. Such performance variation is ill-suited for the future uses of CMPs in which many applications may require a certain level of performance guarantee, which we refer to as performance Quality of Service (QoS). In this dissertation, we address the resource sharing problem from two aspects. Firstly, we propose an analytical and several heuristic models that encapsulate and predict the impact of cache sharing. The models differ by their complexity and prediction accuracy. We validate the models against a cycle-accurate simulation. The most accurate model achieves an average error of 3.9%. Through a case study, we found that the cache sharing impact is largely affected by the temporal reuse behaviors of the co-scheduled applications. Secondly, we investigate a framework for providing performance Quality of Service in a CMP server. We found that the ability of a CMP to partition platform resources alone is not sufficient for fully providing QoS. We also need an appropriate way to specify a QoS target, and an admission control policy that accepts jobs only when their QoS targets can be satisfied. We also found that providing strict QoS often leads to a significant reduction in throughput due to resource fragmentation. We propose novel throughput optimization techniques that include: (1) exploiting various QoS execution modes, and (2) microarchitecture techniques that steal excess resources from a job while still meeting its QoS target. Through simulation, we found that compared to an unoptimized scheme, the throughput can be improved by up to 45%, making the throughput significantly closer to a non-QoS CMP.
- Caching Strategies to Improve Generational Garbage Collection in Smalltalk(2003-08-20) Reddy, Vimal Kodandarama; Gregory T. Byrd, Committee Member; Edward F. Gehringer, Committee Chair; Warren J. Jasper, Committee Member; Eric Rotenberg, Committee MemberCache performance of programs is becoming increasingly important as processor speeds grow faster relative to main memory. Cache misses have become a major consideration for performance of garbage collected systems. This thesis explores a caching strategy for generational garbage collectors, the most prevalent form in use, which takes advantage of large caches available to modern-day processors. A portion of the cache is chosen to reserve the youngest generation and the page-fault manager is provided certain mapping rules that remove all conflicts to the youngest generation. The strategy can be completely realized in software which makes it an attractive solution to increase garbage collection performance. This "biased" cache mapping is shown to reduce cache misses and increase overall performance in the IBM VisualAge Smalltalk system, a high quality Smalltalk implementation that employs a generational copying garbage collector. Favoring youngest generation in the mapping strategy is advantageous for the following reasons: 1. Languages like Smalltalk, where "everything" is an object, tend to allocate furiously. This is because they encourage a programming style where objects are created, used and shortly thereafter destroyed. This large number of allocations translate to initialization write misses if the allocated region is not cached. In generational heaps, all memory is allocated in the region containing youngest-generation objects. 2. A generational garbage collector focuses collection on the youngest generation, scavenging it to reclaim most garbage. It relies on empirical knowledge that most young objects die soon. This means the scavenger runs many times during a program lifetime, scanning the youngest generation for garbage. This process can lead to a large number of read and write cache misses if the youngest generation is not in cache. 3. Youngest generation objects form a major part of a program's working set. Making them available in the cache would also improve the mutator (i.e, user program) performance, making it immune to interference from the garbage collector. 4. Given that most young objects become garbage quickly, when a garbaged object is evicted from a writeback cache, an unneccessary writeback results. Caching the youngest generation would reduce traffic to memory. We do a simulation-based study of our mapping strategies on IBM VisualAge Smalltalk, a generational copying garbage collected system. Our results show a 45% average drop in cache miss rates at the L2 level for direct-mapped caches and 15% average drop for 2-way set-associative caches.
- Collaboration Policies: Access Control Management in SQA-Based Dynamic Collaborations(2008-04-12) Altunay, Mine; Ralph A. Dean, Committee Co-Chair; Douglas S. Reeves, Committee Member; Gregory T. Byrd, Committee Chair; Eric Rotenberg, Committee Member
- Data Allocation with Real-Time Scheduling (DARTS)(2006-12-21) Ghattas, Rony; Ralph C. Smith, Committee Member; Alexander G. Dean, Committee Chair; Thomas M. Conte, Committee Member; Eric Rotenberg, Committee MemberThe problem of utilizing memory and energy efficiently is common to all computing platforms. Many studies have addressed and investigated various methods to circumvent this problem. Nevertheless, most of these studies do not scale well to real-time embedded systems where resources might be limited and particular assumptions that are valid to general computing platforms do not continue to hold. First, memory has always been considered a bottleneck of system performance. It is well known that processors have been improving at a rate of about 60% per year, while memory latencies have been improving at less than 10% per year. This leads to a growing gap between processor cycle time and memory access Time. To compensate for this speed mismatch problem it is common to use a memory hierarchy with a fast cache that can dynamically allocate frequently used data objects close to the processor. Many embedded systems, however, cannot afford using a cache for many reasons presented later. Those systems opt to use a cacheless system which is particularly very popular for real-time embedded applications. Data is allocated at compile time, making memory access latencies deterministic and predictable. Nevertheless, the burden of allocating the data to memory is now the responsibility of the programmer⁄compiler. Second, the proliferation of portable and battery-operated devices has made the efficient use of the available energy budget a vital design constraint. This is particularly true since the energy storage technology is also improving at a rather slow pace. Techniques like dynamic voltage scaling (DVS) and dynamic frequency scaling (DFS) have been proposed to deal with these problems. Still, the applicability of those techniques to resource-constrained real-time system has not been investigated. In this work we propose techniques to deal with both of the above problems. Our main contribution, the data allocation with real-time scheduling (DARTS) framework solves the data allocation and scheduling problems in cacheless systems with the main goals of optimizing memory utilization, energy efficiency, and obviously overall system performance. DARTS is a synergistic optimal approach to allocating data objects and scheduling real-time tasks for embedded systems. It optimally allocates data objects to memory through the use of an integer linear programming (ILP) formulation, which minimizes the system’s worst-case execution times WCET resulting in more scheduling slack. This additional slack is used by our preemption threshold scheduler (PTS) to reduce stack memory requirements while maintaining all hard real-time constraints. The memory reduction of PTS allows these steps to be repeated. The data objects now require less memory, so more can fit into faster memory, further reducing WCET and resulting in more slack time. The increased slack time can be used by PTS to reduce preemptions further, until a fixed point is reached. Using a combination of synthetic and real workloads, we show that the DARTS platform leads to optimal memory utilization and increased energy efficiency. In addition to our main contribution given by the DARTS platform, we also present several techniques to optimize a system’s memory utilization in the absence of a memory hierarchy using PTS, which we enhance and improve. Furthermore, many advanced energy saving techniques like DFS and DVS are investigated as well, and the tradeoffs in their use is presented and analyzed.
- The Effectiveness of Global Difference Value Prediction And Memory Bus Priority Schemes for Speculative Prefetch(2003-07-01) Gunal, Ugur; Thomas M. Conte, Committee Chair; Eric Rotenberg, Committee Member; Mehmet C. Ozturk, Committee MemberProcessor clock speeds have drastically increased in the recent years. However, the cycle time improvement in the DRAM semiconductor technology used for memories has been comparatively slow. The expanding processor — memory gap encourages developers to find aggressive techniques to reduce the latency of memory accesses. Value prediction is a powerful approach to break true data dependencies. Prefetching is another technique, which aims to reduce the processor stall time by bringing data into the cache before it is accessed by the processor. Recovery-free value prediction [26] scheme combines these two techniques and uses value prediction only for prefetching so that the need for validation of predictions and a recovery mechanism for mispredictions are eliminated. In this thesis, the effectiveness of using global difference value prediction for recovery-free speculative execution is studied. A bus model is added for modeling the buses in the memory system. Three bus priority schemes, First Come First Served (FCFS), Real Access First Served (RAFS) and Prefetch Access First Served (PAFS), are proposed and their performance potentials are evaluated when a stride and a hybrid global difference predictor (hgDiff) is used. The results show that the recovery-free speculative execution using value prediction is a promising technique that increases the performance significantly (up to 10%), and this increase depends on the bus priority scheme and the predictor used.
- Enhancing dependence-based prefetching for better timeliness, coverage, and practicality(2008-12-22) Lim, Chungsoo; Eric Rotenberg, Committee Member; Vincent W. Freeh, Committee Member; Gregory T. Byrd, Committee Chair; Yan Solihin, Committee MemberThis dissertation proposes an architecture that efficiently prefetches for loads whose effective addresses are dependent on previously-loaded values (dependence-based prefetching). For timely prefetches, the memory access patterns of producing loads are dynamically learned. These patterns (such as strides) are used to prefetch well ahead of the consumer load. Different prefetching algorithms are used for different patterns, and different algorithms are combined on top of dependence-based prefetching scheme. The proposed prefetcher is placed near the processor core and targets L1 cache misses, because removing L1 cache misses has greater performance potential than removing L2 cache misses. For higher coverage, dependence-based prefetching is extended by augmenting the dependence relation identification mechanism, to include not only direct relations (y = x) but also linear relations (y = ax + b) between producer (x) and consumer (y) loads. With these additional relations, higher performance, measured in instructions per cycle (IPC), can be obtained. We also show that the space overhead for storing the patterns can be reduced by leveraging chain prefetching and focusing on frequently missed loads. We specifically examine how to capture pointers in linked data structures (LDS) with pure hardware implementation. We find that the space requirement can be reduced, compared to previous work, if we selectively record patterns. Still, to make the prefetching scheme generally applicable, a large table is required for storing pointers. So we take one step further in order to eliminate the additional storage need for pointers. We propose a mechanism that utilizes a portion of the L2 cache for storing the pointers. With this mechanism, impractically huge on-chip storage for pointers, which is sometimes a total waste of silicon, can be removed. We show that storing the prefetch table in a partition of the L2 cache outperforms using the L2 cache conventionally for benchmarks that benefit from prefetching.
- Exploration of High-level Synthesis Techniques to Improve Computational Intensive VLSI Designs(2009-12-07) Kim, Taemin; W. Rhett Davis, Committee Member; Eric Rotenberg, Committee Member; Xun Liu, Committee Chair; James M. Tuck, Committee MemberOptimization techniques during high level synthesis procedure are often preferred since design decisions at early stages of a design flow are believed to have a large impact on design quality. In this dissertation, we present three high-level synthesis schemes to improve the power, speed and reliability of deep submicron VLSI systems. Speciﬠcally, we ﬠrst describe a simultaneous register and functional unit (FU) binding algorithm. Our algorithm targets the reduction of multiplexer inputs, shortening the total length of global interconnects. In this algorithm, we introduce three graph parameters that guide our FU and register binding. They are flow dependencies, common primary inputs and common register inputs. We maximize the interconnect sharing among FUs and registers. We then present an interconnect binding algorithm during high-level synthesis for global intercon- nect reduction. Our scheme is based on the observation that not all FUs operate at all time. When idle, FUs can be reconﬠgured as pass-through logic for data transfer, reducing interconnect requirement. Our scheme not only reduces the overall length of global interconnects but also minimizes the power overhead without introducing any timing violations. Lastly, we present a register binding algorithm with the ob jective of register minimization. We have observed that not all pipelined FUs are operating at all time. Idle pipelined FUs can be used to store data temporarily, reducing stand-alone registers.
- Exploring Energy-Time Tradeoff in High Performance Computing(2005-05-16) Pan, Feng; Vincent Freeh, Committee Chair; Jun Xu, Committee Member; Eric Rotenberg, Committee MemberRecently, energy has become an important issue in high-performance computing. For example, low power/energy supercomputers, such as Green Destiny, have been built; the idea is to increase the energy efficiency of nodes. However, these clusters tend to save energy at the expense of performance. Our approach is instead to use high-performance cluster nodes with frequency scalable AMD-64 processors; energy can be saved by scaling down the CPU. Our cluster provides a different balance of power and performance than low-power machines such as Green Destiny. In particular, its performance is on par with a Pentium 4-equipped cluster. This thesis investigates the energy consumption and execution time of a wide range of applications, both serial and parallel, on a power-scalable cluster. We study via direct measurement both intra-node and inter-node effects of memory and communication bottlenecks, respectively. Additionally, we present a framework for executing a single application in several frequency-voltage settings. The basic idea is to first divide programs in to phases and then execute a series of experiments, with each phase assigned a prescribed frequency. Our results show that a power-scalable cluster has the potential to save energy by scaling the processor down to lower energy levels. Furthermore, we found that for some programs, it is possible to both consume less energy and execute in less time by increasing number of nodes and reducing frequency-voltage setting of the nodes. Additionally, we found that our phase detecting heuristic can find assignments of frequency to phase that is superior to any fixed-frequency solution.
- Extending Data Prefetching to Cope with Context Switch Misses(2009-03-18) Cui, Hanyu; Edward Gehringer, Committee Member; Eric Rotenberg, Committee Member; Yan Solihin, Committee Member; Suleyman Sair, Committee ChairAmong the various costs of a context switch, its impact on the performance of L2 caches is the most significant because of the resulting high miss penalty. To mitigate the impact of context switches, several OS approaches have been proposed to reduce the number of context switches. Nevertheless, frequent context switches are inevitable in certain cases and result in severe L2 cache performance degradation. Moreover, traditional prefetching techniques are ineffective in the face of context switches as their prediction tables are also subject to loss of content during a context switch. To reduce the impact of frequent context switches, we propose restoring a program's locality by prefetching into the L2 cache the data a program was using before it was swapped out. A Global History List is used to record a process' L2 read accesses in LRU order. These accesses are saved along with the process' context when the process is swapped out and loaded to guide prefetching when it is swapped in. We also propose a feedback mechanism that greatly reduces memory traffic incurred by our prefetching scheme. A phase guided prefetching scheme was also proposed to complement GHL prefetching. Experiments show significant speedup over baseline architectures with and without traditional prefetching in the presence of frequent context switches.
- Improving the Security of the Heap through Inter-Process Protection and Intra-Process Temporal Protection(2005-12-07) Kharbutli, Mazen Mahmoud; Yan Solihin, Committee Chair; Milos Prvulovic, Committee Member; Eric Rotenberg, Committee Member; Gregory Byrd, Committee Member; Edward Gehringer, Committee Member; William Boettcher, Committee MemberIn most current implementations, memory allocations and deallocations are performed by user-level library code which keeps heap meta-data (heap structure information) and the application's heap data stored in an interleaved fashion in the same address space. Such implementations are inherently unsafe: they allow attackers to use application's vulnerabilities (e.g. lack of heap-based buffer overflow checking) to corrupt its heap meta-data in order to execute malicious code or cause denial-of-service. In this dissertation, we propose an approach where heap meta-data and heap data are protected separately. Our first solution exploits existing inter-process protection mechanisms through Heap Server, a separate process that maintains heap meta-data on behalf of the application and runs in parallel with it. To perform memory allocations and deallocations, the application sends requests to the Server, which responds to the requests and updates the meta-data. Since the heap meta-data is kept in the Heap Server's address space, attacks on the application can no longer corrupt it. Heap Server is directly implementable in current systems because it does not require new hardware. To optimize Heap Server's performance, we explore non-blocking communication, bulk deallocation, and pre-allocation optimizations. Evaluated on a real system, a fully-optimized Heap Server performs almost identical to a Base heap management library with no protection mechanisms. As an alternative solution, we propose a new User-level Temporal Intra-Process Protection (UTIPP) mechanism in which a process protects itself from its own vulnerabilities by write-protecting its own heap meta-data, and only removing the protection for legitimate stores in the heap management library. Unlike existing kernel-level page protection which can only be modified in the privileged mode, UTIPP allows a process to modify the new write-protection bit with a single instruction without disrupting normal pipeline flow. Evaluated on a cycle-accurate simulator, UTIPP adds negligible overhead in most benchmarks. Another contribution of this dissertation is a new heap layout obfuscation technique which relies on randomizing the space between heap chunks and the order of chunks in the heap, making heap data attacks more difficult. This obfuscation is integrated with Heap Server and UTIPP.
- Investigation of multi-state charge-storage properties of redox-active organic molecules in silicon-molecular hybrid devices for DRAM and Flash applications(2008-01-08) Gowda, Srivardhan Shivappa; Leda Lunardi, Committee Member; Eric Rotenberg, Committee Member; Veena Misra, Committee Chair; Jonathan S. Lindsey, Committee Member
- Length Adaptive Processors: A Solution for the Energy/Performance Dilemma in Embedded Systems(2009-04-22) Iyer, Balaji Viswanathan; Eric Rotenberg, Committee Member; Dr. Thomas M. Conte, Committee Chair; W. Rhett Davis, Committee Member; S. Purushothaman Iyer, Committee MemberEmbedded-handheld devices are the predominant computing platform today. These devices are required to perform complex tasks yet run on batteries. Some architects use ASIC to combat this energy-performance dilemma. Even though they are efficient in solving this problem, an ASIC can cause code-compatibility problems for the future generations. Thus, it is necessary for a general purpose solution. Furthermore, no single processor configuration provides the best energy-performance solution over a diverse set of applications or even throughout the life of a single application. As a result, the processor needs to be adaptable to the specific workload behavior. Code-generation and code-compatibility are the biggest challenges in such adaptable processors. At the same time, embedded systems have fixed energy source such as a 1-Volt battery. Thus, the energy consumption of these devices must be predicted with utmost accuracy. A gross miscalculation can cause the system to be cumbersome for the user. In this work, we provide a new paradigm of embedded processors called Dynamic Length-Adaptive Processors that have the flexibility of a general purpose processor with the specialization of an ASIC. We create such a processor called Clustered Length-Adaptive Word Processor (CLAW) that is able to dynamically modify its issue width with one VLIW instruction overhead. This processor is designed in Verilog, synthesized, DRC-checked, and placed and routed. Its energy and performance values are reported using industrial-strength transistor-level analysis tools to dispel several myths that were thought to be dominating factors in embedded systems. To compile benchmarks for the CLAW processor, we provide the necessary software tools that help produce optimized code for performance improvement and energy reduction, and discuss some of the code-generation procedures and challenges. Second, we try and understand the code-generator patterns of the compiler by sampling a representative application and design an ISA opcode-configuration that helps minimize the energy necessary to decode the instructions with no performance-loss. We discover that having a well designed opcode-configuration, not only reduces energy in the decoder by also other units such as the fetch and exception units. Moreover, the sizable amount of energy reduction can be achieved in a diverse set of applications. Next, we try to reduce the energy consumption and power-dissipation of register-read and register-writes by using popular common-value register-sharing techniques that are used to enhance performance. We provide a power-model for these structures based on the value localities of the application. Finally, we perform a case-study using the IEEE 802.11n PHY Transmitter and Decoder and identify its energy-hungry units. Then, we apply our techniques and show that CLAW is a solution for such hybrid complex algorithms for providing high-performance while reducing the total energy.
- Redox-active Organic Molecules on Silicon and Silicon Dioxide Surfaces for Hybrid Silicon-molecular Memory Devices.(2006-11-17) Mathur, Guruvayurappan; Jonathan S. Lindsey, Committee Member; Eric Rotenberg, Committee Member; John R. Hauser, Committee Member; Veena Misra, Committee ChairThe focus of this dissertation is on creating electronic devices that utilize unique charge storage properties of redox-active organic molecules for memory applications. A hybrid silicon-molecular approach has been adopted to make use of the advantages of the existing silicon technology, as well as to study and exploit the interaction between the organic molecules and the bulk semiconductor. As technology heads into the nano regime, this hybrid approach may prove to be the bridge between the existing Si-only technology and a future molecule-only technology. Functionalized monolayers of redox-active molecules were formed on silicon surfaces of different doping types and densities. Electrolyte-molecule-silicon test structures were electrically characterized and studied using cyclic voltammetry and impedance spectroscopy techniques. The dependence of the oxidation and reduction processes on the silicon doping type and density were analyzed and explained using voltage balance equations and surface potentials of silicon. The role played by the silicon substrate on the operation of these memory devices was identified. Multiple bits in a single cell were achieved using either molecules exhibiting multiple stable redox states or mixed monolayer of different molecules. Self-assembled monolayers of redox-active molecules were also incorporated on varying thickness of silicon dioxide on n- and p- silicon substrates in an attempt to create non-volatile memory. The dependences of read/write/erase voltages and retention times of these devices were correlated to the SiO2 thickness by using a combination of Butler-Volmer and semiconductor theories. The region of operation of the silicon surface (accumulation, depletion or inversion) and the extent of tunneling current through the silicon dioxide were found to influence the charging and discharging of the molecules in the monolayer. Increased retention times due to the presence of SiO2 can be useful in realizing non-volatile memories. Polymeric films of molecules were formed on Si and SiO2 substrates and exhibited very high surface densities. Metal films were deposited directly on these films and the resultant devices were found to exhibit redox-independent behavior. A combination of metal gate and dielectric was deposited on molecules in an attempt to create solid-state hybrid silicon-molecular devices. The metal gate and dielectric can replace the electrolyte and electrolytic double-layer to create an electronic cell instead of an ionic cell. The redox properties of the molecules were retained after the deposition of dielectric and metal, which augurs well for a solid-state device. FET type structures were fabricated and molecules incorporated on them in order to modulate the characteristics of the FETs by charging and discharging the molecules. Drain current and transfer characteristics of electrolyte-gated "moleFETs" were modulated by oxidizing and reducing molecules on the channel region. Hybrid moleFET devices may be ideal tools for creating non-volatile FLASH type memory devices. This work has recognized the interaction of organic molecules and bulk silicon and utilized the advantages of current CMOS technology along with the unique properties of molecules, such as discrete quantum states, low voltage operation etc., to create a class of hybrid memory devices. A way to create solid-state molecular devices retaining the inherent properties of molecules has been proposed and demonstrated. This work might be useful in providing a smooth transition from silicon electronics to molecular electronics.
- Slipstream Execution Mode for CMP-based Shared Memory Systems(2003-07-30) Ibrahim, Khaled Zakarya Moustafa; Gregory T. Byrd, Committee Chair; Thomas M. Conte, Committee Member; Eric Rotenberg, Committee Member; Frank Mueller, Committee MemberScalability of applications on distributed shared-memory (DSM) multiprocessors is limited by communication and synchronization overheads. At some point, using more processors to increase parallelism yields diminishing returns or even degrades performance. When increasing concurrency is futile, we propose an additional mode of execution, called slipstream mode, that instead enlists extra processors to assist parallel tasks by reducing perceived overheads. We consider DSM multiprocessors built from dual-processor chip multiprocessor (CMP) nodes (e.g., IBM Power-4 CMP) with shared L2 cache. A parallel task is allocated on one processor of each CMP node. The other processor of each node executes a reduced version of the same task. The reduced version skips shared-memory stores and synchronization, allowing it to run ahead of the true task. Even with the skipped operations, the reduced task makes accurate forward progress and generates an accurate reference stream, because branches and addresses depend primarily on private data. Slipstream execution mode yields multiple benefits. First, the reduced task prefetches data on behalf of the true task. Second, reduced tasks provide a detailed picture of future reference behavior, enabling a number of optimizations aimed at accelerating coherence events. We investigate a well-known optimization, self-invalidation. We also investigate providing confidence mechanism for speculation after barrier synchronization. We investigate the implementation of an OpenMP compiler that supports slipstream execution mode. We discuss how each OpenMP construct can be implemented to take advantage of slipstream mode, and we present a minor extension that allows runtime or compile-time control of slipstream execution. We also investigate the interaction between slipstream mechanisms and OpenMP scheduling. Our implementation supports both static and dynamic scheduling in slipstream mode. For multiprocessor systems with up to 16 CMP nodes, Slipstream mode is 12-19% faster with prefetching only. With self-invalidation also enabled, performance is improved by as much as 29%. We extended slipstream mode to provide a confidence mechanism for barrier speculation. This mechanism identifies dependencies and tries to avoid dependency violations that lead to misspeculations (and subsequently rollbacks). Rollbacks are reduced by up to 95% and the improvement in performance is up to 13%. Slipstream execution mode enables a wide range of optimizations based on an accurate future image of the program behavior. It does not require custom auxiliary hardware tables used by history-based predictors.
- Software Thread Integration for Converting TLP to ILP on VLIW/EPIC Architectures(2003-01-14) So, Won; Eric Rotenberg, Committee Member; Tom Conte, Committee Member; Alexander G. Dean, Committee ChairMultimedia applications are pervasive in modern systems. They generally require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word) or EPIC (Explicitly Parallel Instruction Computing). Despite many efforts to exploit instruction level parallelism (ILP) in the application, typical utilization levels for compiler-generated VLIW/EPIC code range from one-eighth to one-half because a single instruction stream has limited ILP. Software Thread Integration (STI) is a software technique which interleaves multiple threads at the machine instruction level. Integration of threads increases the number of independent instructions, allowing the compiler to generate a more efficient instruction schedule and hence faster runtime performance. We have developed techniques to use STI for converting thread level parallelism (TLP) to ILP on VLIW/EPIC architectures. By focusing on the abundant parallelism at the procedure level in the multimedia applications, we integrate parallel procedure calls, which can be seen as threads, by gathering work in the application. We rely on the programmer to identify parallel procedures, rather than rely on compiler identification. Our methods extend whole-program optimization by expanding the scope of the compiler through software thread integration and procedure cloning. It is effectively a superset of loop jamming as it allows a larger variety of threads to be jammed together. This thesis proposes a methodology to integrate multiple threads in multimedia applications and introduce the concept of a 'Smart RTOS' as an execution model for utilizing integrated threads efficiently in embedded systems. We demonstrate our technique by integrating three procedures from a JPEG application at C source code level, compiling with four compilers for the Itanium EPIC architecture and measuring the performance with the on-chip performance measurement units. Experimental results show procedure speedup of up to 18% and program speedup up to 11%. Detailed performance analysis demonstrates the primary bottleneck to be the Itanium's 16K instruction cache, which has limited room for the code expansion by STI.
- Spectral Prediction: A Signals Approach to Computer Architecture Prefetching(2006-08-10) Sharma, Saurabh; Thomas M. Conte, Committee Chair; Greg Byrd, Committee Member; Purush Iyer, Committee Member; Eric Rotenberg, Committee MemberEffective data prefetching requires accurate mechanisms to predict embedded patterns in the miss reference behavior. This dissertation introduces a novel technique Spectral Prediction that accurately identifies the pattern by dynamically adjusting to its frequency. The proposed technique exploits the fact that addresses in the reference stream follow definite frequencies and captures them using the recurrence distance information. In so doing, the patterns are successfully detected while the random noise is filtered. This dissertation describes two implementations of spectral prediction: Spectral Prefetcher (SP) and Differential-only Spectral Prefetcher (DOSP). The first implementation, SP, is adaptive in behavior and can capture either the pattern of addresses or the pattern of strides between the addresses within the cache miss stream. SP was designed as a proof-of-concept and provided productive insights for designing a more elegant implementation: DOSP, which is resource-efficient and offers better performance. The dissertation also includes simulation driven performance evaluations of SP and DOSP. Our results show that these implementations of spectral prediction achieve 4% to 400% performance improvement for memory-intensive programs running on an aggressive out-of-order processor with large caches and large branch predictor. Additionally, using a set of co-scheduled pairs of benchmarks on a dual-core CMP, we show that a 16KB on chip implementation of DOSP provides an average throughput improvement of 10% and at best by 86%.
- Static Determination of Synchronization Method for Slipstream Multiprocessors(2004-03-30) Christner, Robert K.; Eric Rotenberg, Committee Member; Gregory T. Byrd, Committee Chair; Vincent W. Freeh, Committee MemberThe scalability of a distributed shared memory systems is limited largely by communication overhead, most of which can be attributed to memory latency and synchronization. In systems built with dual-processor single-chip multiprocessors (CMPs) a proposed solution to this scalability limitation is the use of slipstream execution mode for certain applications. Instead of using the second on-chip processor to run a separate slice of the parallel application, slipstream mode utilizes the second processor to run a reduced version of the same task. The reduced version, known as the advanced stream, skips certain high latency events but continues to make accurate forward progress and provides future reference behavior for the unreduced version, known as the redundant stream. Slipstream mode provides several methods of synchronization between the advanced (A-) and redundant (R-) streams to govern how far the A-stream can advance in front of the R-stream and also to provide a method of keeping the A-stream on the correct control path. A current limitation of slipstream is that the method used for A-R synchronization must be specified by the user at run time and used throughout the entire execution of the program. This is because the method that results in the best performance is application dependant and unknown beforehand. We investigate alternate procedures to determine the A-R synchronization method by the use of static code analysis in the form of both profile- and compiler-driven techniques. A trace profile algorithm is presented that gives insight into shared memory access patterns that favor certain synchronization methods. We also discuss compiler integration of the synchronization method determination. Techniques similar to those which are used for compiler-driven prefetching and data forwarding are shown that could be used in this context as well.
- Towards Performance, System and Security Issues in Secure Processor Architectures.(2010-11-08) Chhabra, Siddhartha; Yan Solihin, Committee Chair; Gregory Byrd, Committee Member; Douglas Reeves, Committee Member; Eric Rotenberg, Committee Member
- Using Performance Bounds to Guide Code Compilation and Processor Design(2003-07-10) Zhou, Huiyang; Thomas M. Conte, Committee Chair; Gregory T. Byrd, Committee Member; Eric Rotenberg, Committee Member; S. Purushothaman Iyer, Committee MemberPerformance bounds represent the best achievable performance that can be delivered by target microarchitectures on specified workloads. Accurate performance bounds establish an efficient way to evaluate the performance potential of either code optimizations or architectural innovations. We advocate using performance bounds to guide code compilation. In this dissertation, we introduce a novel bound-guided approach to systematically regulate code-size related instruction level parallelism (ILP) optimizations, including tail duplication, loop unrolling, and if-conversion. Our approach is based on the notion of code size efficiency, which is defined as the ratio of ILP improvement over static code size increase. With such a notion, we (1) develop a general approach to selectively perform optimizations to maximize the ILP improvement while minimizing the cost in code size, (2) define the optimal tradeoff between ILP improvement and code size overhead, and (3) develop a heuristic to achieve this optimal tradeoff. We extend our performance bounds as well as code size efficiency to perform code-size-aware compilation for real-time applications. The profile independent performance bounds are proposed to reveal the criticality for each path in a task. Code optimizations can then focus on the critical paths (even at the cost of non-critical ones) to reduce the worst-case execution time, thereby improving the overall schedulability of the real-time system. For memory intensive applications featuring heavy pointer chasing, we develop an analytical model based on performance bounds to evaluate memory latency hiding techniques. We model the performance potential of these techniques and use the analytical results to motivate an architectural innovation, called recovery-free value prediction, to enhance memory level parallelism (MLP). The experimental results show that our proposed technique improves MLP significantly and achieves impressive speedups.