Log In
New user? Click here to register. Have you forgotten your password?
NC State University Libraries Logo
    Communities & Collections
    Browse NC State Repository
Log In
New user? Click here to register. Have you forgotten your password?
  1. Home
  2. Browse by Author

Browsing by Author "Dr. Eric Rotenberg, Committee Chair"

Filter results by typing the first few letters
Now showing 1 - 7 of 7
  • Results Per Page
  • Sort Options
  • No Thumbnail Available
    Dynamic Pipeline Scaling
    (2003-06-04) Ramrakhyani, Prakash Shyamlal; Dr. Alexander G. Dean, Committee Member; Dr. Wm. Rhett Davis, Committee Member; Dr. Thomas M. Conte, Committee Member; Dr. Eric Rotenberg, Committee Chair
    The classic problem of balancing power and performance continues to exist, as technology progresses. Fortunately, high performance is not a constant requirement in a system. When the performance requirement is not at its peak, the processor can be configured to conserve power, while providing just enough performance. Parameters like voltage, frequency, and cache structure have been proposed to be made dynamically scalable, to conserve power. This thesis analyzes the effects of dynamically scaling a new processor parameter, pipeline depth. We propose Dynamic Pipeline Scaling, a technique to conserve energy at low frequencies when voltage is invariable. When frequency can be lowered enough, adjacent pipeline stages can be merged to form a shallow pipeline. At equal voltage and frequency, the shallow pipeline is more energy-efficient than the deep pipeline. This is because the shallow pipeline has fewer data dependence stalls and a lower branch misprediction penalty. Thus, there are fewer wasteful transitions in a shallow pipeline, which translates directly to lower energy consumption. On a variable-voltage processor, the shallow pipeline requires a higher operating voltage than the deep pipeline for the same frequency. Since energy depends on the square of voltage and depends linearly on the total number of transitions, on a variable-voltage processor, a deep pipeline is typically more energy-efficient than a shallow pipeline. However, there may be situations where variable voltage is not desired. For example, if the latency to switch voltage is large, voltage scaling may not be beneficial in a real-time system with tight deadlines. On such a system, dynamic pipeline scaling can yield energy benefits, in spite of not scaling voltage.
  • No Thumbnail Available
    Improving Transient Fault Tolerance of Slipstream Processors
    (2005-12-12) Parthasarathy, Sailashri; Dr. Eric Rotenberg, Committee Chair; Dr. Suleyman Sair, Committee Member; Dr. Jun Xu, Committee Member
    A slipstream processor runs two copies of a program, one slightly ahead of the other, to achieve both higher single-program performance and transient fault tolerance. The leading copy of the program, or the Advanced Stream (A-stream), is accelerated by executing only a key subset of all instructions. The partial A-stream is speculative. Therefore, a second, complete copy of the program, called the Redundant Stream (R-stream), receives and checks all A-stream outcomes. The R-stream is also accelerated in this process. Together, the A-stream and R-stream finish faster than a single program copy would. The partial redundancy between the A-stream and R-stream enables detection and recovery from transient faults. A transient fault that affects a redundantly executed instruction is easily detected, because its two instances will differ. However, a transient fault that affects a singly executed instruction (instruction removed from A-stream) is difficult to detect directly, because there is no redundant counterpart for comparison. Actually, a fault in a singly executed instruction is indirectly detectable via a redundantly executed consumer. However, such a fault is unrecoverable since the fault is attributed to the consumer. Recovery is initiated too late, from the consumer instead of the faulty producer. We propose a mechanism that conservatively attributes a detected fault, not to the redundantly executed instruction that detected it, but to its singly executed producer. Accordingly, recovery is initiated safely from the singly executed producer. Our approach works by forming a forward slice for each singly executed instruction, terminating in its direct/indirect redundantly executed consumers. Now, a consumer can mark its singly executed producer as faulty when its comparison mismatches. A singly executed branch does not have a forward slice and thus is not checkable by consumers. However, the branch was removed from the A-stream precisely because its branch prediction is highly confident, hence, very likely correct. This likely correct branch prediction is treated as a second execution for the corresponding singly executed branch, different from true execution but nearly as effective for detecting faults. In fact, the observation about confident branches extends to all redundantly executed instructions since the A-stream is predictive as a whole. All A-stream instructions are speculative, yet most likely correct in the fault-free case. This reveals an intriguing predictive checking paradigm. Experiments using the SPEC95 and SPEC2K benchmarks show that coverage improves from 81% for baseline slipstream to 99% with only a small decrease in speedup. To obtain the same performance as baseline slipstream, we propose a relaxed checking model, which still achieves a much higher coverage of 95%.
  • No Thumbnail Available
    Preliminary Study of Trace-Cache-Based Control Independence Architecture.
    (2006-05-23) Al-Otoom, Muawya Mohamed; Dr. Eric Rotenberg, Committee Chair; Dr. Suleyman Sair, Committee Member; Dr. W. Rhett Davis, Committee Member
    Conventional superscalar processors recover from a mispredicted branch by squashing all instructions after the branch. While simple, this approach needlessly re-executes many future control-independent (CI) instructions after the branch's reconvergent point. Selective recovery is possible, but is complicated by the fact that some control-independent instructions must be singled out for re-execution, namely those that depend on data influenced by the mispredicted branch. That is, control-independent data-dependent (CIDD) instructions must be singled out for re-execution, thus avoiding needless re-execution of control-independent data-independent (CIDI) instructions. To contrast different recovery models, we abstract the recovery process as constructing a "recovery sub-program" for repairing partially incorrect future state. In this conceptual framework, selective recovery constructs a shorter recovery sub-program than full recovery. In current selective recovery microarchitectures, the recovery sub-program is constructed on-the-fly after detecting a mispredicted branch, by sequencing through all CI instructions and singling out only the CIDD instructions among them. Not only is this discriminating approach complex, but the same recovery sub-program is repeatedly constructed every time this branch is mispredicted. We propose constructing the recovery sub-program for each branch once and caching it for future use. In particular, traces of CIDD instructions are pre-constructed and stored in a recovery trace cache. When a misprediction is detected, first, the branch's correct control-dependent instructions are fetched from the conventional instruction cache as usual. Then, at the reconvergent point, fetching simply switches from the instruction cache to the recovery trace cache. The appropriate recovery trace is fetched from the recovery trace cache at this time. In this way, fetching only the CIDD instructions is as simple as fetching all CI instructions from a conventional instruction cache. No explicit singling-out process is needed as this was done a priori, on the fill-side of the trace cache. Therefore, the recovery trace cache is efficient on multiple levels, combining the simplicity of full recovery with the performance of selective recovery. This thesis explains the proposed trace-cache-based control independence architecture, at a high level. Preliminary studies are also presented, to project the potential of exploiting control independence as well as the effectiveness of a trace-cache-based approach in particular. The results include (i) breakdowns of retired dynamic instructions into different categories, based on their control and data dependences with respect to prior mispredicted branches, (ii) contributions of individual recovery traces to total CIDI instruction savings, and (iii) hit ratios of finite recovery trace caches.
  • No Thumbnail Available
    Reducing Frequency in Real-Time Systems via Speculation and Fall-Back Recovery
    (2003-04-22) Anantaraman, Aravindh Venkataseshadri; Dr. Eric Rotenberg, Committee Chair; Dr. Gregory T. Byrd, Committee Member; Dr. Frank Mueller, Committee Member; Dr. Alexander G. Dean, Committee Member
    In real-time systems, safe operation requires that tasks complete before their deadlines. Static worst-case timing analysis is used to derive an upper bound on the number of cycles for a task, and this is the basis for a safe frequency that ensures timely completion in any scenario. Unfortunately, it is difficult to tightly bound the number of cycles for a complex task executing on a complex pipeline, and so the safe frequency tends to be over-inflated. Power efficiency is sacrificed for safety. The situation only worsens as advanced microarchitectural techniques are deployed in embedded systems. High-performance microarchitectural techniques such as caching, branch prediction, and pipelining decrease typical execution times. At the same time, it is difficult to tightly bound the worst-case execution time of complex tasks on highly dynamic substrates. As a result, the gap between worst-case execution time and typical execution time is expected to increase. This thesis explores frequency speculation, a technique for reconciling the power/safety trade-off. Tight but unsafe bounds (derived from past task executions) are the basis for a low speculative frequency. The task is divided into multiple smaller sub-tasks and each sub-task is assigned an interim soft deadline, called a checkpoint. Sub-tasks are attempted at the speculative frequency. Continued safe progress of the task as a whole is confirmed for as long as speculative sub-tasks complete before their checkpoints. If a sub-task exceeds its checkpoint (misprediction), the system falls back to a higher recovery frequency that ensures the overall deadline is met in spite of the interim misprediction. The primary contribution of this thesis is the development of two new frequency speculation algorithms. A drawback of the original frequency speculation algorithm is that a sub-task misprediction is detected only after completing the sub-task. The misprediction can be detected earlier through the use of a watchdog timer that expires at the checkpoint unless the sub-task completes in time to advance it to the next checkpoint. Early detection is superior because recovery can be initiated earlier, in the middle of the mispredicted sub-task. This introduces extra slack that can be used to lower the speculative frequency even further. A new issue that arises with early detection is bounding the amount of work that remains in the mispredicted sub-task after the misprediction is detected. The two new algorithms differ in how the unfinished work is bounded. The first algorithm conservatively bounds the execution time of the unfinished portion using the worst-case execution time of the entire sub-task. The second uses more sophisticated analysis to derive a much tighter bound. Both early-detection algorithms outperform the late-detection algorithm. For tight deadlines, the sophisticated analysis of the second early-detection algorithm truly pays off. It yields 60-70% power savings for six real-time applications from the C-lab suite.
  • No Thumbnail Available
    A Simple Mechanism for Detecting Ineffectual Instructions in Slipstream Processors
    (2002-05-15) Koppanalil, Jinson Joseph; Dr. Gregory T. Byrd, Committee Member; Dr. Eric Rotenberg, Committee Chair; Dr. Thomas M. Conte, Committee Member
    The slipstream paradigm harnesses multiple processing elements in a chip multiprocessor (CMP) to speed up a single, sequential program. It does this by running two redundant copies of the program, one slightly ahead of the other. The leading program is the Advanced Stream (A-stream) and the trailing program is the Redundant Stream (R-stream). Predicted non-essential computation is speculatively removed from the A-stream. The A-stream is sped up because it fetches and executes fewer instructions than the original program. The trailing R-stream checks the control flow and data flow outcomes of the A-stream, and redirects it when it fails to make correct forward progress. The R-stream also exploits the A-stream outcomes as accurate branch and value predictions. Therefore, although the R-stream retires the same number of instructions as the original program, it fetches and executes much more efficiently. As a result, both program copies finish sooner than the original program. A slipstream component called the instruction-removal detector (IR-detector) detects past-ineffectual instructions in the R-stream and selects them for possible removal from the A-stream in the future. The IR-detector uses a two-step selection process. First, it selects key trigger instructions -- unreferenced writes, non-modifying writes, and correctly-predicted branches. A table similar to a conventional register rename table can easily detect unreferenced and non-modifying writes. The second step, called back-propagation, selects computation chains feeding the trigger instructions. In an explicit implementation of back-propagation, retired R-stream instructions are buffered and consumer instructions are connected to their producer instructions using a configurable interconnection network. Consumers that are selected because they are ineffectual use these connections to propagate their ineffectual status to their producers, so that they get selected, too. Explicit back-propagation is complex because it requires a configurable interconnection network. This thesis proposes a simpler implementation of back-propagation, called implicit back-propagation. The key idea is to logically monitor the A-stream instead of the R-stream. Now, the IR-detector only performs the first step, i.e., it selects unreferenced writes, non-modifying writes, and correctly-predicted branches. After building up confidence, these trigger instructions are removed from the A-stream. Once removed, their producers become unreferenced writes in the A-stream (because they no longer have consumers). After building up confidence, the freshly exposed unreferenced writes are also removed, exposing additional unreferenced writes. This process continues iteratively, until eventually entire non-essential dependence chains are removed. By logically monitoring the A-stream, back-propagation is reduced to detecting unreferenced writes. Implicit back-propagation eliminates complex hardware and performs within 0.5% of explicit back-propagation.
  • No Thumbnail Available
    Slipstream-Based Steering for Clustered Microarchitectures
    (2003-06-20) Gupta, Nikhil; Dr. Thomas M. Conte, Committee Member; Dr. Gregory T. Byrd, Committee Member; Dr. Eric Rotenberg, Committee Chair
    To harvest increasing levels of ILP while maintaining a fast clock, clustered microarchitectures have been proposed. However, the fast clock enabled by clustering comes at the cost of multiple cycles to communicate values among clusters. A chief performance limiter of a clustered microarchitecture is inter-cluster communication between instructions. Specifically, inter-cluster communication between critical-path instructions is the most harmful. The slipstream paradigm identifies critical-path instructions in the form of effectual instructions. We propose eliminating virtually all inter-cluster communication among effectual instructions, simply by ensuring that the entire effectual component of the program executes within a cluster. This thesis proposes two execution models: the replication model and the dedicated-cluster model. In the replication model, a copy of the effectual component is executed on each of the clusters and the ineffectual instructions are shared among the clusters. In the dedicated-cluster model, the effectual component is executed on a single cluster (the effectual cluster), while all ineffectual instructions are steered to the remaining clusters. Outcomes of ineffectual instructions are not needed (in hindsight), hence their execution can be exposed to inter-cluster communication latency without significantly impacting overall performance. IPC of the replication model on dual clusters and quad clusters is virtually independent of inter-cluster communication latency. IPC decreases by 1.3% and 0.8%, on average, for a dual-cluster and quad-cluster microarchitecture, respectively, when inter-cluster communication latency increases from 2 cycles to 16 cycles. In contrast, IPC of the best-performing dependence-based steering decreases by 35% and 55%, on average, for a dual-cluster and quad-cluster microarchitecture, respectively, over the same latency range. For dual clusters and quad clusters with low latencies (fewer than 8 cycles), slipstream-based steering underperforms conventional steering because improved latency tolerance is outweighed by higher contention for execution bandwidth within clusters. However, the balance shifts at higher latencies. For a dual-cluster microarchitecture, dedicated-cluster-based steering outperforms the best conventional steering on average by 10% and 24% at 8 and 16 cycles, respectively. For a quad-cluster microarchitecture, replication-based steering outperforms the best conventional steering on average by 10% and 32% at 8 and 16 cycles, respectively. Slipstream-based steering desensitizes the IPC performance of a clustered microarchitecture to tens of cycles of inter-cluster communication latency. As feature sizes shrink, it will take multiple cycles to propagate signals across the processor chip. For a clustered microarchitecture, this implies that with further scaling of feature size, the inter-cluster communication latency will increase to the point where microarchitects must manage a distributed system on a chip. Thus, if individual clusters are clocked faster, at the expense of increasing inter-cluster communication latency, performance of a clustered microarchitecture using slipstream-based steering will improve considerably as compared to a clustered microarchitecture using the best conventional steering approach.
  • No Thumbnail Available
    Transparent Control Independence (TCI)
    (2007-08-14) Al-Zawawi, Ahmed Sami; Dr. Suleyman Sair, Committee Member; Dr. Warren J. Jasper, Committee Member; Dr. Eric Rotenberg, Committee Chair; Dr. Thomas M. Conte, Committee Member

Contact

D. H. Hill Jr. Library

2 Broughton Drive
Campus Box 7111
Raleigh, NC 27695-7111
(919) 515-3364

James B. Hunt Jr. Library

1070 Partners Way
Campus Box 7132
Raleigh, NC 27606-7132
(919) 515-7110

Libraries Administration

(919) 515-7188

NC State University Libraries

  • D. H. Hill Jr. Library
  • James B. Hunt Jr. Library
  • Design Library
  • Natural Resources Library
  • Veterinary Medicine Library
  • Accessibility at the Libraries
  • Accessibility at NC State University
  • Copyright
  • Jobs
  • Privacy Statement
  • Staff Confluence Login
  • Staff Drupal Login

Follow the Libraries

  • Facebook
  • Instagram
  • Twitter
  • Snapchat
  • LinkedIn
  • Vimeo
  • YouTube
  • YouTube Archive
  • Flickr
  • Libraries' news

ncsu libraries snapchat bitmoji

×