Browsing by Author "Dr. Thomas M. Conte, Committee Member"
Now showing 1 - 8 of 8
- Results Per Page
- Sort Options
- Dynamic Pipeline Scaling(2003-06-04) Ramrakhyani, Prakash Shyamlal; Dr. Alexander G. Dean, Committee Member; Dr. Wm. Rhett Davis, Committee Member; Dr. Thomas M. Conte, Committee Member; Dr. Eric Rotenberg, Committee ChairThe classic problem of balancing power and performance continues to exist, as technology progresses. Fortunately, high performance is not a constant requirement in a system. When the performance requirement is not at its peak, the processor can be configured to conserve power, while providing just enough performance. Parameters like voltage, frequency, and cache structure have been proposed to be made dynamically scalable, to conserve power. This thesis analyzes the effects of dynamically scaling a new processor parameter, pipeline depth. We propose Dynamic Pipeline Scaling, a technique to conserve energy at low frequencies when voltage is invariable. When frequency can be lowered enough, adjacent pipeline stages can be merged to form a shallow pipeline. At equal voltage and frequency, the shallow pipeline is more energy-efficient than the deep pipeline. This is because the shallow pipeline has fewer data dependence stalls and a lower branch misprediction penalty. Thus, there are fewer wasteful transitions in a shallow pipeline, which translates directly to lower energy consumption. On a variable-voltage processor, the shallow pipeline requires a higher operating voltage than the deep pipeline for the same frequency. Since energy depends on the square of voltage and depends linearly on the total number of transitions, on a variable-voltage processor, a deep pipeline is typically more energy-efficient than a shallow pipeline. However, there may be situations where variable voltage is not desired. For example, if the latency to switch voltage is large, voltage scaling may not be beneficial in a real-time system with tight deadlines. On such a system, dynamic pipeline scaling can yield energy benefits, in spite of not scaling voltage.
- Hardware Realization and Implementation Issues for the Sliding-window Packet Switch(2006-04-16) Phelps, Brian Roberson; Dr. Arne A. Nilsson, Committee Member; Dr. Thomas M. Conte, Committee Member; Dr. Paul D. Franzon, Committee Chair; Dr. Sanjeev Kumar, Committee MemberShared memory packet switches are known to provide the best delay-throughput and respond well to bursty traffic. Shared memory switches are also known to scale poorly due to centralized control and memory bottlenecks. The Sliding Window Packet Switch (SW) algorithm is a shared memory switch that employs decentralized control and multiple memory modules to facilitate the scalability of hardware. The SW algorithm is independent of the type of packet or cell. This research has two closely related goals. The first goal is to implement the SW algorithm in hardware such as an FPGA. This implementation is actually a specific case of the SW algorithm with four input ports and four output ports (i.e. a 4x4 switch). The second goal is to determine what scalability constraints exist in hardware for larger numbers of input and output ports (large NxN). These constraints are used to predict the overall throughput that the hardware implementation can handle.
- A Multi-gigabit CMOS Transceiver with 2x Oversampling Linear Phase Detection(2003-05-25) Vichienchom, Kasin; Dr. Griff Bilbro, Committee Member; Dr. Thomas M. Conte, Committee Member; Dr. Paul D. Franzon, Committee Member; Dr. Wentai Liu, Committee ChairThis dissertation presents the design of a high-speed CMOS transceiver for serial digital data. The design is based on a parallel architecture data recovery circuit. It uses multiple clock phases from a multi-phase phase-locked loop (MPLL) operating at low frequency to sample high frequency input data in a time-interleaved manner. This results in the reduction of the speed requirement for the transceiver. The new technique of time-interleaved sampling is realized by placing the analog and digital samplers alternately to sample the input data at a sampling rate of two times the data rate (2x). This hybrid parallel sampling scheme provides the input phase error to the multi-phase PLL and simultaneously recovers and deserializes the input data. The data phase detection generates the loop error signal that is proportional to the input phase error, therefore allowing the PLL to have a proportional loop control. This results in improvement of the loop stability, the output jitter, and the bit error rate over the conventional all-digital 2x oversampling, referred to as the bang-bang type phase detection. In addition, to investigate its operation closely, the model and analysis of the multi-phase PLL based on the discrete-time linear system has been developed. This model takes into account the sampling nature of the loop, which provides greater insight into the system behavior and an understanding of system constraints. The analysis shows that when the PLL loop bandwidth is much smaller than the input frequency, the system response can be approximated by the conventional continuous-time model and thus the number of phase detectors employed can be reduced. The model predicts the stability limit of the multi-phase PLL as a function of input frequency, loop bandwidth, and the number of phase detectors. In addition, the phase noise due to the bang-bang type phase detector in PLL-based clock recovery circuits has been analyzed using this model.
- A Simple Mechanism for Detecting Ineffectual Instructions in Slipstream Processors(2002-05-15) Koppanalil, Jinson Joseph; Dr. Gregory T. Byrd, Committee Member; Dr. Eric Rotenberg, Committee Chair; Dr. Thomas M. Conte, Committee MemberThe slipstream paradigm harnesses multiple processing elements in a chip multiprocessor (CMP) to speed up a single, sequential program. It does this by running two redundant copies of the program, one slightly ahead of the other. The leading program is the Advanced Stream (A-stream) and the trailing program is the Redundant Stream (R-stream). Predicted non-essential computation is speculatively removed from the A-stream. The A-stream is sped up because it fetches and executes fewer instructions than the original program. The trailing R-stream checks the control flow and data flow outcomes of the A-stream, and redirects it when it fails to make correct forward progress. The R-stream also exploits the A-stream outcomes as accurate branch and value predictions. Therefore, although the R-stream retires the same number of instructions as the original program, it fetches and executes much more efficiently. As a result, both program copies finish sooner than the original program. A slipstream component called the instruction-removal detector (IR-detector) detects past-ineffectual instructions in the R-stream and selects them for possible removal from the A-stream in the future. The IR-detector uses a two-step selection process. First, it selects key trigger instructions -- unreferenced writes, non-modifying writes, and correctly-predicted branches. A table similar to a conventional register rename table can easily detect unreferenced and non-modifying writes. The second step, called back-propagation, selects computation chains feeding the trigger instructions. In an explicit implementation of back-propagation, retired R-stream instructions are buffered and consumer instructions are connected to their producer instructions using a configurable interconnection network. Consumers that are selected because they are ineffectual use these connections to propagate their ineffectual status to their producers, so that they get selected, too. Explicit back-propagation is complex because it requires a configurable interconnection network. This thesis proposes a simpler implementation of back-propagation, called implicit back-propagation. The key idea is to logically monitor the A-stream instead of the R-stream. Now, the IR-detector only performs the first step, i.e., it selects unreferenced writes, non-modifying writes, and correctly-predicted branches. After building up confidence, these trigger instructions are removed from the A-stream. Once removed, their producers become unreferenced writes in the A-stream (because they no longer have consumers). After building up confidence, the freshly exposed unreferenced writes are also removed, exposing additional unreferenced writes. This process continues iteratively, until eventually entire non-essential dependence chains are removed. By logically monitoring the A-stream, back-propagation is reduced to detecting unreferenced writes. Implicit back-propagation eliminates complex hardware and performs within 0.5% of explicit back-propagation.
- Slipstream-Based Steering for Clustered Microarchitectures(2003-06-20) Gupta, Nikhil; Dr. Thomas M. Conte, Committee Member; Dr. Gregory T. Byrd, Committee Member; Dr. Eric Rotenberg, Committee ChairTo harvest increasing levels of ILP while maintaining a fast clock, clustered microarchitectures have been proposed. However, the fast clock enabled by clustering comes at the cost of multiple cycles to communicate values among clusters. A chief performance limiter of a clustered microarchitecture is inter-cluster communication between instructions. Specifically, inter-cluster communication between critical-path instructions is the most harmful. The slipstream paradigm identifies critical-path instructions in the form of effectual instructions. We propose eliminating virtually all inter-cluster communication among effectual instructions, simply by ensuring that the entire effectual component of the program executes within a cluster. This thesis proposes two execution models: the replication model and the dedicated-cluster model. In the replication model, a copy of the effectual component is executed on each of the clusters and the ineffectual instructions are shared among the clusters. In the dedicated-cluster model, the effectual component is executed on a single cluster (the effectual cluster), while all ineffectual instructions are steered to the remaining clusters. Outcomes of ineffectual instructions are not needed (in hindsight), hence their execution can be exposed to inter-cluster communication latency without significantly impacting overall performance. IPC of the replication model on dual clusters and quad clusters is virtually independent of inter-cluster communication latency. IPC decreases by 1.3% and 0.8%, on average, for a dual-cluster and quad-cluster microarchitecture, respectively, when inter-cluster communication latency increases from 2 cycles to 16 cycles. In contrast, IPC of the best-performing dependence-based steering decreases by 35% and 55%, on average, for a dual-cluster and quad-cluster microarchitecture, respectively, over the same latency range. For dual clusters and quad clusters with low latencies (fewer than 8 cycles), slipstream-based steering underperforms conventional steering because improved latency tolerance is outweighed by higher contention for execution bandwidth within clusters. However, the balance shifts at higher latencies. For a dual-cluster microarchitecture, dedicated-cluster-based steering outperforms the best conventional steering on average by 10% and 24% at 8 and 16 cycles, respectively. For a quad-cluster microarchitecture, replication-based steering outperforms the best conventional steering on average by 10% and 32% at 8 and 16 cycles, respectively. Slipstream-based steering desensitizes the IPC performance of a clustered microarchitecture to tens of cycles of inter-cluster communication latency. As feature sizes shrink, it will take multiple cycles to propagate signals across the processor chip. For a clustered microarchitecture, this implies that with further scaling of feature size, the inter-cluster communication latency will increase to the point where microarchitects must manage a distributed system on a chip. Thus, if individual clusters are clocked faster, at the expense of increasing inter-cluster communication latency, performance of a clustered microarchitecture using slipstream-based steering will improve considerably as compared to a clustered microarchitecture using the best conventional steering approach.
- Software Thread Integration for Instruction Level Parallelism(2007-07-05) So, Won; Dr. Eric Rotenberg, Committee Member; Dr. Thomas M. Conte, Committee Member; Dr. Vincent W. Freeh, Committee Member; Dr. Alexander G. Dean, Committee ChairMultimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word) or EPIC (Explicitly Parallel Instruction Computing). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor's functional units busy. This dissertation proposes Software Thread Integration (STI) for Instruction Level Parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler's scope and hence allowing it to create a more efficient instruction schedule. STI is essentially procedure jamming with intraprocedural code motion transformations which allow arbitrary alignment of instructions or code regions. This alignment enables code to be moved to use available execution resources better and improve the execution schedule. Parallel procedures are identified by the programmer with either annotations in conventional procedural languages or graph analysis for stream coarse-grain dataflow programming languages. We use the method of procedure cloning and integration for improving program run-time performance by integrating parallel procedures via STI. This defines a new way of converting parallelism at the thread level to the instruction level. With filter integration we apply STI for streaming applications, exploiting explicit coarse-grain dataflow information expressed by stream programming languages. During integration of threads, various STI code transformations are applied in order to maximize the ILP and reconcile control flow differences between two threads. Different transformations are selectively applied according to the control structure and the ILP characteristics of the code, driven by interactions with software pipelining. This approach effectively combines ILP-improving code transformations with instruction scheduling techniques so that they complement each other. Code transformations involve code motion as well as loop transformations such as loop jamming, unrolling, splitting, and peeling. We propose a methodology for efficiently finding the best integration scenario among all possibilities. We quantitatively estimate the performance impact of integration, allowing various integration scenarios to be compared and ranked via profitability analysis. The estimated profitability is verified and corrected by an iterative compilation approach, compensating for possible estimation inaccuracy. Our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration. The proposed methods are automated by the STI for ILP Tool Chain targeting Texas Instrument C6x VLIW DSPs. This work contributes to the definition of an alternative development path for DSP applications. We seek to provide efficient compilation of C or C-like languages with a small amount of additional high-level dataflow information targeting popular and practical VLIW DSP platforms, reducing the need for extensive manual C and assembly code optimization and tuning.
- STI Concepts for Bit-Bang Communication Protocols(2003-01-28) Kumar, Nagendra J; Dr. Thomas M. Conte, Committee Member; Dr. Eric Rotenberg, Committee Member; Dr. Alexander Dean, Committee ChairIn the modern times, embedded communication networks are being used in increased number of embedded systems to provide more reliability and cost effectiveness. Designers are forced to limit and minimize the size, weight, power consumption, costs and also the design time of their products. However, network controller chips are also expensive and hence moving functionality from hardware to software cuts down the costs and also makes custom fit protocols easier to implement. Traditional methods of sharing a processor are not adequate for implementing communication protocol controllers in software because of the processing required during each bit. The available idle time is fine grain compared to the bit time and is usually small for even the fast context switching techniques (e.g. co-routines) to run any other thread. Without some scheme to recover this fine-grain idle time, no other work in the system would make any progress. Software Thread Integration (STI) provides low cost concurrency on general-purpose microprocessors by interleaving multiple threads of control (having real-time constraints) into one. This thesis introduces new methods for implementing communication protocols in software using statically scheduled co-routines and software thread integration. With co-routines, switching from primary to secondary threads and vice versa can be done without incurring a penalty as severe as "context - switching". This technique will be been demonstrated on the SAE J1850 communication standard used in off- and on-road land-based vehicles. These methods also minimize the number of co-routine calls needed to share the processor thereby enabling finer-grain idle time to be recovered for use by the secondary thread. Increased number of compute cycles implies ∗Improved performance of the secondary thread and ∗Reduced minimum clock speed for the microprocessor. Thus, now more secondary thread work can be done and also the minimum clock speed required of the processor is reduced. These factors enable the embedded system designers to use processors more efficiently and also with less development effort.
- Transparent Control Independence (TCI)(2007-08-14) Al-Zawawi, Ahmed Sami; Dr. Suleyman Sair, Committee Member; Dr. Warren J. Jasper, Committee Member; Dr. Eric Rotenberg, Committee Chair; Dr. Thomas M. Conte, Committee Member
