Log In
New user? Click here to register. Have you forgotten your password?
NC State University Libraries Logo
    Communities & Collections
    Browse NC State Repository
Log In
New user? Click here to register. Have you forgotten your password?
  1. Home
  2. Browse by Author

Browsing by Author "Dr. Eric Rotenberg, Committee Member"

Filter results by typing the first few letters
Now showing 1 - 18 of 18
  • Results Per Page
  • Sort Options
  • No Thumbnail Available
    Adding Rivalrous Hardware Scheduling to the First Generation FREEDM Systems Communication Platform
    (2010-02-22) Sachidananda, Subash Ghattadahalli; Dr. Alexander Dean, Committee Chair; Dr. Frank Mueller, Committee Co-Chair; Dr. Eric Rotenberg, Committee Member
    Existing power management systems are hard-wired, slow, unreliable and insecure. By improving the communication framework for power management systems, using Internet and modern communication protocols like IEC61850, Zigbee (IEEE 802.15.4), etc., not only can energy be managed better, but also, the system as a whole can be made faster, secure and reliable. Any communication network has certain important characteristics like delay, range, scalability, network topology, etc., that dictate its effectiveness and usefulness in a particular environment. Delays in various parts of the network is one of the important characteristics that needs to be studied carefully. Network delays can be reduced by not only faster hardware but also better software. Importance also needs to be placed on monitoring power consumption of the devices, and ways to improve power efficiency of the system. In conjunction with that, a first generation communication framework for renewable energy distribution and management system, was set up using a network of embedded boards and personal computers. Various communication protocols and interfaces were tried to prove the versatility and reliability of the entire system. Experiments were conducted to test the range and delays in various parts of the communication network. Having established a platform for the nodes to communicate, investigation was done to implement techniques that could make the nodes more power efficient. One of the ways to stretch the battery life is through addition of an SMPS. However, addition of an SMPS introduces power instability to the system and interferes with normal functioning of sensitive devices like ADC, compass, etc. A processor controlled SMPS was added to the communication platform and its interference with normal functioning of the system was studied. Possible hardware and software approaches to counter this interference are also presented.
  • No Thumbnail Available
    Benchmark Characterization of Embedded Processors
    (2005-05-16) Dasarathan, Dinesh; Dr. Thomas M. Conte, Committee Chair; Dr. Edward F. Gehringer, Committee Member; Dr. Eric Rotenberg, Committee Member
    The design of a processor is an iterative process, with many cycles of simulation, performance analysis and subsequent changes. The inputs to these cycles of simulations are generally a selected subset of standard benchmarks. To aid in reducing the number of cycles involved in design, one can characterize these selected benchmarks and use those characteristics to hit at a good initial design that will converge faster. Methods and systems to characterize benchmarks for normal processors are designed and implemented. This thesis extends these approaches and defines an abstract system to characterize benchmarks for embedded processors, taking into consideration the architectural requirements, power constraints and code compressibility. To demonstrate this method, around 25 benchmarks are characterized (10 from SPEC, and 15 from standard embedded benchmark suites - Mediabench and Netbench), and compared. Moreover, the similarities between these benchmarks are also analyzed and presented.
  • No Thumbnail Available
    Clock Tree Insertion and Verification for 3D Integrated Circuits
    (2005-09-26) Mineo, Christopher Alexander; Dr. W. Rhett Davis, Committee Chair; Dr. Paul Franzon, Committee Member; Dr. Eric Rotenberg, Committee Member
    The use of three dimensional chip fabrication technologies has emerged as a solution to the difficulties involved with the continued scaling of bulk silicon devices. While the technology exists, it is undervalued and underutilized largely due to the design and verification challenges a complex 3D design presents. This work presents a clock tree insertion and timing verification methodology for three dimensional integrated circuits (3DIC). It has been designed in the context of and incorporated into the 3DIC design methodology also developed within our research group. The 3DIC verification methodology serves as an efficient means to perform all setup and hold timing checks harnessing the power of existing commercial chip design and verification tools. A novel approach is presented in which the multi-die design is temporarily transformed to appear as a traditional 2D design to the commercial tools for verification purposes. Various parasitic extraction algorithms are examined, and we present a method for performing accurate 3D parasitic extraction for timing purposes. We offer theoretical insight into the optimization of a 3D clock tree for power savings and coupling-induced delay minimization. A practical example of the 3DIC design and verification flow is detailed through the explanation of our research group's test chip, a nearly 140,000 cell 3D fast Fourier transform chip currently awaiting fabrication at MIT's Lincoln Labs.
  • No Thumbnail Available
    Compositional Static Cache Analysis Using Module-level Abstraction
    (2003-12-10) Patil, Kaustubh Sambhaji; Dr. Frank Mueller, Committee Chair; Dr. Alexander Dean, Committee Member; Dr. Eric Rotenberg, Committee Member
    Static cache analysis is utilized for timing analysis to derive worst-case execution time of a program. Such analysis is constrained by the requirement of an inter-procedural analysis for the entire program. But the complexity of cycle-level simulations for entire programs currently restricts the feasibility of static cache analysis to small programs. Computationally complex inter-procedural analysis is needed to determine caching effects, which depend on knowledge of data and instruction references. Static cache simulation traditionally relies on absolute address information of instruction and data elements. This thesis presents a framework to perform worst-case static cache analysis for direct-mapped instruction caches using a module-level and compositional approach, thus addressing the issue of complexity of inter-procedural analysis for an entire program. The module-level analysis parameterizes the data-flow information in terms of the starting offset of a module. The compositional analysis stage uses this parameterized data-flow information for each module. Thus, the emphasis here is on handling most of the complexity in the module-level analysis and performing as little analysis as possible at the compositional level. The experimental results show that the compositional analysis framework provides equally accurate predictions when compared with the simulation approach that uses complete inter-procedural analysis.
  • No Thumbnail Available
    Design of DDR2 Interface for Tezzaron TSC8200A Octopus Memory intended for Chip Stacking Applications
    (2010-05-14) Bapat, Ojas Ashok; Dr. W. Rhett Davis, Committee Member; Dr. Eric Rotenberg, Committee Member; Dr. Paul D. Franzon, Committee Chair
    This document talks about the design of a DDR2 Controller for the Tezzaron TSC8200A (Octopus) High-Speed Self-Repairing L3 Memory which is intended for chip stacking applications. The controller is the part of a LEON3 processor architecture. The system consists of three leon processor cores connected to all the peripherals and memory through an AMBA-2.0AHB/APB Master/Slave bus interface. The development environment is the gaisler open source library which is a set of reusable IP cores designed for system on chip development. The advantage of using this environment is that the libraries are technology independent and can be used with various target technologies and CAD tools. The DDR2 controller acts as a slave to the AHB bus. On the other side is the Tezzaron Octopus Memory. The controller consists mainly of two parts; one which implements the state machines for both the AHB side interface and the Memory side interface and the other which does the job of shifting, alignment and conversion of signals from single to double data rate. This part also has the pads instantiated in it. As the Octopus Memory has two independent ports which are seen as two separate parallel memories by the host processor, we have two instantiations of the controller in the design. Also, unlike conventional DDR2 standards, the Octopus Memory uses only single ended signals. Also, since this memory has been specially designed for stacking, it does not support/require on-die termination and off-chip driver capability. Here, we talk about the challenges faced in the design of the Controller state machines, the physical interface, synthesis and the functional and timing verification of the DDR2 controller. Also, we talk about the place and route strategy adopted to layout the entire 3-core processor architecture along with the controller and memory. Since the Tezzaron Octopus memory IP was not available at the time, we have used a dummy .lef block for it. Assertion based formal verification techniques were used the verify the outputs and internal signals of the controller. The design was synthesized in IBM 130nm technology library with artisan memories and I/O pads. The total synthesized area of the entire user logic is 7.01 mm2 without the macros and pads. The standard cell area for just the controller is 4.89 mm2 . The total die size for the user logic with the macros and pads is 7mm x 7mm with a core utilization of 0.7. Having memory on a separate die helps us get all the benefits of an on chip memory while reducing the complexity and number of process steps. The memory and the user logic can be can be individually processed in different feature sizes or even different materials. The dies can be than stacked on top of each other and connected with through silicon vias. The Octopus memory follows the IMIS interface specification which defines a high bandwidth 1024-bit wide vertical bus at the memory surface. This allows for shorter interconnects, thus greatly reducing the latency. It improves the bandwidth by allowing up to eight parallel 64 bit double pumped data ports.
  • No Thumbnail Available
    Development of a Cycle Level, Full System, x86 Microprocessor Simulator
    (2008-03-24) Gambhir, Mohit; Dr. Yan Solihin, Committee Chair; Dr. Eric Rotenberg, Committee Member; DR. Vincent W. Freeh, Committee Member
    Although x86 processors are the most popular processors in commercial and scientific working environment, there is a scarcity of open source microprocessor simulators that can enable researchers to experiment with new x86 based microprocessor and memory system designs. Also, most of the simulators that exist today are user space simulators that do not profile the operating system code that gets executed when interrupts and system calls are invoked while an application is running. This work involves the development of a cycle level, full system, x86 microprocessor simulator called MYSim. One of the biggest challenges involved in developing an x86 based processor simulator is that the x86 instruction set is complex. Its complexities include variable length instructions that may take varying number of cycles to decode. Also, the operands in an x86 instruction may reside in registers or in memory or both. These complexities make the x86 instruction set architecture (ISA) particularly hard to simulate. MYSim is an execution driven simulator that divides the simulation in two parts: the first part is the functional simulator or emulator, which actually executes the simulated application as well as the OS code and the second part is the timing simulator, which models the timing of the application. MYSim uses Bochs (an open source x86 emulator) as the functional simulator which emulates x86 processors, hardware devices, memory, etc. and enables the execution of various operating systems and software within the emulation. MYSim's timing simulator is ported from SESC (SuperESCalar Simulator), a simulator that initially supported MIPS ISA and was modified, as part of this work, to support x86 ISA. The functional simulator executes the next x86 instruction, breaks it into μops and feeds those μµops to the timing simulator. The timing simulator models a full out of order pipeline with branch prediction, caches, buses and most major components that are required to be simulated in order to model accurate timing of modern microprocessors.
  • No Thumbnail Available
    Development of ASIC Technology Library for the TSMC 0.25 micrometers Standard Cell Library
    (2003-08-19) Sundararaman, Vishwanath; Dr. Griff Bilbro, Committee Member; Dr. Eric Rotenberg, Committee Member; Dr. Paul D. Franzon, Committee Chair
    The Synopsys synthesis tool generates the hierarchical netlist of a design using worst-case and best-case ASIC technology libraries. The worst-case library checks for the setup time violation and the best-case library checks for the hold time violations of the design. The worst-case library is characterized by a supply voltage of 2.25V, operating temperature of 125°C, and slow process corner. The best-case library is characterized by a supply voltage of 2.75V, operating temperature of -55°C, and fast process corner. The technology libraries are developed for the CMOS TSMC 0.25μm technology. The CMOS nonlinear delay models are used for delay calculations. Variations in operating temperature, supply voltage and manufacturing process causes performance variations in electronic networks. Using different operating conditions, the timing of the design under different environmental conditions can be evaluated. The delay values specified in the cells for a technology specify a set of nominal operating condition. The worst-case and best-case libraries are developed by running HSPICE simulations for all the 36 basic cells. The technology library contains information used for the following synthesis activities: • Translation — functional information for each cell • Optimization — area and timing information for each cell (including timing constraints on sequential cells) • Design rule fixing — design rule constraints on cells
  • No Thumbnail Available
    Environment Replay for Low-End Reactive Embedded Systems
    (2005-01-06) Seetharam, Adarsh; Dr. Eric Rotenberg, Committee Member; Dr. Frank Mueller, Committee Member; Dr. Alexander Dean, Committee Chair
    Existing benchmark suites for embedded systems focus on batch processing applications to simplify portability. However, embedded systems are typically tightly coupled to the external environment through input/output (I/O) operations, resulting in reactive, real-time rather than batch behavior. Furthermore, often the environmental state is not dependent on program progress, so portions of the program may block until an environmental change occurs, limiting the impact of a faster processor or more efficient code. Existing benchmark suites ignore these important aspects, leading to one-dimensional characterizations of embedded systems. This work offers methods to record and play back environmental inputs within the limited resources available on common low-end microcontroller units (MCUs). We modify input operations in the source code at the C level, creating a record version and a replay version. For recording, input data is captured and stored as the program executes. This data is analyzed, compressed and converted off-line into a series of time-dependent events. During replay, the input operations read the compressed environmental input event data, rather than the environment. These changes allow virtualization of input operations, resulting in C code which can easily be ported to different processors for batch-mode performance evaluation, yet still react to the original event timeline. Our methods are demonstrated with a universal infrared remote control application. Environmental inputs are recorded on an 8-bit MCU, processed and replayed. We then evaluate the impact of a higher clock rate and also porting to a 16-bit MCU. We characterize memory requirements, response times and the implications of porting interrupts.
  • No Thumbnail Available
    Exploiting Computational Locality in Global Value Histories.
    (2002-05-24) Bodine, Jill T.; Dr. Eric Rotenberg, Committee Member; Dr. Greg Byrd, Committee Member; Dr. Thomas Conte, Committee Chair
    Value prediction is a speculative technique to break true data dependencies by predicting uncomputed values based on history. Previous research focused on exploiting two types of value locality (computation-based and context-based) in the local value history, which is the value sequence produced by the same instruction that is being predicted. Besides local value history, value locality also exists in global value history, which is the value sequence produced by all dynamic instructions according to their execution order. In this thesis, a new type of value locality, computational locality in global value history is studied. A prediction scheme, called gDiff, is designed to exploit one special and most common case of this computational model, the stride-based computation, in global value history. Experiments show that there exists very strong stride type of locality in global value sequences and ideally the gDiff predictor can achieve 73% prediction accuracy for all value producing instructions without any hybrid scheme, much higher than local stride and local context prediction schemes. However, the ability to realistically exploit locality in global value history is greatly challenged by the value delay issue, i.e., the correlated value may not be available when the prediction is being made. The value delay issue is studied in an out-of-order (OOO) execution pipeline model and the gDiff predictor is improved by maintaining an order in the value queue and utilizing local stride predictions when global values are unavailable to avoid the value delay problem. This improved predictor, called hgDiff, demonstrates 88% accuracy and 69% prediction coverage on average, outperforming a local stride predictor by 2% higher accuracy and 13% higher coverage.
  • No Thumbnail Available
    A Feasibility Study on the application of Stream Architectures for Packet Processing Applications
    (2003-09-05) Rai, Jathin S; Dr. Gregory T. Byrd, Committee Chair; Dr. Yannis Viniotis, Committee Member; Dr. Eric Rotenberg, Committee Member
    A new breed of processors has emerged for packet processing applications, called Network Processors (NP). These processors make use of co-processors, which are dedicated hardware, to perform key computational kernels at wire speed. This reduces the flexibility of the Network Processor. This thesis looks into the inefficiencies of the current architecture employed and the feasibility a stream architecture for packet processing applications. The thesis simulates the performance of an IPv4 forwarding algorithm on a generic stream architecture and characterizes the maximum sustainable throughput of the forwarding engine. The algorithm has been implemented with maximizing performance being the focus. The architecture has been tweaked to accommodate the algorithm. The forwarding engine developed is able to sustain a data rate of OC-48 running on a 500 MHz clock. It also incorporates a bit extraction engine which makes it flexible to support different routing schemes.
  • No Thumbnail Available
    Frequency-aware Static Timing Analysis for Power-aware Embedded Architectures
    (2004-03-14) Seth, Kiran Ravi; Dr. Frank Mueller, Committee Chair; Dr. Alexander Dean, Committee Member; Dr. Eric Rotenberg, Committee Member
    Power is a valuable resource in embedded systems as the lifetime of many such systems is constrained by their battery capacity. Recent advances in processor design have added support for dynamic frequency/voltage scaling (DVS) for saving power. Recent work on real-time scheduling focuses on saving power in static as well as dynamic scheduling environments by exploiting idle and slack due to early task completion for DVS of subsequent tasks. These scheduling algorithms rely on a priori knowledge of worst-case execution times (WCET) for each task. They assume that DVS has no effect on the worst-case execution cycles (WCEC) of a task and scale the WCET according to the processor frequency. However, for systems with memory hierarchies, the WCEC typically does change under DVS due to frequency modulation. Hence, current assumptions used by DVS schemes result in a highly exaggerated WCET. The research presented contributes novel techniques for tight and flexible static timing analysis particularly well-suited for dynamic scheduling schemes. The technical contributions are as follows: (1) The problem of changing execution cycles due to scaling techniques is assessed. (2) A parametric approach towards bounding the WCET statically with respect to the frequency is proposed. Using a parametric model, the effect of changes in frequency on the WCEC can be captured and, thus, the WCET over any frequency range can be accurately modeled. (3) The design and implementation of the frequency-aware static timing analysis (FAST) tool, based on prior experience with static timing analysis, is discussed. (4) Experiments demonstrate that the FAST tool provides safe upper bounds on the WCET, which are tight. The FAST tool allows the capture of the WCET of six benchmarks using equations that overestimate the WCET by less than 1%. FAST equations can also be used to improve existing DVS scheduling schemes to ensure that the effect of frequency scaling on WCET is considered and that the WCET used is not exaggerated. (5) Three DVS scheduling schemes are leveraged by incorporating FAST into them and by showing that the power consumption further decreases.
  • No Thumbnail Available
    Memory Design for FFT Processor in 3DIC Technology
    (2009-03-18) Gonsalves, Kiran; Dr. William Rhett Davis, Committee Member; Dr. Paul Franzon, Committee Chair; Dr. Eric Rotenberg, Committee Member
    Computation of Fast Fourier Transform (FFT) of a sequence is an integral part for the Synthetic Aperture Radar (SAR). For FFT computations, there are a lot of data modification operations (multiplication and addition) involved. Typically, a memory (either on-chip or off-chip) would store the input data packet and output data would also be written to the same location. This memory would also be used as a scratch pad for storing intermediate results. As the required resolution of the image increases, the size of the input data increases. Hence, the number of computations in the butterfly structure of the FFT increases and this results in numerous data transactions between the memory and the computation units. Each data access can be expensive in terms of power dissipation and access time. The power dissipation is proportional to the size of the memory and the access time is dependent on the electrical proximity of the memory to the processing unit. Three Dimensional Integrated Circuits (3D IC) enable the tight integration of this memory with the logic that operates on the memory. Apart from form-factor improvement, 3D IC technology's main advantage is that it significantly enhances interconnect resources. Davis et al. in mention that in the best case, if the inter-tier vias are ignored, the average wire length can be expected to drop by number of tiers raised to the power of 2. This structure is advantageous as it reduces the access time and enables quicker computation of FFT when compared to its two dimensional counterpart. Alternatively, when run at the same speed, the 3D version can be said to dissipate lower power than the 2D version, owing to smaller interconnect parasitics. The electrical proximity of the memory enables more interconnections (wider buses) and as a result, many number of small memories can be interfaced to the processing elements. This would not be possible in the conventional off-chip structure as the number of interconnect pins would be a limiting factor due to limitations on pin-outs and Printed Circuit Board (PCB) routing. This thesis supports the demonstration of memory on logic in a 3D IC environment by creating a full custom memory. The two types of memories designed for the application are Static Random Access Memory (SRAM) (for storing input, intermediate and output data) and Read Only Memory (ROM) (for storing twiddle factors for FFT computation). For the application a dual ported SRAM cell is sufficient with one port for read and another port for write purposes. The FFT algorithm used ensures that any location in the memory is never read from and written to at the same time and this eliminates the necessity for a design that protects against simultaneous read/write. The ROM is required to store elements that do not change during the calculation of the FFT, i.e. the twiddle factors. In this project, a 32 x 64 SRAM including multiplexers and 3D TSVs is designed. This can be readily integrated into a 3DIC flow. The area for the SRAM is 0.155 square mm, giving an area of 75.68 square microns per bit. The access time for the SRAM is 1.7ns. The energy for read access is 408.79 fJ/bit. The energy for write access is 90.78 fJ/bit. A 129 x 52 ROM is designed with 3D TSVs. This can be integrated into a 3DIC flow. The area for ROM is 0.032922 square mm, giving an area of 4.72 square microns per bit. The access time for the ROM is 1ns. The energy per access is 165 pJ/bit.
  • No Thumbnail Available
    A Methodology for Hardware Design and Verification of Architectures for Channel Equalization
    (2005-12-02) Patel, Virendra Rameshbhai; Dr. Winser E. Alexander, Committee Chair; Dr. Rhett W. Davis, Committee Member; Dr. Eric Rotenberg, Committee Member
    Hardware implementing wireless applications in today's cellular systems has stringent requirements such as high speed, flexibility, and low power dissipation resulting in complex systems. These requirements have led to the development of systems on a single chip. Although this development promises a variety of design advantages, designers are facing new design difficulties and challenges while designing these complex systems. Some of the design difficulties and challenges presented by the traditional design flow, in designing these complex systems, are increase in the simulation time, increase in the verification effort required, increase in the time to market, difficulty in exploring the design space, and increase in the productivity gap. In this research work, we introduce a new design flow that starts at the system level. The design flow, called the system-level design flow, promises to reduce the difficulty in exploring the design space, to reduce the simulation times, to reduce the verification and debugging time, to allow the definition of both hardware and software components of a design, and to allow defining the system at a high level of abstraction. To validate our design flow and its advantages, we consider a subsystem for a Wireless Communication System called a 'Multiple Input Multiple Output' (MIMO) wireless communication system for analysis. We consider the designs of channel equalization architectures for the MIMO wireless communication system. We consider algorithms such as least mean square and iterative conjugate gradient algorithms for implementing channel equalization. We design the algorithms using SystemC and Verilog. We consider the use of SystemVerilog to interface SystemC to the Verilog environment.
  • No Thumbnail Available
    Non-Uniform Power Distribution in Data Centers for Safely Overprovisioning Circuit Capacity and Boosting Throughput
    (2005-04-06) Femal, Mark Edward; Dr. Eric Rotenberg, Committee Member; Dr. Vincent Freeh, Committee Chair; Dr. Frank Mueller, Committee Member
    Management of power in data centers is driven by the need to not exceed circuit capacity. Such techniques are evolving from ad hoc methods based on maximum node power usage to systematic methods that employ power-scalable components. These components allow for dynamically controlling power consumption with an accompanying effect on performance. Because the incremental performance gain from operating in a higher performance state is less than the increase in power, it is possible to overprovision the hardware infrastructure to increase throughput and yet still remain below an aggregate power limit. In overprovisioning, if each component operates at maximum power the limit would be exceeded with disastrous results. However, safe overprovisioning regulates power consumption locally to meet the global power budget. This research work presents PICLE, the Power Infrastructure Controller for Limited Environments. This framework is designed for boosting throughput through intelligent monitoring of server clusters by load-balancing available aggregate power under a set of operating constraints. The solution is useful for data centers that cannot expand the number of power circuits or seek effective usage of the available power budget due to power fluctuations. The framework is also ideally suited for environments with a heterogeneous workload and hence, a non-uniform power allocation requirement. Synthetic benchmarks indicate overprovisioning throughput gains of nearly 6% from a staticly assigned, power managed environment and over 30% from an unmanaged environment. In addition, based on a representative workload for a two minute period, a non-uniform power allocation scheme is shown to increase throughput by over 16% versus a uniform power allocation mechanism.
  • No Thumbnail Available
    Scheduling to Consolidate Idle Periods for Energy-Efficiency in Multicore Systems.
    (2009-09-29) Pal, Poulomi; Dr. Gregory T. Byrd, Committee Chair; Dr. Eric Rotenberg, Committee Member; Dr. Rhett Davis, Committee Member
    PAL, POULOMI. Scheduling to Consolidate Idle Periods for Energy Efficiency in Multicore Systems. (Under the direction of Dr. Gregory T. Byrd.) Power-efficiency and energy savings are the major drivers in the CPU design space. Design at all levels needs to be energy-aware in order to be considered seriously. With the emergence of mobile devices as a large market, energy efficiency has become of prime importance, as the emphasis now lies on making the battery last longer. Most modern cellphones are required to do much more than making and receiving calls. With 4G approaching quickly, the cellphone is required to be almost as good as any general purpose computer with respect to application complexities. This gives rise to the need for adaptive systems that can handle applications with high performance, and conserve energy as well. The research presented here provides a scheduling scheme which aims to consolidate CPU idle times, and to introduce some amount of inertia and determinism in the system, so as to provide the opportunity to switch CPUs into low-power modes for conservation of energy. The baseline used here is the Linux kernel-2.6.28, which implements distributed control for load balancing for different CPUs. Changes are made to this kernel, and CPU activity is used as the metric for comparison. It is found that the CPUs that are numbered higher get much longer idle periods, which can be used for switching to energy-efficiency mode, while still maintaining comparable performance.
  • No Thumbnail Available
    Software Thread Integration for Instruction Level Parallelism
    (2007-07-05) So, Won; Dr. Eric Rotenberg, Committee Member; Dr. Thomas M. Conte, Committee Member; Dr. Vincent W. Freeh, Committee Member; Dr. Alexander G. Dean, Committee Chair
    Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word) or EPIC (Explicitly Parallel Instruction Computing). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor's functional units busy. This dissertation proposes Software Thread Integration (STI) for Instruction Level Parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler's scope and hence allowing it to create a more efficient instruction schedule. STI is essentially procedure jamming with intraprocedural code motion transformations which allow arbitrary alignment of instructions or code regions. This alignment enables code to be moved to use available execution resources better and improve the execution schedule. Parallel procedures are identified by the programmer with either annotations in conventional procedural languages or graph analysis for stream coarse-grain dataflow programming languages. We use the method of procedure cloning and integration for improving program run-time performance by integrating parallel procedures via STI. This defines a new way of converting parallelism at the thread level to the instruction level. With filter integration we apply STI for streaming applications, exploiting explicit coarse-grain dataflow information expressed by stream programming languages. During integration of threads, various STI code transformations are applied in order to maximize the ILP and reconcile control flow differences between two threads. Different transformations are selectively applied according to the control structure and the ILP characteristics of the code, driven by interactions with software pipelining. This approach effectively combines ILP-improving code transformations with instruction scheduling techniques so that they complement each other. Code transformations involve code motion as well as loop transformations such as loop jamming, unrolling, splitting, and peeling. We propose a methodology for efficiently finding the best integration scenario among all possibilities. We quantitatively estimate the performance impact of integration, allowing various integration scenarios to be compared and ranked via profitability analysis. The estimated profitability is verified and corrected by an iterative compilation approach, compensating for possible estimation inaccuracy. Our modeling methods combined with limited compilation quickly find the best integration scenario without requiring exhaustive integration. The proposed methods are automated by the STI for ILP Tool Chain targeting Texas Instrument C6x VLIW DSPs. This work contributes to the definition of an alternative development path for DSP applications. We seek to provide efficient compilation of C or C-like languages with a small amount of additional high-level dataflow information targeting popular and practical VLIW DSP platforms, reducing the need for extensive manual C and assembly code optimization and tuning.
  • No Thumbnail Available
    STI Concepts for Bit-Bang Communication Protocols
    (2003-01-28) Kumar, Nagendra J; Dr. Thomas M. Conte, Committee Member; Dr. Eric Rotenberg, Committee Member; Dr. Alexander Dean, Committee Chair
    In the modern times, embedded communication networks are being used in increased number of embedded systems to provide more reliability and cost effectiveness. Designers are forced to limit and minimize the size, weight, power consumption, costs and also the design time of their products. However, network controller chips are also expensive and hence moving functionality from hardware to software cuts down the costs and also makes custom fit protocols easier to implement. Traditional methods of sharing a processor are not adequate for implementing communication protocol controllers in software because of the processing required during each bit. The available idle time is fine grain compared to the bit time and is usually small for even the fast context switching techniques (e.g. co-routines) to run any other thread. Without some scheme to recover this fine-grain idle time, no other work in the system would make any progress. Software Thread Integration (STI) provides low cost concurrency on general-purpose microprocessors by interleaving multiple threads of control (having real-time constraints) into one. This thesis introduces new methods for implementing communication protocols in software using statically scheduled co-routines and software thread integration. With co-routines, switching from primary to secondary threads and vice versa can be done without incurring a penalty as severe as "context - switching". This technique will be been demonstrated on the SAE J1850 communication standard used in off- and on-road land-based vehicles. These methods also minimize the number of co-routine calls needed to share the processor thereby enabling finer-grain idle time to be recovered for use by the secondary thread. Increased number of compute cycles implies ∗Improved performance of the secondary thread and ∗Reduced minimum clock speed for the microprocessor. Thus, now more secondary thread work can be done and also the minimum clock speed required of the processor is reduced. These factors enable the embedded system designers to use processors more efficiently and also with less development effort.
  • No Thumbnail Available
    Weld for Itanium Processor
    (2002-12-03) Sharma, Saurabh; Dr. Thomas M. Conte, Committee Chair; Dr. Eric Rotenberg, Committee Member; Dr. Alexander Dean, Committee Member
    This dissertation extends a WELD for Itanium processors. Emre Özer presented WELD architecture in his Ph.D. thesis. WELD integrates multithreading support into an Itanium processor to hide run-time latency effects that cannot be determined by the compiler. Also, it proposes a hardware technique called operation welding that merges operations from different threads to utilize the hardware resources. Hardware contexts such as program counters and the fetch units are duplicated to support for multithreading. The experimental results show that Dual-thread WELD attains a maximum of 11% speedup as compared to single-threaded Itanium architecture while still maintaining the hardware simplicity of the EPIC architecture.

Contact

D. H. Hill Jr. Library

2 Broughton Drive
Campus Box 7111
Raleigh, NC 27695-7111
(919) 515-3364

James B. Hunt Jr. Library

1070 Partners Way
Campus Box 7132
Raleigh, NC 27606-7132
(919) 515-7110

Libraries Administration

(919) 515-7188

NC State University Libraries

  • D. H. Hill Jr. Library
  • James B. Hunt Jr. Library
  • Design Library
  • Natural Resources Library
  • Veterinary Medicine Library
  • Accessibility at the Libraries
  • Accessibility at NC State University
  • Copyright
  • Jobs
  • Privacy Statement
  • Staff Confluence Login
  • Staff Drupal Login

Follow the Libraries

  • Facebook
  • Instagram
  • Twitter
  • Snapchat
  • LinkedIn
  • Vimeo
  • YouTube
  • YouTube Archive
  • Flickr
  • Libraries' news

ncsu libraries snapchat bitmoji

×