Browsing by Author "Gregory T. Byrd, Committee Member"
Now showing 1 - 7 of 7
- Results Per Page
- Sort Options
- Caching Strategies to Improve Generational Garbage Collection in Smalltalk(2003-08-20) Reddy, Vimal Kodandarama; Gregory T. Byrd, Committee Member; Edward F. Gehringer, Committee Chair; Warren J. Jasper, Committee Member; Eric Rotenberg, Committee MemberCache performance of programs is becoming increasingly important as processor speeds grow faster relative to main memory. Cache misses have become a major consideration for performance of garbage collected systems. This thesis explores a caching strategy for generational garbage collectors, the most prevalent form in use, which takes advantage of large caches available to modern-day processors. A portion of the cache is chosen to reserve the youngest generation and the page-fault manager is provided certain mapping rules that remove all conflicts to the youngest generation. The strategy can be completely realized in software which makes it an attractive solution to increase garbage collection performance. This "biased" cache mapping is shown to reduce cache misses and increase overall performance in the IBM VisualAge Smalltalk system, a high quality Smalltalk implementation that employs a generational copying garbage collector. Favoring youngest generation in the mapping strategy is advantageous for the following reasons: 1. Languages like Smalltalk, where "everything" is an object, tend to allocate furiously. This is because they encourage a programming style where objects are created, used and shortly thereafter destroyed. This large number of allocations translate to initialization write misses if the allocated region is not cached. In generational heaps, all memory is allocated in the region containing youngest-generation objects. 2. A generational garbage collector focuses collection on the youngest generation, scavenging it to reclaim most garbage. It relies on empirical knowledge that most young objects die soon. This means the scavenger runs many times during a program lifetime, scanning the youngest generation for garbage. This process can lead to a large number of read and write cache misses if the youngest generation is not in cache. 3. Youngest generation objects form a major part of a program's working set. Making them available in the cache would also improve the mutator (i.e, user program) performance, making it immune to interference from the garbage collector. 4. Given that most young objects become garbage quickly, when a garbaged object is evicted from a writeback cache, an unneccessary writeback results. Caching the youngest generation would reduce traffic to memory. We do a simulation-based study of our mapping strategies on IBM VisualAge Smalltalk, a generational copying garbage collected system. Our results show a 45% average drop in cache miss rates at the L2 level for direct-mapped caches and 15% average drop for 2-way set-associative caches.
- Hardware Architecture of a Behavior Modeling Coprocessor for Network Intrusion Detection System(2009-03-26) Yadav, Meeta; Paul D. Franzon, Committee Chair; Michael A Rappa, Committee Member; Yannis Viniotis, Committee Member; Gregory T. Byrd, Committee MemberYADAV, MEETA. Hardware Architecture of a Behavior Modeling Coprocessor for Network Intrusion Detection. (Under the direction of Professor Paul D. Franzon). Intrusion detection systems protect a network against exploitation and manipulation by monitoring the incoming and outgoing traffic and classifying it as normal or malicious. The task of classifying network traffic is difficult and is made more complex by growing performance pressures of increasing traffic rates, the need to detect stealthy attacks by performing sophisticated analysis, the requirement of in-line processing and the inability of software based systems to keep up with the line-speeds. Most current intrusion detection systems make trade-offs between one or more performance requirements. For instance, software based systems are scalable and can perform more complex algorithmic analysis on the network traffic but are incapable of keeping up with the line speeds. Hardware based systems can process packets in real-time but are not scalable or configurable, and they are limited to rule based packet filtering. These growing performance pressures on network security devices have redefined the issues to be addressed in the design of a security system, underlining the need for a scalable and configurable hardware system that has the ability to effectively detect intrusions by performing sophisticated analysis at line-speeds while keeping up with the increasing traffic rate and attack sophistication. The focus of this dissertation is to design a hardware based intrusion detection system that is scalable, configurable, and capable of analyzing traffic to detect various categories of attacks at linespeeds. Specifically, we address four important issues with the design of hardware based systems: - A behavior based technique was implemented in hardware to detect attacks embedded in the different protocol layers, across layers and in the payload of the packet. The technique monitors the traffic deeply, recovers-higher layer semantics, understands the flow of commands, requests, responses and detect attacks embedded across packets and across connections. The technique checks the network traffic for behavioral compliance using configurable, parametric data structures called theories that can model simple as well as complex behavior. Theories translate themselves into hardware using configurable functional units called assertion blocks. - Theories and assertion blocks are parametric and configurable in nature and can be configured to translate any behavior description to hardware. The ability of individual theories and assertion blocks to be configured lends the configurability aspect to the entire system. To enable the system to scale with an increase in behavior modules a configurable fabric of assertion blocks has been developed. The configurable assertion block fabric contains pre-synthesized assertion modules that are triggered by theories to perform the operation specified by the theories. - A Multi-Level Fractional Hash Algorithm was developed to effectively manage the traffic information gathered by inserting and querying a connection record with average case of O(1). The technique uses associative memory arranged in different levels, an on-chip bit vector array to insert records and the tag based technique of caches to query a record. - To block pre-defined and user defined malicious content a high speed, Trie based pattern matching algorithm was designed. The algorithm splits the pattern set into tries that are stored in the on-chip memory and pruned patterns that are stored in the off-chip SRAM. The streaming data is split into into sub-streams that can lead to a possible match. The sub-streams are searched in parallel for malicious content by traversing the on-chip tries and comparing the pruned patterns stored off-chip using dedicated comparators. The throughput of the pattern matching algorithm is 14 Gbps and is independent of length of the patterns, location of the malicious content in streaming data and the number of patterns in the pattern set. The architectural and algorithmic enhancements that addressed the performance issues with security systems were integrated to architect The Hardware Architecture of a Behavior Modeling Coprocessor for Network Intrusion Detection, called Behavioral Intrusion Prevention and Detection System (BIPDS). BIPDS is designed to carry out threat detection with dedicated hardware accelerators by monitoring all communication layers, extracting relevant data, and enabling highly efficient operation. The designed system supports a large number of protocols and applications, and allows for extensibility to new applications and services. Different aspects of security have been handled with behavioral modeling which enable the system to detect attack and pre-attack behavior. A key accomplishment of BIPDS is its scalable architecture, and flexibility to be updated which enables the system to adapt to various network configurations, and scale with an increase in network traffic and behavior models. The main contribution of this dissertation is the identification of an efficient hardware architecture that can parallel process one million simultaneous data connections at 11Gbps and has a die area of 17.3 sq mm (TSMC 0.25 μ library), and has a morphable data path to accommodate changes in network sizes and configurations.
- Low-Power Repeater Insertion for Global Interconnects(2006-05-04) Peng, Yuantao; Xun Liu, Committee Chair; W. Rhett Davis, Committee Member; Gregory T. Byrd, Committee Member; Paul D Franzon, Committee MemberRepeater insertion is one of the most widely used techniques to reduce the signal propagation delay on global interconnects. The number of repeaters inserted into interconnects is expected to be enormous due to the ever-increasing chip dimension. The huge number of repeaters can take up significant silicon area and consume a lot of power. Consequently, minimization of power consumption of repeaters with timing closure constraints is a very important problem in future low-power VLSI design. In this dissertation, we investigate efficient schemes for low-power repeater insertion on global interconnects. We first analyze key issues on repeater library design by introducing an analytical low-power repeater insertion algorithm for uniform two-pin interconnects. Our study leads to the answer of how to design a compact repeater library for low-power in early design stages. We then discuss several low-power repeater insertion schemes under given timing constraints. These schemes achieve a better trade-off between solution quality and runtime than previously proposed approaches. To handle the signal integrity problem while performing repeater insertion, we next present a novel low-power repeater insertion scheme under both timing and signal slew rate constraints. The proposed scheme is able to capture both delay and slew rate information, resulting in high quality interconnect designs. Besides the repeater insertion algorithms for given interconnects, we also describe a simple yet effective power macromodel for global interconnects with the consideration of low-power repeater insertion. By incorporating the macromodel into a macrocell placement tool, we have achieved simultaneous minimization of timing violations and power dissipation.
- Securing Communication in Dynamic Network Environments(2007-06-11) Wang, Pan; Peng Ning, Committee Co-Chair; Douglas S. Reeves, Committee Chair; Wenye Wang, Committee Member; Gregory T. Byrd, Committee MemberIn dynamic network environments, users may come from different domains, and the number of users and the network topology may change unpredictably over time. How to protect the users' ommunication in such dynamic environments, therefore, is extremely challenging. This dissertation has investigated multiple research problems related to securing users' communication in dynamic network environments, focusing on two kinds of dynamic networks, i.e., mobile ad hoc networks and overlay networks. It first introduces a secure address auto-configuration scheme for mobile ad hoc networks, since a precondition of network communication is that each user is configured with a unique network identifier (address). This proposed auto-configuration scheme binds each address with a public key, allows a user to self-authenticate itself, and thus greatly thwarts the address spoofing attacks, in the absence of centralized authentication services. Next, this thesis presents two storage-efficient stateless group key distribution schemes to protect the group communication of a dynamic set of users. These two key distribution schemes utilize one-way key chains with a logical tree. They allow an authorized user to get updated group keys even if the user goes off-line for a while, and significantly reduce the storage requirement at each user if compared with previous stateless key distribution schemes. Third, this thesis investigates the solution using cryptographic methods to enforce network access control in mobile ad hoc networks, whose dynamic natures make it difficult to directly apply traditional access control techniques such as firewalls. A functioning prototype demonstrates the proposed access control system is practical and effective. Finally, this dissertation introduces a k-anonymity communication protocol for overlay networks to protect the privacy of users' communication. Unlike the existing anonymous communication protocols that either cannot provide provable anonymity or suffer from transmission collision, the proposed protocol is transmission collision free and provides provable k-anonymity for both the sender and the recipient. The analysis shows the proposed anonymous communication protocol is secure even under a strong adversary model, in which the adversary controls a fraction of nodes, is able to eavesdrop all network traffic and maliciously modify/replay the transmitted messages. A proof-of-concept implementation demonstrates the proposed protocol is practical.
- Slipstream Processors(2003-07-17) Purser, Zachary Robert; Eric Rotenberg, Committee Chair; Gregory T. Byrd, Committee Member; Thomas M. Conte, Committee Member; S. Purushothaman Iyer, Committee MemberProcessors execute a program's full dynamic instruction stream to arrive at its final output, yet there exist shorter instruction streams that produce the same overall effect. This thesis proposes creating a shorter but otherwise equivalent version of the original program by removing ineffectual computation and computation related to highly-predictable control flow. The shortened program is run concurrently with and slightly ahead of a full copy of the program on a chip multiprocessor (CMP) or simultaneous multithreading (SMT) processor. The leading program passes all of its control-flow and data-flow outcomes to the trailing program for checking. This redundant program arrangement provides two key benefits. 1) Improved single-program performance. The leading program is sped up because it retires fewer instructions. Although the number of retired instructions is not reduced in the trailing program, it fetches and executes instructions more efficiently by virtue of having near-oracle branch and value predictions from the leading program. Thus, the trailing program is also sped up in the wake or 'slipstream' of the leading program, at the same time validating the speculative leading program and redirecting it as needed. Slipstream execution using two processors of a CMP substrate outperforms conventional non-redundant execution using only one of the processors. Likewise, given a sufficiently reduced leading program, slipstream execution using two contexts of an SMT substrate outperforms conventional non-redundant execution using only one of the contexts. 2) Fault tolerance. The shorter program is a subset of the full program and this partial redundancy is exploited for detecting and recovering from transient hardware faults. This does not require any additional hardware support, since the same mechanisms used to detect and recover from misspeculation in the leading program apply equally well to transient fault detection and recovery. In fact, there is no way to distinguish between misspeculation and faults. The broader rationale for slipstream is extending, not replacing, the capabilities of CMP/SMT processors, providing additional modes of execution. This thesis demonstrates the feasibility and benefits of the slipstream execution model.
- Transaction-level Modeling for a Network-on-chip Router in Multiprocessor System(2009-08-10) Hu, Jianchen; William Rhett Davis, Committee Chair; Gregory T. Byrd, Committee Member; Xun Liu, Committee MemberAs the complexity of SoC design grows, the traditional register transfer level (RTL) centric design flow cannot meet the time to market. In that case, a higher modeling level of abstraction is need for designer to explore the design space at system level. Transaction-level model (TLM) is such an approach since it could run much faster than RTL model and also have enough accuracy. There are different modeling styles of TLM for different applications. In this thesis, we develop a hybrid-TLM of Network-on-chip (NoC) based on OSCI TLM-2.0 standard. We use a simplified version of the AMBA AXI protocol for the bus. This model contains a cycle-accurate AXI router and other periphery modules with approximately-timed coding style, which achieve fast simulation speed and accurate result. This model keeps good interoperability since it entirely based on TLM-2.0 standard. And the designer could build complex NoCs by making use of this model.
- Using Performance Bounds to Guide Code Compilation and Processor Design(2003-07-10) Zhou, Huiyang; Thomas M. Conte, Committee Chair; Gregory T. Byrd, Committee Member; Eric Rotenberg, Committee Member; S. Purushothaman Iyer, Committee MemberPerformance bounds represent the best achievable performance that can be delivered by target microarchitectures on specified workloads. Accurate performance bounds establish an efficient way to evaluate the performance potential of either code optimizations or architectural innovations. We advocate using performance bounds to guide code compilation. In this dissertation, we introduce a novel bound-guided approach to systematically regulate code-size related instruction level parallelism (ILP) optimizations, including tail duplication, loop unrolling, and if-conversion. Our approach is based on the notion of code size efficiency, which is defined as the ratio of ILP improvement over static code size increase. With such a notion, we (1) develop a general approach to selectively perform optimizations to maximize the ILP improvement while minimizing the cost in code size, (2) define the optimal tradeoff between ILP improvement and code size overhead, and (3) develop a heuristic to achieve this optimal tradeoff. We extend our performance bounds as well as code size efficiency to perform code-size-aware compilation for real-time applications. The profile independent performance bounds are proposed to reveal the criticality for each path in a task. Code optimizations can then focus on the critical paths (even at the cost of non-critical ones) to reduce the worst-case execution time, thereby improving the overall schedulability of the real-time system. For memory intensive applications featuring heavy pointer chasing, we develop an analytical model based on performance bounds to evaluate memory latency hiding techniques. We model the performance potential of these techniques and use the analytical results to motivate an architectural innovation, called recovery-free value prediction, to enhance memory level parallelism (MLP). The experimental results show that our proposed technique improves MLP significantly and achieves impressive speedups.