Log In
New user? Click here to register. Have you forgotten your password?
NC State University Libraries Logo
    Communities & Collections
    Browse NC State Repository
Log In
New user? Click here to register. Have you forgotten your password?
  1. Home
  2. Browse by Author

Browsing by Author "Dr. Gregory T. Byrd, Committee Member"

Filter results by typing the first few letters
Now showing 1 - 15 of 15
  • Results Per Page
  • Sort Options
  • No Thumbnail Available
    A Compilation Tool for Automated Mapping of Algorithms onto FPGA-Based Custom Computing Machines
    (2002-08-27) Sahin, Ibrahim; Dr. Paul D. Franzon, Committee Member; Dr. Gregory T. Byrd, Committee Member; Dr. Winser E. Alexander, Committee Co-Chair; Dr. Clay S. Gloster, Committee Co-Chair
    Adaptive computing, also known as Reconfigurable Computing (RC), is a field that combines hardware and software data processing platforms. RC systems combine the flexibility of General Purpose Processors (GPP) with the speed of application specific processors [1,2]. In a typical reconfigurable computer, computationally intensive portions of algorithms are executed on Field Programmable Gate Arrays (FPGA) for enhanced performance. Although RC systems offer significant performance advantages over GPPs, they have a few disadvantages. RC systems require more application development time than GPPs. Also, RC system designers need to be knowledgeable in the areas of hardware and software system design. Since each application is different in terms of data inputs, outputs, and the method of processing data, designers are required to design a specific RC implementation for each specific problem. Our major contribution in this research is the development of a design automation tool called the Reconfigurable Computing Compilation Tool (RCCT) to address the problems mentioned above. In addition, this tool was designed to automate the process of mapping applications onto RC systems, and to provide the potential performance benefits of RC systems to typical software programmers. The final version of the tool contains four components: The RC Compiler, the Module Library, the Loader and the Simulator. Our contributions also includes a novel assembly language instruction set for the modules and a session file format (a new assembly language program format for RC systems). The tool was tested on several applications to demonstrate its effectiveness. Among the selected applications were matrix multiplication, and some image processing algorithms such as 3-D Image correlation. We compared the execution times of the applications when they were run on different GPPs to different RC configurations to demonstrate the tool's effectiveness. Our results showed that the tool is able to enhance the performance of the applications by mapping portions of them to the RC systems. Simulations with the tool showed that when the user applications are mapped to the RC systems, significant speedups (around 10 times to 100 times) can be attained for the mapped sections of the applications. We also noticed that the design and implementation time of the RC versions of the applications were reduced significantly. With the tool, the RC versions of the applications were developed, in a matter of a few hours. No special skills are needed to map applications to the RC systems using RCCT if the required hardware modules are readily available.
  • No Thumbnail Available
    Finite-Difference Time-Domain Methods for Electromagnetic Problems Involving Biological Bodies
    (2006-03-13) Schmidt, Stefan; Dr. Zhilin Li, Committee Member; Dr. Gregory T. Byrd, Committee Member; Dr. Gianluca Lazzi, Committee Chair; Dr. Brian L. Hughes, Committee Member
    As more applications of wireless devices in the personal space are emerging, the analysis of interactions between electromagnetic energy and the human body will become increasingly important. Due to the risk of adverse health effects caused by the use of wireless devices adjacent to or implanted into the human body, it is important to minimize their electromagnetic interaction with biological objects. Efficient numerical methods may play an integral role in the design and analysis of wireless telemetry in implanted biomedical devices, as well as computation and minimization of the specific absorption rate (SAR) associated with wireless devices as an alternative to repetitive design prototyping and measurements. Research presented in this dissertation addresses the need to develop efficient numerical methods for the solution of such bio-electromagnetic problems. Bio-electromagnetic problems involving inhomogeneous dispersive media are traditionally solved using the Finite-Difference Time-Domain (FDTD) method. In this class of problems, the spatial discretization is often dominated by very fine geometric details rather than the smallest wavelength of interest. For an explicit FDTD scheme, these fine details dictate a small time-step due to the Courant-Friedrichs-Lewy (CFL) stability bound, which in turn leads to a large number of computational steps. In this dissertation, numerical methods are considered that overcome the CFL stability bound for particular bio-electromagnetic problems. One such method is to incorporate a thin wire sub-cell model into the explicit FDTD method for the computation of inductive coupling. The sub-cell model allows the use of larger FDTD cells, hence relaxing the CFL stability bound. A novel stability bound for the method is derived. Furthermore, an extension to the Thin-Strut FDTD method is proposed for the modeling of thin wire elements in lossy dielectric materials. Numerical results obtained by the Thin-Strut FTDT method were compared with measurements. Furthermore, the Partial Inductance Method (PIM) was implemented using arbitrarily oriented cylindrical wire elements to obtain an analytical approximation of inductive coupling and to verify the Thin-Strut FDTD method. The PIM method was also shown to be a very efficient tool for the approximation of free-space or low frequency inductive coupling problems for biomedical applications. The Alternating-Direction-Implicit (ADI) FDTD method is another method considered in this dissertation. Due to its unconditional stability, the ADI FDTD method alleviates the CFL stability bound. The objective is to apply the ADI method to the simulation of bio-electromagnetic problems and the computation of the SAR. For large time-steps, the ADI method has larger dispersion and phase errors than the explicit FDTD method, but it is still useful for the computation of SAR where those errors are tolerable. An improved anisotropic-material Perfectly-Matched-Layer (PML) Absorbing-Boundary-Condition(ABC) is presented for the ADI FDTD method. The material independent D-H-field formulation of the PML ABC leads to an efficient and simple implementation and allows the truncation of dispersive material models. Furthermore, this formulation is easily extended to n[superscript th]-order dispersive materials. Numerical results for reflection errors associated with the PML and their dependence on parameters like PML conductivity and time-step size are investigated. Furthermore, uniform and expanding grid implementations of the ADI FDTD method are used to compute the Specific Absorption Rate (SAR) distribution inside spherical objects representative of bio-electromagnetic problem. Different grid implementation sizes are considered, and errors associated with the ADI FDTD method are investigated by comparing numerical results to those obtained using the explicit FDTD method.
  • No Thumbnail Available
    Forwarding Engine for IPv6
    (2003-06-03) Sawhney, Ishdeep Singh; Dr. Gregory T. Byrd, Committee Member; Dr. Yannis Viniotis, Committee Member; Dr. Paul D. Franzon, Committee Chair
    We focus on forwarding engine for million entry IPv6 (Internet Protocol version 6) routing tables. The memory requirements are analyzed for a trie based scheme and a binary search scheme for doing IP address lookup. We also develop an architecture to bound the worst-case update performance of lookup schemes. The scalability of the two lookup schemes was analyzed with respect to increasing routing table size and increase in address size. Currently available DRAM memories were analyzed for memory access requirements and memory mapping schemes were developed to improve the lookup performance. The trie based scheme was analyzed with respect to variations in different parameters like depth, pipeline stages, etc. The update performance of IP lookup schemes was identified as a potential problem and an architecture was developed to bound the worst-case performance. The update mechanism is independent of the lookup scheme and is implemented in hardware. The implementation is done in a 0.25u CMOS cell library.
  • No Thumbnail Available
    Improving Power and Performance Efficiency in Parallel and Distributed Computing Systems
    (2009-11-13) Lim, Min Yeol; Dr. George N. Rouskas, Committee Member; Dr. Gregory T. Byrd, Committee Member; Dr. Xiaosong Ma, Committee Member; Dr. Robert J. Fowler, Committee Member; Dr. Vincent W. Freeh, Committee Chair
    For decades, high-performance computing systems have focused on increasing maximum performance at any cost. A consequence of the devotion towards boosting performance significantly increases power consumption. The most powerful supercomputers require up to 10 megawatts of peak power – enough to sustain a city of 40,000. However, some of that power may be wasted with little or no performance gain, because applications do not require peak performance all the time. Therefore, improving power and performance efficiency becomes one of the primary concerns in parallel and distributed computing. Our goal is to build a runtime system that can understand power-performance tradeoffs and balance power consumption and performance penalty adaptively. In this thesis, we make the following contributions. First, we develop a MPI runtime system that can dynamically balance power and performance tradeoffs in MPI applications. Our system dynamically identifies power saving opportunities without prior knowledge about system behaviors and then determines the best p-state to improve the power and performance efficiency. The system is entirely transparent to MPI applications with no user intervention. Second, we develop a method for determining minimum energy consumption in voltage and frequency scaling systems for a given time delay. Our approach helps to better analyze the performance of a specific DVFS algorithm in terms of balancing power and performance. Third, we develop a power prediction model that can correlate power and performance data on a chip multiprocessor machine. Our model shows that the power consumption can be estimated by hardware performance counters with reasonable accuracy in various execution environments. Given the prediction model, one can make a runtime decision of balancing power and performance tradeoffs on a chip-multiprocessor machine without delay for actual power measurements. Last, we develop an algorithm to save power by dynamically migrating virtual machines and placing them onto fewer physical machines depending on workloads. Our scheme uses a two-level, adaptive buffering scheme which reserves processing capacity. It is designed to adapt the buffer sizes to workloads in order to balance performance violations and energy savings by reducing the amount of energy wasted on the buffers. Our simulation framework justifies our study of the energy benefits and the performance effects of the algorithm along with studies of its sensitivity to various parameters.
  • No Thumbnail Available
    Just-in-Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs
    (2005-08-08) Kappiah, Nandini; Dr. Frank Mueller, Committee Member; Dr. Vincent W. Freeh, Committee Chair; Dr. Gregory T. Byrd, Committee Member
    Although users of high-performance computing are most interested in raw performance, both energy and power consumption have become critical concerns. As a result improving energy efficiency of nodes on HPC machines has become important and the importance of power-scalable clusters, where the frequency and voltage can be dynamically modified, has increased. This thesis investigates the energy consumption and execution time of applications on a power-scalable cluster. It studies intra-node and inter-node effects of memory and communication bottlenecks. Results show that a power-scalable cluster has the potential to save energy by scaling the processor down to lower energy levels. This thesis presents a model that predicts the energy-time trade-off for larger clusters. On power-scalable clusters, one opportunity for saving energy with little or no loss of performance exists when the computational load is not perfectly balanced. This situation occurs frequently, as keeping the load balanced between nodes is one of the long standing fundamental problems in parallel and distributed computing. However, despite the large body of research aimed at balancing load both statically and dynamically, this problem is quite difficult to solve. This thesis presents a system called Jitter that reduces the frequency on nodes that are assigned less computation and therefore have idle time or slack time. This saves energy on these nodes, and the goal of Jitter is to attempt to ensure that they arrive 'just in time' so that they avoid increasing overall execution time. Specifically, we dynamically determine which nodes have enough slack time so that they can be slowed down, which will greatly reduce the consumed energy on that node. Thus a superior energy-time trade-off can be achieved. This thesis studies a suite of MPI benchmarks, which are profiled, gathering information about the computation and communication occuring in the application. This information is used to analyse various energy-time trade-offs of benchmark suite. This thesis also proposes an algorithm that exploits load-imbalance to reduce energy consumption and minimize delays for parallel applications. This algorithm is validated on a large variety of benchmarks.
  • No Thumbnail Available
    Maximizing Service Coverage of Adaptive Services in Wireless Mobile Ad-Hoc Networks using Non-Clustering Approach
    (2003-05-11) Thangavelu, Krithiga; Dr. Douglas S. Reeves, Committee Chair; Dr. Munindar P.Singh, Committee Member; Dr. Gregory T. Byrd, Committee Member
    Wireless Mobile Ad-hoc Networks are characterized by dynamic network topology and lack of network infrastructure. The network fragments into smaller networks and merges over a period of time due to mobility. This makes provisioning solutions to common network problems, like routing and QoS provisioning, a challenging task. Services in ad-hoc networks face two-fold problems. Making nodes aware of the availability and the location of services in a dynamically changing network is difficult, especially when such services are not tightly coupled with a fixed infrastructure. Servers may come and leave the network. Nodes may shutdown services to conserve energy. The problem is further exacerbated by the limitations posed by the wireless network on the bandwidth and by the limited computational capability of the wireless devices. This thesis addresses the problem of providing continuous and guaranteed access to such centralized services in a mobile wireless ad-hoc network. A distributed algorithm based on the exchange of service provider information is proposed to solve the problem. The previous work addressing the same problem assumes that the nodes move in long-term groups. Our solution does not make this assumption and targets arbitrary motion. So, no attempt is made to correlate the movement of the nodes, in order to solve the problem. In this thesis, we illustrate that our approach achieves higher service availability than the previous methods at the cost of a higher number of service instances. The proposed algorithm converges after a time period equivalent to the average propagation delay of the service instance information from a service provider to its reachable nodes. The computational and communication complexity of the algorithm is theoretically proved to be O(slogn) and O(n[subscript g]²) where s is the number of service instances, n is the number of nodes in the ad-hoc network and n[subscript g] is the average number of nodes in a connected component of the graph formed by the nodes in the ad-hoc network. The service cost incurred in providing the necessary service coverage is proved through simulation to be in the order of the number of connected components in the graph formed by the nodes in the ad-hoc network. Simulation results are used to prove that the algorithm provides for maximum service coverage independent of the mobility pattern of the nodes in the ad-hoc network.
  • No Thumbnail Available
    Network-on-Chip Optimization: as shown through a novel LDPC Decoder Design
    (2010-04-02) Mineo, Christopher Alexander; Dr. Gregory T. Byrd, Committee Member; Dr. Paul D. Franzon, Committee Member; Dr. Donald L. Bitzer, Committee Member; Dr. William Rhett Davis, Committee Chair
    In this work we describe the network-on-chip (NoC) simulator, which fills the gap between architectural level and circuit level NoC simulation. The core is a fast, high level transaction-based NoC simulator, which accesses carefully compiled power, timing, and area models for basic NoC components built from detailed circuit simulation. It makes use of the architectural evaluator, which performs a detailed global interconnect analysis within the framework of industry-standard design tools. Using low density parity check decoding (LDPC) as a test vehicle, the NoC simulator is used in an NoC design study, and shows a method by which on-chip networks can be optimized. The foundation for architectural and transaction based modeling is set by a demonstration of the functional 3D NoC Test Chip, a 3-ary 3-cube on-chip interconnection network implemented in a 3-tier three dimensional integrated circuit (3DIC) technology. The chip, being among the first and only functional synthesized academic 3DIC's, not only demonstrates the feasibility of inter-tier signaling in a 3DIC, but has enabled power measurements that bring credibility to our power modeling methodology. We discuss a characterization methodology for parameterized NoC router components, so that we can quickly and easily estimate the power, performance, and area overhead for a wide range of NoC systems. While the completed models are provided so that they may be used for architectural evaluation independent from the remainder of our simulation framework, we describe the architecture of the NoC simulator. The simulator is used to study various LDPC and NoC parameters to help with high level design decision making. The results make a compelling case for the 2D and 3D torus networks and very shallow network memory buffers. We also introduce the concept and show the importance of processing element throttle. Using the simulator, a pareto optimal set of NoC configurations for our application is produced.
  • No Thumbnail Available
    Page Pinning Improves Performance of Generational Garbage Collection
    (2006-05-04) Sawyer, Richard Kevin; Dr. Edward F. Gehringer, Committee Chair; Dr. Gregory T. Byrd, Committee Member; Dr. Suleyman Sair, Committee Member
    Garbage collection became widely used with the growing popularity of the Java programming language. For garbage-collected programs, memory latency is an important performance factor. Thus, a reduction in the cache miss rate will boost performance. In most programs, the majority of references are to newly allocated objects (the nursery). This work evaluates a page-mapping strategy that pins the nursery in a portion of the L2 cache. Pinning maps nursery pages in a way that prevents conflict misses for them, but increases the number of conflict misses for other objects. Cache performance is measured by the miss-rate improvement and speedup obtained by pinning on the SPECjvm98 and the DaCapo benchmarks. Pinning is shown to produce a lower global miss rate than competing virtual-memory mapping strategies, such as page coloring and bin hopping. This improvement in miss rate shortens overall execution time for practically every benchmark and every configuration. Pinning greatly reduces average pause time and variability of pause times for nursery collections.
  • No Thumbnail Available
    PMPT - Performance Monitoring PEBS Tool
    (2006-08-11) Beu, Jesse Garrett; Dr. Suleyman Sair, Committee Member; Dr. Gregory T. Byrd, Committee Member; Dr. Thomas M. Conte, Committee Chair
    For many applications a common source of performance degradation is excessive processor stalling from high memory latencies or poor data placement. Performance degradations from program and memory hierarchy interactions are often difficult for programmers and compilers to correct due to a lack of run-time information or limited knowledge about the underlying problem. By leveraging the Pentium 4 processor's performance monitoring hardware, specific run-time information can be provided, allowing code modifications to reduce or even eliminate problematic code, resulting in reduced execution times. Furthermore, many tools currently available to aid programmers are program counter centric. These tools point out which area of the code produce slowdowns, but they do not directly show where the problem data structures are. This is a common problem in programs that dynamically allocate memory. By creating a "malloc-centric" tool, we can develop an interesting perspective of the memory behavior of the system, providing better insight into the sources of performance problems.
  • No Thumbnail Available
    Predicting Compiler Optimization Performance for High-Performance Computing Applications
    (2005-08-30) Venkatagiri, Radha; Dr. Yan Solihin, Committee Chair; Dr. Gregory T. Byrd, Committee Member; Dr. Rada Y. Chirkova, Committee Member
    High performance computing application developers often spend a large amount of time in tuning their applications. Despite the advances in compilers and compiler optimization techniques, tuning efforts are still largely manual and require many trials and errors. One of the reasons for this is that many compiler optimizations do not always provide performance gain in all cases. Complicating the problem further is the fact that many compiler optimizations help performance in some cases, but hurt performance in other cases in the same application. To make it worse, it may help performance when it runs with a specific input set, but hurt the performance of the same application when it runs with a different input set. The central idea that this work deals with is whether machine learning techniques can be used to automate compiler optimization selection. Artificial Neural Networks (ANN), and Decision Trees (DT) are modelled, trained and used to predict whether Loop Unrolling optimizations should be applied or not for loops of serial programs. Simple loop characteristics such as iteration count, nesting level, and body size, are collected and used as input to the ANN or DT. A very simple microbenchmark is used to train the ANN, and this is used to predict the benefit of loop unrolling across differnt NAS (Serial Version) benchmarks. We find that an ANN trained using the microbenchmark accurately predicts whether loop unrolling is beneficial in 62\% of the cases. BT predicts correctly if loop unrolling is benefial in 82\% of the cases. Furthermore we find that benchmarks such as FT which perform poorly when tested with ANN trained with the microbenchmark yield accurate results in 69\% of the cases when tested using an ANN trained with loops from other NAS benchmarks. Decision trees used to classify loops (as being benefitted from loop unrolling or not) from the NAS benchmarks were found to have an accuracy of 79.54\%. A DT built using the microbenchmark correctly classified NAS loops 53\% of the time. Although the results show promise, we believe that to accurately automate compiler optimization selection, more complex loops may need to be modeled in the microbenchmark and many other factors may need to be taken into account in characterizing each loop nest.
  • No Thumbnail Available
    Reducing Frequency in Real-Time Systems via Speculation and Fall-Back Recovery
    (2003-04-22) Anantaraman, Aravindh Venkataseshadri; Dr. Eric Rotenberg, Committee Chair; Dr. Gregory T. Byrd, Committee Member; Dr. Frank Mueller, Committee Member; Dr. Alexander G. Dean, Committee Member
    In real-time systems, safe operation requires that tasks complete before their deadlines. Static worst-case timing analysis is used to derive an upper bound on the number of cycles for a task, and this is the basis for a safe frequency that ensures timely completion in any scenario. Unfortunately, it is difficult to tightly bound the number of cycles for a complex task executing on a complex pipeline, and so the safe frequency tends to be over-inflated. Power efficiency is sacrificed for safety. The situation only worsens as advanced microarchitectural techniques are deployed in embedded systems. High-performance microarchitectural techniques such as caching, branch prediction, and pipelining decrease typical execution times. At the same time, it is difficult to tightly bound the worst-case execution time of complex tasks on highly dynamic substrates. As a result, the gap between worst-case execution time and typical execution time is expected to increase. This thesis explores frequency speculation, a technique for reconciling the power/safety trade-off. Tight but unsafe bounds (derived from past task executions) are the basis for a low speculative frequency. The task is divided into multiple smaller sub-tasks and each sub-task is assigned an interim soft deadline, called a checkpoint. Sub-tasks are attempted at the speculative frequency. Continued safe progress of the task as a whole is confirmed for as long as speculative sub-tasks complete before their checkpoints. If a sub-task exceeds its checkpoint (misprediction), the system falls back to a higher recovery frequency that ensures the overall deadline is met in spite of the interim misprediction. The primary contribution of this thesis is the development of two new frequency speculation algorithms. A drawback of the original frequency speculation algorithm is that a sub-task misprediction is detected only after completing the sub-task. The misprediction can be detected earlier through the use of a watchdog timer that expires at the checkpoint unless the sub-task completes in time to advance it to the next checkpoint. Early detection is superior because recovery can be initiated earlier, in the middle of the mispredicted sub-task. This introduces extra slack that can be used to lower the speculative frequency even further. A new issue that arises with early detection is bounding the amount of work that remains in the mispredicted sub-task after the misprediction is detected. The two new algorithms differ in how the unfinished work is bounded. The first algorithm conservatively bounds the execution time of the unfinished portion using the worst-case execution time of the entire sub-task. The second uses more sophisticated analysis to derive a much tighter bound. Both early-detection algorithms outperform the late-detection algorithm. For tight deadlines, the sophisticated analysis of the second early-detection algorithm truly pays off. It yields 60-70% power savings for six real-time applications from the C-lab suite.
  • No Thumbnail Available
    A Simple Mechanism for Detecting Ineffectual Instructions in Slipstream Processors
    (2002-05-15) Koppanalil, Jinson Joseph; Dr. Gregory T. Byrd, Committee Member; Dr. Eric Rotenberg, Committee Chair; Dr. Thomas M. Conte, Committee Member
    The slipstream paradigm harnesses multiple processing elements in a chip multiprocessor (CMP) to speed up a single, sequential program. It does this by running two redundant copies of the program, one slightly ahead of the other. The leading program is the Advanced Stream (A-stream) and the trailing program is the Redundant Stream (R-stream). Predicted non-essential computation is speculatively removed from the A-stream. The A-stream is sped up because it fetches and executes fewer instructions than the original program. The trailing R-stream checks the control flow and data flow outcomes of the A-stream, and redirects it when it fails to make correct forward progress. The R-stream also exploits the A-stream outcomes as accurate branch and value predictions. Therefore, although the R-stream retires the same number of instructions as the original program, it fetches and executes much more efficiently. As a result, both program copies finish sooner than the original program. A slipstream component called the instruction-removal detector (IR-detector) detects past-ineffectual instructions in the R-stream and selects them for possible removal from the A-stream in the future. The IR-detector uses a two-step selection process. First, it selects key trigger instructions -- unreferenced writes, non-modifying writes, and correctly-predicted branches. A table similar to a conventional register rename table can easily detect unreferenced and non-modifying writes. The second step, called back-propagation, selects computation chains feeding the trigger instructions. In an explicit implementation of back-propagation, retired R-stream instructions are buffered and consumer instructions are connected to their producer instructions using a configurable interconnection network. Consumers that are selected because they are ineffectual use these connections to propagate their ineffectual status to their producers, so that they get selected, too. Explicit back-propagation is complex because it requires a configurable interconnection network. This thesis proposes a simpler implementation of back-propagation, called implicit back-propagation. The key idea is to logically monitor the A-stream instead of the R-stream. Now, the IR-detector only performs the first step, i.e., it selects unreferenced writes, non-modifying writes, and correctly-predicted branches. After building up confidence, these trigger instructions are removed from the A-stream. Once removed, their producers become unreferenced writes in the A-stream (because they no longer have consumers). After building up confidence, the freshly exposed unreferenced writes are also removed, exposing additional unreferenced writes. This process continues iteratively, until eventually entire non-essential dependence chains are removed. By logically monitoring the A-stream, back-propagation is reduced to detecting unreferenced writes. Implicit back-propagation eliminates complex hardware and performs within 0.5% of explicit back-propagation.
  • No Thumbnail Available
    Slipstream-Based Steering for Clustered Microarchitectures
    (2003-06-20) Gupta, Nikhil; Dr. Thomas M. Conte, Committee Member; Dr. Gregory T. Byrd, Committee Member; Dr. Eric Rotenberg, Committee Chair
    To harvest increasing levels of ILP while maintaining a fast clock, clustered microarchitectures have been proposed. However, the fast clock enabled by clustering comes at the cost of multiple cycles to communicate values among clusters. A chief performance limiter of a clustered microarchitecture is inter-cluster communication between instructions. Specifically, inter-cluster communication between critical-path instructions is the most harmful. The slipstream paradigm identifies critical-path instructions in the form of effectual instructions. We propose eliminating virtually all inter-cluster communication among effectual instructions, simply by ensuring that the entire effectual component of the program executes within a cluster. This thesis proposes two execution models: the replication model and the dedicated-cluster model. In the replication model, a copy of the effectual component is executed on each of the clusters and the ineffectual instructions are shared among the clusters. In the dedicated-cluster model, the effectual component is executed on a single cluster (the effectual cluster), while all ineffectual instructions are steered to the remaining clusters. Outcomes of ineffectual instructions are not needed (in hindsight), hence their execution can be exposed to inter-cluster communication latency without significantly impacting overall performance. IPC of the replication model on dual clusters and quad clusters is virtually independent of inter-cluster communication latency. IPC decreases by 1.3% and 0.8%, on average, for a dual-cluster and quad-cluster microarchitecture, respectively, when inter-cluster communication latency increases from 2 cycles to 16 cycles. In contrast, IPC of the best-performing dependence-based steering decreases by 35% and 55%, on average, for a dual-cluster and quad-cluster microarchitecture, respectively, over the same latency range. For dual clusters and quad clusters with low latencies (fewer than 8 cycles), slipstream-based steering underperforms conventional steering because improved latency tolerance is outweighed by higher contention for execution bandwidth within clusters. However, the balance shifts at higher latencies. For a dual-cluster microarchitecture, dedicated-cluster-based steering outperforms the best conventional steering on average by 10% and 24% at 8 and 16 cycles, respectively. For a quad-cluster microarchitecture, replication-based steering outperforms the best conventional steering on average by 10% and 32% at 8 and 16 cycles, respectively. Slipstream-based steering desensitizes the IPC performance of a clustered microarchitecture to tens of cycles of inter-cluster communication latency. As feature sizes shrink, it will take multiple cycles to propagate signals across the processor chip. For a clustered microarchitecture, this implies that with further scaling of feature size, the inter-cluster communication latency will increase to the point where microarchitects must manage a distributed system on a chip. Thus, if individual clusters are clocked faster, at the expense of increasing inter-cluster communication latency, performance of a clustered microarchitecture using slipstream-based steering will improve considerably as compared to a clustered microarchitecture using the best conventional steering approach.
  • No Thumbnail Available
    A Toolkit for Intrusion Alerts Correlation based on Prerequisites and Consequences of Attacks
    (2002-12-19) Cui, Yun; Dr. Peng Ning, Committee Chair; Dr. Douglas S. Reeves, Committee Member; Dr. Gregory T. Byrd, Committee Member
    Intrusion Detection has been studied for about twenty years. Intrusion Detection Systems (IDSs) are usually considered the second line of defense to protect against malicious activities along with the prevention-based security mechanisms such as authentication and access control. However, traditional IDSs have two major weaknesses. First, they usually focus on low-level attacks or anomalies, and raise alerts independently, though there may be logical connections between them. Second, there are a lot of false alerts reported by traditional IDSs, which are mixed with true alerts. Thus, the intrusion analysts or the system administrators are often overwhelmed by the volume of alerts. Motivated by this observation, we propose a technique to construct high-level attack scenarios by correlating low-level intrusion alerts using their prerequisites and consequences. The prerequisite of an alert specifies what must be true in order for the corresponding attack to be successful, and the consequence describes what is possibly true if the attack indeed succeeds. We conjecture that the alerts being correlated together have a higher possibility to be true alerts than the uncorrelated ones. If this is true, through this correlation, not only can we construct the high-level attack scenarios, but also differentiate between true alerts and false alerts. In this thesis work, I implement an alert correlation tool based on this framework. It consists of the following components: a knowledge base, an alert preprocessor, an alert correlation engine and a graph output component. To further facilitate analysis of large amounts of intrusion alerts, I develop three utilities, namely adjustable graph reduction, focused analysis, and graph decomposition. I also perform a sequence of experiments to evaluate the aforementioned techniques using DARPA 2000 evaluation datasets and DEFCON 8 CTF dataset. The experimental results show that the proposed techniques are effective. First, we successfully construct attack scenarios behind the low-level alerts; Second, the false alert rates are significantly reduced after the attention is focused on alerts that are correlated with others; Third, the three utilities greatly reduce the complexity of the correlated alerts, while at the same time maintaining the structure of the correlated alerts.
  • No Thumbnail Available
    Tracing Intruders behind Stepping Stones
    (2005-08-06) Wang, Xinyuan; Dr. Douglas S. Reeves, Committee Chair; Dr. Peng Ning, Committee Member; Dr. George N. Rouskas, Committee Member; Dr. Gregory T. Byrd, Committee Member
    Network based intruders seldom attack directly from their own hosts but rather stage their attacks through intermediate 'stepping stones' to conceal their identity and origin. To track down and apprehend those perpetrators behind stepping stones, it is critically important to be able to correlate connections through stepping stones. Tracing intruders behind stepping stones and correlating intrusion connections through stepping stones are challenging due to various readily available evasive countermeasures by intruders: •Installing and using backdoor relays (i.e. netcat) at intermediate stepping stones to evade logging of normal logins. •Using different types of connections (i.e. TCP, UDP) at different portions of the connection chain through stepping stones to complicate connection matching. •Using encrypted connections (with different keys) across stepping stones to defeat any content based comparison. • Introducing timing perturbation at intermediate stepping stones to counteract timing based correlation of encrypted connections. In this dissertation, we address these challenges in detail and design solutions to them. For unencrypted intrusion connections through stepping stones, we design and implement a novel intrusion tracing framework called Sleepy Watermark Tracing (SWT), which applies principles of steganography and active networking. SWT is "sleepy" in that it does not introduce overhead when no intrusion is detected. Yet it is "active" in that when an intrusion is detected, the host under attack will inject a watermark into the backward connection of the intrusion, and wake up and collaborate with intermediate routers along the intrusion path. Our prototype shows that SWT can trace back to the trustworthy security gateway closest to the origin of the intrusion, with only a single packet from the intruder. With its unique active tracing, SWT can even trace when intrusion connections are idle. Encryption of connections through stepping stones defeats any content based correlation and makes correlation of intrusion connections more difficult. Based on inter-packet timing characteristics, we develop a novel correlation scheme of both encrypted and unencrypted connections. We show that (after some filtering) inter-packet delays (IPDs) of both encrypted and unencrypted, interactive connections are preserved across many router hops and stepping stones. The effectiveness of IPD based correlation requires that timing characteristics be distinctive enough to identify connections. We have found that normal interactive connections such as telnet, SSH and rlogin are almost always distinctive enough to provide correct correlation across stepping stones. The timing perturbation at intermediate stepping stones of packet flows poses additional challenge in correlating encrypted connections through stepping stones. The timing perturbation could either make unrelated flows have similar timing characteristics or make related flows exhibit different timing characteristics, which would either increase the false positive rate or decrease the true positive rate of timing-based correlation. To address this new challenge, we develop a novel watermark based correlation scheme that is designed to be specifically robust against such kinds of timing perturbation. The idea is to actively embed a unique watermark into the flow by slightly adjusting the timing of selected packets of the flow. If the embedded watermark is unique enough and robust enough against the timing perturbation by the adversary, the watermarked flow could be uniquely identified and thus effectively correlated. By utilizing redundancy techniques, we develop a robust watermark correlation framework that reveals a rather surprising result on the inherent limits of independent and identically distributed (iid) random timing perturbations over sufficiently long flows. We also identify the tradeoffs between the defining characteristics of the timing perturbation and the achievable correlation effectiveness. Our experiments show that our watermark based correlation performs significantly better than existing passive timing based correlation in the face of random timing perturbation. In this research, we learn some general lessons about tracing and correlating intrusion connections through stepping stones. Specifically, we demonstrate the significant advantages of active correlation approach over passive correlation approaches in the presence of active countermeasures. We also demonstrate that information hiding and redundancy techniques can be used to build highly effective intrusion tracing and correlation frameworks.

Contact

D. H. Hill Jr. Library

2 Broughton Drive
Campus Box 7111
Raleigh, NC 27695-7111
(919) 515-3364

James B. Hunt Jr. Library

1070 Partners Way
Campus Box 7132
Raleigh, NC 27606-7132
(919) 515-7110

Libraries Administration

(919) 515-7188

NC State University Libraries

  • D. H. Hill Jr. Library
  • James B. Hunt Jr. Library
  • Design Library
  • Natural Resources Library
  • Veterinary Medicine Library
  • Accessibility at the Libraries
  • Accessibility at NC State University
  • Copyright
  • Jobs
  • Privacy Statement
  • Staff Confluence Login
  • Staff Drupal Login

Follow the Libraries

  • Facebook
  • Instagram
  • Twitter
  • Snapchat
  • LinkedIn
  • Vimeo
  • YouTube
  • YouTube Archive
  • Flickr
  • Libraries' news

ncsu libraries snapchat bitmoji

×