Log In
New user? Click here to register. Have you forgotten your password?
NC State University Libraries Logo
    Communities & Collections
    Browse NC State Repository
Log In
New user? Click here to register. Have you forgotten your password?
  1. Home
  2. Browse by Author

Browsing by Author "Dr. Frank Mueller, Committee Chair"

Filter results by typing the first few letters
Now showing 1 - 11 of 11
  • Results Per Page
  • Sort Options
  • No Thumbnail Available
    Analyzing Memory Performance Bottlenecks in OpenMP Programs on SMP Architectures using ccSIM
    (2003-08-14) Nagarajan, Anita; Dr. Frank Mueller, Committee Chair; Dr. Gregory Byrd, Committee Member; Dr. Purushothaman Iyer, Committee Member
    As computing demands increase, performance analysis of application behavior has become a widely researched topic. In order to obtain optimal application performance, an understanding of the interaction between hardware and software is essential. Program performance is quantified in terms of various metrics, and it is important to obtain detailed information in order to determine potential bottlenecks during execution. Upon isolation of the exact causes of performance problems, optimizations to overcome them can be proposed. In SMP systems, sharing of data could result in increased program latency due to the requirement of maintaining memory coherence. The main contribution of this thesis is ccSIM, a cache-coherent multilevel memory hierarchy simulator for shared memory multiprocessor systems, fed by traces obtained through on-the-fly dynamic binary rewriting of OpenMP programs. Interleaved parallel trace execution is simulated for the different processors and results are studied for several OpenMP benchmarks. The coherence-related metrics obtained from ccSIM are validated against hardware performance counters to verify simulation accuracy. Cumulative as well as per-reference statistics are provided, which help in a detailed analysis of performance and in isolating bottlenecks in the memory hierarchy. Results obtained for coherence events from the simulations indicate a good match with hardware counters for a Power3 SMP node. The exact locations of invalidations in source code and coherence misses caused by these invalidations are derived. This information, together with the classification of invalidates, helps in proposing optimization techniques or code transformations that could potentially yield better performance for a particular application on the architecture of interest.
  • No Thumbnail Available
    Buddy Threading in Distributed Applications on Simultaneous Multi-Threading Processors
    (2005-04-19) Vouk, Nikola; Dr. Michael Rappa, Committee Member; Dr. Frank Mueller, Committee Chair; Dr. Vincent Freeh, Committee Member
    Modern processors provide a multitude of opportunities for instruction-level parallelism that most current applications cannot fully utilize. To increase processor core execution efficiency, modern processors can execute instructions from two or more tasks simultaneously in the functional units in order to increase the execution rate of instructions per cycle (IPC). These processors implement simultaneous multi-threading (SMT), which increases processor efficiency through thread-level parallelism, but problems can arise due to cache conflicts and CPU resource starvation. Consider high end applications typically running on clusters of commodity computers. Each compute node is sending, receiving and calculating data for some application. Non-SMT processors must compute data, context switch, communicate that data, context switch, compute more data, and so on. The computation phases often utilize floating point functional units while integer functional units for communication. Until recently, modern communication libraries were not able to take complete advantage of this parallelism due to the lack of SMT hardware. This thesis explores the feasibility of exploiting this natural compute/communicate parallelism in distributed applications, especially for applications that are not optimized for the constraints imposed by SMT hardware. This research explores hardware and software thread synchronization primitives to reduce inter-thread communication latency and operating system context switch time in order to maximize a program's ability to compute and communicate simultaneously. This work investigates the reduction of inter-thread communication through hardware synchronization primitives. These primitives allow threads to 'instantly' notify each other of changes in program state. We also describe a thread-promoting buddy scheduler that allows threads to always be co-scheduled together, thereby providing an application the exclusive use of all processor resources, reducing context switch overhead, inter-thread communication latency and scheduling overhead. Finally, we describe the design and implementation of a modified MPI over Channel (MPICH) MPI library that allows legacy applications to take advantage of SMT processor parallelism. We conclude with an evaluation of these techniques using several ASCI benchmarks. Overall, we show that compute-communicate application performance can be further improved by taking advantage of the native parallelism provided by SMT processors. To fully exploit this advantage, these applications must be written to overlap communication with computation as much as possible.
  • No Thumbnail Available
    Compositional Static Cache Analysis Using Module-level Abstraction
    (2003-12-10) Patil, Kaustubh Sambhaji; Dr. Frank Mueller, Committee Chair; Dr. Alexander Dean, Committee Member; Dr. Eric Rotenberg, Committee Member
    Static cache analysis is utilized for timing analysis to derive worst-case execution time of a program. Such analysis is constrained by the requirement of an inter-procedural analysis for the entire program. But the complexity of cycle-level simulations for entire programs currently restricts the feasibility of static cache analysis to small programs. Computationally complex inter-procedural analysis is needed to determine caching effects, which depend on knowledge of data and instruction references. Static cache simulation traditionally relies on absolute address information of instruction and data elements. This thesis presents a framework to perform worst-case static cache analysis for direct-mapped instruction caches using a module-level and compositional approach, thus addressing the issue of complexity of inter-procedural analysis for an entire program. The module-level analysis parameterizes the data-flow information in terms of the starting offset of a module. The compositional analysis stage uses this parameterized data-flow information for each module. Thus, the emphasis here is on handling most of the complexity in the module-level analysis and performing as little analysis as possible at the compositional level. The experimental results show that the compositional analysis framework provides equally accurate predictions when compared with the simulation approach that uses complete inter-procedural analysis.
  • No Thumbnail Available
    Frequency-aware Static Timing Analysis for Power-aware Embedded Architectures
    (2004-03-14) Seth, Kiran Ravi; Dr. Frank Mueller, Committee Chair; Dr. Alexander Dean, Committee Member; Dr. Eric Rotenberg, Committee Member
    Power is a valuable resource in embedded systems as the lifetime of many such systems is constrained by their battery capacity. Recent advances in processor design have added support for dynamic frequency/voltage scaling (DVS) for saving power. Recent work on real-time scheduling focuses on saving power in static as well as dynamic scheduling environments by exploiting idle and slack due to early task completion for DVS of subsequent tasks. These scheduling algorithms rely on a priori knowledge of worst-case execution times (WCET) for each task. They assume that DVS has no effect on the worst-case execution cycles (WCEC) of a task and scale the WCET according to the processor frequency. However, for systems with memory hierarchies, the WCEC typically does change under DVS due to frequency modulation. Hence, current assumptions used by DVS schemes result in a highly exaggerated WCET. The research presented contributes novel techniques for tight and flexible static timing analysis particularly well-suited for dynamic scheduling schemes. The technical contributions are as follows: (1) The problem of changing execution cycles due to scaling techniques is assessed. (2) A parametric approach towards bounding the WCET statically with respect to the frequency is proposed. Using a parametric model, the effect of changes in frequency on the WCEC can be captured and, thus, the WCET over any frequency range can be accurately modeled. (3) The design and implementation of the frequency-aware static timing analysis (FAST) tool, based on prior experience with static timing analysis, is discussed. (4) Experiments demonstrate that the FAST tool provides safe upper bounds on the WCET, which are tight. The FAST tool allows the capture of the WCET of six benchmarks using equations that overestimate the WCET by less than 1%. FAST equations can also be used to improve existing DVS scheduling schemes to ensure that the effect of frequency scaling on WCET is considered and that the WCET used is not exaggerated. (5) Three DVS scheduling schemes are leveraged by incorporating FAST into them and by showing that the power consumption further decreases.
  • No Thumbnail Available
    Hybrid online/offline optimization of Application Binaries
    (2004-07-08) Dhoot, Anubhav Vijay; Dr. Frank Mueller, Committee Chair; Dr. Xiaosong Ma, Committee Member; Dr. Peng Ning, Committee Member
    Long-running parallel applications suffer from performance limitations particularly due to inefficiencies in accessing memory. Dynamic optimizations, i.e., optimizations performed at execution time, provide opportunities not available at compile or link time to improve performance and remove bottlenecks for the current execution. In particular, they enable one to apply transformations to tune the performance for a particular execution instance. This can potentially include effects of the environment, input values as well as be able to optimize code from other sources like pre-compiled libraries and code from mixed-language sources. This thesis presents design and implementation of components of a dynamic optimizing system for long-running parallel applications that use dynamic binary rewriting. The system uses a hybrid online/offline model to collect a memory profile that guides the choice of functions to be optimized. We describe the design and implementation of a module that enables optimization of a desired function from the executable, i.e., without relying on the source code. We also present the module that enables hot swapping of code of an executing application. Dynamic binary rewriting is used to hot-swap the bottleneck function with an optimized function while the application is still executing. Binary manipulation is used in two ways - first to collect a memory profile through instrumentation to identify bottleneck functions and then to control hot-swapping of code using program transformation. We show experiments as a proof of concept for implementations of remaining components of the framework and for validation of existing modules.
  • No Thumbnail Available
    Remote Data Collection and Analysis using Mobile Agents and Service-Oriented Architectures.
    (2008-08-19) Girish Chandra, Harsha; Dr. Helen Gu, Committee Member; Dr. Frank Mueller, Committee Chair; Dr. Nagiza Samatova, Committee Member
    The ubiquity of wireless systems have ushered us into a new era of mobile computing. With the emergence of superior input/output, communication hardware and cheap data services, mobile phones have become a bed for offering new and exotic services. Superior GUI and remote connectivity make mobile phones and PDAs good candidates for data collection, but lacking battery life and computational prowess, they are poor computational devices. We introduce a novel architecture that builds on agents on mobile phones as the front end and a service-oriented architecture composed of high-performance devices as the back end. Agent-based computing, which has proved to be advantageous for desktops⁄servers, can also encompass hand-held devices to provide us with new service management capabilities. In this thesis, we discuss a new service deployment strategy on mobile phones based on mobile Agents. Mobile agent is an agent that can migrate from one node to the other node in the network while preserving its state. This solves the problem of introducing new services manually and provides the advantage of on-the-fly code updates for existing services. We also discuss the challenges of mobile agent development in Java, mainly introducing code migration in Java (J2ME), which is the critical component of a mobile agent, and interoperability among different J2ME profiles and with Java standard edition (J2SE). As a computational backbone for the architecture, we utilize inexpensive but powerful nodes based on the IBM Cell Broadband architecture namely through PlayStation (PS3) devices running on Linux. Powered by a RISC based main processing unit (PPU) and eight synergistic processing units (SPU), a PS3 can analyze large data sets with great speed. In this work, we also analyze the programming paradigm used in the PS3 machines. We discuss the design and implementation of several high-performance kernels in the PS3 and measure the speedup obtained corresponding to an x86 machine. Lastly, we discuss how high-performance computing can be introduced as a service by using platform-neutral protocols such as XML-RPC, to integrate the heterogeneous platform of mobile agents and service-oriented architectures (SOA).
  • No Thumbnail Available
    Scalable Compression and Replay of Communication Traces in Massively Parallel Environments
    (2006-10-02) Noeth, Michael James; Dr. Tao Xie, Committee Member; Dr. Xiaosong Ma, Committee Member; Dr. Frank Mueller, Committee Chair
    Characterizing the communication behavior of large-scale applications is a difficult and costly task due to code and system complexity as well as the time to execute such codes. An alternative to running actual codes is to gather their communication traces and then replay them, which facilitates application tuning and future procurements. While past approaches lacked lossless scalable trace collection, we contribute an approach that provides near constant-size communication traces regardless of the number of nodes while preserving structural information. We introduce intra- and inter-node compression techniques of MPI events and present results of our implementation. Given this novel capability, we discuss its impact on communication tuning and beyond.
  • No Thumbnail Available
    Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems
    (2006-04-23) Varma, Jyothish S; Dr. Tao Xie, Committee Member; Dr. Vincent Freeh, Committee Member; Dr. Frank Mueller, Committee Chair
    Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-timeto-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computation can progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer. This thesis presents a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains a consistent view of active nodes in the presence of faults. Our protocol shows response time in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in the communication layer of MPI runtime systems.
  • No Thumbnail Available
    System Virtualization for Proactive Fault-Tolerant Computing
    (2008-05-02) Nagarajan, Arun Babu; Dr. Xiaosong MA, Committee Member; Dr. Xiaohui (Helen) Gu, Committee Member; Dr. Frank Mueller, Committee Chair
    Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from "unhealthy" nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplified by but not limited to Xen. This thesis contributes an automatic and transparent mechanism for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint⁄restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the first comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.
  • No Thumbnail Available
    Traced Based Dependence Analysis for Speculative Loop Optimizations
    (2007-06-19) Ramaseshan, Ravi; Dr. Xiaosong Ma, Committee Member; Dr. Frank Mueller, Committee Chair; Dr. Thomas Conte, Committee Member
    Thread level speculation (TLS) is a powerful technique that can harness, in part, the large computing potential of multi-core / chip multiprocessors. The performance of a TLS system is limited by the number of rollbacks performed, and thus the number of dependence violations detected at run-time. Hence, the decomposition of a serial program into threads that have a low probability of causing dependence violations is imperative. In this thesis, we develop a framework that calculates a dynamic dependence graph of a program originating from an execution under a training input. We are investigating our hypothesis that by generating such a dependence graph, we are able to parallelize the program beyond the capability of a static compiler while limiting the number of required rollbacks. In our approach, we evaluated two techniques for calculating dependence graphs to perform our dependence analysis: power regular segment descriptors and shadow maps. After calculating dependence graphs that aid loop nest optimizations and after determining program performance after parallelization, we assess results obtained with our framework and then discuss future directions of this research. We observed the most improvement in performance for two benchmarks, while the others showed either no improvement or degradation in performance or in one case a slow-down with our analysis.
  • No Thumbnail Available
    Transparent Fault Tolerance for Job Healing in HPC Environments
    (2009-07-07) Wang, Chao; Dr. Frank Mueller, Committee Chair; Dr. Xiaosong Ma, Committee Member; Dr. Yan Solihin, Committee Member; Dr. Nagiza Samatova, Committee Member
    As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions. This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas. First, at the job level, novel, scalable mechanisms are built in support of proactive FT and to significantly enhance reactive FT. The contributions of this dissertation in this area are (1) a transparent job pause mechanism, which allows a job to pause when a process fails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerant approach that combines process-level live migration with health monitoring to complement reactive with proactive FT and to reduce the number of checkpoints when a majority of the faults can be handled proactively; (3) a novel back migration approach to eliminate load imbalance or bottlenecks caused by migrated tasks; and (4) an incremental checkpointing mechanism, which is combined with full checkpoints to explore the potential of reducing the overhead of checkpointing by performing fewer full checkpoints interspersed with multiple smaller incremental checkpoints. Second, for the job input data, transparent techniques are provided to improve the reliability, availability and performance of HPC I/O systems. In this area, the dissertation contributes (1) a mechanism for offline job input data reconstruction to ensure availability of job input data and to improve center-wide performance at no cost to job owners; (2) an approach to automatic recover job input data at run-time during failures by recovering staged data from an original source; and (3) “just in time†replication of job input data so as to maximize the use of supercomputer cycles. Experimental results demonstrate the value of these advanced fault tolerance techniques to increase fault resilience in HPC environments.

Contact

D. H. Hill Jr. Library

2 Broughton Drive
Campus Box 7111
Raleigh, NC 27695-7111
(919) 515-3364

James B. Hunt Jr. Library

1070 Partners Way
Campus Box 7132
Raleigh, NC 27606-7132
(919) 515-7110

Libraries Administration

(919) 515-7188

NC State University Libraries

  • D. H. Hill Jr. Library
  • James B. Hunt Jr. Library
  • Design Library
  • Natural Resources Library
  • Veterinary Medicine Library
  • Accessibility at the Libraries
  • Accessibility at NC State University
  • Copyright
  • Jobs
  • Privacy Statement
  • Staff Confluence Login
  • Staff Drupal Login

Follow the Libraries

  • Facebook
  • Instagram
  • Twitter
  • Snapchat
  • LinkedIn
  • Vimeo
  • YouTube
  • YouTube Archive
  • Flickr
  • Libraries' news

ncsu libraries snapchat bitmoji

×