Browsing by Author "Nagiza Samatova, Committee Member"
Now showing 1 - 3 of 3
- Results Per Page
- Sort Options
- Automating the annotation and discovery of MicroRNA in multi-species high-throughput 454 Sequencing(2008-08-04) Wheeler, Benjamin Matthew; Steffen Heber, Committee Chair; Brian Wiegmnn, Committee Member; Nagiza Samatova, Committee Member
- High Performance Parallel and Distributed Genomic Sequence Search(2009-03-26) Lin, Heshan; Xiaosong Ma, Committee Chair; Steffen Heber, Committee Member; Frank Mueller, Committee Member; Nagiza Samatova, Committee Member; Douglas Reeves, Committee MemberGenomic sequence database search identifies similarities between given query sequences and known sequences in a database. It forms a critical class of applications used widely and routinely in computational biology. Due to their wide application in diverse task settings, sequence search tools today are run on several types of parallel systems, including batch jobs on one or more supercomputers and interactive queries through web-based services. Despite successful parallelization of popular sequence search tools such as BLAST, in the past two decades the growth of sequence databases has outpaced that of computing hardware elements, making scalable and efficient parallel sequence search processing crucial in helping life scientists' dealing with the ever-increasing amount of genomic information. In this thesis, we investigate efficient and scalable parallel and distributed sequence-search solutions by addressing unique problems and challenges in the aforementioned execution settings. Specifically, this thesis research 1) introduces parallel I/O techniques into sequence-search tools and proposes novel computation and I/O co-scheduling algorithms that enable genomic sequence search to scale efficiently on massively parallel computers; 2) presents a semantic based distributed I/O framework that leverages the application specific meta information to drastically reduce the amount of data transfer and thus enables distributed sequence searching collaboration in the global scale; 3) proposes a novel request scheduling technique for clustered sequence-search web servers that comprehensively takes into account both data locality and parallel search efficiency to optimize query response time under various server load levels and access scenarios. The efficacy of our proposed solutions has been verified on a broad range of parallel and distributed systems, including Peta-scale supercomputers, the NSF TeraGrid system, and small- or medium-sized clusters. In addition, our optimizations of massively parallel sequence search have been transformed into the official release of mpiBLAST-PIO, currently the only supported branch of mpiBLAST, a popular open-source sequence-search tool. mpiBLAST-PIO is able to achieve 93% parallel efficiency across 32,768 cores on the IBM Blue Gene/P supercomputer.
- Towards Transparent Parallel Processing on Multi-core Computers(2009-12-22) Li, Jiangtian; Xiaosong Ma, Committee Chair; Xiaohui Gu, Committee Member; Frank Mueller, Committee Member; Nagiza Samatova, Committee MemberParallelization of all application types is critical with the trend towards an exponentially increasing number of cores per chip, which needs to be done at multiple levels to address the unique challenges and exploit the new opportunities brought by new architecture advances. In this dissertation, we focus on enhancing the utilization of future-generation, many-core personal computers for high performance and energy effective computing. On one hand, computation- and/or data-intensive tasks such as scientific data processing and visualization, which are typically performed sequentially on personal workstations, need to be parallelized to take advantage of the increasing hardware parallelism. Explicit parallel programming, however, is labor-intensive and requires sophisticated performance tuning for individual platforms and operating systems. In this PhD study, we made a first step toward transparent parallelization for data processing codes, by developing automatic parallelization tools for scripting languages. More specifically, we present pR, a framework that transparently parallelizes the R language for high-performance statistical computing. We apply parallelizing compiler technology to runtime, whole-program dependence analysis and employ incremental code analysis assisted with evaluation results. Experimental results demonstrate that pR can exploit both task and data parallelism transparently and overall achieve good performance as well as scalability. Further, we attack the performance tuning problem for transparent parallel execution, by proposing and designing a novel online task decomposition and scheduling approach for transparent parallel computing. This approach collects runtime task cost information transparently and performs online static scheduling, utilizing cost estimates generated by ANN-based runtime performance prediction, as well as by loop iteration test runs. We implement the above techniques in the pR framework and our proposed approach is demonstrated to significantly improve task partitioning and scheduling over a variety of benchmarks. On the other hand, multi-core personal computers will inevitably be under-utilized when their owners perform light-weight tasks such as editing and web browsing, making volunteer computing more appealing than ever. In this study, we made a first step towards a novel computation model, energy-aware volunteer computing on multi-core processors, by evaluating the potential energy/performance trade-off of a more aggressive execution model that selects active nodes over idle nodes for scheduling foreign application tasks, in order to better utilize idle cores and achieve energy savings. Our experiment results suggest that aggressive volunteer computing can bring significant energy saving compared to common existing execution modes and provides an attractive computation model.
