High Performance Parallel and Distributed Genomic Sequence Search

Show full item record

Title: High Performance Parallel and Distributed Genomic Sequence Search
Author: Lin, Heshan
Advisors: Xiaosong Ma, Committee Chair
Steffen Heber, Committee Member
Frank Mueller, Committee Member
Nagiza Samatova, Committee Member
Douglas Reeves, Committee Member
Abstract: Genomic sequence database search identifies similarities between given query sequences and known sequences in a database. It forms a critical class of applications used widely and routinely in computational biology. Due to their wide application in diverse task settings, sequence search tools today are run on several types of parallel systems, including batch jobs on one or more supercomputers and interactive queries through web-based services. Despite successful parallelization of popular sequence search tools such as BLAST, in the past two decades the growth of sequence databases has outpaced that of computing hardware elements, making scalable and efficient parallel sequence search processing crucial in helping life scientists' dealing with the ever-increasing amount of genomic information. In this thesis, we investigate efficient and scalable parallel and distributed sequence-search solutions by addressing unique problems and challenges in the aforementioned execution settings. Specifically, this thesis research 1) introduces parallel I/O techniques into sequence-search tools and proposes novel computation and I/O co-scheduling algorithms that enable genomic sequence search to scale efficiently on massively parallel computers; 2) presents a semantic based distributed I/O framework that leverages the application specific meta information to drastically reduce the amount of data transfer and thus enables distributed sequence searching collaboration in the global scale; 3) proposes a novel request scheduling technique for clustered sequence-search web servers that comprehensively takes into account both data locality and parallel search efficiency to optimize query response time under various server load levels and access scenarios. The efficacy of our proposed solutions has been verified on a broad range of parallel and distributed systems, including Peta-scale supercomputers, the NSF TeraGrid system, and small- or medium-sized clusters. In addition, our optimizations of massively parallel sequence search have been transformed into the official release of mpiBLAST-PIO, currently the only supported branch of mpiBLAST, a popular open-source sequence-search tool. mpiBLAST-PIO is able to achieve 93% parallel efficiency across 32,768 cores on the IBM Blue Gene/P supercomputer.
Date: 2009-03-26
Degree: PhD
Discipline: Computer Science
URI: http://www.lib.ncsu.edu/resolver/1840.16/3481


Files in this item

Files Size Format View
etd.pdf 953.3Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record