High Performance Parallel and Distributed Genomic Sequence Search

Lin, Heshan

High Performance Parallel and Distributed Genomic Sequence Search

Files

etd.pdf (953.38 KB)

Date

2009-03-26

Authors

Lin, Heshan

Advisors

Xiaosong Ma, Committee Chair

Steffen Heber, Committee Member

Frank Mueller, Committee Member

Nagiza Samatova, Committee Member

Douglas Reeves, Committee Member

Abstract

Genomic sequence database search identifies similarities between given query sequences and known sequences in a database. It forms a critical class of applications used widely and routinely in computational biology. Due to their wide application in diverse task settings, sequence search tools today are run on several types of parallel systems, including batch jobs on one or more supercomputers and interactive queries through web-based services. Despite successful parallelization of popular sequence search tools such as BLAST, in the past two decades the growth of sequence databases has outpaced that of computing hardware elements, making scalable and efficient parallel sequence search processing crucial in helping life scientists' dealing with the ever-increasing amount of genomic information. In this thesis, we investigate efficient and scalable parallel and distributed sequence-search solutions by addressing unique problems and challenges in the aforementioned execution settings. Specifically, this thesis research 1) introduces parallel I/O techniques into sequence-search tools and proposes novel computation and I/O co-scheduling algorithms that enable genomic sequence search to scale efficiently on massively parallel computers; 2) presents a semantic based distributed I/O framework that leverages the application specific meta information to drastically reduce the amount of data transfer and thus enables distributed sequence searching collaboration in the global scale; 3) proposes a novel request scheduling technique for clustered sequence-search web servers that comprehensively takes into account both data locality and parallel search efficiency to optimize query response time under various server load levels and access scenarios. The efficacy of our proposed solutions has been verified on a broad range of parallel and distributed systems, including Peta-scale supercomputers, the NSF TeraGrid system, and small- or medium-sized clusters. In addition, our optimizations of massively parallel sequence search have been transformed into the official release of mpiBLAST-PIO, currently the only supported branch of mpiBLAST, a popular open-source sequence-search tool. mpiBLAST-PIO is able to achieve 93% parallel efficiency across 32,768 cores on the IBM Blue Gene/P supercomputer.

Keywords

parallel I/O, scheduling, distributed computing, parallel bioinformatics, sequence database search

URI

http://www.lib.ncsu.edu/resolver/1840.16/3481

Degree

PhD

Discipline

Computer Science

Collections

Dissertations

Full item page

High Performance Parallel and Distributed Genomic Sequence Search

Files

Date

Authors

Advisors

Journal Title

Series/Report No.

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Degree

Discipline

Collections