High Performance Parallel and Distributed Genomic Sequence Search

Abstract

Genomic sequence database search identifies similarities between given query sequences and known sequences in a database. It forms a critical class of applications used widely and routinely in computational biology. Due to their wide application in diverse task settings, sequence search tools today are run on several types of parallel systems, including batch jobs on one or more supercomputers and interactive queries through web-based services. Despite successful parallelization of popular sequence search tools such as BLAST, in the past two decades the growth of sequence databases has outpaced that of computing hardware elements, making scalable and efficient parallel sequence search processing crucial in helping life scientists' dealing with the ever-increasing amount of genomic information. In this thesis, we investigate efficient and scalable parallel and distributed sequence-search solutions by addressing unique problems and challenges in the aforementioned execution settings. Specifically, this thesis research 1) introduces parallel I/O techniques into sequence-search tools and proposes novel computation and I/O co-scheduling algorithms that enable genomic sequence search to scale efficiently on massively parallel computers; 2) presents a semantic based distributed I/O framework that leverages the application specific meta information to drastically reduce the amount of data transfer and thus enables distributed sequence searching collaboration in the global scale; 3) proposes a novel request scheduling technique for clustered sequence-search web servers that comprehensively takes into account both data locality and parallel search efficiency to optimize query response time under various server load levels and access scenarios. The efficacy of our proposed solutions has been verified on a broad range of parallel and distributed systems, including Peta-scale supercomputers, the NSF TeraGrid system, and small- or medium-sized clusters. In addition, our optimizations of massively parallel sequence search have been transformed into the official release of mpiBLAST-PIO, currently the only supported branch of mpiBLAST, a popular open-source sequence-search tool. mpiBLAST-PIO is able to achieve 93% parallel efficiency across 32,768 cores on the IBM Blue Gene/P supercomputer.

Description

Keywords

parallel I/O, scheduling, distributed computing, parallel bioinformatics, sequence database search

Citation

Degree

PhD

Discipline

Computer Science

Collections