High Performance Parallel and Distributed Genomic Sequence Search
dc.contributor.advisor | Xiaosong Ma, Committee Chair | en_US |
dc.contributor.advisor | Steffen Heber, Committee Member | en_US |
dc.contributor.advisor | Frank Mueller, Committee Member | en_US |
dc.contributor.advisor | Nagiza Samatova, Committee Member | en_US |
dc.contributor.advisor | Douglas Reeves, Committee Member | en_US |
dc.contributor.author | Lin, Heshan | en_US |
dc.date.accessioned | 2010-04-02T18:30:32Z | |
dc.date.available | 2010-04-02T18:30:32Z | |
dc.date.issued | 2009-03-26 | en_US |
dc.degree.discipline | Computer Science | en_US |
dc.degree.level | dissertation | en_US |
dc.degree.name | PhD | en_US |
dc.description.abstract | Genomic sequence database search identifies similarities between given query sequences and known sequences in a database. It forms a critical class of applications used widely and routinely in computational biology. Due to their wide application in diverse task settings, sequence search tools today are run on several types of parallel systems, including batch jobs on one or more supercomputers and interactive queries through web-based services. Despite successful parallelization of popular sequence search tools such as BLAST, in the past two decades the growth of sequence databases has outpaced that of computing hardware elements, making scalable and efficient parallel sequence search processing crucial in helping life scientists' dealing with the ever-increasing amount of genomic information. In this thesis, we investigate efficient and scalable parallel and distributed sequence-search solutions by addressing unique problems and challenges in the aforementioned execution settings. Specifically, this thesis research 1) introduces parallel I/O techniques into sequence-search tools and proposes novel computation and I/O co-scheduling algorithms that enable genomic sequence search to scale efficiently on massively parallel computers; 2) presents a semantic based distributed I/O framework that leverages the application specific meta information to drastically reduce the amount of data transfer and thus enables distributed sequence searching collaboration in the global scale; 3) proposes a novel request scheduling technique for clustered sequence-search web servers that comprehensively takes into account both data locality and parallel search efficiency to optimize query response time under various server load levels and access scenarios. The efficacy of our proposed solutions has been verified on a broad range of parallel and distributed systems, including Peta-scale supercomputers, the NSF TeraGrid system, and small- or medium-sized clusters. In addition, our optimizations of massively parallel sequence search have been transformed into the official release of mpiBLAST-PIO, currently the only supported branch of mpiBLAST, a popular open-source sequence-search tool. mpiBLAST-PIO is able to achieve 93% parallel efficiency across 32,768 cores on the IBM Blue Gene/P supercomputer. | en_US |
dc.identifier.other | etd-03132009-172048 | en_US |
dc.identifier.uri | http://www.lib.ncsu.edu/resolver/1840.16/3481 | |
dc.rights | I hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dis sertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to NC State University or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report. | en_US |
dc.subject | parallel I/O | en_US |
dc.subject | scheduling | en_US |
dc.subject | distributed computing | en_US |
dc.subject | parallel bioinformatics | en_US |
dc.subject | sequence database search | en_US |
dc.title | High Performance Parallel and Distributed Genomic Sequence Search | en_US |
Files
Original bundle
1 - 1 of 1