An application scheduler for bioinformatics applications

No Thumbnail Available

Date

2003-06-05

Journal Title

Series/Report No.

Journal ISSN

Volume Title

Publisher

Abstract

Bioinformatics analyses are unique as they usually comprise of a large number of small, computationally intensive but fairly independent processes. This gives them a high degree of parallelism and a very small cost of synchronization among the different tasks. However, the time to schedule each task is often a whole lot more than the execution time of each independent task. To reduce the overhead associated with scheduling, tasks are grouped together increasing the average life of a job and also reducing the number of jobs that need to be scheduled. Grouping also reduces the variation in the lifetime of a job, allowing better predictions of the time that a resource needs to be allocated for a task. While increasing the number of tasks in a group improves the predictability of a task, tasks within a group are executed sequentially, reducing the amount of parallelism available with the pool of tasks. This can increase the total execution time of an analysis, degrading performance of the application. An analysis pipeline for the A. fumigatus genome, consisting of BLASTn analyses against 11 genomes, two gene prediction algorithms and a RepeatMasker analysis, was developed in the DeCIFR tool for comparative genomics. Performance tests were carried out on a homogenous, single-resource grid consisting of 14 Dell Servers running dual 1Ghz Intel Xeon processors with 2GB of RAM. Performance improvements of upto 52% were seen due to grouping. The variation in the CPU time for a job reduced from around 75% to about 20% by grouping just 20 tasks together. The best case improvement in CPU time variation was from 87% to 13%. Over-grouping of tasks led to poor utilization of resource degrading the performance by about 40%. The grouping of tasks is dependent on the average life of the task and the ratio of tasks to be completed to the number of resources available.

Description

Keywords

Bioinformatics, Application scheduler, Resource scheduling, Grid Computing

Citation

Degree

MS

Discipline

Computer Engineering

Collections