Methods for Accurate Analysis of High-Throughput Transcriptome Data

Abstract

A detailed understanding of the transcriptome is a prerequisite for deciphering the flow of information from genotype to phenotype. Fortunately, modern high-throughput technologies now provide an unprecedented ability to observe the full complement of transcriptional events, which extend far beyond the classical "one gene, one protein" hypothesis to include alternatively spliced genes, microRNAs, RNA interference, anti-sense transcription, and a variety of other, until recently, unknown phenomena. However, in order to accurately interpret the results of these assays, new statistical and bioinformatic methods must be developed in parallel to biotechnological advances. In this thesis, we present several methods for improving the accuracy of inferences obtained from the high-throughput transcriptome data generated by these new technologies. First, we present a novel method for microarray quality assessment. Since accurate inference is dependent on the quality of the underlying data, quality assessment is a critical component in any microarray data analysis. Our method, which uses an unsupervised classifier to discriminate between high and low quality microarray datasets, exhibits performance comparable to supervised learners constructed using the same training data. However, because our approach requires only unnannotated data, it is easy to customize and to keep up-to-date as technology evolves. Next, we present an alternative method for microarray quality assessment, which identifies low quality microarrays by simulating a set of differentially expressed genes. This method directly measures the ability of a planned statistical analysis to identify differential gene expression when suspected low quality arrays are included in the dataset. A key advantage of this approach is that, unlike other methods, this method provides a specific recommendation about whether to retain or discard low quality chips in the context of a particular experimental setting. Finally, we introduce a procedure for accurately quantifying alternative splicing using RNA-Seq data. Our method uses a familiar linear models approach, but improves upon similar methods that assume uniform coverage of RNA-Seq reads along the targeted transcripts. We first show, through simulation, that using an incorrect read sampling distribution can lead to incorrect conclusions about the expression of isoforms in a mixture. Applying our method to an example dataset, we identify 438 differentially spliced genes, exhibiting a range of expression patterns including genes with switch-like differential splicing between two tissues, as well as genes with more subtle variations in isoform expression. Taken together, we expect that these methods can serve to increase the accuracy of inferences drawn from high-throughput transcriptome data, and in doing so, lead to an advancement of our understanding of the biology of genome expression.

Description

Keywords

RNA-Seq, quality assessment, microarray, alternative splicing, transcriptome

Citation

Degree

PhD

Discipline

Bioinformatics

Collections