Dissertations
Permanent URI for this collectionhttps://www.lib.ncsu.edu/resolver/1840.20/24
Browse
Browsing Dissertations by Discipline "Bioinformatics"
Now showing 1 - 20 of 107
- Results Per Page
- Sort Options
- Advanced Applications of RNA-Sequencing Data in Plant Bioinformatics.(2022-10-20) Knight, Montana; Colleen Doherty, Co-Chair; Dahlia Nielsen, Co-Chair; Nicolas Buchler, Member; Jeffrey Thorne, Member; Imara Perera, Graduate School Representative
- Advances in Causal Inference and the Study of Interlocus Gene Conversion.(2023-06-27) Xu, Tanchumin; Jeffrey Thorne, Chair; Brian Wiegmann, Member; Gavin Conant, Member; Shu Yang, Member; Seth Sullivant, Graduate School Representative
- Analysis and Computational Methods for Understanding Disease Resistance at the Genomic Level in Complex Plant Systems.(2023-11-03) Schoonmaker, Ashley Nicole; Amanda Hulse, Co-Chair; Susana Milla-Lewis, Co-Chair; Christopher Healey, Graduate School Representative; Gavin Conant, Member; Ross Whetten, Member; Jean Ristaino, Member
- Analysis of Cis-acting Regulatory Motifs Involved in Alternative Splicing(2009-04-15) Zhao, Sihui; Steffen Heber, Committee Chair; Zhao-Bang Zeng, Committee Co-Chair; David Bird, Committee Member; Hao Zhang, Committee MemberAlternative splicing is an important posttranscriptional process in eukaryotes. It dramatically expands the proteome and contributes essentially to the regulation of gene expression. Cis-acting regulatory motifs play a pivotal role in the regulation of alternative splicing. Many human diseases involved with aberrant (alternative) splicing are caused by mutations of splicing regulatory motifs. However, due to the short, degenerate and context-dependent nature, the prediction of cis-acting splicing motifs is a very challenging task. In this dissertation, we focus on discovery of splicing signals from sequences. This may help to reveal the integrated splicing code and to understand the regulation of gene expression in the resolution of exon level. In chapter one, we review the up-to-date research development in alternative splicing and its regulation, as well as the experimental and computational approaches in genome-wide alternative splicing analysis. We describe a large-scale data analysis experiment to discover AS motifs in chapter two. We applied a computational framework to re-analyze a dataset containing about 3,000 cassette exons and skipping rates for regulatory motifs. The alternative spliced events were clustered by their expression profiles to find co-regulated genes. Rather than using a fixed cutoff as cluster boundary, we used systematic sampling to sample sequence clusters and eliminated redundant motifs predicted from overlapping clusters. We conclude that these predicted motifs may be promising candidates responsible for AS regulation by comparison to known motifs and by positional bias. In chapter three, we describe a new approach to discover short and degenerate AS motifs. We implemented a two-step approach incorporating skipping rates in motif discovery. In the simulation study, we show that this approach is especially suitable to discover short and highly degenerate motifs. Analysis of cassette exons in Central Nervous System tissues produced 15 motifs which are associated with the variation of skipping rates. We discover that Nova and hnRNP A1 binding sites are involved with AS regulation, as well as about ten novel motifs. Moreover, co-operation between predicted motifs are also revealed. In chapter four, we give the present status of SPRED, a database of cis-acting regulatory splicing elements. The motifs in SPRED are compiled from literature. They are all experimentally validated. The web interface is publically accessible and accompanied with query and similarity search tools. The goal of SPRED is to provide a comprehensive motif dictionary to facilitate the research in AS and its regulation. Finally, we give the conclusions in chapter five. We also give the perspective for future study and briefly review the potential challenge.
- Analysis of Gene Expression Profiles with Linear Mixed Models(2005-04-25) Hsieh, Wen-Ping; Greg Gibson, Committee Chair; Russ Wolfinger, Committee Co-Chair; Dennis Boos, Committee Member; Spencer Muse, Committee MemberWith the emergence of high throughput technology, proper interpretation of data has become critical for many aspects of biomedical research. My dissertation explores two major issues in gene expression profile microarray data analysis. One is quantification of variation across and among species and its effect on biological interpretation. The second part of my work is to develop better statistical estimates that can account for different sources of variation for significant gene detection. A previously published dataset of oligonucleotide array data for three primate species was analyzed with linear mixed models. By decomposing the variation of expression into different explanatory factors, the differences among species as well as between tissues was revealed at the expression level. Issues of cross-species hybridization and expression divergence compared to mutation-drift equilibrium were addressed. The power and flexibility of the linear mixed model framework for detection of differentially expressed genes was then explored with a dataset that includes spiked-in controls. The impact of probe-level sequence variation on cross-hybridization was detected through a Gibb's sampling method that highlights potential problems for short oligonucleotide microarray data analysis. A motif as short as fifteen bases can possibly cause significant cross-hybridization. Finally, a bivariate model using information from both perfect match probes and mismatch probes was proposed as a means to increase the statistical power for detection of significant differences in gene expression. The improved performance of the method was demonstrated through Monte Carlo simulation. The detection power can increase as much as 20% with 5% false positive rate under certain circumstances.
- Analysis of Multilocus Linkage Disequilibrium Structure in the Human Genome(2008-03-30) Kim, Yunjung; Zhao-Bang Zeng, Committee Chair; Jung-Ying Tzeng, Committee Member; Gregory Gibson, Committee Member; Philip Awadalla, Committee MemberThe International HapMap Project and high- throughput genotyping technology have generated millions of genome-wide marker data that can be used in genetic studies. Each marker can be analyzed separately. But analyzing multiple markers simultaneously through haplotypes has generated great interest recently. Understanding the haplotype structure in the human genome may provide important information on human evolutionary history and identification of genetic variants responsible for human complex diseases. Since the alleles at closely linked markers on a single chromosome are often in statistical dependence (i.e. linkage disequilibrium (LD)), one crucial aspect of haplotype analysis is to characterize LD patterns in different regions and different populations. To assess the extent of correlation of genetic variation at multiple markers in a given region and a population, pairwise LD measures such as and have been commonly used. However, pairwise LD measures alone may be suboptimal to effectively capture the variability of background levels of disequilibrium since multilocus LD measures can provide information about simultaneous allele associations among multiple loci which pairwise LD measures miss. In addition, in order to fully characterize the haplotype structure and LD pattern at multiple markers, it is necessary to consider high order disequilibria and estimate their values.
- Analysis on Microarray Data and DNA Regulatory Elements Prediction(2002-10-22) Lu, Jun; Spencer Muse, Committee ChairTranscription profiling with microarray technology has significantly accelerated our understanding of complex biological processes by allowing the genome-wide measure of message RNA levels. Microarrays are commonly used for identifying genes with expression differing between two or more samples (e.g. treatments vs. controls), searching for gene expression patterns among a set of samples or genes, and studying gene regulation networks. Here, we first address the variation intrinsic to microarray experiments. The analysis of variance technique was applied to partition and quantify several sources of variation likely to be present in a typical cDNA microarray experiment. Based on a pilot experiment with intensive replication at several levels, we showed that significant amounts of variation can be attributed to slide, plate and pin differences. The origin of these sources of variation was discussed and suggestions were made on how to minimize or avoid them when a future microarray experiment is designed. Next, we demonstrated that molecular cancer classification could be approached by discriminant analysis. We analyzed a public Affymetrix chip dataset and selected the predictor genes based on the t-values and stepwise discriminant analysis, and evaluated the resulting model's performance in predicting 34 test samples by discriminant analysis. Only two samples were not correctly predicted with 25 predictor genes we chose. We also evaluated the parsimony of our model by evaluating, through a stepwise method, the minimum number of genes required to maintain a high level of accuracy in predicting cancer types. The accumulation of microarray data can help elucidate the gene regulation mechanisms in cells. Here, we attempted to find an improved matrix description for transcription factor binding site. We applied a genetic algorithm (GA) to derive matrices that were trained from a set of true binding sequences and random sequences. Preliminary results indicate that the matrix derived shows a higher specificity in binding site prediction than the regular position weighted matrix (PWM) within a range of cutoff scores. The binding site of the cell-cycle related transcription factors, E2Fs, was taken as an example to illustrate our method. When both the GA-derived and regular matrices were applied to scan the human gene upstream sequences, the matrix we derived gave significant less predictions than the regular matrix, given the same false negative rate observed in the training dataset.
- Analytical Tools for Characterizing Developmental Toxicity of Environmental Chemicals Using High-throughput Screening in Zebrafish (Danio rerio).(2016-07-19) Zhang, Guozhu; David Reif, Chair; Denis Fourches, Member; John Godwin, Member; Carolyn Mattingly, Member; Jeffrey Yoder, Graduate School Representative
- Analytical Tools for Population-based Association Studies(2008-08-21) Liu, Youfang; Daowen Zhang, Committee Member; Trudy F. C. Mackay, Committee Member; Zhao-Bang Zeng, Committee Co-Chair; Jung-Ying Tzeng, Committee Co-ChairDisease gene fine mapping is an important task in human genetic research. Association analysis is becoming a primary approach for localizing disease loci, especially when abundant SNPs are available due to the well improved genotyping technology during the last decades. Despite the rapid improvement of detection ability, there are many limitations of association strategy. In this dissertation, we focused on three different topics including haplotype similarity based test, association test incorporating genotyping error and simulation tool for large data set. 1) Previous haplotype similarity based tests don't have the ability to incorporate covariates in the test. In chapter 2, we proposed a new association method based on haplotype similarity that incorporates covariates and utilizes maximum amount of data information. We found that our method gives power improvement when neither LD nor allele frequency is too low and is comparable under other scenarios. 2) In chapter 3, we proposed a new strategy that incorporates the genotyping uncertainty to assess the association between traits and SNPs. Extensive simulation studies for case-control designs demonstrated that intensity information based association test can reduce the impact induced by genotyping error. 3) In chapter 4, we described simulation software, SimuGeno, which is used to simulate large scale genomic data for case-control association studies.
- Application of Next Generation Sequencing Technologies to Pharmacogenomics.(2012-07-19) Hariani, Gunjan Dhanraj; Alison Motsinger, Chair; Dahlia Nielsen, Member; Eric Stone, Member; Jorge Piedrahita, Member; Jaime Collazo, Graduate School Representative
- Association Analysis of Prenatal Exposures, Umbilical Cord Blood DNA Methylation and Childhood Health.(2023-02-13) Wang, Yaxu; Jung-Ying Tzeng, Co-Chair; Cathrine Hoyo, Co-Chair; Arnab Maity, Member; David Reif, Member; Terrence K Allen, External
- Bayesian Approach for Nonlinear Dynamic System and Genome-Wide Association Study(2010-04-28) Ouyang, Haojun; Sujit K. Ghosh, Committee Chair; Jung-Ying Tzeng, Committee Co-ChairGenome-wide association studies (GWAS) have been widely used to identify single-nucleotide polymorphisms (SNPs) that are responsible for diseases. A challenging aspect of this study is to resolve the various issues related to multiple tests. We propose a new Bayesian method to measure statistical significance in these genome-wide studies based on the concept of false discovery rate (FDR). Our proposed method provides a convenient way to integrate prior knowledge obtained from external resources into current study. By controlling Bayesian positive FDR at a given level, the realized FDR is controlled. Our simulations show that the power can be substantially improved with correct prior information while the FDR is controlled at the desired level. When prior information is imprecise, our method can still improve the power of detecting signals, while keeping the FDR under control. The modified Bayesian method is applied to a GWAS for schizophrenia. Meta-analysis is another approach to utilize information from multiple sources by combining results from multiple independent studies. A major concern in carrying out meta-analysis involves the proper characterization of heterogeneity among population. To account for heterogeneity, the most commonly used approach is to implement a random-effects model, where the random-effects are assumed to be normally distributed with an unknown population mean and an unknown variance. We relax the normality assumption and show that a broad class of distributions can be approximated by a class of mixture distributions. The population mean and variance estimates based on the mixture density are then obtained by the EM algorithm. Our results show that the proposed method greatly improves the accuracy in estimating overall mean effect and heterogeneity variance in various realistic cases. We illustrate our method to a study on DRD2 gene in multiple association studies with schizophrenia. Dynamic system defined by ordinary differential equations is an important tool to modeling complicated biology system. To estimate parameters in the dynamic system which analytic, close form solution is not available and involving missing or censored data, we extend Bayesian Euler's Approximation method based on data augmentation algorithm. Our simulation study shown the method is robust in both cases. The proposed method is applied to analyze HIV viral load dataset, which enable us to retrieve information from the censored data.
- Bioinformatics and Machine Learning in Human Microbiome Analysis.(2022-05-04) Song, Kuncheng; Yihui Zhou, Chair; Fred Wright, Member; Oliver Baars, Graduate School Representative; Benjamin Callahan, Member; Xinxia Peng, Member
- Carbohydrate Utilization Pathway Analysis in the Hyperthermophile Thermotoga maritima(2006-03-01) Conners, Shannon Burns; Todd Klaenhammer, Committee Member; Robert Kelly, Committee Chair; Greg Gibson, Committee Member; Bruce Weir, Committee Member; Jason Osborne, Committee MemberCarbohydrate utilization and production pathways identified in Thermotoga species likely contribute to their ubiquity in hydrothermal environments. Many carbohydrate-active enzymes from Thermotoga maritima have been characterized biochemically; however, sugar uptake systems and regulatory mechanisms that control them have not been well defined. Transcriptional data from cDNA microarrays were examined using mixed effects statistical models to predict candidate sugar substrates for ABC (ATP-binding cassette) transporters in T. maritima. Genes encoding proteins previously annotated as oligopeptide/dipeptide ABC transporters responded transcriptionally to various carbohydrates. This finding was consistent with protein sequence comparisons that revealed closer relationships to archaeal sugar transporters than to bacterial peptide transporters. In many cases, glycosyl hydrolases, co-localized with these transporters, also responded to the same sugars. Putative transcriptional repressors of the LacI, XylR, and DeoR families were likely involved in regulating genomic units for beta-1,4-glucan, beta-1,3-glucan, beta-1,4-mannan, ribose, and rhamnose metabolism and transport. Carbohydrate utilization pathways in T. maritima may be related to ecological interactions within cell communities. Exopolysaccharide-based biofilms composed primarily of β-linked glucose, with small amounts of mannose and ribose, formed under certain conditions in both pure T. maritima cultures and mixed cultures of T. maritima and M. jannaschii. Further examination of transcriptional differences between biofilm-bound sessile cells and planktonic cells revealed differential expression of beta-glucan-specific degradation enzymes, even though maltose, an alpha-1,4 linked glucose disaccharide, was used as a growth substrate. Higher transcripts of genes encoding iron and sulfur compound transport, iron-sulfur cluster chaperones, and iron-sulfur cluster proteins suggest altered redox environments in biofilm cells. Further direct comparisons between cellobiose and maltose-grown cells suggested that transcription of cellobiose utilization genes is highly sensitive to the presence of cellobiose, or a cellobiose-maltose mixture. Increased transcripts of genes related to polysulfide reductases in cellobiose-grown cells and biofilm cells suggested that T. maritima cells in pure culture biofilms escaped hydrogen inhibition by preferentially reducing sulfur compounds, while cells in mixed culture biofilms form close associations with hydrogen-utilizing methanogens. In addition to probing issues related to the microbial physiology and ecology of T. maritima, this work illustrates the strategic use of DNA microarray-based transcriptional analysis for functional genomics studies.
- Challenges and Solutions in Association Analysis.(2022-01-09) Huang, Yueyang; Jung-Ying Tzeng, Chair; Wenbin Lu, Member; Cathrine Hoyo, Member; David Reif, Member
- Clustering of Mixed Data Types with Application to Toxicogenomics(2006-04-25) Bushel, Pierre Robert; Greg C. Gibson, Committee Chair; Russell D. Wolfinger, Committee Member; Spencer V. Muse, Committee Member; Robert C. Smart, Committee MemberDNA microarray analysis provides unprecedented capabilities for simultaneous measurement of genome-wide alterations in transcription levels. Toxicogenomics bridges gene and protein expression analyses with conventional toxicology to elucidate a global view of the toxic outcomes and mechanistic changes elicited by toxicant exposure and environmental stressors to biological systems. Inherent in toxicogenomics data are systematic error, stochastic variation and disparate measurement domains and types which complicate the acquisition of significant, meaningful and broad biological interpretations from analysis of the data. In this dissertation, a classification regimen comprised of analysis of replicate data, outlier diagnostics and gene selection procedures was employed to utilize microarray data for categorization of sub-classes of biological samples exposed to pharmacologic agents. To assess contrasts of centrilobular congestion severity of the rat liver subsequent to exposure with acetaminophen (APAP), microarray data, clinical chemistry evaluations and histopathology observations were integrated in a database and analyzed using mixed linear model approaches. Finally, the k-prototype algorithm with a mixed objective function comprised of the sum of the squared Euclidean distance to measure the dissimilarity of samples based on microarray array and clinical chemistry numeric data features and simple matching to measure the dissimilarity of the samples based on histopathology features with categorical values, was modified (Modk-prototypes) to the specifications of k-means clustering. In addition, the objective function included weighting terms for the microarray, clinical chemistry and histopathology domain data in order to computationally integrate the data as well as constrain the clustering of the APAP-treated samples according to similarity of gene expression and toxicological profiles. Simulated annealing optimization of the Modk (SA-Modk) —prototypes algorithm was used to validate the clustering of the APAP-treated samples. The clusters were vetted for gene expression and toxicological (VETed) k-prototypes features that discerned clusters from one another. The VETed k-prototypes are shown to be ideal for distinguishing between zero, minimal, and moderate levels of necrosis of the hepatocytes and centrilobular region of the rat liver that are end-point representations of the clusters of APAP-treated samples. In this dissertation, chapter 1 is an introduction to general toxicology, microarray gene expression array platforms, experimental designs, preprocessing of the data and gene selection approaches, toxicogenomics as it applies to compound classification and phenotypic anchoring of gene expression, databases and informatics resources for toxicogenomics and clustering of mixed data types. Chapter 2 is dedicated to statistical validation and significance of differentially expressed genes as well as sub-categorization of samples exposed to phenobarbital and peroxisome proliferators clofibrate, gemfibrozil and Wyeth 14, 643. Chapter 3 presents integration of microarray data with clinical chemistry and histopathology data to contrast levels of centrilobular congestion of the rat liver by mixed linear modeling of gene expression ratio values acquired from rats exposed to APAP. Chapter 4 describes the utilization of a modified k (Modk) —prototypes objective function and algorithm, and simulated annealing optimization version of the Modk (SA-Modk)-prototypes objective function, for computational integration of microarray, clinical chemistry and histopathology mixed numeric and categorical data. It also includes partitioning of APAP-treated biological samples into clusters which contain vetted expression and toxicological (VETed) k-prototypes features that distinguish between levels of necrosis of the hepatocytes and centrilobular region of the rat liver. In chapter 5, a conclusion of the research, development and analyses presented in this dissertation is provided.
- Comparative Genomic and Transcriptional Analyses of Magnaporthe oryzae and other Eukaryotes.(2011-09-12) Sailsbery, Joshua; Ignazio Carbone, Co-Chair; Ralph Dean, Co-Chair; Eric Stone, Member; Jeffrey Thorne, Member; Gary Payne, Member; Scott McCulloch, Graduate School Representative
- Computational Approaches for Aggregating, Analyzing, and Visualizing Multi-Modal Feature Data.(2024-06-07) Fleming, Jonathon F; David Reif, Chair; Gavin Conant, Member; Alison Motsinger-Reif, External; Stacy Supak, Member; Scott Belcher, Member
- Computational Approaches for Analyzing Complex Nontargeted Mass Spectrometry Datasets with Variable Degrees of Feature Annotation.(2024-04-15) Chappel, Jessie Rene; Fred Wright, Chair; Jacqueline Hughes-Oliver, Minor; Jung-Ying Tzeng, Member; Erin Baker, Inter-Institutional; David Reif, External
- Computational Biology of Ras Proteins(2008-04-07) Dellinger, Andrew Everette; William R. Atchley, Committee Chair; Carla Mattos, Committee Member; Jeffrey Thorne, Committee Member; Jon Doyle, Committee MemberIn this research, computational biology is used to elucidate how evolutionary history has changed roles of structure and function among Ras proteins, with a focus on the Ras family. This dissertation begins with phylogenetic analyses of the Ras superfamily and Ras family. Phylogenetic trees of the Ras family were estimated using Neighbor-Joining, Weighted Neighbor-joining, Parsimony, Quartet Puzzling, Maximum Likelihood and Bayesian methods. In nearly all cases, each clade represented a subfamily. Clade members and clade divisions were consistent among all the trees, increasing the probability of a correct estimation of the evolutionary history. Further investigation into the evolution of sequence involved decomposing sequence covariation into its respective components. The roles of the functional and structural components of covariation were the focus of several multivariate analyses. Decision tree analysis, a data mining method, found that sequence divergence in critical sites of the hydrophobic core, dimerization regions and ligand binding regions were sufficient to divide Ras subfamilies. Alignments of GDP-bound and GTP-bound crystal structures revealed that only Ral and M-Ras proteins have structural variation in the effector binding switch I regions, while all Ras structures vary in the protein binding switch II region. Di-Ras2-GDP was shown to have a unique C-terminal loop which binds to the interswitch region. Last, a common factor analysis was computed. The factors contain the set of sites that both discriminate among the subfamilies and have a unique functional or structural role, such as Ral tree-determinant sites. Finally, sequence signatures were developed for each of the families of the Ras superfamily using Boltzmann-Shannon entropy. This method was compared to the PROSITE signature, profile hidden Markov model and MEME position-specific scoring matrix methods. The Entropy method identified approximately 8% fewer proteins than the best of the other methods, MEME. Comparative analyses of these sequence signatures determined which sites and amino acids played important roles in the changes in protein function and structure among Ras families.
