Clustering of Mixed Data Types with Application to Toxicogenomics

Abstract

DNA microarray analysis provides unprecedented capabilities for simultaneous measurement of genome-wide alterations in transcription levels. Toxicogenomics bridges gene and protein expression analyses with conventional toxicology to elucidate a global view of the toxic outcomes and mechanistic changes elicited by toxicant exposure and environmental stressors to biological systems. Inherent in toxicogenomics data are systematic error, stochastic variation and disparate measurement domains and types which complicate the acquisition of significant, meaningful and broad biological interpretations from analysis of the data. In this dissertation, a classification regimen comprised of analysis of replicate data, outlier diagnostics and gene selection procedures was employed to utilize microarray data for categorization of sub-classes of biological samples exposed to pharmacologic agents. To assess contrasts of centrilobular congestion severity of the rat liver subsequent to exposure with acetaminophen (APAP), microarray data, clinical chemistry evaluations and histopathology observations were integrated in a database and analyzed using mixed linear model approaches. Finally, the k-prototype algorithm with a mixed objective function comprised of the sum of the squared Euclidean distance to measure the dissimilarity of samples based on microarray array and clinical chemistry numeric data features and simple matching to measure the dissimilarity of the samples based on histopathology features with categorical values, was modified (Modk-prototypes) to the specifications of k-means clustering. In addition, the objective function included weighting terms for the microarray, clinical chemistry and histopathology domain data in order to computationally integrate the data as well as constrain the clustering of the APAP-treated samples according to similarity of gene expression and toxicological profiles. Simulated annealing optimization of the Modk (SA-Modk) —prototypes algorithm was used to validate the clustering of the APAP-treated samples. The clusters were vetted for gene expression and toxicological (VETed) k-prototypes features that discerned clusters from one another. The VETed k-prototypes are shown to be ideal for distinguishing between zero, minimal, and moderate levels of necrosis of the hepatocytes and centrilobular region of the rat liver that are end-point representations of the clusters of APAP-treated samples. In this dissertation, chapter 1 is an introduction to general toxicology, microarray gene expression array platforms, experimental designs, preprocessing of the data and gene selection approaches, toxicogenomics as it applies to compound classification and phenotypic anchoring of gene expression, databases and informatics resources for toxicogenomics and clustering of mixed data types. Chapter 2 is dedicated to statistical validation and significance of differentially expressed genes as well as sub-categorization of samples exposed to phenobarbital and peroxisome proliferators clofibrate, gemfibrozil and Wyeth 14, 643. Chapter 3 presents integration of microarray data with clinical chemistry and histopathology data to contrast levels of centrilobular congestion of the rat liver by mixed linear modeling of gene expression ratio values acquired from rats exposed to APAP. Chapter 4 describes the utilization of a modified k (Modk) —prototypes objective function and algorithm, and simulated annealing optimization version of the Modk (SA-Modk)-prototypes objective function, for computational integration of microarray, clinical chemistry and histopathology mixed numeric and categorical data. It also includes partitioning of APAP-treated biological samples into clusters which contain vetted expression and toxicological (VETed) k-prototypes features that distinguish between levels of necrosis of the hepatocytes and centrilobular region of the rat liver. In chapter 5, a conclusion of the research, development and analyses presented in this dissertation is provided.

Description

Keywords

toxicology, database, gene expression, toxicogenomics, genomic sciences, microarray, clustering, statistics, bioinformatics

Citation

Degree

PhD

Discipline

Bioinformatics

Collections