Browsing by Author "Hao Helen Zhang, Committee Member"
Now showing 1 - 5 of 5
- Results Per Page
- Sort Options
- Boosting methods for variable selection in high dimensional sparse models(2009-08-27) Hwang, Wook Yeon; Hao Helen Zhang, Committee Member; Howard Bondell, Committee Member; Wenbin Lu, Committee Member; Subhashis Ghosal, Committee ChairFirstly, we propose new variable selection techniques for regression in high dimensional linear models based on a forward selection version of the LASSO, adaptive LASSO or elastic net, respectively to be called as forward iterative regression and shrinkage technique (FIRST), adaptive FIRST and elastic FIRST. These methods seem to work better for an extremely sparse high dimensional linear regression model. We exploit the fact that the LASSO, adaptive LASSO and elastic net have closed form solutions when the predictor is one-dimensional. The explicit formula is then repeatedly used in an iterative fashion until convergence occurs. By carefully considering the relationship between estimators at successive stages, we develop fast algorithms to compute our estimators. The performance of our new estimators is compared with commonly used estimators in terms of predictive accuracy and errors in variable selection. It is observed that our approach has better prediction performance for highly sparse high dimensional linear regression models. Secondly, we propose a new variable selection technique for binary classification in high dimensional models based on a forward selection version of the Squared Support Vector Machines or one-norm Support Vector Machines, to be called as forward iterative selection and classification algorithm (FISCAL). This methods seem to work better for a highly sparse high dimensional binary classification model. We suggest the squared support vector machines using 1-norm and 2-norm simultaneously. The squared support vector machines are convex and differentiable except at zero when the predictor is one-dimensional. Then an iterative forward selection approach is applied along with the squared support vector machines until a stopping rule is satisfied. Also, we develop a recursive algorithm for the FISCAL to save computational burdens. We apply the processes to the original onenorm Support Vector Machines. We compare the FISCAL with other widely used binary classification approaches with regard to prediction performance and selection accuracy. The FISCAL shows competitive prediction performance for highly sparse high dimensional binary classification models.
- Controlling Variable Selection By the Addition of Pseudo-Variables(2004-08-09) Wu, Yujun; Marc G. Genton, Committee Member; Leonard A. Stefanski, Committee Co-Chair; Dennis D. Boos, Committee Co-Chair; Hao Helen Zhang, Committee MemberMany variable selection procedures have been developed in the literature for linear regression models. We propose a new and general approach, the False Selection Rate (FSR) method, to control variable selection with the advantage of being applicable to a broader class of regression models; for example, binary regression, Poisson regression, etc. By adding a number of pseudo-variables to the real set of data and monitoring the proportion of pseudo-variables falsely selected in the model, we are able to control the model false selection rate, selecting as many important variables as possible while selecting a relatively low proportion of false important ones. We focus on forward selection because it is applicable in the case where there are more variables than observations. Due to the difficulty of obtaining analytical results, we study our approach by Monte Carlo and compare it with a variety of commonly used procedures. We first focus on linear regression models, and then extend the approach to logistic regression models. The new method is illustrated on four real data sets.
- Methods for Accurate Analysis of High-Throughput Transcriptome Data(2009-11-30) Howard, Brian E; Steffen Heber, Committee Chair; David Bird, Committee Member; Dahlia Nielsen, Committee Member; Heike Winter-Sederoff, Committee Member; Hao Helen Zhang, Committee MemberA detailed understanding of the transcriptome is a prerequisite for deciphering the flow of information from genotype to phenotype. Fortunately, modern high-throughput technologies now provide an unprecedented ability to observe the full complement of transcriptional events, which extend far beyond the classical "one gene, one protein" hypothesis to include alternatively spliced genes, microRNAs, RNA interference, anti-sense transcription, and a variety of other, until recently, unknown phenomena. However, in order to accurately interpret the results of these assays, new statistical and bioinformatic methods must be developed in parallel to biotechnological advances. In this thesis, we present several methods for improving the accuracy of inferences obtained from the high-throughput transcriptome data generated by these new technologies. First, we present a novel method for microarray quality assessment. Since accurate inference is dependent on the quality of the underlying data, quality assessment is a critical component in any microarray data analysis. Our method, which uses an unsupervised classifier to discriminate between high and low quality microarray datasets, exhibits performance comparable to supervised learners constructed using the same training data. However, because our approach requires only unnannotated data, it is easy to customize and to keep up-to-date as technology evolves. Next, we present an alternative method for microarray quality assessment, which identifies low quality microarrays by simulating a set of differentially expressed genes. This method directly measures the ability of a planned statistical analysis to identify differential gene expression when suspected low quality arrays are included in the dataset. A key advantage of this approach is that, unlike other methods, this method provides a specific recommendation about whether to retain or discard low quality chips in the context of a particular experimental setting. Finally, we introduce a procedure for accurately quantifying alternative splicing using RNA-Seq data. Our method uses a familiar linear models approach, but improves upon similar methods that assume uniform coverage of RNA-Seq reads along the targeted transcripts. We first show, through simulation, that using an incorrect read sampling distribution can lead to incorrect conclusions about the expression of isoforms in a mixture. Applying our method to an example dataset, we identify 438 differentially spliced genes, exhibiting a range of expression patterns including genes with switch-like differential splicing between two tissues, as well as genes with more subtle variations in isoform expression. Taken together, we expect that these methods can serve to increase the accuracy of inferences drawn from high-throughput transcriptome data, and in doing so, lead to an advancement of our understanding of the biology of genome expression.
- Robustness in Latent Variable Models(2006-07-13) Huang, Xianzheng; Marie Davidian, Committee Chair; Leonard A. Stefanski, Committee Co-Chair; Anastasios A. Tsiatis, Committee Member; Hao Helen Zhang, Committee MemberStatistical models involving latent variables are widely used in many areas of applications, such as biomedical science and social science. When likelihood-based parametric inferential methods are used to make statistical inference, certain distributional assumptions on the latent variables are often invoked. As latent variables are not observable, parametric assumptions on the latent variables cannot be verified directly using observed data. Even though semiparametric and nonparametric approaches have been developed to avoid making strong assumptions on the latent variables, parametric inferential approaches are still more appealing in many situations in terms of consistency and efficiency in estimation and computation burden. The goals of our study are to gain insight into the sensitivity of statistical inference to model assumptions on latent variables, and to develop methods for diagnosing latent-model misspecification to enable one to reveal whether the parametric inference is robust under certain latent-model assumptions. We refer to such robustness as latent-model robustness. We start with a simple class of latent variable models, the structural measurement error models, to first tackle the problem. We define theoretical conditions under which a certain degree of latent-model robustness is achieved and study some special structural measurement error models analytically to gain insight into the sensitivity of inference to latent-model assumptions under these specific contexts. Then we borrow the idea of simulation-extrapolation (SIMEX), or remeasurement method, introduced by Cook and Stefanski (1994) to develop an empirical diagnostic tool that is able to reveal graphically whether or not robustness is attained under the imposed latent-variable assumptions. Testing procedures are proposed as a numerical supplement to the graphical diagnostic tool. These methods are then generalized and refined to adapt to a more complex class of latent variable models called joint models. For this generalization we focus on joint models that link a primary response, which can be a simple response or a censored time-to-event, to an error-prone longitudinal process. The performances of the proposed methods are demonstrated through application to simulated data and data from medical studies.
- Statistical Studies of Genomics Data(2004-12-28) Feng, Sheng; Zhao-Bang Zeng, Committee Chair; Bruce Weir, Committee Co-Chair; Leonard Stefanski, Committee Member; Hao Helen Zhang, Committee Member; Russell Wolfinger, Committee MemberIn recent years, studies on Genetics and Genomics have become one of the most active fields in science. The Genetic and Genomics data have several significant and unique characteristics that bring great challenges for data analysis. Three statistical studies have been presented in this dissertation. In chapter 1, an empirical Bayesian approach has been developed in a linear mixed model for Microarray data analysis. In chapter 2, a multiple order Markov chain model is applied to summarize the local correlation patterns among multiple genetic markers in linkage disequilibrium mapping. In chapter 3, a shrinkage method is being developed to integrate Biological prior knowledge presented in moment statistics. This new method may be useful in some genetic network studies.
