Variable Selection in Multiclass Support Vector Machine and Applications in Genomic Data Analysis

Show full item record

Title: Variable Selection in Multiclass Support Vector Machine and Applications in Genomic Data Analysis
Author: Huang, Lingkang
Advisors: Dr. Zhao-Bang Zeng, Committee Chair
Dr. Hao Helen Zhang, Committee Co-Chair
Abstract: Microarray techniques provide new insights into cancer diagnosis using gene expression profiles. Molecular diagnosis based on high-throughput genomic data sets presents major challenge due to the overwhelming number of variables and complex multi-class nature of tumor samples. In this thesis, the author first tackled a multi-class problem related to liver toxicity severity prediction using the Random Forest and GEMS-SVM (Gene Expression Model Selector using Support Vector Machine). However, the original SVM regularization formulation does not accommodate the variable selection. Most existing approaches, including GEMS-SVM, handle this issue by selecting genes prior to classification, which does not consider the correlation among genes since they are selected by univariate ranking. In this thesis, the author developed new multi-class SVM (MSVM) approaches which can perform multi-class classification and variable selection simultaneously and learn optimal classifiers by considering all classes and all genes at the same time. The original multi-class SVM proposed by Crammer and Singer (2001) does not perform the variable selection. By using the MSVM loss function proposed by Crammer and Singer (2001), the author developed new variable selection approaches for both linear and non-linear classification problems. For linear classification problems, four different sparse regularization terms were included in the objective function respectively. For nonlinear classification problems, two different approaches have been developed to tackle them. The first approach was used in non-linear MSVMs via basis function transformation. The second approach was used in non-linear MSVMs via kernel functions. The newly developed methods were applied to both simulation and real data sets. The results demonstrated that our methods could select a much smaller number of genes, compared with other existing methods, with high classification accuracy to predict the tumor subtypes. The combination of high accuracy and small number of genes makes our new methods as powerful tools for disease diagnostics based on expression data and target identifications of the therapeutic intervention.
Date: 2009-03-04
Degree: PhD
Discipline: Bioinformatics

Files in this item

Files Size Format View
etd.pdf 1.258Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record