Variable Selection in Multiclass Support Vector Machine and Applications in Genomic Data Analysis
No Thumbnail Available
Files
Date
2009-03-04
Authors
Huang, Lingkang
Journal Title
Series/Report No.
Journal ISSN
Volume Title
Publisher
Abstract
Microarray techniques provide new insights into cancer diagnosis using gene expression profiles. Molecular diagnosis based on high-throughput genomic data sets presents major challenge due to
the overwhelming number of variables and complex multi-class nature of tumor samples. In this thesis, the author first tackled a multi-class problem related to liver toxicity severity
prediction using the Random Forest and GEMS-SVM (Gene Expression Model Selector using Support Vector Machine). However, the original SVM regularization formulation does not accommodate the variable selection. Most existing approaches, including GEMS-SVM, handle this issue by selecting genes prior to classification,
which does not consider the correlation among genes since they are selected by univariate ranking. In this thesis, the author
developed new multi-class SVM (MSVM) approaches which can perform multi-class classification and variable selection simultaneously
and learn optimal classifiers by considering all classes and all genes at the same time. The original multi-class SVM proposed by
Crammer and Singer (2001) does not perform the variable selection. By using the MSVM loss function proposed by Crammer and Singer
(2001), the author developed new variable selection approaches for both linear and non-linear classification problems. For linear
classification problems, four different sparse regularization terms were included in the objective function respectively. For
nonlinear classification problems, two different approaches have been developed to tackle them. The first approach was used in
non-linear MSVMs via basis function transformation. The second
approach was used in non-linear MSVMs via kernel functions. The newly developed methods were applied to both simulation and real
data sets. The results demonstrated that our methods could select a much smaller number of genes, compared with other existing
methods, with high classification accuracy to predict the tumor subtypes. The combination of high accuracy and small number of
genes makes our new methods as powerful tools for disease diagnostics based on expression data and target identifications of
the therapeutic intervention.
Description
Keywords
multi-class classification, support vector machine, microarray, variable selection
Citation
Degree
PhD
Discipline
Bioinformatics
Statistics
Statistics