Variable Selection in Multiclass Support Vector Machine and Applications in Genomic Data Analysis

No Thumbnail Available

Date

2009-03-04

Journal Title

Series/Report No.

Journal ISSN

Volume Title

Publisher

Abstract

Microarray techniques provide new insights into cancer diagnosis using gene expression profiles. Molecular diagnosis based on high-throughput genomic data sets presents major challenge due to the overwhelming number of variables and complex multi-class nature of tumor samples. In this thesis, the author first tackled a multi-class problem related to liver toxicity severity prediction using the Random Forest and GEMS-SVM (Gene Expression Model Selector using Support Vector Machine). However, the original SVM regularization formulation does not accommodate the variable selection. Most existing approaches, including GEMS-SVM, handle this issue by selecting genes prior to classification, which does not consider the correlation among genes since they are selected by univariate ranking. In this thesis, the author developed new multi-class SVM (MSVM) approaches which can perform multi-class classification and variable selection simultaneously and learn optimal classifiers by considering all classes and all genes at the same time. The original multi-class SVM proposed by Crammer and Singer (2001) does not perform the variable selection. By using the MSVM loss function proposed by Crammer and Singer (2001), the author developed new variable selection approaches for both linear and non-linear classification problems. For linear classification problems, four different sparse regularization terms were included in the objective function respectively. For nonlinear classification problems, two different approaches have been developed to tackle them. The first approach was used in non-linear MSVMs via basis function transformation. The second approach was used in non-linear MSVMs via kernel functions. The newly developed methods were applied to both simulation and real data sets. The results demonstrated that our methods could select a much smaller number of genes, compared with other existing methods, with high classification accuracy to predict the tumor subtypes. The combination of high accuracy and small number of genes makes our new methods as powerful tools for disease diagnostics based on expression data and target identifications of the therapeutic intervention.

Description

Keywords

multi-class classification, support vector machine, microarray, variable selection

Citation

Degree

PhD

Discipline

Bioinformatics
Statistics

Collections