Estimating the Number of Clusters in Cluster Analysis

No Thumbnail Available

Date

2007-03-08

Journal Title

Series/Report No.

Journal ISSN

Volume Title

Publisher

Abstract

In many applied fields of study such as medicine, psychology, ecology, taxonomy and finance one has to deal with massive amounts of noisy but structured data. A question that often arises in this context is whether or not the observations in these data fall into some "natural" groups, and if so, how many groups? This dissertation proposes a new quantity, called the [it maximal jump function], for assessing the number of groups in a data set. The estimated maximal jump function measures the excess transformed [it distortion] attainable by fitting an extra cluster to a data set. By [it distortion,] we mean the average distance between each observation and its nearest cluster center. [it Distortion] $ d g$ in the above sense, is a measure of the error incurred by fitting $g$ clusters to a data set. Three stopping rules based on the maximal jump function are proposed for determining the number of groups in a data set. A new procedure for clustering data sets with a common covariance structure is also introduced. The proposed methods are tested on a wide variety of real data including DNA microarray data sets as well as on high-dimensional simulated data possessing numerous "noisy" features⁄dimensions. Also, to show the effectiveness of the proposed methods, comparisons are made to some well known clustering methods.

Description

Keywords

High-dimensional Data, Noise Features, Jump Function, Distortion, Cluster Analysis

Citation

Degree

PhD

Discipline

Statistics

Collections