Regression via Clustering using Dirichlet Mixtures

dc.contributor.advisorHao H. Zhang, Committee Memberen_US
dc.contributor.advisorSubhashis Ghosal, Committee Chairen_US
dc.contributor.advisorJohn F. Monahan, Committee Memberen_US
dc.contributor.advisorSujit K. Ghosh, Committee Memberen_US
dc.contributor.authorKang, Changkuen_US
dc.date.accessioned2010-04-02T18:37:15Z
dc.date.available2010-04-02T18:37:15Z
dc.date.issued2005-12-06en_US
dc.degree.disciplineStatisticsen_US
dc.degree.leveldissertationen_US
dc.degree.namePhDen_US
dc.description.abstractRegression analysis is a fundamental problem of statistics. When the regression function has an unknown form, parametric analysis is sometimes inappropriate. In such a situation, the regression function should be estimated by nonparametric methods. Often, the regressor variable is sampled from several different subpopulations and the regression function has different forms depending on the source. The labels of these source subpopulations are not observable. Although a nonparametrically specified regression function can capture the overall regression function, nonparametric regression estimates are usually dependent on the assumption of homoscedasticity of additive errors. If the underlying distribution of X has unknown clusters, then the usual assumption, the homoscedasity does not hold. In estimating the regression function, we propose the idea of first finding clusters in the regressor variables by the Dirichlet mixture to impute lost subpopulation labels. A standard regression method such as linear or polynomial regression then may be used within each cluster. Markov Chain Monte Carlo (MCMC) sampling method is used to find the clusters and for each sample the estimated regression functions can be obtained. We also apply our method to the large p, small n problem, where the number of variables p is much greater than the number of samples n. In several simulation experiments, our method is compared to other methods such as kernel and smoothing splines in the univariate case and GAM (generalized additive model) and MARS (Multivariate Adaptive Regression Splines) in the multivariate case. The consistency issue is discussed without explicit proof.en_US
dc.identifier.otheretd-11022005-230329en_US
dc.identifier.urihttp://www.lib.ncsu.edu/resolver/1840.16/3822
dc.rightsI hereby certify that, if appropriate, I have obtained and attached hereto a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to NC State University or its agents the non-exclusive license to archive and make accessible, under the conditions specified below, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.en_US
dc.subjectBayesianen_US
dc.subjectclusteringen_US
dc.subjectDirichlet mixturesen_US
dc.titleRegression via Clustering using Dirichlet Mixturesen_US

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
etd.pdf
Size:
524.77 KB
Format:
Adobe Portable Document Format

Collections