A framework for feature selection in clustering

DM Witten, R Tibshirani - Journal of the American Statistical …, 2010 - Taylor & Francis
Journal of the American Statistical Association, 2010Taylor & Francis
We consider the problem of clustering observations using a potentially large set of features.
One might expect that the true underlying clusters present in the data differ only with respect
to a small fraction of the features, and will be missed if one clusters the observations using
the full set of features. We propose a novel framework for sparse clustering, in which one
clusters the observations using an adaptively chosen subset of the features. The method
uses a lasso-type penalty to select the features. We use this framework to develop simple …
We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated and genomic data.
Taylor & Francis Online