Estimating the number of clusters in a data set via the gap statistic

R Tibshirani, G Walther, T Hastie - Journal of the Royal …, 2001 - Wiley Online Library
Journal of the Royal Statistical Society: Series B (Statistical …, 2001Wiley Online Library
We propose a method (the 'gap statistic') for estimating the number of clusters (groups) in a
set of data. The technique uses the output of any clustering algorithm (eg K‐means or
hierarchical), comparing the change in within‐cluster dispersion with that expected under an
appropriate reference null distribution. Some theory is developed for the proposal and a
simulation study shows that the gap statistic usually outperforms other methods that have
been proposed in the literature.
We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K‐means or hierarchical), comparing the change in within‐cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.
Wiley Online Library