Google Scholar

Estimating the number of clusters in a data set via the gap statistic

R Tibshirani, G Walther, T Hastie - Journal of the Royal …, 2001 - Wiley Online Library

Journal of the Royal Statistical Society: Series B (Statistical …, 2001•Wiley Online Library

We propose a method (the 'gap statistic') for estimating the number of clusters (groups) in a
set of data. The technique uses the output of any clustering algorithm (eg K‐means or
hierarchical), comparing the change in within‐cluster dispersion with that expected under an
appropriate reference null distribution. Some theory is developed for the proposal and a
simulation study shows that the gap statistic usually outperforms other methods that have
been proposed in the literature.

We propose a method (the ‘gap statistic’) for estimating the number of clusters (groups) in a set of data. The technique uses the output of any clustering algorithm (e.g. K‐means or hierarchical), comparing the change in within‐cluster dispersion with that expected under an appropriate reference null distribution. Some theory is developed for the proposal and a simulation study shows that the gap statistic usually outperforms other methods that have been proposed in the literature.

Wiley Online Library

Show moreShow less

Save Cite Cited by 7071 Related articles All 13 versions Library Search

Cite

Advanced search

Saved to My library

Estimating the number of clusters in a data set via the gap statistic