Unsupervised Cluster Analysis - K-means Clustering
This vignette demonstrates unsupervised cluster analysis. Wikipedia describes cluster analysis as the process of grouping a set of objects so that objects in the same group (called a cluster) are more similar than those in other groups (clusters).
Cluster analysis is widely implemented in business, traditionally in marketing for market (customer) segmentation. In market segmentation, cluster analysis defines the similarities and differences across customer segments for product/service preferences and customer behaviour. Cluster analysis is also widely implemented in other fields and disciplines to observe similarities and differences in data.
This vignette conducts an unsupervised cluster analysis based on a data set of 616 individual respondents. The analysis aims to assign respondents into clusters based on their emotional attributes towards organisational change.
Firstly, pre-processed the data frame and reviewed the variables for statistical integrity. For cluster analysis, it is generally mandatory to normalise or standardise scales, mainly if variables use different measurement scales. Standardisation was unnecessary for this instance, as data was gathered using the same seven-point scale. Nevertheless, normalisation was carried out.
The next important step was choosing the segmentation variables to build the clusters. For this vignette, the segmentation variable was employee emotional attributes towards organisational change.
Two techniques were implemented to compute and verify the optimal number of clusters for this data set. Finally, visualised the optimal number of clusters based on employee emotional attributes towards organisational change.
Chart 1 is an example of exploratory clustering. Individual respondents are plotted on each sub-chart and coloured according to the predicted cluster. Sub-chart clusters range from 1 to 10, with centres marked for each cluster. Reviewing these sub-charts makes it challenging to assess the optimal number of clusters with any degree of confidence.
Rather than rely on a visual assessment, Chart 2 uses the silhouette method to determine the optimal number of clusters.
Sometimes, depending on the strengths and limitations of the method, calculating the optimal number of clusters with different methods may produce slightly different results. If this were to occur, another option for finding the optimal number of clusters is the maximum consensus of numerous methods. Chart 3 shows the results of the consensus method for determining the optimal number of clusters. The optimal number of clusters for this data set was two.
Based on Charts 2 and 3, the following charts illustrate how individual respondents are assigned to the optimal number of clusters. Respondent id has been removed from each data point to declutter the charts. Chart 4 is plotted with jitter, whereas Chart 5 has no jitter.
In this example of unsupervised cluster analysis, respondents in the green cluster of Charts 4 and 5 generally reported positive emotional attributes towards organisational change. They were generally supportive of change, as shown by the cluster centre. In contrast, the respondents in the red cluster of Charts 4 and 5 mainly reported negative emotional attributes towards organisational change. They were more likely to oppose the change, as indicated by the cluster centre.
In addition to this example of unsupervised cluster analysis, see the vignette on supervised cluster analysis where the target variable is known.
Reference:
Emotion was measured using ‘A semantic differential mood scale’ by Lorr and Wunderlich, published in the Journal of Clinical Psychology.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.0 (2024-04-24 ucrt)
## os Windows 11 x64 (build 22631)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Australia.utf8
## ctype English_Australia.utf8
## tz Australia/Brisbane
## date 2024-07-29
## pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] CRAN (R 4.4.0)
## backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.0)
## bayestestR 0.14.0 2024-07-24 [1] CRAN (R 4.4.1)
## broom * 1.0.6 2024-05-17 [1] CRAN (R 4.4.0)
## bslib 0.7.0 2024-03-29 [1] CRAN (R 4.4.0)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
## car 3.1-2 2023-03-30 [1] CRAN (R 4.4.0)
## carData 3.0-5 2022-01-06 [1] CRAN (R 4.4.0)
## class 7.3-22 2023-05-03 [2] CRAN (R 4.4.0)
## cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1)
## cluster 2.1.6 2023-12-01 [2] CRAN (R 4.4.0)
## coda 0.19-4.1 2024-01-31 [1] CRAN (R 4.4.0)
## codetools 0.2-20 2024-03-31 [2] CRAN (R 4.4.0)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.4.1)
## data.table * 1.15.4 2024-03-30 [1] CRAN (R 4.4.0)
## datawizard 0.12.2 2024-07-21 [1] CRAN (R 4.4.1)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.0)
## dials * 1.2.1 2024-02-22 [1] CRAN (R 4.4.0)
## DiceDesign 1.10 2023-12-07 [1] CRAN (R 4.4.0)
## digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0)
## emmeans 1.10.3 2024-07-01 [1] CRAN (R 4.4.1)
## estimability 1.5.1 2024-05-12 [1] CRAN (R 4.4.0)
## evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
## factoextra * 1.0.7 2020-04-01 [1] CRAN (R 4.4.0)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
## foreach 1.5.2 2022-02-02 [1] CRAN (R 4.4.0)
## fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
## furrr 0.3.1 2022-08-15 [1] CRAN (R 4.4.0)
## future 1.33.2 2024-03-26 [1] CRAN (R 4.4.0)
## future.apply 1.11.2 2024-03-28 [1] CRAN (R 4.4.0)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
## ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
## ggpubr 0.6.0 2023-02-10 [1] CRAN (R 4.4.0)
## ggrepel 0.9.5 2024-01-10 [1] CRAN (R 4.4.0)
## ggsignif 0.6.4 2022-10-13 [1] CRAN (R 4.4.0)
## globals 0.16.3 2024-03-08 [1] CRAN (R 4.4.0)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
## gower 1.0.1 2022-12-22 [1] CRAN (R 4.4.0)
## GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.4.0)
## gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
## hardhat 1.4.0 2024-06-02 [1] CRAN (R 4.4.0)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
## highr 0.11 2024-05-26 [1] CRAN (R 4.4.0)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
## httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.0)
## infer * 1.0.7 2024-03-25 [1] CRAN (R 4.4.0)
## insight 0.20.2 2024-07-13 [1] CRAN (R 4.4.0)
## ipred 0.9-15 2024-07-18 [1] CRAN (R 4.4.1)
## iterators 1.0.14 2022-02-05 [1] CRAN (R 4.4.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
## jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
## knitr 1.48 2024-07-07 [1] CRAN (R 4.4.1)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
## later 1.3.2 2023-12-06 [1] CRAN (R 4.4.0)
## lattice 0.22-6 2024-03-20 [2] CRAN (R 4.4.0)
## lava 1.8.0 2024-03-05 [1] CRAN (R 4.4.0)
## lhs 1.2.0 2024-06-30 [1] CRAN (R 4.4.1)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
## listenv 0.9.1 2024-01-29 [1] CRAN (R 4.4.0)
## lubridate 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
## MASS 7.3-60.2 2024-04-24 [2] local
## Matrix 1.7-0 2024-03-22 [2] CRAN (R 4.4.0)
## mclust 6.1.1 2024-04-29 [1] CRAN (R 4.4.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
## mime 0.12 2021-09-28 [1] CRAN (R 4.4.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
## modeldata * 1.4.0 2024-06-19 [1] CRAN (R 4.4.1)
## multcomp 1.4-26 2024-07-18 [1] CRAN (R 4.4.1)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
## mvtnorm 1.2-5 2024-05-21 [1] CRAN (R 4.4.0)
## NbClust 3.0.1 2022-05-02 [1] CRAN (R 4.4.0)
## nnet 7.3-19 2023-05-03 [2] CRAN (R 4.4.0)
## parallelly 1.37.1 2024-02-29 [1] CRAN (R 4.4.0)
## parameters * 0.22.1 2024-07-21 [1] CRAN (R 4.4.1)
## parsnip * 1.2.1 2024-03-22 [1] CRAN (R 4.4.0)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
## pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.4.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
## pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.1)
## prodlim 2024.06.25 2024-06-24 [1] CRAN (R 4.4.1)
## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.4.0)
## promises 1.3.0 2024-04-05 [1] CRAN (R 4.4.0)
## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
## Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.1)
## recipes * 1.1.0 2024-07-04 [1] CRAN (R 4.4.1)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.0)
## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
## rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
## rpart 4.1.23 2023-12-05 [2] CRAN (R 4.4.0)
## rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
## rsample * 1.2.1 2024-03-25 [1] CRAN (R 4.4.0)
## rstatix 0.7.2 2023-02-01 [1] CRAN (R 4.4.0)
## rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
## sandwich 3.1-0 2023-12-11 [1] CRAN (R 4.4.0)
## sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.0)
## scales * 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
## see 0.8.5 2024-07-17 [1] CRAN (R 4.4.1)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
## shiny 1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
## stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
## stringr 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
## survival 3.5-8 2024-02-14 [2] CRAN (R 4.4.0)
## TH.data 1.1-2 2023-04-17 [1] CRAN (R 4.4.0)
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
## tidymodels * 1.2.0 2024-03-25 [1] CRAN (R 4.4.0)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
## timeDate 4032.109 2023-12-14 [1] CRAN (R 4.4.0)
## tune * 1.2.1 2024-04-18 [1] CRAN (R 4.4.0)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.0)
## usethis 2.2.3 2024-02-19 [1] CRAN (R 4.4.0)
## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
## withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
## workflows * 1.1.4 2024-02-19 [1] CRAN (R 4.4.0)
## workflowsets * 1.1.0 2024-03-21 [1] CRAN (R 4.4.0)
## xfun 0.46 2024-07-18 [1] CRAN (R 4.4.1)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.0)
## yaml 2.3.9 2024-07-05 [1] CRAN (R 4.4.1)
## yardstick * 1.3.1 2024-03-21 [1] CRAN (R 4.4.0)
## zoo 1.8-12 2023-04-13 [1] CRAN (R 4.4.0)
##
## [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
## [2] C:/Program Files/R/R-4.4.0/library
##
## ──────────────────────────────────────────────────────────────────────────────