Topic Modelling
As an unsupervised machine learning method, topic modelling extracts valuable insights from unstructured text data for making business decisions. According to Wikipedia, topic modelling is a form of statistical modelling used to discover abstract topics or hidden semantic structures in a collection of documents (body of text). Topic modelling is used to analyse customer feedback, staff surveys, social media and blogs.
This vignette demonstrates a simplified example of topic modelling. The dataset consists of feedback from 62 bank staff as they experienced a bank merger. The bank merger was nationwide, with staff surveyed predominantly based in Queensland.
This vignette demonstrates two methods of topic modelling. These methods are Latent Dirichlet Allocation (LDA) and Structural Topic Modelling (STM). Both methods will analyse bank staff feedback to discover and compare the underlying themes or topics.
To commence, tidied the raw text, removed stop words and lemmatised similar words. A word frequency bar chart visualises the keywords used by bank staff to describe the merger.
With pre-processing complete, as a matter of interest, the tf-idf statistic was calculated for each document. According to Wikipedia, tf-idf (term frequency-inverse document frequency) is a statistic reflecting how important a word is to a document within a collection of documents or corpus.
Before implementing topic modelling, exploring the optimal number of topics in a corpus is helpful. In this example, three different statistical approaches, combined with informed judgement, lead to the optimal number of topics for this corpus.
Finally, implemented LDA and STM algorithms and compared the results. For brevity, the results for each topic modelling method show one visualisation for gamma (the probability of topics per document) and beta (the probability of words per topic). To conclude, listed a sample of comments by bank staff to support the different themes derived from each topic modelling method.
Chart 1 lists the most common words bank staff use to describe the merger.
The tf-idf statistic calculates the importance of a word to one document compared to the collection of documents. Rather than showing the tf-idf statistic for all respondents, Chart 2 shows the highest tf-idf statistic for a sample of respondents numbered by respondent id.
The following three charts show the results of investigating the optimal number of topics for this corpus. Chart 3 diagnostics suggests that six topics would produce a reasonable outcome. Based on the harmonic mean, six topics would also be reasonable, as shown in Chart 4. The extracted image in Chart 5 was calculated on six topics. It shows that topics are separate and distinct from each other with no overlap.
Chart 3 Metrics estimating the preferred number of topics for the LDA model
Chart 4 Harmonic mean approach to determining the number of topics
Chart 5 LDAvis Intertopic Distance Map image with six topics
The following two visualisations show document-topic probabilities (gamma) and word-topic probabilities (beta) with LDA.
The following two visualisations show document-topic probabilities (gamma) and word-topic probabilities (beta) with STM.
Both topic modelling methods implemented in this vignette produced very similar results, as shown in Charts 7 and 9. There are close similarities between five of the six topics for each method. For example, LDA topic 1 produced similar results to STM topic 2. LDA topic 2 is similar to STM topic 5. LDA topic 4 has consistency with STM topic 4. LDA topic 5 is similar to STM topic 3. LDA topic 6 is similar to STM topic 1. LDA topic 3 and STM topic 6 have no similarities suggesting two different topics.
Further analysis was done, based on gamma, to verify these results. Compiled a list of documents in common for LDA and STM methods for each derived topic or theme. Extracts follow.
Derived this theme from the following sample of comments in common with LDA topic 1 and STM topic 2.
| id | comments by respondents |
|---|---|
| 322 | Lack of training in the updates of procedures. Lack of head office support and being fobbed off as ““not their problem”“. Having to go through many departments all over Australia to fix any problems. No one admitting or understanding changes have been caused to customers and their frustrations and our frustrations to help them fix them. No one wants to commit to a solution. |
| 318 | Probably the way I feel is the indecision by higher levels of support. Not knowing what is happening and the lack of help in our Head Office. |
| 521 | Little, no support or understanding from management of Queensland. |
This theme was derived from the following sample of comments in common with LDA topic 2 and STM topic 5.
| id | comments by respondents |
|---|---|
| 34 | Change is a good thing for the right situation and this is one. Who can oppose this change when it will eventually benefit all involved? |
| 36 | I have no objections either way. All that the change needs is more input from employees and customers. Some changes do not make sense. |
| 343 | I agree that this merger is a good thing for the organisation and for the shareholders but as usual the people at ground level are left to deal with the anger and objections of the customers without a lot of help from above. |
This theme was derived from the following sample of comments in common with LDA topic 4 and STM topic 4.
| id | comments by respondents |
|---|---|
| 519 | Better computer system, larger organisation, more chances for promotion. |
| 31 | The main change to a fully merged computer system is an essential part of the bank’s development. After the pain the gain will be worthwhile. So we are led to believe. |
| 505 | Improves the system, highlights the weak links. |
Derived this theme from the following sample of comments in common with LDA topic 5 and STM topic 3.
| id | comments by respondents |
|---|---|
| 434 | In my working life I have been through many changes (some self initiated others not). I am a fairly resilient person who likes to believe that change usually results in growth both personal and business wise. |
| 508 | I have support in the change as I enjoy personal growth in all areas of my life (work and home) and enjoy the challenge of new horizons. |
| 485 | Change is progress we all need to change to grow in our personal and professional lives. |
This theme was derived from the following sample of comments in common with LDA topic 6 and STM topic 1.
| id | comments by respondents |
|---|---|
| 18 | It needs to happen to remain competitive and keep a job and a future. |
| 319 | These days mergers happen with all industries. It’s a case of go with it or find another job. That easy, if you still have one. |
| 473 | It’s a fact of life, it’s happening to everybody, everywhere. It needs to happen for business to survive. It may seem difficult at times now but will eventually get easier. |
This theme was derived from the following sample of comments in LDA topic 3.
| id | comments by respondents |
|---|---|
| 45 | I believe in time the change will be beneficial for the whole organisation but currently the length of time affecting the change is having adverse effects on the company. Some decisions made appear to be logical and practical and others appear illogical and caused difficulties. |
| 66 | Change is progress to a better working environment. Given time change becomes normal. |
| 344 | The feeling of uncertainty and being taken out of my comfort zone of familiarity has caused me to feel unsure about the changes. But at the same time the changes have built up my confidence in a way. The change has been a big learning curve so therefore a feeling of accomplishment. |
Upon review, comments in STM topic 6 did not provide further insights from the themes already derived above, more a combination of existing themes.
Overall, both LDA and STM methods performed well on this dataset, reliably extracting the same main abstract topics or underlying themes from the corpus.
For more examples of text analysis, look at the vignettes on word frequency, n-grams word pairs and correlation, and sentiment analysis.
Reference:
Silge, J. & Robinson, D. (2017). Text Mining with R, O’Reilly Media Inc.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.0 (2024-04-24 ucrt)
## os Windows 11 x64 (build 22631)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Australia.utf8
## ctype English_Australia.utf8
## tz Australia/Brisbane
## date 2024-07-30
## pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## bslib 0.7.0 2024-03-29 [1] CRAN (R 4.4.0)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
## cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.4.1)
## crosstalk 1.2.1 2023-11-23 [1] CRAN (R 4.4.0)
## data.table * 1.15.4 2024-03-30 [1] CRAN (R 4.4.0)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.0)
## digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0)
## evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
## fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
## ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
## ggthemes * 5.1.0 2024-02-10 [1] CRAN (R 4.4.0)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
## gmp * 0.7-4 2024-01-15 [1] CRAN (R 4.4.0)
## gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
## highr 0.11 2024-05-26 [1] CRAN (R 4.4.0)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
## httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.0)
## httr 1.4.7 2023-08-15 [1] CRAN (R 4.4.0)
## janeaustenr 1.0.0 2022-08-26 [1] CRAN (R 4.4.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
## jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
## kableExtra * 1.4.0 2024-01-24 [1] CRAN (R 4.4.0)
## knitr 1.48 2024-07-07 [1] CRAN (R 4.4.1)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
## later 1.3.2 2023-12-06 [1] CRAN (R 4.4.0)
## lattice 0.22-6 2024-03-20 [2] CRAN (R 4.4.0)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.4.0)
## ldatuning * 1.0.2 2020-04-21 [1] CRAN (R 4.4.0)
## LDAvis * 0.3.2 2015-10-24 [1] CRAN (R 4.4.0)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
## Matrix 1.7-0 2024-03-22 [2] CRAN (R 4.4.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
## mime 0.12 2021-09-28 [1] CRAN (R 4.4.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
## modeltools 0.2-23 2020-03-05 [1] CRAN (R 4.4.0)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
## NLP * 0.2-1 2020-10-14 [1] CRAN (R 4.4.0)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
## pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.4.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
## pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.1)
## plotly * 4.10.4 2024-01-13 [1] CRAN (R 4.4.0)
## plyr 1.8.9 2023-10-02 [1] CRAN (R 4.4.0)
## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.4.0)
## promises 1.3.0 2024-04-05 [1] CRAN (R 4.4.0)
## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
## RColorBrewer * 1.1-3 2022-04-03 [1] CRAN (R 4.4.0)
## Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.1)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.0)
## reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.4.0)
## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
## rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
## Rmpfr * 0.9-5 2024-01-21 [1] CRAN (R 4.4.0)
## rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
## rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
## sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.0)
## scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
## servr * 0.30 2024-03-23 [1] CRAN (R 4.4.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
## shiny 1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
## slam 0.1-51 2024-07-17 [1] CRAN (R 4.4.1)
## SnowballC 0.7.1 2023-04-25 [1] CRAN (R 4.4.0)
## stm * 1.3.7 2023-12-01 [1] CRAN (R 4.4.0)
## stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
## svglite 2.1.3 2023-12-08 [1] CRAN (R 4.4.0)
## systemfonts 1.1.0 2024-05-15 [1] CRAN (R 4.4.0)
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
## tidytext * 0.4.2 2024-04-10 [1] CRAN (R 4.4.0)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
## tm * 0.7-13 2024-04-20 [1] CRAN (R 4.4.0)
## tokenizers 0.3.0 2022-12-22 [1] CRAN (R 4.4.0)
## topicmodels * 0.2-16 2024-01-09 [1] CRAN (R 4.4.0)
## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.0)
## usethis 2.2.3 2024-02-19 [1] CRAN (R 4.4.0)
## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
## viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.4.0)
## withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
## wordcloud * 2.6 2018-08-24 [1] CRAN (R 4.4.0)
## xfun 0.46 2024-07-18 [1] CRAN (R 4.4.1)
## xml2 1.3.6 2023-12-04 [1] CRAN (R 4.4.0)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.0)
## yaml 2.3.9 2024-07-05 [1] CRAN (R 4.4.1)
##
## [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
## [2] C:/Program Files/R/R-4.4.0/library
##
## ──────────────────────────────────────────────────────────────────────────────