Topic Modelling

Objective

As an unsupervised machine learning method, topic modelling extracts valuable insights from unstructured text data for making business decisions. According to Wikipedia, topic modelling is a form of statistical modelling used to discover abstract topics or hidden semantic structures in a collection of documents (body of text). Topic modelling is used to analyse customer feedback, staff surveys, social media and blogs.

This vignette demonstrates a simplified example of topic modelling. The dataset consists of feedback from 62 bank staff as they experienced a bank merger. The bank merger was nationwide, with staff surveyed predominantly based in Queensland.

This vignette demonstrates two methods of topic modelling. These methods are Latent Dirichlet Allocation (LDA) and Structural Topic Modelling (STM). Both methods will analyse bank staff feedback to discover and compare the underlying themes or topics.

Workflow

To commence, tidied the raw text, removed stop words and lemmatised similar words. A word frequency bar chart visualises the keywords used by bank staff to describe the merger.

With pre-processing complete, as a matter of interest, the tf-idf statistic was calculated for each document. According to Wikipedia, tf-idf (term frequency-inverse document frequency) is a statistic reflecting how important a word is to a document within a collection of documents or corpus.

Before implementing topic modelling, exploring the optimal number of topics in a corpus is helpful. In this example, three different statistical approaches, combined with informed judgement, lead to the optimal number of topics for this corpus.

Finally, implemented LDA and STM algorithms and compared the results. For brevity, the results for each topic modelling method show one visualisation for gamma (the probability of topics per document) and beta (the probability of words per topic). To conclude, listed a sample of comments by bank staff to support the different themes derived from each topic modelling method.

Results

1. Word Frequency

Chart 1 lists the most common words bank staff use to describe the merger.

2. Visualise tf-idf

The tf-idf statistic calculates the importance of a word to one document compared to the collection of documents. Rather than showing the tf-idf statistic for all respondents, Chart 2 shows the highest tf-idf statistic for a sample of respondents numbered by respondent id.

3. Optimal number of topics

The following three charts show the results of investigating the optimal number of topics for this corpus. Chart 3 diagnostics suggests that six topics would produce a reasonable outcome. Based on the harmonic mean, six topics would also be reasonable, as shown in Chart 4. The extracted image in Chart 5 was calculated on six topics. It shows that topics are separate and distinct from each other with no overlap.

Chart 3 Metrics estimating the preferred number of topics for the LDA model

Chart 4 Harmonic mean approach to determining the number of topics

Chart 5 LDAvis Intertopic Distance Map image with six topics

4. LDA method

The following two visualisations show document-topic probabilities (gamma) and word-topic probabilities (beta) with LDA.

4.1 LDA document-topic probabilities

4.2 LDA word-topic probabilities

5. STM method

The following two visualisations show document-topic probabilities (gamma) and word-topic probabilities (beta) with STM.

5.1 STM document-topic probabilities

5.2 STM word-topic probabilities

6. Compare model results

Both topic modelling methods implemented in this vignette produced very similar results, as shown in Charts 7 and 9. There are close similarities between five of the six topics for each method. For example, LDA topic 1 produced similar results to STM topic 2. LDA topic 2 is similar to STM topic 5. LDA topic 4 has consistency with STM topic 4. LDA topic 5 is similar to STM topic 3. LDA topic 6 is similar to STM topic 1. LDA topic 3 and STM topic 6 have no similarities suggesting two different topics.

Further analysis was done, based on gamma, to verify these results. Compiled a list of documents in common for LDA and STM methods for each derived topic or theme. Extracts follow.

6.1 Topic 1: Dissatisfaction with management

Derived this theme from the following sample of comments in common with LDA topic 1 and STM topic 2.

**Topic 1: Dissatisfaction with management support**
id	comments by respondents
322	Lack of training in the updates of procedures. Lack of head office support and being fobbed off as ““not their problem”“. Having to go through many departments all over Australia to fix any problems. No one admitting or understanding changes have been caused to customers and their frustrations and our frustrations to help them fix them. No one wants to commit to a solution.
318	Probably the way I feel is the indecision by higher levels of support. Not knowing what is happening and the lack of help in our Head Office.
521	Little, no support or understanding from management of Queensland.

6.2 Topic 2: Support for the merger

This theme was derived from the following sample of comments in common with LDA topic 2 and STM topic 5.

**Topic 2: Support for the merger**
id	comments by respondents
34	Change is a good thing for the right situation and this is one. Who can oppose this change when it will eventually benefit all involved?
36	I have no objections either way. All that the change needs is more input from employees and customers. Some changes do not make sense.
343	I agree that this merger is a good thing for the organisation and for the shareholders but as usual the people at ground level are left to deal with the anger and objections of the customers without a lot of help from above.

6.3 Topic 3: Computer system

This theme was derived from the following sample of comments in common with LDA topic 4 and STM topic 4.

**Topic 3: Computer system**
id	comments by respondents
519	Better computer system, larger organisation, more chances for promotion.
31	The main change to a fully merged computer system is an essential part of the bank’s development. After the pain the gain will be worthwhile. So we are led to believe.
505	Improves the system, highlights the weak links.

6.4 Topic 4: Professional and personal growth

Derived this theme from the following sample of comments in common with LDA topic 5 and STM topic 3.

**Topic 4: Personal and professional growth**
id	comments by respondents
434	In my working life I have been through many changes (some self initiated others not). I am a fairly resilient person who likes to believe that change usually results in growth both personal and business wise.
508	I have support in the change as I enjoy personal growth in all areas of my life (work and home) and enjoy the challenge of new horizons.
485	Change is progress we all need to change to grow in our personal and professional lives.

6.5 Topic 5: Need for and acceptance of change

This theme was derived from the following sample of comments in common with LDA topic 6 and STM topic 1.

**Topic 5: Need for and acceptance of change**
id	comments by respondents
18	It needs to happen to remain competitive and keep a job and a future.
319	These days mergers happen with all industries. It’s a case of go with it or find another job. That easy, if you still have one.
473	It’s a fact of life, it’s happening to everybody, everywhere. It needs to happen for business to survive. It may seem difficult at times now but will eventually get easier.

6.6 Topic 6: Benefits of time

This theme was derived from the following sample of comments in LDA topic 3.

**Topic 6: Benefits of time**
id	comments by respondents
45	I believe in time the change will be beneficial for the whole organisation but currently the length of time affecting the change is having adverse effects on the company. Some decisions made appear to be logical and practical and others appear illogical and caused difficulties.
66	Change is progress to a better working environment. Given time change becomes normal.
344	The feeling of uncertainty and being taken out of my comfort zone of familiarity has caused me to feel unsure about the changes. But at the same time the changes have built up my confidence in a way. The change has been a big learning curve so therefore a feeling of accomplishment.

Upon review, comments in STM topic 6 did not provide further insights from the themes already derived above, more a combination of existing themes.

Overall, both LDA and STM methods performed well on this dataset, reliably extracting the same main abstract topics or underlying themes from the corpus.

For more examples of text analysis, look at the vignettes on word frequency, n-grams word pairs and correlation, and sentiment analysis.

Reference:

Silge, J. & Robinson, D. (2017). Text Mining with R, O’Reilly Media Inc.

Session information and package update

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.0 (2024-04-24 ucrt)
##  os       Windows 11 x64 (build 22631)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_Australia.utf8
##  ctype    English_Australia.utf8
##  tz       Australia/Brisbane
##  date     2024-07-30
##  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version date (UTC) lib source
##  bslib          0.7.0   2024-03-29 [1] CRAN (R 4.4.0)
##  cachem         1.1.0   2024-05-16 [1] CRAN (R 4.4.0)
##  cli            3.6.3   2024-06-21 [1] CRAN (R 4.4.1)
##  colorspace     2.1-0   2023-01-23 [1] CRAN (R 4.4.1)
##  crosstalk      1.2.1   2023-11-23 [1] CRAN (R 4.4.0)
##  data.table   * 1.15.4  2024-03-30 [1] CRAN (R 4.4.0)
##  devtools       2.4.5   2022-10-11 [1] CRAN (R 4.4.0)
##  digest         0.6.36  2024-06-23 [1] CRAN (R 4.4.1)
##  dplyr        * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
##  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.4.0)
##  evaluate       0.24.0  2024-06-10 [1] CRAN (R 4.4.0)
##  fansi          1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
##  farver         2.1.2   2024-05-13 [1] CRAN (R 4.4.0)
##  fastmap        1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
##  forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.4.0)
##  fs             1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
##  generics       0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
##  ggplot2      * 3.5.1   2024-04-23 [1] CRAN (R 4.4.0)
##  ggthemes     * 5.1.0   2024-02-10 [1] CRAN (R 4.4.0)
##  glue           1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
##  gmp          * 0.7-4   2024-01-15 [1] CRAN (R 4.4.0)
##  gtable         0.3.5   2024-04-22 [1] CRAN (R 4.4.0)
##  here         * 1.0.1   2020-12-13 [1] CRAN (R 4.4.0)
##  highr          0.11    2024-05-26 [1] CRAN (R 4.4.0)
##  hms            1.1.3   2023-03-21 [1] CRAN (R 4.4.0)
##  htmltools      0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
##  htmlwidgets    1.6.4   2023-12-06 [1] CRAN (R 4.4.0)
##  httpuv         1.6.15  2024-03-26 [1] CRAN (R 4.4.0)
##  httr           1.4.7   2023-08-15 [1] CRAN (R 4.4.0)
##  janeaustenr    1.0.0   2022-08-26 [1] CRAN (R 4.4.0)
##  jquerylib      0.1.4   2021-04-26 [1] CRAN (R 4.4.0)
##  jsonlite       1.8.8   2023-12-04 [1] CRAN (R 4.4.0)
##  kableExtra   * 1.4.0   2024-01-24 [1] CRAN (R 4.4.0)
##  knitr          1.48    2024-07-07 [1] CRAN (R 4.4.1)
##  labeling       0.4.3   2023-08-29 [1] CRAN (R 4.4.0)
##  later          1.3.2   2023-12-06 [1] CRAN (R 4.4.0)
##  lattice        0.22-6  2024-03-20 [2] CRAN (R 4.4.0)
##  lazyeval       0.2.2   2019-03-15 [1] CRAN (R 4.4.0)
##  ldatuning    * 1.0.2   2020-04-21 [1] CRAN (R 4.4.0)
##  LDAvis       * 0.3.2   2015-10-24 [1] CRAN (R 4.4.0)
##  lifecycle      1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
##  lubridate    * 1.9.3   2023-09-27 [1] CRAN (R 4.4.0)
##  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
##  Matrix         1.7-0   2024-03-22 [2] CRAN (R 4.4.0)
##  memoise        2.0.1   2021-11-26 [1] CRAN (R 4.4.0)
##  mime           0.12    2021-09-28 [1] CRAN (R 4.4.0)
##  miniUI         0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
##  modeltools     0.2-23  2020-03-05 [1] CRAN (R 4.4.0)
##  munsell        0.5.1   2024-04-01 [1] CRAN (R 4.4.0)
##  NLP          * 0.2-1   2020-10-14 [1] CRAN (R 4.4.0)
##  pillar         1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
##  pkgbuild       1.4.4   2024-03-17 [1] CRAN (R 4.4.0)
##  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
##  pkgload        1.4.0   2024-06-28 [1] CRAN (R 4.4.1)
##  plotly       * 4.10.4  2024-01-13 [1] CRAN (R 4.4.0)
##  plyr           1.8.9   2023-10-02 [1] CRAN (R 4.4.0)
##  profvis        0.3.8   2023-05-02 [1] CRAN (R 4.4.0)
##  promises       1.3.0   2024-04-05 [1] CRAN (R 4.4.0)
##  purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
##  R6             2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
##  RColorBrewer * 1.1-3   2022-04-03 [1] CRAN (R 4.4.0)
##  Rcpp           1.0.13  2024-07-17 [1] CRAN (R 4.4.1)
##  readr        * 2.1.5   2024-01-10 [1] CRAN (R 4.4.0)
##  remotes        2.5.0   2024-03-17 [1] CRAN (R 4.4.0)
##  reshape2       1.4.4   2020-04-09 [1] CRAN (R 4.4.0)
##  rlang          1.1.4   2024-06-04 [1] CRAN (R 4.4.0)
##  rmarkdown      2.27    2024-05-17 [1] CRAN (R 4.4.0)
##  Rmpfr        * 0.9-5   2024-01-21 [1] CRAN (R 4.4.0)
##  rprojroot      2.0.4   2023-11-05 [1] CRAN (R 4.4.0)
##  rstudioapi     0.16.0  2024-03-24 [1] CRAN (R 4.4.0)
##  sass           0.4.9   2024-03-15 [1] CRAN (R 4.4.0)
##  scales         1.3.0   2023-11-28 [1] CRAN (R 4.4.0)
##  servr        * 0.30    2024-03-23 [1] CRAN (R 4.4.0)
##  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
##  shiny          1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
##  slam           0.1-51  2024-07-17 [1] CRAN (R 4.4.1)
##  SnowballC      0.7.1   2023-04-25 [1] CRAN (R 4.4.0)
##  stm          * 1.3.7   2023-12-01 [1] CRAN (R 4.4.0)
##  stringi        1.8.4   2024-05-06 [1] CRAN (R 4.4.0)
##  stringr      * 1.5.1   2023-11-14 [1] CRAN (R 4.4.0)
##  svglite        2.1.3   2023-12-08 [1] CRAN (R 4.4.0)
##  systemfonts    1.1.0   2024-05-15 [1] CRAN (R 4.4.0)
##  tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
##  tidyr        * 1.3.1   2024-01-24 [1] CRAN (R 4.4.0)
##  tidyselect     1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
##  tidytext     * 0.4.2   2024-04-10 [1] CRAN (R 4.4.0)
##  tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.4.0)
##  timechange     0.3.0   2024-01-18 [1] CRAN (R 4.4.0)
##  tm           * 0.7-13  2024-04-20 [1] CRAN (R 4.4.0)
##  tokenizers     0.3.0   2022-12-22 [1] CRAN (R 4.4.0)
##  topicmodels  * 0.2-16  2024-01-09 [1] CRAN (R 4.4.0)
##  tzdb           0.4.0   2023-05-12 [1] CRAN (R 4.4.0)
##  urlchecker     1.0.1   2021-11-30 [1] CRAN (R 4.4.0)
##  usethis        2.2.3   2024-02-19 [1] CRAN (R 4.4.0)
##  utf8           1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
##  vctrs          0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
##  viridisLite    0.4.2   2023-05-02 [1] CRAN (R 4.4.0)
##  withr          3.0.0   2024-01-16 [1] CRAN (R 4.4.0)
##  wordcloud    * 2.6     2018-08-24 [1] CRAN (R 4.4.0)
##  xfun           0.46    2024-07-18 [1] CRAN (R 4.4.1)
##  xml2           1.3.6   2023-12-04 [1] CRAN (R 4.4.0)
##  xtable         1.8-4   2019-04-21 [1] CRAN (R 4.4.0)
##  yaml           2.3.9   2024-07-05 [1] CRAN (R 4.4.1)
## 
##  [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
##  [2] C:/Program Files/R/R-4.4.0/library
## 
## ──────────────────────────────────────────────────────────────────────────────