Objective

Text analysis extracts valuable insights from unstructured text data for making business decisions.

The vignette on word frequency focussed on calculating and visualising single-word tokens. However, the relationship between words and documents provides greater insight and understanding. This includes consecutive sequencing of words known as n-grams, word pairs and correlation between words that co-occur within text documents.

This vignette focuses on calculating and visualising n-grams (in particular bigrams), word pairs and correlation. The dataset consists of documented feedback from 181 staff at a Federal Government Agency about their experience with a major restructure occurring within their agency.

Workflow

Implemented the following steps to produce the results.

  1. The raw dataset was tidied. This included removing stop words and word lemmatisation.

  2. Bigrams are a sequence of two adjacent words or a pair of consecutive words from a string of tokens. Prepared a visualisation of the top bigrams. A network of bigrams is of greater interest when examining the relationship among bigrams. Trigrams or listing three consecutive words were not pursued in this instance.

  3. The next step in this analysis was showing a network of word pairs or co-occurring words. These are words that commonly appear together in each document by each respondent. Unlike bigrams, co-occurring words do not directly follow one another. The relationship with co-occurring words is symmetrical and not directional like bi-grams.

  4. The final section of this vignette visualises the pairwise correlation of words within documents based on the coefficient that indicates how often words appear together relative to how often they appear separately.

Results

1. Visualise bigrams

The bigram frequency plot captures the underlying concerns expressed by staff about the restructure. Chart 1 reveals that staff were most concerned about the adverse impact the restructure was having on customer service.

Chart 2 shows the directional relationship among the most common bigrams in the corpus.

2. Co-occurring keywords

Chart 3 illustrates a network of co-occurring keywords by respondents to describe the organisational restructuring.

## Warning: The `trans` argument of `continuous_scale()` is deprecated as of ggplot2 3.5.0.
## ℹ Please use the `transform` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

3. Pairwise correlation

The final chart visualises a network of correlated words from respondents about the restructure occurring in their organisation.

This vignette shows how n-grams, word pairs and correlation analysis provide more context and a greater understanding and interpretation of text documents than single-word frequency. Further analysis, not demonstrated here, can include identifying words in the text that follows, precede, co-occur or be correlated to specific words of interest. For example, what words are most associated with “support” or “oppose” change?

More examples of text analysis appear in the vignettes on sentiment analysis and topic modelling.


Reference:

Silge, J. & Robinson, D. (2017). Text Mining with R, O’Reilly Media Inc.


Session information and package update

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.0 (2024-04-24 ucrt)
##  os       Windows 11 x64 (build 22631)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_Australia.utf8
##  ctype    English_Australia.utf8
##  tz       Australia/Brisbane
##  date     2024-07-30
##  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version  date (UTC) lib source
##  backports      1.5.0    2024-05-23 [1] CRAN (R 4.4.0)
##  broom          1.0.6    2024-05-17 [1] CRAN (R 4.4.0)
##  bslib          0.7.0    2024-03-29 [1] CRAN (R 4.4.0)
##  cachem         1.1.0    2024-05-16 [1] CRAN (R 4.4.0)
##  cli            3.6.3    2024-06-21 [1] CRAN (R 4.4.1)
##  colorspace     2.1-0    2023-01-23 [1] CRAN (R 4.4.1)
##  crayon         1.5.3    2024-06-20 [1] CRAN (R 4.4.1)
##  data.table   * 1.15.4   2024-03-30 [1] CRAN (R 4.4.0)
##  devtools       2.4.5    2022-10-11 [1] CRAN (R 4.4.0)
##  digest         0.6.36   2024-06-23 [1] CRAN (R 4.4.1)
##  dplyr        * 1.1.4    2023-11-17 [1] CRAN (R 4.4.0)
##  ellipsis       0.3.2    2021-04-29 [1] CRAN (R 4.4.0)
##  evaluate       0.24.0   2024-06-10 [1] CRAN (R 4.4.0)
##  fansi          1.0.6    2023-12-08 [1] CRAN (R 4.4.0)
##  farver         2.1.2    2024-05-13 [1] CRAN (R 4.4.0)
##  fastmap        1.2.0    2024-05-15 [1] CRAN (R 4.4.0)
##  forcats      * 1.0.0    2023-01-29 [1] CRAN (R 4.4.0)
##  fs             1.6.4    2024-04-25 [1] CRAN (R 4.4.0)
##  generics       0.1.3    2022-07-05 [1] CRAN (R 4.4.0)
##  ggforce        0.4.2    2024-02-19 [1] CRAN (R 4.4.0)
##  ggplot2      * 3.5.1    2024-04-23 [1] CRAN (R 4.4.0)
##  ggraph       * 2.2.1    2024-03-07 [1] CRAN (R 4.4.0)
##  ggrepel        0.9.5    2024-01-10 [1] CRAN (R 4.4.0)
##  glue           1.7.0    2024-01-09 [1] CRAN (R 4.4.0)
##  graphlayouts   1.1.1    2024-03-09 [1] CRAN (R 4.4.0)
##  gridExtra      2.3      2017-09-09 [1] CRAN (R 4.4.0)
##  gtable         0.3.5    2024-04-22 [1] CRAN (R 4.4.0)
##  here         * 1.0.1    2020-12-13 [1] CRAN (R 4.4.0)
##  highr          0.11     2024-05-26 [1] CRAN (R 4.4.0)
##  hms            1.1.3    2023-03-21 [1] CRAN (R 4.4.0)
##  htmltools      0.5.8.1  2024-04-04 [1] CRAN (R 4.4.0)
##  htmlwidgets    1.6.4    2023-12-06 [1] CRAN (R 4.4.0)
##  httpuv         1.6.15   2024-03-26 [1] CRAN (R 4.4.0)
##  igraph       * 2.0.3    2024-03-13 [1] CRAN (R 4.4.0)
##  janeaustenr    1.0.0    2022-08-26 [1] CRAN (R 4.4.0)
##  jquerylib      0.1.4    2021-04-26 [1] CRAN (R 4.4.0)
##  jsonlite       1.8.8    2023-12-04 [1] CRAN (R 4.4.0)
##  knitr          1.48     2024-07-07 [1] CRAN (R 4.4.1)
##  labeling       0.4.3    2023-08-29 [1] CRAN (R 4.4.0)
##  later          1.3.2    2023-12-06 [1] CRAN (R 4.4.0)
##  lattice        0.22-6   2024-03-20 [2] CRAN (R 4.4.0)
##  lifecycle      1.0.4    2023-11-07 [1] CRAN (R 4.4.0)
##  lubridate    * 1.9.3    2023-09-27 [1] CRAN (R 4.4.0)
##  magrittr       2.0.3    2022-03-30 [1] CRAN (R 4.4.0)
##  MASS           7.3-60.2 2024-04-24 [2] local
##  Matrix         1.7-0    2024-03-22 [2] CRAN (R 4.4.0)
##  memoise        2.0.1    2021-11-26 [1] CRAN (R 4.4.0)
##  mime           0.12     2021-09-28 [1] CRAN (R 4.4.0)
##  miniUI         0.1.1.1  2018-05-18 [1] CRAN (R 4.4.0)
##  munsell        0.5.1    2024-04-01 [1] CRAN (R 4.4.0)
##  pillar         1.9.0    2023-03-22 [1] CRAN (R 4.4.0)
##  pkgbuild       1.4.4    2024-03-17 [1] CRAN (R 4.4.0)
##  pkgconfig      2.0.3    2019-09-22 [1] CRAN (R 4.4.0)
##  pkgload        1.4.0    2024-06-28 [1] CRAN (R 4.4.1)
##  plyr           1.8.9    2023-10-02 [1] CRAN (R 4.4.0)
##  polyclip       1.10-7   2024-07-23 [1] CRAN (R 4.4.1)
##  profvis        0.3.8    2023-05-02 [1] CRAN (R 4.4.0)
##  promises       1.3.0    2024-04-05 [1] CRAN (R 4.4.0)
##  purrr        * 1.0.2    2023-08-10 [1] CRAN (R 4.4.0)
##  R6             2.5.1    2021-08-19 [1] CRAN (R 4.4.0)
##  RColorBrewer * 1.1-3    2022-04-03 [1] CRAN (R 4.4.0)
##  Rcpp           1.0.13   2024-07-17 [1] CRAN (R 4.4.1)
##  readr        * 2.1.5    2024-01-10 [1] CRAN (R 4.4.0)
##  remotes        2.5.0    2024-03-17 [1] CRAN (R 4.4.0)
##  reshape2       1.4.4    2020-04-09 [1] CRAN (R 4.4.0)
##  rlang          1.1.4    2024-06-04 [1] CRAN (R 4.4.0)
##  rmarkdown      2.27     2024-05-17 [1] CRAN (R 4.4.0)
##  rprojroot      2.0.4    2023-11-05 [1] CRAN (R 4.4.0)
##  rstudioapi     0.16.0   2024-03-24 [1] CRAN (R 4.4.0)
##  sass           0.4.9    2024-03-15 [1] CRAN (R 4.4.0)
##  scales         1.3.0    2023-11-28 [1] CRAN (R 4.4.0)
##  sessioninfo    1.2.2    2021-12-06 [1] CRAN (R 4.4.0)
##  shiny          1.8.1.1  2024-04-02 [1] CRAN (R 4.4.0)
##  SnowballC      0.7.1    2023-04-25 [1] CRAN (R 4.4.0)
##  stringi        1.8.4    2024-05-06 [1] CRAN (R 4.4.0)
##  stringr      * 1.5.1    2023-11-14 [1] CRAN (R 4.4.0)
##  tibble       * 3.2.1    2023-03-20 [1] CRAN (R 4.4.0)
##  tidygraph      1.3.1    2024-01-30 [1] CRAN (R 4.4.0)
##  tidyr        * 1.3.1    2024-01-24 [1] CRAN (R 4.4.0)
##  tidyselect     1.2.1    2024-03-11 [1] CRAN (R 4.4.0)
##  tidytext     * 0.4.2    2024-04-10 [1] CRAN (R 4.4.0)
##  tidyverse    * 2.0.0    2023-02-22 [1] CRAN (R 4.4.0)
##  timechange     0.3.0    2024-01-18 [1] CRAN (R 4.4.0)
##  tokenizers     0.3.0    2022-12-22 [1] CRAN (R 4.4.0)
##  tweenr         2.0.3    2024-02-26 [1] CRAN (R 4.4.0)
##  tzdb           0.4.0    2023-05-12 [1] CRAN (R 4.4.0)
##  urlchecker     1.0.1    2021-11-30 [1] CRAN (R 4.4.0)
##  usethis        2.2.3    2024-02-19 [1] CRAN (R 4.4.0)
##  utf8           1.2.4    2023-10-22 [1] CRAN (R 4.4.0)
##  vctrs          0.6.5    2023-12-01 [1] CRAN (R 4.4.0)
##  viridis        0.6.5    2024-01-29 [1] CRAN (R 4.4.0)
##  viridisLite    0.4.2    2023-05-02 [1] CRAN (R 4.4.0)
##  widyr        * 0.1.5    2022-09-13 [1] CRAN (R 4.4.0)
##  withr          3.0.0    2024-01-16 [1] CRAN (R 4.4.0)
##  wordcloud    * 2.6      2018-08-24 [1] CRAN (R 4.4.0)
##  xfun           0.46     2024-07-18 [1] CRAN (R 4.4.1)
##  xtable         1.8-4    2019-04-21 [1] CRAN (R 4.4.0)
##  yaml           2.3.9    2024-07-05 [1] CRAN (R 4.4.1)
## 
##  [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
##  [2] C:/Program Files/R/R-4.4.0/library
## 
## ──────────────────────────────────────────────────────────────────────────────