Objective

This vignette demonstrates web scraping text data. Organisations scrap text from the Internet to gain valuable insights for evidence-based decision-making. For example, organisations scrap text to analyse customer feedback and social media.

Manually gathering ever-changing external data from the Internet can be a tedious, laborious, and repetitive task that is inefficient and prone to error. Web scraping offers a solution to these challenges, making the process more efficient and practical.

This vignette aims to scrap and briefly analyse text from the Internet. In this instance, the campaign launch speech delivered by each political leader contesting to be the Prime Minister of Australia at the 2022 Australian Federal Election.

Workflow

The workflow to scrap and analyse text data follows:

1. Source election speeches from the following links. (Note: At times, these links hosted by the Museum of Australian Democracy at Old Parliament House Canberra are unavailable due to website maintenance).

Election campaign launch speech delivered by Anthony Albanese, Australian Labor Party, at Perth, Western Australia, on 1st May 2022

Election campaign launch speech delivered by Scott Morrison, Liberal Party, at Brisbane, Queensland, 15th May 2022

2. Scrap the websites adopting responsible practices.

3. Unnest tokens and tidy the text. Remove stop words and complete word lemmatisation. The final step was preparing the text for visualisation.

Results

There are several methods to analyse text data. These methods include word frequency, bigram and correlation analysis, sentiment analysis and topic modelling. This vignette focused on word frequency for each speech and was visualised with a bar chart and word cloud. The bar chart for each speech shows the top 20 words, including ties, while the word cloud includes all word tokens with a minimum frequency of five.

1. Albanese Speech

2. Morrison Speech

Although commentary is outside the scope of this vignette, the visualisations show that the Liberal and/or Coalition brand was not communicated as frequently during Morrison’s speech compared to the Labor brand (not to be confused with labour) during Albanese’s speech.

For another example of web scraping, see the vignette on scraping numerical data. For more advanced text analysis, see the vignettes on n-grams and correlations, sentiment analysis and topic modelling.


Session information and package update

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.0 (2024-04-24 ucrt)
##  os       Windows 11 x64 (build 22631)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_Australia.utf8
##  ctype    English_Australia.utf8
##  tz       Australia/Brisbane
##  date     2024-09-28
##  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version date (UTC) lib source
##  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.4.0)
##  bslib          0.7.0   2024-03-29 [1] CRAN (R 4.4.0)
##  cachem         1.1.0   2024-05-16 [1] CRAN (R 4.4.0)
##  cli            3.6.3   2024-06-21 [1] CRAN (R 4.4.1)
##  colorspace     2.1-0   2023-01-23 [1] CRAN (R 4.4.1)
##  crayon         1.5.3   2024-06-20 [1] CRAN (R 4.4.1)
##  data.table   * 1.15.4  2024-03-30 [1] CRAN (R 4.4.0)
##  devtools       2.4.5   2022-10-11 [1] CRAN (R 4.4.0)
##  digest         0.6.36  2024-06-23 [1] CRAN (R 4.4.1)
##  dplyr        * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
##  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.4.0)
##  evaluate       0.24.0  2024-06-10 [1] CRAN (R 4.4.0)
##  fansi          1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
##  farver         2.1.2   2024-05-13 [1] CRAN (R 4.4.0)
##  fastmap        1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
##  forcats      * 1.0.0   2023-01-29 [1] CRAN (R 4.4.0)
##  fs             1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
##  generics       0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
##  ggplot2      * 3.5.1   2024-04-23 [1] CRAN (R 4.4.0)
##  glue           1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
##  gtable         0.3.5   2024-04-22 [1] CRAN (R 4.4.0)
##  here         * 1.0.1   2020-12-13 [1] CRAN (R 4.4.0)
##  highr          0.11    2024-05-26 [1] CRAN (R 4.4.0)
##  hms            1.1.3   2023-03-21 [1] CRAN (R 4.4.0)
##  htmltools      0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
##  htmlwidgets    1.6.4   2023-12-06 [1] CRAN (R 4.4.0)
##  httpuv         1.6.15  2024-03-26 [1] CRAN (R 4.4.0)
##  httr           1.4.7   2023-08-15 [1] CRAN (R 4.4.0)
##  janeaustenr    1.0.0   2022-08-26 [1] CRAN (R 4.4.0)
##  jquerylib      0.1.4   2021-04-26 [1] CRAN (R 4.4.0)
##  jsonlite       1.8.8   2023-12-04 [1] CRAN (R 4.4.0)
##  knitr          1.48    2024-07-07 [1] CRAN (R 4.4.1)
##  later          1.3.2   2023-12-06 [1] CRAN (R 4.4.0)
##  lattice        0.22-6  2024-03-20 [2] CRAN (R 4.4.0)
##  lifecycle      1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
##  lubridate    * 1.9.3   2023-09-27 [1] CRAN (R 4.4.0)
##  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
##  Matrix         1.7-0   2024-03-22 [2] CRAN (R 4.4.0)
##  memoise        2.0.1   2021-11-26 [1] CRAN (R 4.4.0)
##  mime           0.12    2021-09-28 [1] CRAN (R 4.4.0)
##  miniUI         0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
##  munsell        0.5.1   2024-04-01 [1] CRAN (R 4.4.0)
##  pillar         1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
##  pkgbuild       1.4.4   2024-03-17 [1] CRAN (R 4.4.0)
##  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
##  pkgload        1.4.0   2024-06-28 [1] CRAN (R 4.4.1)
##  polite       * 0.1.3   2023-06-30 [1] CRAN (R 4.4.0)
##  profvis        0.3.8   2023-05-02 [1] CRAN (R 4.4.0)
##  promises       1.3.0   2024-04-05 [1] CRAN (R 4.4.0)
##  purrr        * 1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
##  R6             2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
##  ratelimitr     0.4.1   2018-10-07 [1] CRAN (R 4.4.0)
##  RColorBrewer * 1.1-3   2022-04-03 [1] CRAN (R 4.4.0)
##  Rcpp           1.0.13  2024-07-17 [1] CRAN (R 4.4.1)
##  readr        * 2.1.5   2024-01-10 [1] CRAN (R 4.4.0)
##  remotes        2.5.0   2024-03-17 [1] CRAN (R 4.4.0)
##  rlang          1.1.4   2024-06-04 [1] CRAN (R 4.4.0)
##  rmarkdown      2.27    2024-05-17 [1] CRAN (R 4.4.0)
##  robotstxt      0.7.13  2020-09-03 [1] CRAN (R 4.4.0)
##  rprojroot      2.0.4   2023-11-05 [1] CRAN (R 4.4.0)
##  rstudioapi     0.16.0  2024-03-24 [1] CRAN (R 4.4.0)
##  rvest        * 1.0.4   2024-02-12 [1] CRAN (R 4.4.0)
##  sass           0.4.9   2024-03-15 [1] CRAN (R 4.4.0)
##  scales         1.3.0   2023-11-28 [1] CRAN (R 4.4.0)
##  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
##  shiny          1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
##  SnowballC      0.7.1   2023-04-25 [1] CRAN (R 4.4.0)
##  stringi        1.8.4   2024-05-06 [1] CRAN (R 4.4.0)
##  stringr      * 1.5.1   2023-11-14 [1] CRAN (R 4.4.0)
##  tibble       * 3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
##  tidyr        * 1.3.1   2024-01-24 [1] CRAN (R 4.4.0)
##  tidyselect     1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
##  tidytext     * 0.4.2   2024-04-10 [1] CRAN (R 4.4.0)
##  tidyverse    * 2.0.0   2023-02-22 [1] CRAN (R 4.4.0)
##  timechange     0.3.0   2024-01-18 [1] CRAN (R 4.4.0)
##  tokenizers     0.3.0   2022-12-22 [1] CRAN (R 4.4.0)
##  tzdb           0.4.0   2023-05-12 [1] CRAN (R 4.4.0)
##  urlchecker     1.0.1   2021-11-30 [1] CRAN (R 4.4.0)
##  usethis        2.2.3   2024-02-19 [1] CRAN (R 4.4.0)
##  utf8           1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
##  vctrs          0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
##  withr          3.0.0   2024-01-16 [1] CRAN (R 4.4.0)
##  wordcloud    * 2.6     2018-08-24 [1] CRAN (R 4.4.0)
##  xfun           0.46    2024-07-18 [1] CRAN (R 4.4.1)
##  xml2           1.3.6   2023-12-04 [1] CRAN (R 4.4.0)
##  xtable         1.8-4   2019-04-21 [1] CRAN (R 4.4.0)
##  yaml           2.3.9   2024-07-05 [1] CRAN (R 4.4.1)
## 
##  [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
##  [2] C:/Program Files/R/R-4.4.0/library
## 
## ──────────────────────────────────────────────────────────────────────────────