Objective

This vignette demonstrates web scraping numerical data. In addition to analysing internal data for decision-making, organisations analyse external numerical data that is increasingly available on the Internet. This numerical data includes freely available government data, industry statistics, competitors’ products and pricing, etcetera.

Manually gathering ever-changing external data from the Internet can be a tedious, laborious, and repetitive task that is inefficient and prone to error. Web scraping offers a solution to these challenges, making the process more efficient and practical.

This vignette aims to demonstrate how specific numerical data can be scraped from the Internet, providing a more efficient way to analyse the performance of top men and women tennis champions across all four grand slam tournaments at a particular point in time.

Workflow

Once the numerical data is sourced, there are several methods to scrap web data for analysis. Some methods are more effective than others. Preferred methods include APIs (Application Programming Interfaces) and XML Path Language (XPath) in conjunction with other web scraping packages.

It is responsible practice to request permission and follow the conditions for scraping a site. In this Wikipedia example, web pages were scraped with a crawl delay of five seconds. Scraped specific data from the following Wikipedia pages and prepared the data for visualisation.

List of men’s singles Grand Slam tennis champions

List of women’s singles Grand Slam tennis champions

Results

1. Men Tennis Champions

Performance table

The scraped dataset was filtered to list only those players in the amateur and open era with a total of 10 or more championship titles. This amounted to eight players at the end of the 2022 season, shown in Table 1.

Visualisation

At the end of the 2022 season, and the retirement of Roger Federer, Chart 1 shows the number of championship titles held by the top men players.

The above results show Rafael Nadal holds 22 championship titles at grand slam events. Notably, since 2005, Nadal has dominated on clay at the French Open with 14 titles.

2. Women Tennis Champions

Performance table

The scraped dataset was filtered to list only those players in the amateur and open era with a total of 10 or more championship titles. This consisted of seven players at the end of the 2022 season, listed in Table 2.

Visualisation

At the end of the 2022 season, and the retirement of Serena Williams, Chart 2 shows the number of championship titles held by the top women players.

Australian Margaret Court holds the most titles for women, followed by Serena Williams and Steffi Graff. All three top players hold multiple titles at each of the four grand slam events.

For another example of web scraping, see the vignette on scraping text data.


Session information and package update

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.0 (2024-04-24 ucrt)
##  os       Windows 11 x64 (build 22631)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_Australia.utf8
##  ctype    English_Australia.utf8
##  tz       Australia/Brisbane
##  date     2024-09-28
##  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.4.0)
##  bslib         0.7.0   2024-03-29 [1] CRAN (R 4.4.0)
##  cachem        1.1.0   2024-05-16 [1] CRAN (R 4.4.0)
##  cli           3.6.3   2024-06-21 [1] CRAN (R 4.4.1)
##  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.4.1)
##  crosstalk     1.2.1   2023-11-23 [1] CRAN (R 4.4.0)
##  data.table  * 1.15.4  2024-03-30 [1] CRAN (R 4.4.0)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.4.0)
##  digest        0.6.36  2024-06-23 [1] CRAN (R 4.4.1)
##  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
##  DT          * 0.33    2024-04-04 [1] CRAN (R 4.4.0)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.4.0)
##  evaluate      0.24.0  2024-06-10 [1] CRAN (R 4.4.0)
##  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
##  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
##  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.4.0)
##  fs            1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
##  generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
##  ggplot2     * 3.5.1   2024-04-23 [1] CRAN (R 4.4.0)
##  glue          1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
##  gtable        0.3.5   2024-04-22 [1] CRAN (R 4.4.0)
##  here        * 1.0.1   2020-12-13 [1] CRAN (R 4.4.0)
##  highr         0.11    2024-05-26 [1] CRAN (R 4.4.0)
##  hms           1.1.3   2023-03-21 [1] CRAN (R 4.4.0)
##  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
##  htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.4.0)
##  httpuv        1.6.15  2024-03-26 [1] CRAN (R 4.4.0)
##  httr          1.4.7   2023-08-15 [1] CRAN (R 4.4.0)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.4.0)
##  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.4.0)
##  knitr         1.48    2024-07-07 [1] CRAN (R 4.4.1)
##  labeling      0.4.3   2023-08-29 [1] CRAN (R 4.4.0)
##  later         1.3.2   2023-12-06 [1] CRAN (R 4.4.0)
##  lazyeval      0.2.2   2019-03-15 [1] CRAN (R 4.4.0)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
##  lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.4.0)
##  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
##  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.4.0)
##  mime          0.12    2021-09-28 [1] CRAN (R 4.4.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
##  munsell       0.5.1   2024-04-01 [1] CRAN (R 4.4.0)
##  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
##  pkgbuild      1.4.4   2024-03-17 [1] CRAN (R 4.4.0)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
##  pkgload       1.4.0   2024-06-28 [1] CRAN (R 4.4.1)
##  plotly      * 4.10.4  2024-01-13 [1] CRAN (R 4.4.0)
##  polite      * 0.1.3   2023-06-30 [1] CRAN (R 4.4.0)
##  profvis       0.3.8   2023-05-02 [1] CRAN (R 4.4.0)
##  promises      1.3.0   2024-04-05 [1] CRAN (R 4.4.0)
##  purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
##  ratelimitr    0.4.1   2018-10-07 [1] CRAN (R 4.4.0)
##  Rcpp          1.0.13  2024-07-17 [1] CRAN (R 4.4.1)
##  readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.4.0)
##  remotes       2.5.0   2024-03-17 [1] CRAN (R 4.4.0)
##  rlang         1.1.4   2024-06-04 [1] CRAN (R 4.4.0)
##  rmarkdown     2.27    2024-05-17 [1] CRAN (R 4.4.0)
##  robotstxt     0.7.13  2020-09-03 [1] CRAN (R 4.4.0)
##  rprojroot     2.0.4   2023-11-05 [1] CRAN (R 4.4.0)
##  rstudioapi    0.16.0  2024-03-24 [1] CRAN (R 4.4.0)
##  rvest         1.0.4   2024-02-12 [1] CRAN (R 4.4.0)
##  sass          0.4.9   2024-03-15 [1] CRAN (R 4.4.0)
##  scales        1.3.0   2023-11-28 [1] CRAN (R 4.4.0)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
##  shiny         1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
##  stringi       1.8.4   2024-05-06 [1] CRAN (R 4.4.0)
##  stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.4.0)
##  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
##  tidyr       * 1.3.1   2024-01-24 [1] CRAN (R 4.4.0)
##  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
##  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.4.0)
##  timechange    0.3.0   2024-01-18 [1] CRAN (R 4.4.0)
##  tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.4.0)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.4.0)
##  usethis       2.2.3   2024-02-19 [1] CRAN (R 4.4.0)
##  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
##  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
##  viridisLite   0.4.2   2023-05-02 [1] CRAN (R 4.4.0)
##  WikipediR   * 1.7.1   2024-04-05 [1] CRAN (R 4.4.0)
##  withr         3.0.0   2024-01-16 [1] CRAN (R 4.4.0)
##  xfun          0.46    2024-07-18 [1] CRAN (R 4.4.1)
##  xml2          1.3.6   2023-12-04 [1] CRAN (R 4.4.0)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.4.0)
##  yaml          2.3.9   2024-07-05 [1] CRAN (R 4.4.1)
## 
##  [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
##  [2] C:/Program Files/R/R-4.4.0/library
## 
## ──────────────────────────────────────────────────────────────────────────────