Exploring Categorical Variables
Wikipedia describes exploratory data analysis (EDA) as analysing datasets to summarise main characteristics, often using statistical graphics and other data visualisation methods. According to Wickham and Grolemund in their publication R for Data Science, two questions help make discoveries within data. These questions are: what type of variation occurs within variables and what type of covariation occurs between variables?
This vignette explores two categorical variables: respondents’ age group and job role. The dataset comprised 616 respondents from 10 public and private sector organisations experiencing organisational change.
The raw dataset was tidied prior to exploration. This included renaming variables, updating data types and checking for anomalies. Cases with missing values were removed. Apart from collapsing the original age group levels, no other wrangling was required.
Both age group and job role are summarised with frequency tables and then visualised with an appropriate chart. To conclude, explored the covariation between these two categorical variables.
Table 1 summarises the distribution of age group by count and proportion. Chart 1 visualises the distribution of age group.
Table 1 Age group distribution summary (sorted by frequency) | ||
|---|---|---|
age group | n | percent |
40-49 yrs | 203 | 33% |
30-39 yrs | 180 | 30% |
<30 yrs | 126 | 21% |
>49 yrs | 100 | 16% |
N = 609 | ||
Table 2 summarises the distribution of job role by count and proportion. Chart 2 visualises the distribution of job role.
Table 2 Job role distribution summary (sorted by frequency) | ||
|---|---|---|
job role | n | percent |
Employee | 358 | 59% |
Middle management | 121 | 20% |
Supervisor | 92 | 15% |
Executive/Senior management | 40 | 7% |
N = 611 | ||
Table 3 shows the frequency of job role by age group.
Table 3 Contingency table of job role by age group | ||||
|---|---|---|---|---|
Age group | Employee | Supervisor | Middle management | Executive/Senior management |
<30 yrs | 107 | 9 | 9 | 0 |
30-39 yrs | 107 | 29 | 34 | 10 |
40-49 yrs | 98 | 36 | 51 | 16 |
>49 yrs | 43 | 17 | 26 | 14 |
N = 606 | ||||
Table 4 calculates the proportion of job role within each age group.
Table 4 Contingency table of job role by age group | ||||
|---|---|---|---|---|
Age group | Employee | Supervisor | Middle management | Executive/Senior management |
<30 yrs | 86% | 7% | 7% | |
30-39 yrs | 59% | 16% | 19% | 6% |
40-49 yrs | 49% | 18% | 25% | 8% |
>49 yrs | 43% | 17% | 26% | 14% |
Charts 3 and 4 visualise the frequency of job role across age groups using stacked and dodge plots.
Charts 5 and 6 illustrate alternative ways to visualise frequency across two categorical variables.
Finally, Chart 7 is a stacked bar chart showing the proportion of job role within each age group. This chart highlights the covariation between these two categorical variables.
This vignette explored the distribution within and across two categorical variables. See the vignette on categorical hypothesis testing to determine if there is a statistically significant relationship between age group and job role.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.0 (2024-04-24 ucrt)
## os Windows 11 x64 (build 22631)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Australia.utf8
## ctype English_Australia.utf8
## tz Australia/Brisbane
## date 2024-07-29
## pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## askpass 1.2.0 2023-09-03 [1] CRAN (R 4.4.0)
## bslib 0.7.0 2024-03-29 [1] CRAN (R 4.4.0)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
## cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.4.1)
## crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.1)
## crul 1.5.0 2024-07-19 [1] CRAN (R 4.4.1)
## curl 5.2.1 2024-03-01 [1] CRAN (R 4.4.0)
## data.table * 1.15.4 2024-03-30 [1] CRAN (R 4.4.0)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.0)
## digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0)
## evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
## flextable * 0.9.6 2024-05-05 [1] CRAN (R 4.4.0)
## fontBitstreamVera 0.1.1 2017-02-01 [1] CRAN (R 4.4.0)
## fontLiberation 0.1.0 2016-10-15 [1] CRAN (R 4.4.0)
## fontquiver 0.2.1 2017-02-01 [1] CRAN (R 4.4.0)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
## fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
## gdtools 0.3.7 2024-03-05 [1] CRAN (R 4.4.0)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
## gfonts 0.2.0 2023-01-08 [1] CRAN (R 4.4.0)
## ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
## gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
## highr 0.11 2024-05-26 [1] CRAN (R 4.4.0)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
## httpcode 0.3.0 2020-04-10 [1] CRAN (R 4.4.0)
## httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
## jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
## knitr 1.48 2024-07-07 [1] CRAN (R 4.4.1)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
## later 1.3.2 2023-12-06 [1] CRAN (R 4.4.0)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
## mime 0.12 2021-09-28 [1] CRAN (R 4.4.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
## officer 0.6.6 2024-05-05 [1] CRAN (R 4.4.0)
## openssl 2.2.0 2024-05-16 [1] CRAN (R 4.4.0)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
## pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.4.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
## pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.1)
## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.4.0)
## promises 1.3.0 2024-04-05 [1] CRAN (R 4.4.0)
## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
## ragg 1.3.2 2024-05-15 [1] CRAN (R 4.4.0)
## Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.1)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.0)
## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
## rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
## rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
## rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
## sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.0)
## scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
## shiny 1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
## stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
## systemfonts 1.1.0 2024-05-15 [1] CRAN (R 4.4.0)
## textshaping 0.4.0 2024-05-24 [1] CRAN (R 4.4.0)
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.0)
## usethis 2.2.3 2024-02-19 [1] CRAN (R 4.4.0)
## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
## uuid 1.2-0 2024-01-14 [1] CRAN (R 4.4.0)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
## withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
## xfun 0.46 2024-07-18 [1] CRAN (R 4.4.1)
## xml2 1.3.6 2023-12-04 [1] CRAN (R 4.4.0)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.0)
## yaml 2.3.9 2024-07-05 [1] CRAN (R 4.4.1)
## zip 2.3.1 2024-01-27 [1] CRAN (R 4.4.0)
##
## [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
## [2] C:/Program Files/R/R-4.4.0/library
##
## ──────────────────────────────────────────────────────────────────────────────