Simple Classification and Regression Decision Tree
According to Wikipedia, decision trees are a supervised learning approach for creating predictive models that draw conclusions about a set of observations. Decision tree models where the target variable is discrete are called classification trees, and trees, where the target variable is continuous (typically numbers), are called regression trees. This vignette provides a simple example of classification and regression trees using recursive partitioning.
The dataset contained 251 observations from Queensland State Government employees experiencing organisational change. Respondents were sourced from two state government departments, a state government agency, and a state government authority.
As part of an organisational change research project, categorical information was gathered to understand the characteristics of individual respondents. This vignette explores the relationship between categorical data and the outcome variable by conducting and comparing classification and regression decision trees.
The raw data set was wrangled and tidied before processing. The classification and regression trees modelled six categorical independent variables. These were: time employed at the organisation, time employed in the current job role, job role, level of education, age group and gender. The target or outcome variable for both decision trees was reaction to organisational change.
Briefly explored these variables through informative visualisations. Grew and visualised the initial classification and regression tree. The trees were then pruned based on statistical analysis to reduce over-fitting and improve accuracy. Visualised the pruned trees and identified the most important explanatory variables.
The first step in the analysis was defining the target or outcome variable for the regression and classification decision trees. The regression tree used the original seven-point numeric scale as the outcome variable, with reaction to organisational change ranging from totally oppose to totally support. For the classification tree, the seven-point numeric response variable was collapsed into a three-level outcome variable (oppose, neutral, support). Tables 1 and 2 show the outcome variable for each decision tree.
Table 1 Response variable for the regression tree | |
Numeric scale for reaction to change | Freq |
---|---|
1 | 23 |
2 | 71 |
3 | 90 |
4 | 73 |
5 | 124 |
6 | 189 |
7 | 46 |
1=Totally oppose, 2=Oppose, 3=Partially oppose, 4=Neutral | |
5=Partially support, 6=Support, 7=Totally support |
Table 2 Response variable for the classification tree | |
Categorical scale for reaction to change | Freq |
---|---|
Oppose | 184 |
Neutral | 73 |
Support | 359 |
Charts 1 to 6 illustrate the distribution of each categorical independent variable with missing values removed. Chart 7 bar chart visualises the categorical outcome variable for the classification tree. Chart 8 histogram visualises the numeric outcome variable for the regression tree.
The classification tree was grown and then pruned to determine the most important explanatory variables associated with the outcome variable, reaction to organisational change.
Chart 9 visualises the grown classification tree.
Note: Please magnify the screen to see Chart 9 in greater detail.
The classification tree was pruned to improve model fit. Output 1 shows the optimal prunings based on the complexity parameter. Fit 3 has the lowest cross-validated error (xerror), and the associated cp was used to prune the tree. Chart 10 visually represents the cross-validation results plotted against the geometric mean. Optimal pruning has four splits and a tree size with five leaves.
Output 1 Complexity parameter table
CP nsplit rel error xerror xstd
1 0.042056075 0 1.0000000 1.0000000 0.07322384
2 0.032710280 2 0.9158879 1.0280374 0.07346581
3 0.018691589 4 0.8504673 0.9532710 0.07272313
4 0.009345794 7 0.7850467 0.9813084 0.07303826
5 0.004672897 9 0.7663551 0.9813084 0.07303826
6 0.000000000 13 0.7476636 1.0000000 0.07322384
Chart 10 Complexity parameter cross-validation results
Chart 11 shows the pruned classification tree for this data set.
Note: Please magnify the screen to see Chart 11 in greater detail.
Output 2 translates the above classification tree as a set of rules. Most of the support for organisational change (47 per cent) was determined by job role, that is, staff holding supervisor, middle management or senior management roles. Opposition to organisational change was expressed by 37 per cent of employees who were at their organisation for more than two years. Older employees (above 40 years) who joined their organisation less than two years prior were supportive of change. In contrast, reaction to organisational change for younger employees (below 40 years) with less than two years of organisational experience was determined by gender, with males supportive and females remaining neutral about the change.
Output 2 Set of rules for the pruned classification tree
Oppo Neut Supp
reaction_cat is Oppose [ .47 .16 .37] with cover 37% when
job_role is Employee
time_org is 2 to <5 yrs or 5 to <10 yrs or 10 to <20 yrs or 20+ yrs
reaction_cat is Neutral [ .16 .56 .28] with cover 10% when
job_role is Employee
time_org is <2 yrs
age_group is <30 yrs or 30-39 yrs
gender is Female
reaction_cat is Support [ .22 .11 .67] with cover 4% when
job_role is Employee
time_org is <2 yrs
age_group is <30 yrs or 30-39 yrs
gender is Male
reaction_cat is Support [ .14 .09 .76] with cover 47% when
job_role is Supervisor or Middle management or Senior management
reaction_cat is Support [ .00 .00 1.00] with cover 3% when
job_role is Employee
time_org is <2 yrs
age_group is 40-49 yrs or >49 yrs
Chart 12 shows that, for the classification tree, job role, followed by the length of time at the organisation and age group, were the most important variables associated with an individual’s reaction to organisational change.
Chart 13 visualises the relationship between job role, length of time with the organisation and the outcome variable, reaction to organisational change.
Chart 14 visualises the relationship between job role, age group and the outcome variable, reaction to organisational change.
The regression tree was grown and then pruned to determine the most important explanatory variables associated with the outcome variable, reaction to organisational change.
Chart 15 visualises the grown regression tree. To assist interpretation, the seven-point numeric scale for this regression tree was: 1=Totally oppose, 2=Oppose, 3=Partially oppose, 4=Neutral, 5=Partially support, 6=Support, 7=Totally support.
Note: Please magnify the screen to see Chart 15 in greater detail.
The regression tree was pruned to improve model fit. Output 3 shows the optimal prunings based on the complexity parameter. Fit 9 has the lowest cross-validated error (xerror), and the associated cp was used to prune the tree. Chart 16 visually represents the cross-validation results plotted against the geometric mean. Optimal pruning has 10 splits and a tree size with 11 leaves.
Output 3 Complexity parameter table
CP nsplit rel error xerror xstd
1 0.1122875993 0 1.0000000 1.0062459 0.07003313
2 0.0456215880 1 0.8877124 0.9761758 0.07628802
3 0.0451262138 2 0.8420908 0.9810970 0.07784462
4 0.0311902747 3 0.7969646 0.9887460 0.07919452
5 0.0292846222 4 0.7657743 0.9505621 0.07516157
6 0.0186000754 5 0.7364897 0.9346427 0.07403404
7 0.0154021094 6 0.7178896 0.9255104 0.07484184
8 0.0127631885 7 0.7024875 0.9135059 0.07382542
9 0.0097623799 10 0.6641980 0.8928810 0.07328894
10 0.0082922737 11 0.6544356 0.9007314 0.07465176
11 0.0074957703 12 0.6461433 0.9119382 0.07564988
12 0.0072881312 13 0.6386475 0.9223913 0.07620666
13 0.0059862343 14 0.6313594 0.9227458 0.07626552
14 0.0049927757 15 0.6253732 0.9129758 0.07589752
15 0.0047143236 16 0.6203804 0.9228279 0.07611790
16 0.0021162574 17 0.6156661 0.9169090 0.07453774
17 0.0002843738 18 0.6135498 0.9124043 0.07620968
18 0.0000000000 19 0.6132654 0.9119606 0.07611458
Chart 16 Complexity parameter cross-validation results
Chart 17 shows the pruned regression tree for this data set.
Output 4 translates the above regression tree as a set of rules. Compared to the classification tree, the set of rules for the regression tree, with a continuous outcome variable, are more granular.
Output 4 Set of rules for the pruned regression tree
reaction is 2.8 with cover 9% when
job_role is Employee
time_role is 2 to <5 yrs or 5 to <10 yrs or 10+ yrs
gender is Male
reaction is 3.5 with cover 4% when
job_role is Employee or Supervisor or Middle management
age_group is 40-49 yrs or >49 yrs
education is High school or Certificate/Diploma or Undergraduate degree
reaction is 4.0 with cover 20% when
job_role is Employee
time_role is <2 yrs
age_group is <30 yrs or 30-39 yrs
gender is Female
reaction is 4.2 with cover 7% when
job_role is Employee
time_role is 2 to <5 yrs or 5 to <10 yrs or 10+ yrs
gender is Female
reaction is 4.8 with cover 6% when
job_role is Employee
time_role is <2 yrs
age_group is 40-49 yrs or >49 yrs
time_org is 2 to <5 yrs or 5 to <10 yrs or 10 to <20 yrs or 20+ yrs
reaction is 4.8 with cover 8% when
job_role is Employee
time_role is <2 yrs
age_group is <30 yrs or 30-39 yrs
gender is Male
reaction is 4.9 with cover 10% when
job_role is Employee or Supervisor or Middle management
education is Postgraduate degree
reaction is 5.1 with cover 4% when
job_role is Employee or Supervisor or Middle management
age_group is <30 yrs or 30-39 yrs
education is High school or Certificate/Diploma or Undergraduate degree
reaction is 5.2 with cover 16% when
job_role is Employee or Supervisor or Middle management
education is High school or Certificate/Diploma
reaction is 6.0 with cover 13% when
job_role is Senior management
reaction is 6.3 with cover 3% when
job_role is Employee
time_role is <2 yrs
age_group is 40-49 yrs or >49 yrs
time_org is <2 yrs
Chart 18 shows, for the regression tree, job role followed by age group and length of time at the organisation were the most important variables associated with an individual’s reaction to organisational change.
Chart 19 visualises the relationship between job role and the outcome variable, reaction to organisational change.
The above classification and regression trees produced similar results. Both models identified job role as the most important categorical variable associated with reaction to organisational change. Senior management, middle management and supervisors generally supported the organisational change, whereas employees were likelier to remain neutral or oppose the change. The length of time at the organisation and age group were also important categorical variables associated with reaction to organisational change.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.0 (2024-04-24 ucrt)
## os Windows 11 x64 (build 22631)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Australia.utf8
## ctype English_Australia.utf8
## tz Australia/Brisbane
## date 2024-07-30
## pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] CRAN (R 4.4.0)
## backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.0)
## broom 1.0.6 2024-05-17 [1] CRAN (R 4.4.0)
## bslib 0.7.0 2024-03-29 [1] CRAN (R 4.4.0)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
## car 3.1-2 2023-03-30 [1] CRAN (R 4.4.0)
## carData 3.0-5 2022-01-06 [1] CRAN (R 4.4.0)
## cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1)
## codetools 0.2-20 2024-03-31 [2] CRAN (R 4.4.0)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.4.1)
## cowplot 1.1.3 2024-01-22 [1] CRAN (R 4.4.0)
## data.table * 1.15.4 2024-03-30 [1] CRAN (R 4.4.0)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.0)
## digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0)
## evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
## fontawesome 0.5.2 2023-08-19 [1] CRAN (R 4.4.0)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
## foreach 1.5.2 2022-02-02 [1] CRAN (R 4.4.0)
## fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
## ggalluvial * 0.12.5 2023-02-22 [1] CRAN (R 4.4.0)
## ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
## ggpubr * 0.6.0 2023-02-10 [1] CRAN (R 4.4.0)
## ggsignif 0.6.4 2022-10-13 [1] CRAN (R 4.4.0)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
## gt * 0.11.0 2024-07-09 [1] CRAN (R 4.4.1)
## gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
## gtExtras * 0.5.0 2023-09-15 [1] CRAN (R 4.4.0)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
## highr 0.11 2024-05-26 [1] CRAN (R 4.4.0)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
## httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.0)
## iterators 1.0.14 2022-02-05 [1] CRAN (R 4.4.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
## jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
## knitr 1.48 2024-07-07 [1] CRAN (R 4.4.1)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
## later 1.3.2 2023-12-06 [1] CRAN (R 4.4.0)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
## mime 0.12 2021-09-28 [1] CRAN (R 4.4.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
## paletteer 1.6.0 2024-01-21 [1] CRAN (R 4.4.0)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
## pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.4.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
## pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.1)
## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.4.0)
## promises 1.3.0 2024-04-05 [1] CRAN (R 4.4.0)
## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
## Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.1)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
## rematch2 2.1.2 2020-05-01 [1] CRAN (R 4.4.0)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.0)
## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
## rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
## rpart * 4.1.23 2023-12-05 [2] CRAN (R 4.4.0)
## rpart.plot * 3.1.2 2024-02-26 [1] CRAN (R 4.4.0)
## rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
## rstatix 0.7.2 2023-02-01 [1] CRAN (R 4.4.0)
## rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
## sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.0)
## scales 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
## shiny 1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
## stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.0)
## usethis 2.2.3 2024-02-19 [1] CRAN (R 4.4.0)
## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
## vip * 0.4.1 2023-08-21 [1] CRAN (R 4.4.0)
## withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
## xfun 0.46 2024-07-18 [1] CRAN (R 4.4.1)
## xml2 1.3.6 2023-12-04 [1] CRAN (R 4.4.0)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.0)
## yaml 2.3.9 2024-07-05 [1] CRAN (R 4.4.1)
##
## [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
## [2] C:/Program Files/R/R-4.4.0/library
##
## ──────────────────────────────────────────────────────────────────────────────