Extreme Gradient Boosting Regression
Gradient boosting is a powerful machine-learning technique for building classification and regression predictive models. It is a decision tree ensemble method where each tree depends on and learns from the results of previous trees to produce an optimal prediction model.
This vignette demonstrates extreme gradient boosting regression with two objectives. First, develop a model identifying the most significant explanatory variables associated with the target variable, reaction to organisational change. Second, predict an individual’s reaction to change and evaluate the accuracy of these predictions.
The original data set comprised 616 respondents from 10 public and private sector organisations experiencing organisational change. Respondents reported self-efficacy, irrational ideas, maladaptive defence mechanisms, emotion, behavioural intentions and reaction towards change in their organisation.
The raw data set was wrangled and tidied before processing. Conducted a brief exploratory analysis comprising a statistical summary, distribution of variables and correlation analysis to understand the variables. The gradient boosting regression model used explanatory variables fitted by confirmatory factor analysis (CFA), exploratory factors analysis (EFA) and principal component analysis (PCA). See the vignettes on CFA, EFA and PCA for more information about dimensionality reduction of these explanatory variables.
The gradient boosting regression model was developed and fitted on the training data using a workflow that considered resampling methods, feature engineering, model specifications and optimised hyperparameters. Reviewed the results of the training model and identified important predictor variables associated with reaction to organisational change.
The trained model was then applied to the unseen test data to predict the target or outcome variable. Evaluated the model’s performance with regression metrics and a scatterplot comparing actual outcomes with model-predicted outcomes for the test set.
The data set was filtered to analyse only those respondents who reported experiencing significant organisational change. Table 1 provides a statistical summary of the explanatory variables and the outcome variable, reaction.
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| self_efficacy | 1 | 218 | 5.56 | 0.80 | 5.65 | 5.64 | 0.76 | 1.71 | 6.94 | 5.24 | -1.21 | 2.80 | 0.05 |
| irrational_ideas | 2 | 218 | 3.69 | 0.81 | 3.68 | 3.69 | 0.86 | 1.21 | 5.79 | 4.58 | -0.05 | -0.26 | 0.06 |
| defence_mechanisms | 3 | 218 | 2.98 | 0.76 | 2.83 | 2.95 | 0.86 | 1.25 | 5.25 | 4.00 | 0.30 | -0.46 | 0.05 |
| emotion | 4 | 218 | 3.80 | 1.13 | 3.80 | 3.79 | 1.11 | 1.00 | 6.90 | 5.90 | 0.09 | -0.22 | 0.08 |
| behavioural_intentions | 5 | 218 | 5.04 | 1.09 | 5.20 | 5.09 | 1.04 | 1.60 | 7.00 | 5.40 | -0.57 | 0.13 | 0.07 |
| reaction | 6 | 218 | 4.23 | 1.91 | 5.00 | 4.31 | 1.48 | 1.00 | 7.00 | 6.00 | -0.27 | -1.37 | 0.13 |
Chart 1 combination violin box plots visualise the distribution of explanatory variables and the outcome variable, reaction.
Chart 2 correlation heatmap summarises the correlation coefficient between each variable.
Commenced building the model by randomly splitting the data into a training and testing set at a 3:1 ratio using stratified sampling. Stratified sampling allocates approximately equal proportions of observations across the range of values for the outcome variable to balance the training and testing sets. Resampled data in the training set using five repeats of 10-fold cross-validation. The recipe for gradient boosting was a standard formula with no additional feature engineering. The model was specified, and a workflow created for implementation.
Because the ideal parameters to tune and train the model are unknown, conducted an efficient grid search using racing methods. The tuning process involved evaluating all models on a subset of resamples and eliminating tuning parameter combinations during subsequent resamples that were unlikely to produce the best results. This process was implemented using parallel processing, as parameter tuning can be computationally intensive. Chart 3 shows model tuning results.
Table 2 summarises the optimal parameter tuning combination for the boosted model.
| trees | min_n | tree_depth | learn_rate | loss_reduction | sample_size | .config |
|---|---|---|---|---|---|---|
| 887 | 7 | 3 | 0.0058 | 0.0159 | 0.5982 | Preprocessor1_Model03 |
With parameter tuning complete, the workflow and model were finalised for fitting.
The tuned model was fitted on the training set. Chart 4 illustrates the importance of explanatory variables on the target variable, reaction to organisational change. Behavioural intentions and emotion are significant predictors of reaction to change. It is noted that the importance and order of the three remaining explanatory variables will vary depending on random data splitting and resampling.
The model fitted on the training data was then applied to the unseen testing data to predict the outcome variable. Table 3 shows actual reaction and predicted reaction (.pred) in a small sample of observations extracted from the testing data.
| self_efficacy | irrational_ideas | defence_mechanisms | emotion | behavioural_intentions | reaction | .pred |
|---|---|---|---|---|---|---|
| 6.59 | 5.37 | 3.25 | 2.45 | 4.4 | 2 | 1.7428 |
| 6.00 | 3.95 | 4.08 | 2.85 | 2.7 | 2 | 1.6289 |
| 5.47 | 4.16 | 4.08 | 1.90 | 1.8 | 1 | 1.9479 |
| 5.00 | 3.11 | 2.33 | 2.70 | 4.6 | 2 | 2.8512 |
| 6.35 | 4.74 | 3.17 | 3.60 | 5.2 | 5 | 3.3025 |
| 4.65 | 4.37 | 2.75 | 2.70 | 3.2 | 1 | 1.4711 |
| 6.53 | 3.95 | 2.42 | 5.50 | 6.9 | 6 | 6.4034 |
| 5.59 | 3.37 | 3.58 | 3.95 | 5.4 | 6 | 5.6920 |
| 5.47 | 3.05 | 3.33 | 5.65 | 6.1 | 6 | 6.0225 |
| 6.71 | 3.16 | 2.42 | 5.55 | 6.3 | 7 | 6.4579 |
| 6.06 | 3.05 | 3.42 | 2.65 | 3.4 | 2 | 1.1696 |
| 6.35 | 3.32 | 4.25 | 2.80 | 3.9 | 3 | 1.9728 |
| 5.00 | 3.79 | 3.42 | 3.90 | 5.3 | 5 | 5.1483 |
| 5.65 | 3.16 | 3.00 | 4.10 | 6.3 | 6 | 6.2475 |
| 5.12 | 4.47 | 2.33 | 3.30 | 3.8 | 3 | 2.6512 |
Table 4 summarises key regression metrics for the test set.
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 1.0296 | Preprocessor1_Model1 |
| rsq | standard | 0.7235 | Preprocessor1_Model1 |
Chart 5 scatterplot compares the actual reaction to change with the predicted reaction for the test set. The dotted line through the origin (x=y) represents the perfect model where all predicted values would equal the true value. Overall, this model produced a favourable result with a coefficient of determination (R2) greater than 0.70 on the test set.
References:
Self-efficacy was measured using the ‘Self-efficacy scale:
Construction and validation’ by Sherer, Maddux, Mercandante,
Prentice-Dunn and Rogers, published in Psychological
Reports.
Irrational ideas were measured using the ‘Irrational belief scale’
developed by Malouff and Schutte, published in the Sourcebook of
Adult Assessment Strategies, based on Ellis and Harper’s work,
published in A New Guide to Rational Living.
Maladaptive defence mechanisms were measured using selected items from
‘The Defense Style Questionnaire’ by Andrews, Singh and Bond, published
in The Journal of Nervous and Mental Disease.
Emotion was measured using ‘A semantic differential mood scale’ by Lorr
and Wunderlich, published in the Journal of Clinical
Psychology.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.0 (2024-04-24 ucrt)
## os Windows 11 x64 (build 22631)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Australia.utf8
## ctype English_Australia.utf8
## tz Australia/Brisbane
## date 2024-07-30
## pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.0)
## base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.4.0)
## boot 1.3-30 2024-02-26 [2] CRAN (R 4.4.0)
## broom * 1.0.6 2024-05-17 [1] CRAN (R 4.4.0)
## bslib 0.7.0 2024-03-29 [1] CRAN (R 4.4.0)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
## class 7.3-22 2023-05-03 [2] CRAN (R 4.4.0)
## cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1)
## codetools 0.2-20 2024-03-31 [2] CRAN (R 4.4.0)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.4.1)
## cvms * 1.6.1 2024-02-27 [1] CRAN (R 4.4.0)
## data.table * 1.15.4 2024-03-30 [1] CRAN (R 4.4.0)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.0)
## dials * 1.2.1 2024-02-22 [1] CRAN (R 4.4.0)
## DiceDesign 1.10 2023-12-07 [1] CRAN (R 4.4.0)
## digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1)
## doParallel * 1.0.17 2022-02-07 [1] CRAN (R 4.4.0)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0)
## evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
## finetune * 1.2.0 2024-03-21 [1] CRAN (R 4.4.0)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
## foreach * 1.5.2 2022-02-02 [1] CRAN (R 4.4.0)
## fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
## furrr 0.3.1 2022-08-15 [1] CRAN (R 4.4.0)
## future 1.33.2 2024-03-26 [1] CRAN (R 4.4.0)
## future.apply 1.11.2 2024-03-28 [1] CRAN (R 4.4.0)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
## GGally * 2.2.1 2024-02-14 [1] CRAN (R 4.4.0)
## ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
## ggstats 0.6.0 2024-04-05 [1] CRAN (R 4.4.0)
## globals 0.16.3 2024-03-08 [1] CRAN (R 4.4.0)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
## gower 1.0.1 2022-12-22 [1] CRAN (R 4.4.0)
## GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.4.0)
## gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
## hardhat 1.4.0 2024-06-02 [1] CRAN (R 4.4.0)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
## highr 0.11 2024-05-26 [1] CRAN (R 4.4.0)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
## httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.0)
## infer * 1.0.7 2024-03-25 [1] CRAN (R 4.4.0)
## ipred 0.9-15 2024-07-18 [1] CRAN (R 4.4.1)
## iterators * 1.0.14 2022-02-05 [1] CRAN (R 4.4.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
## jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
## kableExtra * 1.4.0 2024-01-24 [1] CRAN (R 4.4.0)
## knitr 1.48 2024-07-07 [1] CRAN (R 4.4.1)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
## later 1.3.2 2023-12-06 [1] CRAN (R 4.4.0)
## lattice 0.22-6 2024-03-20 [2] CRAN (R 4.4.0)
## lava 1.8.0 2024-03-05 [1] CRAN (R 4.4.0)
## lhs 1.2.0 2024-06-30 [1] CRAN (R 4.4.1)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
## listenv 0.9.1 2024-01-29 [1] CRAN (R 4.4.0)
## lme4 1.1-35.5 2024-07-03 [1] CRAN (R 4.4.1)
## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
## MASS 7.3-60.2 2024-04-24 [2] local
## Matrix 1.7-0 2024-03-22 [2] CRAN (R 4.4.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
## mime 0.12 2021-09-28 [1] CRAN (R 4.4.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
## minqa 1.2.7 2024-05-20 [1] CRAN (R 4.4.0)
## mnormt 2.1.1 2022-09-26 [1] CRAN (R 4.4.0)
## modeldata * 1.4.0 2024-06-19 [1] CRAN (R 4.4.1)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
## nlme 3.1-164 2023-11-27 [2] CRAN (R 4.4.0)
## nloptr 2.1.1 2024-06-25 [1] CRAN (R 4.4.1)
## nnet 7.3-19 2023-05-03 [2] CRAN (R 4.4.0)
## parallelly 1.37.1 2024-02-29 [1] CRAN (R 4.4.0)
## parsnip * 1.2.1 2024-03-22 [1] CRAN (R 4.4.0)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
## pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.4.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
## pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.1)
## plyr 1.8.9 2023-10-02 [1] CRAN (R 4.4.0)
## prodlim 2024.06.25 2024-06-24 [1] CRAN (R 4.4.1)
## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.4.0)
## promises 1.3.0 2024-04-05 [1] CRAN (R 4.4.0)
## psych * 2.4.6.26 2024-06-27 [1] CRAN (R 4.4.1)
## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
## RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.4.0)
## Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.1)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
## recipes * 1.1.0 2024-07-04 [1] CRAN (R 4.4.1)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.0)
## repr 1.1.7 2024-03-22 [1] CRAN (R 4.4.0)
## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
## rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
## rpart 4.1.23 2023-12-05 [2] CRAN (R 4.4.0)
## rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
## rsample * 1.2.1 2024-03-25 [1] CRAN (R 4.4.0)
## rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
## sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.0)
## scales * 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
## shiny 1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
## skimr * 2.1.5 2022-12-23 [1] CRAN (R 4.4.0)
## stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
## survival 3.5-8 2024-02-14 [2] CRAN (R 4.4.0)
## svglite 2.1.3 2023-12-08 [1] CRAN (R 4.4.0)
## systemfonts 1.1.0 2024-05-15 [1] CRAN (R 4.4.0)
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
## tidymodels * 1.2.0 2024-03-25 [1] CRAN (R 4.4.0)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
## timeDate 4032.109 2023-12-14 [1] CRAN (R 4.4.0)
## tune * 1.2.1 2024-04-18 [1] CRAN (R 4.4.0)
## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.0)
## usemodels * 0.2.0 2022-02-18 [1] CRAN (R 4.4.0)
## usethis 2.2.3 2024-02-19 [1] CRAN (R 4.4.0)
## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
## vip * 0.4.1 2023-08-21 [1] CRAN (R 4.4.0)
## viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.4.0)
## withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
## workflows * 1.1.4 2024-02-19 [1] CRAN (R 4.4.0)
## workflowsets * 1.1.0 2024-03-21 [1] CRAN (R 4.4.0)
## xfun 0.46 2024-07-18 [1] CRAN (R 4.4.1)
## xgboost * 1.7.8.1 2024-07-24 [1] CRAN (R 4.4.1)
## xml2 1.3.6 2023-12-04 [1] CRAN (R 4.4.0)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.0)
## yaml 2.3.9 2024-07-05 [1] CRAN (R 4.4.1)
## yardstick * 1.3.1 2024-03-21 [1] CRAN (R 4.4.0)
##
## [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
## [2] C:/Program Files/R/R-4.4.0/library
##
## ──────────────────────────────────────────────────────────────────────────────