Machine Learning Regression Decision Tree
According to Wikipedia, decision trees are a supervised machine learning approach commonly used in predictive modelling to draw conclusions about a set of observations. Decision tree models where the target variable is discrete are called classification trees, and trees, where the target variable is continuous (typically numbers), are called regression trees. Decision trees visually and explicitly represent decisions for decision-making.
This vignette demonstrates a regression decision tree using machine learning to model and predict client satisfaction with service quality. Service quality is critical for business success. It is essential for attracting new customers and retaining existing customers. The data set consists of observations from 424 clients in a business-to-business (B2B) relationship. Data was gathered on a custom-designed survey instrument based on the SERVQUAL theoretical framework. SERVQUAL consists of five dimensions described as follows:
In addition to the above specific dimensions, measured each client’s overall satisfaction with service quality.
The raw data set was wrangled and tidied before processing. Conducted a brief exploratory analysis comprising a statistical summary, distribution visualisation and correlation analysis to understand the variables.
The regression decision tree model was developed and fitted on the training data using a workflow that considered resampling methods, feature engineering and hyperparameter optimisation. Reviewed the results of the training model and identified important predictor variables associated with a client’s overall satisfaction with service quality.
The trained model was then applied to the unseen test data to predict the target or outcome variable. Evaluated performance and metrics of the model on the test data with a scatterplot comparing actual overall satisfaction with predicted overall satisfaction.
Table 1 is a statistical summary for each of the five explanatory variables and the target variable, overall satisfaction.
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tangibles | 1 | 424 | 5.92 | 0.90 | 6.0 | 6.05 | 0.74 | 2 | 7 | 5 | -1.61 | 3.51 | 0.04 |
| assurance | 2 | 424 | 5.52 | 1.07 | 6.0 | 5.61 | 0.74 | 2 | 7 | 5 | -0.85 | 0.09 | 0.05 |
| reliability | 3 | 424 | 5.43 | 1.32 | 6.0 | 5.60 | 0.74 | 1 | 7 | 6 | -1.13 | 0.60 | 0.06 |
| responsiveness | 4 | 424 | 5.29 | 1.37 | 6.0 | 5.44 | 1.48 | 1 | 7 | 6 | -0.90 | 0.23 | 0.07 |
| empathy | 5 | 424 | 5.39 | 1.17 | 5.5 | 5.47 | 1.48 | 1 | 7 | 6 | -0.78 | 0.71 | 0.06 |
| overall_satisfaction | 6 | 424 | 5.66 | 1.23 | 6.0 | 5.86 | 0.00 | 1 | 7 | 6 | -1.63 | 2.52 | 0.06 |
Chart 1 combination violin box plots show the distribution of five explanatory variables and the outcome variable, overall satisfaction. The box plots show favourable levels of service quality for each of the explanatory variables and the outcome variable.
Chart 2 correlation heatmap shows all variables positively correlated with coefficients ranging from 0.55 to 0.85. The intangible measures are more closely correlated when compared to the tangible measure.
Commenced building the model by randomly splitting the data into a training and testing set at a 3:1 ratio using stratified sampling. Stratified sampling allocates approximately equal proportions of observations across the range of values for the outcome variable to balance the training and testing sets. The data was resampled in the training set using 10-fold cross-validation, repeated five times. The recipe for the regression tree was a standard formula, with no additional feature engineering. Hyperparameters were tuned to default settings initially, then optimised using regular grid search. Added these components to a workflow for execution.
The workflow was implemented using parallel processing. During this process, tuned the hyperparameters. Table 2 shows a short list of parameter tuning combinations.
| cost_complexity | tree_depth | min_n | .metric | .estimator | mean | n | std_err | .config | |
|---|---|---|---|---|---|---|---|---|---|
| 3 | 1.0000e-10 | 8 | 40 | rmse | standard | 0.7691907 | 50 | 0.02095406 | Preprocessor1_Model22 |
| 9 | 3.1623e-06 | 15 | 21 | rmse | standard | 0.7200883 | 50 | 0.02211900 | Preprocessor1_Model17 |
| 5 | 1.0000e-10 | 15 | 21 | rmse | standard | 0.7200868 | 50 | 0.02211868 | Preprocessor1_Model16 |
| 7 | 3.1623e-06 | 8 | 21 | rmse | standard | 0.7200857 | 50 | 0.02211885 | Preprocessor1_Model14 |
| 1 | 1.0000e-10 | 8 | 21 | rmse | standard | 0.7200842 | 50 | 0.02211852 | Preprocessor1_Model13 |
| 2 | 1.0000e-10 | 8 | 21 | rsq | standard | 0.6548680 | 50 | 0.01869506 | Preprocessor1_Model13 |
| 6 | 1.0000e-10 | 15 | 21 | rsq | standard | 0.6548675 | 50 | 0.01869508 | Preprocessor1_Model16 |
| 8 | 3.1623e-06 | 8 | 21 | rsq | standard | 0.6548674 | 50 | 0.01869517 | Preprocessor1_Model14 |
| 10 | 3.1623e-06 | 15 | 21 | rsq | standard | 0.6548669 | 50 | 0.01869519 | Preprocessor1_Model17 |
| 4 | 1.0000e-10 | 8 | 40 | rsq | standard | 0.6095659 | 50 | 0.01803846 | Preprocessor1_Model22 |
Model 13 in Table 2 has the best parameter tuning combination with the equal lowest root mean square error (rmse), equal highest R2 (rsq) with equal lowest cost-complexity, minimum tree depth and a minimum number of nodes. Model 13 parameters have been extracted in Table 3.
| cost_complexity | tree_depth | min_n | .config |
|---|---|---|---|
| 1e-10 | 8 | 21 | Preprocessor1_Model13 |
Chart 3 visualises Tables 2 and 3 by comparing parameter tuning combinations. Model 13, with a tree depth of 8, and minimal node size of 21, share the equal lowest rmse and equal highest R2 with a tree depth of 15, as shown in Table 2. Consequently, the green line for tree depth 8 is plotted beneath tree depth 15 and is not visible in Chart 3.
With parameter tuning complete, the workflow and model were finalised for fitting.
The tuned model was fitted on the training set resulting in the regression decision tree shown in Chart 4.
Note: Please magnify the screen to see Chart 4 in greater detail.
Chart 5 illustrates the importance of predictor variables on the target variable, overall satisfaction. Responsiveness and reliability were calculated as the most important variables explaining overall client satisfaction.
The model fitted on the training data was then applied to the unseen testing data to predict the outcome variable. Table 4 shows actual overall satisfaction and predicted overall satisfaction (.pred) in a small sample of observations extracted from the testing data.
| tangibles | assurance | reliability | responsiveness | empathy | overall_satisfaction | .pred |
|---|---|---|---|---|---|---|
| 7.0 | 6.5 | 6.5 | 6.5 | 7.0 | 7 | 6.6111 |
| 7.0 | 7.0 | 7.0 | 7.0 | 7.0 | 7 | 6.9688 |
| 6.0 | 5.0 | 3.0 | 2.0 | 4.0 | 2 | 5.4118 |
| 2.0 | 3.5 | 3.0 | 2.5 | 3.5 | 3 | 3.2000 |
| 6.0 | 6.0 | 6.0 | 6.0 | 6.5 | 6 | 6.0000 |
| 6.5 | 6.5 | 6.5 | 6.0 | 6.5 | 7 | 6.6111 |
| 6.0 | 6.0 | 6.0 | 6.0 | 5.5 | 6 | 6.0000 |
| 6.0 | 3.5 | 2.0 | 2.5 | 4.5 | 3 | 2.4615 |
| 6.0 | 5.0 | 6.0 | 6.0 | 6.0 | 6 | 5.9091 |
| 6.0 | 6.0 | 5.0 | 4.5 | 5.5 | 6 | 5.7778 |
| 6.5 | 4.0 | 4.0 | 3.5 | 4.0 | 4 | 4.4375 |
| 4.5 | 5.0 | 6.0 | 6.0 | 6.0 | 6 | 5.9091 |
| 2.0 | 3.5 | 3.0 | 2.5 | 3.5 | 3 | 3.2000 |
| 6.0 | 5.0 | 5.0 | 5.5 | 5.0 | 6 | 6.0000 |
| 6.0 | 6.0 | 5.0 | 5.0 | 6.0 | 6 | 5.7778 |
Evaluated model performance on the test data. Table 5 summarises key regression metrics for the test set. The R2 on the test set was 0.6174, marginally lower than the training set at 0.6549.
| .metric | .estimator | .estimate | .config |
|---|---|---|---|
| rmse | standard | 0.8042 | Preprocessor1_Model1 |
| rsq | standard | 0.6174 | Preprocessor1_Model1 |
Chart 6 scatterplot compares the actual overall satisfaction for the test set with the model prediction for the test set. The dotted line through the origin (x=y) represents the perfect model where all predicted values would equal the true value. The chart shows that the model is more accurate predicting higher levels of satisfaction levels rather than lower levels of satisfaction. This is due to the relatively small number of observations with low satisfaction compared to the high number of observations with high satisfaction to train the model entirely.
Reference:
Data was gathered using a custom-designed survey instrument based on the SERVQUAL theoretical framework. The SERVQUAL methodology is documented in Delivering Quality Service: Balancing Customer Perceptions and Expectations by Zeithaml, Parasuraman and Berry.
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.0 (2024-04-24 ucrt)
## os Windows 11 x64 (build 22631)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_Australia.utf8
## ctype English_Australia.utf8
## tz Australia/Brisbane
## date 2024-07-30
## pandoc 3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.0)
## broom * 1.0.6 2024-05-17 [1] CRAN (R 4.4.0)
## bslib 0.7.0 2024-03-29 [1] CRAN (R 4.4.0)
## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0)
## class 7.3-22 2023-05-03 [2] CRAN (R 4.4.0)
## cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1)
## codetools 0.2-20 2024-03-31 [2] CRAN (R 4.4.0)
## colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.4.1)
## data.table * 1.15.4 2024-03-30 [1] CRAN (R 4.4.0)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.4.0)
## dials * 1.2.1 2024-02-22 [1] CRAN (R 4.4.0)
## DiceDesign 1.10 2023-12-07 [1] CRAN (R 4.4.0)
## digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1)
## doParallel * 1.0.17 2022-02-07 [1] CRAN (R 4.4.0)
## dplyr * 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0)
## evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0)
## fansi 1.0.6 2023-12-08 [1] CRAN (R 4.4.0)
## farver 2.1.2 2024-05-13 [1] CRAN (R 4.4.0)
## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
## forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.4.0)
## foreach * 1.5.2 2022-02-02 [1] CRAN (R 4.4.0)
## fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0)
## furrr 0.3.1 2022-08-15 [1] CRAN (R 4.4.0)
## future 1.33.2 2024-03-26 [1] CRAN (R 4.4.0)
## future.apply 1.11.2 2024-03-28 [1] CRAN (R 4.4.0)
## generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
## GGally * 2.2.1 2024-02-14 [1] CRAN (R 4.4.0)
## ggplot2 * 3.5.1 2024-04-23 [1] CRAN (R 4.4.0)
## ggstats 0.6.0 2024-04-05 [1] CRAN (R 4.4.0)
## globals 0.16.3 2024-03-08 [1] CRAN (R 4.4.0)
## glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0)
## gower 1.0.1 2022-12-22 [1] CRAN (R 4.4.0)
## GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.4.0)
## gtable 0.3.5 2024-04-22 [1] CRAN (R 4.4.0)
## hardhat 1.4.0 2024-06-02 [1] CRAN (R 4.4.0)
## here * 1.0.1 2020-12-13 [1] CRAN (R 4.4.0)
## highr 0.11 2024-05-26 [1] CRAN (R 4.4.0)
## hms 1.1.3 2023-03-21 [1] CRAN (R 4.4.0)
## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
## htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.0)
## httpuv 1.6.15 2024-03-26 [1] CRAN (R 4.4.0)
## infer * 1.0.7 2024-03-25 [1] CRAN (R 4.4.0)
## ipred 0.9-15 2024-07-18 [1] CRAN (R 4.4.1)
## iterators * 1.0.14 2022-02-05 [1] CRAN (R 4.4.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0)
## jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0)
## kableExtra * 1.4.0 2024-01-24 [1] CRAN (R 4.4.0)
## knitr 1.48 2024-07-07 [1] CRAN (R 4.4.1)
## labeling 0.4.3 2023-08-29 [1] CRAN (R 4.4.0)
## later 1.3.2 2023-12-06 [1] CRAN (R 4.4.0)
## lattice 0.22-6 2024-03-20 [2] CRAN (R 4.4.0)
## lava 1.8.0 2024-03-05 [1] CRAN (R 4.4.0)
## lhs 1.2.0 2024-06-30 [1] CRAN (R 4.4.1)
## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
## listenv 0.9.1 2024-01-29 [1] CRAN (R 4.4.0)
## lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.4.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
## MASS 7.3-60.2 2024-04-24 [2] local
## Matrix 1.7-0 2024-03-22 [2] CRAN (R 4.4.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0)
## mime 0.12 2021-09-28 [1] CRAN (R 4.4.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
## mnormt 2.1.1 2022-09-26 [1] CRAN (R 4.4.0)
## modeldata * 1.4.0 2024-06-19 [1] CRAN (R 4.4.1)
## munsell 0.5.1 2024-04-01 [1] CRAN (R 4.4.0)
## nlme 3.1-164 2023-11-27 [2] CRAN (R 4.4.0)
## nnet 7.3-19 2023-05-03 [2] CRAN (R 4.4.0)
## parallelly 1.37.1 2024-02-29 [1] CRAN (R 4.4.0)
## parsnip * 1.2.1 2024-03-22 [1] CRAN (R 4.4.0)
## pillar 1.9.0 2023-03-22 [1] CRAN (R 4.4.0)
## pkgbuild 1.4.4 2024-03-17 [1] CRAN (R 4.4.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
## pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.4.1)
## plyr 1.8.9 2023-10-02 [1] CRAN (R 4.4.0)
## prodlim 2024.06.25 2024-06-24 [1] CRAN (R 4.4.1)
## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.4.0)
## promises 1.3.0 2024-04-05 [1] CRAN (R 4.4.0)
## psych * 2.4.6.26 2024-06-27 [1] CRAN (R 4.4.1)
## purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.4.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.0)
## RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.4.0)
## Rcpp 1.0.13 2024-07-17 [1] CRAN (R 4.4.1)
## readr * 2.1.5 2024-01-10 [1] CRAN (R 4.4.0)
## recipes * 1.1.0 2024-07-04 [1] CRAN (R 4.4.1)
## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.0)
## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0)
## rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0)
## rpart * 4.1.23 2023-12-05 [2] CRAN (R 4.4.0)
## rpart.plot * 3.1.2 2024-02-26 [1] CRAN (R 4.4.0)
## rprojroot 2.0.4 2023-11-05 [1] CRAN (R 4.4.0)
## rsample * 1.2.1 2024-03-25 [1] CRAN (R 4.4.0)
## rstudioapi 0.16.0 2024-03-24 [1] CRAN (R 4.4.0)
## sass 0.4.9 2024-03-15 [1] CRAN (R 4.4.0)
## scales * 1.3.0 2023-11-28 [1] CRAN (R 4.4.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
## shiny 1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
## stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.0)
## stringr * 1.5.1 2023-11-14 [1] CRAN (R 4.4.0)
## survival 3.5-8 2024-02-14 [2] CRAN (R 4.4.0)
## svglite 2.1.3 2023-12-08 [1] CRAN (R 4.4.0)
## systemfonts 1.1.0 2024-05-15 [1] CRAN (R 4.4.0)
## tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
## tidymodels * 1.2.0 2024-03-25 [1] CRAN (R 4.4.0)
## tidyr * 1.3.1 2024-01-24 [1] CRAN (R 4.4.0)
## tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
## tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.4.0)
## timechange 0.3.0 2024-01-18 [1] CRAN (R 4.4.0)
## timeDate 4032.109 2023-12-14 [1] CRAN (R 4.4.0)
## tune * 1.2.1 2024-04-18 [1] CRAN (R 4.4.0)
## tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.4.0)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.4.0)
## usethis 2.2.3 2024-02-19 [1] CRAN (R 4.4.0)
## utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
## vip * 0.4.1 2023-08-21 [1] CRAN (R 4.4.0)
## viridisLite 0.4.2 2023-05-02 [1] CRAN (R 4.4.0)
## withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0)
## workflows * 1.1.4 2024-02-19 [1] CRAN (R 4.4.0)
## workflowsets * 1.1.0 2024-03-21 [1] CRAN (R 4.4.0)
## xfun 0.46 2024-07-18 [1] CRAN (R 4.4.1)
## xml2 1.3.6 2023-12-04 [1] CRAN (R 4.4.0)
## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.4.0)
## yaml 2.3.9 2024-07-05 [1] CRAN (R 4.4.1)
## yardstick * 1.3.1 2024-03-21 [1] CRAN (R 4.4.0)
##
## [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
## [2] C:/Program Files/R/R-4.4.0/library
##
## ──────────────────────────────────────────────────────────────────────────────