Machine Learning Regression Decision Tree

Objective

According to Wikipedia, decision trees are a supervised machine learning approach commonly used in predictive modelling to draw conclusions about a set of observations. Decision tree models where the target variable is discrete are called classification trees, and trees, where the target variable is continuous (typically numbers), are called regression trees. Decision trees visually and explicitly represent decisions for decision-making.

This vignette demonstrates a regression decision tree using machine learning to model and predict client satisfaction with service quality. Service quality is critical for business success. It is essential for attracting new customers and retaining existing customers. The data set consists of observations from 424 clients in a business-to-business (B2B) relationship. Data was gathered on a custom-designed survey instrument based on the SERVQUAL theoretical framework. SERVQUAL consists of five dimensions described as follows:

Tangibles – the appearance of physical facilities, equipment, personnel and communication materials
Assurance – knowledge and ability to inspire trust and confidence
Reliability – ability to perform service dependably and accurately
Responsiveness – willingness to help and provide prompt service
Empathy – providing caring and individualised attention.

In addition to the above specific dimensions, measured each client’s overall satisfaction with service quality.

Workflow

The raw data set was wrangled and tidied before processing. Conducted a brief exploratory analysis comprising a statistical summary, distribution visualisation and correlation analysis to understand the variables.

The regression decision tree model was developed and fitted on the training data using a workflow that considered resampling methods, feature engineering and hyperparameter optimisation. Reviewed the results of the training model and identified important predictor variables associated with a client’s overall satisfaction with service quality.

The trained model was then applied to the unseen test data to predict the target or outcome variable. Evaluated performance and metrics of the model on the test data with a scatterplot comparing actual overall satisfaction with predicted overall satisfaction.

Results

1. Explore data

Table 1 is a statistical summary for each of the five explanatory variables and the target variable, overall satisfaction.

Table 1 Statistical summary of service quality measures
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
tangibles	1	424	5.92	0.90	6.0	6.05	0.74	2	7	5	-1.61	3.51	0.04
assurance	2	424	5.52	1.07	6.0	5.61	0.74	2	7	5	-0.85	0.09	0.05
reliability	3	424	5.43	1.32	6.0	5.60	0.74	1	7	6	-1.13	0.60	0.06
responsiveness	4	424	5.29	1.37	6.0	5.44	1.48	1	7	6	-0.90	0.23	0.07
empathy	5	424	5.39	1.17	5.5	5.47	1.48	1	7	6	-0.78	0.71	0.06
overall_satisfaction	6	424	5.66	1.23	6.0	5.86	0.00	1	7	6	-1.63	2.52	0.06

Chart 1 combination violin box plots show the distribution of five explanatory variables and the outcome variable, overall satisfaction. The box plots show favourable levels of service quality for each of the explanatory variables and the outcome variable.

Chart 2 correlation heatmap shows all variables positively correlated with coefficients ranging from 0.55 to 0.85. The intangible measures are more closely correlated when compared to the tangible measure.

2. Regression Tree

2.1 Train model

2.1.1 Build model

Commenced building the model by randomly splitting the data into a training and testing set at a 3:1 ratio using stratified sampling. Stratified sampling allocates approximately equal proportions of observations across the range of values for the outcome variable to balance the training and testing sets. The data was resampled in the training set using 10-fold cross-validation, repeated five times. The recipe for the regression tree was a standard formula, with no additional feature engineering. Hyperparameters were tuned to default settings initially, then optimised using regular grid search. Added these components to a workflow for execution.

The workflow was implemented using parallel processing. During this process, tuned the hyperparameters. Table 2 shows a short list of parameter tuning combinations.

Table 2 Short-list of parameter tuning combinations
	cost_complexity	tree_depth	min_n	.metric	.estimator	mean	n	std_err	.config
3	1.0000e-10	8	40	rmse	standard	0.7691907	50	0.02095406	Preprocessor1_Model22
9	3.1623e-06	15	21	rmse	standard	0.7200883	50	0.02211900	Preprocessor1_Model17
5	1.0000e-10	15	21	rmse	standard	0.7200868	50	0.02211868	Preprocessor1_Model16
7	3.1623e-06	8	21	rmse	standard	0.7200857	50	0.02211885	Preprocessor1_Model14
1	1.0000e-10	8	21	rmse	standard	0.7200842	50	0.02211852	Preprocessor1_Model13
2	1.0000e-10	8	21	rsq	standard	0.6548680	50	0.01869506	Preprocessor1_Model13
6	1.0000e-10	15	21	rsq	standard	0.6548675	50	0.01869508	Preprocessor1_Model16
8	3.1623e-06	8	21	rsq	standard	0.6548674	50	0.01869517	Preprocessor1_Model14
10	3.1623e-06	15	21	rsq	standard	0.6548669	50	0.01869519	Preprocessor1_Model17
4	1.0000e-10	8	40	rsq	standard	0.6095659	50	0.01803846	Preprocessor1_Model22

Model 13 in Table 2 has the best parameter tuning combination with the equal lowest root mean square error (rmse), equal highest R² (rsq) with equal lowest cost-complexity, minimum tree depth and a minimum number of nodes. Model 13 parameters have been extracted in Table 3.

Table 3 Optimal model parameter tuning combination
cost_complexity	tree_depth	min_n	.config
1e-10	8	21	Preprocessor1_Model13

Chart 3 visualises Tables 2 and 3 by comparing parameter tuning combinations. Model 13, with a tree depth of 8, and minimal node size of 21, share the equal lowest rmse and equal highest R² with a tree depth of 15, as shown in Table 2. Consequently, the green line for tree depth 8 is plotted beneath tree depth 15 and is not visible in Chart 3.

With parameter tuning complete, the workflow and model were finalised for fitting.

2.1.2 Fit and review model

The tuned model was fitted on the training set resulting in the regression decision tree shown in Chart 4.

Note: Please magnify the screen to see Chart 4 in greater detail.

Chart 5 illustrates the importance of predictor variables on the target variable, overall satisfaction. Responsiveness and reliability were calculated as the most important variables explaining overall client satisfaction.

2.2 Test model

2.2.1 Predict on test data

The model fitted on the training data was then applied to the unseen testing data to predict the outcome variable. Table 4 shows actual overall satisfaction and predicted overall satisfaction (.pred) in a small sample of observations extracted from the testing data.

Table 4 Sample of outcome variable predictions in test data
tangibles	assurance	reliability	responsiveness	empathy	overall_satisfaction	.pred
7.0	6.5	6.5	6.5	7.0	7	6.6111
7.0	7.0	7.0	7.0	7.0	7	6.9688
6.0	5.0	3.0	2.0	4.0	2	5.4118
2.0	3.5	3.0	2.5	3.5	3	3.2000
6.0	6.0	6.0	6.0	6.5	6	6.0000
6.5	6.5	6.5	6.0	6.5	7	6.6111
6.0	6.0	6.0	6.0	5.5	6	6.0000
6.0	3.5	2.0	2.5	4.5	3	2.4615
6.0	5.0	6.0	6.0	6.0	6	5.9091
6.0	6.0	5.0	4.5	5.5	6	5.7778
6.5	4.0	4.0	3.5	4.0	4	4.4375
4.5	5.0	6.0	6.0	6.0	6	5.9091
2.0	3.5	3.0	2.5	3.5	3	3.2000
6.0	5.0	5.0	5.5	5.0	6	6.0000
6.0	6.0	5.0	5.0	6.0	6	5.7778

2.2.2 Review model on test data

Evaluated model performance on the test data. Table 5 summarises key regression metrics for the test set. The R² on the test set was 0.6174, marginally lower than the training set at 0.6549.

Table 5 Regression tree metrics for test set
.metric	.estimator	.estimate	.config
rmse	standard	0.8042	Preprocessor1_Model1
rsq	standard	0.6174	Preprocessor1_Model1

Chart 6 scatterplot compares the actual overall satisfaction for the test set with the model prediction for the test set. The dotted line through the origin (x=y) represents the perfect model where all predicted values would equal the true value. The chart shows that the model is more accurate predicting higher levels of satisfaction levels rather than lower levels of satisfaction. This is due to the relatively small number of observations with low satisfaction compared to the high number of observations with high satisfaction to train the model entirely.

Reference:

Data was gathered using a custom-designed survey instrument based on the SERVQUAL theoretical framework. The SERVQUAL methodology is documented in Delivering Quality Service: Balancing Customer Perceptions and Expectations by Zeithaml, Parasuraman and Berry.

Session information and package update

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.0 (2024-04-24 ucrt)
##  os       Windows 11 x64 (build 22631)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_Australia.utf8
##  ctype    English_Australia.utf8
##  tz       Australia/Brisbane
##  date     2024-07-30
##  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version    date (UTC) lib source
##  backports      1.5.0      2024-05-23 [1] CRAN (R 4.4.0)
##  broom        * 1.0.6      2024-05-17 [1] CRAN (R 4.4.0)
##  bslib          0.7.0      2024-03-29 [1] CRAN (R 4.4.0)
##  cachem         1.1.0      2024-05-16 [1] CRAN (R 4.4.0)
##  class          7.3-22     2023-05-03 [2] CRAN (R 4.4.0)
##  cli            3.6.3      2024-06-21 [1] CRAN (R 4.4.1)
##  codetools      0.2-20     2024-03-31 [2] CRAN (R 4.4.0)
##  colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.4.1)
##  data.table   * 1.15.4     2024-03-30 [1] CRAN (R 4.4.0)
##  devtools       2.4.5      2022-10-11 [1] CRAN (R 4.4.0)
##  dials        * 1.2.1      2024-02-22 [1] CRAN (R 4.4.0)
##  DiceDesign     1.10       2023-12-07 [1] CRAN (R 4.4.0)
##  digest         0.6.36     2024-06-23 [1] CRAN (R 4.4.1)
##  doParallel   * 1.0.17     2022-02-07 [1] CRAN (R 4.4.0)
##  dplyr        * 1.1.4      2023-11-17 [1] CRAN (R 4.4.0)
##  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.4.0)
##  evaluate       0.24.0     2024-06-10 [1] CRAN (R 4.4.0)
##  fansi          1.0.6      2023-12-08 [1] CRAN (R 4.4.0)
##  farver         2.1.2      2024-05-13 [1] CRAN (R 4.4.0)
##  fastmap        1.2.0      2024-05-15 [1] CRAN (R 4.4.0)
##  forcats      * 1.0.0      2023-01-29 [1] CRAN (R 4.4.0)
##  foreach      * 1.5.2      2022-02-02 [1] CRAN (R 4.4.0)
##  fs             1.6.4      2024-04-25 [1] CRAN (R 4.4.0)
##  furrr          0.3.1      2022-08-15 [1] CRAN (R 4.4.0)
##  future         1.33.2     2024-03-26 [1] CRAN (R 4.4.0)
##  future.apply   1.11.2     2024-03-28 [1] CRAN (R 4.4.0)
##  generics       0.1.3      2022-07-05 [1] CRAN (R 4.4.0)
##  GGally       * 2.2.1      2024-02-14 [1] CRAN (R 4.4.0)
##  ggplot2      * 3.5.1      2024-04-23 [1] CRAN (R 4.4.0)
##  ggstats        0.6.0      2024-04-05 [1] CRAN (R 4.4.0)
##  globals        0.16.3     2024-03-08 [1] CRAN (R 4.4.0)
##  glue           1.7.0      2024-01-09 [1] CRAN (R 4.4.0)
##  gower          1.0.1      2022-12-22 [1] CRAN (R 4.4.0)
##  GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.4.0)
##  gtable         0.3.5      2024-04-22 [1] CRAN (R 4.4.0)
##  hardhat        1.4.0      2024-06-02 [1] CRAN (R 4.4.0)
##  here         * 1.0.1      2020-12-13 [1] CRAN (R 4.4.0)
##  highr          0.11       2024-05-26 [1] CRAN (R 4.4.0)
##  hms            1.1.3      2023-03-21 [1] CRAN (R 4.4.0)
##  htmltools      0.5.8.1    2024-04-04 [1] CRAN (R 4.4.0)
##  htmlwidgets    1.6.4      2023-12-06 [1] CRAN (R 4.4.0)
##  httpuv         1.6.15     2024-03-26 [1] CRAN (R 4.4.0)
##  infer        * 1.0.7      2024-03-25 [1] CRAN (R 4.4.0)
##  ipred          0.9-15     2024-07-18 [1] CRAN (R 4.4.1)
##  iterators    * 1.0.14     2022-02-05 [1] CRAN (R 4.4.0)
##  jquerylib      0.1.4      2021-04-26 [1] CRAN (R 4.4.0)
##  jsonlite       1.8.8      2023-12-04 [1] CRAN (R 4.4.0)
##  kableExtra   * 1.4.0      2024-01-24 [1] CRAN (R 4.4.0)
##  knitr          1.48       2024-07-07 [1] CRAN (R 4.4.1)
##  labeling       0.4.3      2023-08-29 [1] CRAN (R 4.4.0)
##  later          1.3.2      2023-12-06 [1] CRAN (R 4.4.0)
##  lattice        0.22-6     2024-03-20 [2] CRAN (R 4.4.0)
##  lava           1.8.0      2024-03-05 [1] CRAN (R 4.4.0)
##  lhs            1.2.0      2024-06-30 [1] CRAN (R 4.4.1)
##  lifecycle      1.0.4      2023-11-07 [1] CRAN (R 4.4.0)
##  listenv        0.9.1      2024-01-29 [1] CRAN (R 4.4.0)
##  lubridate    * 1.9.3      2023-09-27 [1] CRAN (R 4.4.0)
##  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.4.0)
##  MASS           7.3-60.2   2024-04-24 [2] local
##  Matrix         1.7-0      2024-03-22 [2] CRAN (R 4.4.0)
##  memoise        2.0.1      2021-11-26 [1] CRAN (R 4.4.0)
##  mime           0.12       2021-09-28 [1] CRAN (R 4.4.0)
##  miniUI         0.1.1.1    2018-05-18 [1] CRAN (R 4.4.0)
##  mnormt         2.1.1      2022-09-26 [1] CRAN (R 4.4.0)
##  modeldata    * 1.4.0      2024-06-19 [1] CRAN (R 4.4.1)
##  munsell        0.5.1      2024-04-01 [1] CRAN (R 4.4.0)
##  nlme           3.1-164    2023-11-27 [2] CRAN (R 4.4.0)
##  nnet           7.3-19     2023-05-03 [2] CRAN (R 4.4.0)
##  parallelly     1.37.1     2024-02-29 [1] CRAN (R 4.4.0)
##  parsnip      * 1.2.1      2024-03-22 [1] CRAN (R 4.4.0)
##  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.4.0)
##  pkgbuild       1.4.4      2024-03-17 [1] CRAN (R 4.4.0)
##  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.4.0)
##  pkgload        1.4.0      2024-06-28 [1] CRAN (R 4.4.1)
##  plyr           1.8.9      2023-10-02 [1] CRAN (R 4.4.0)
##  prodlim        2024.06.25 2024-06-24 [1] CRAN (R 4.4.1)
##  profvis        0.3.8      2023-05-02 [1] CRAN (R 4.4.0)
##  promises       1.3.0      2024-04-05 [1] CRAN (R 4.4.0)
##  psych        * 2.4.6.26   2024-06-27 [1] CRAN (R 4.4.1)
##  purrr        * 1.0.2      2023-08-10 [1] CRAN (R 4.4.0)
##  R6             2.5.1      2021-08-19 [1] CRAN (R 4.4.0)
##  RColorBrewer   1.1-3      2022-04-03 [1] CRAN (R 4.4.0)
##  Rcpp           1.0.13     2024-07-17 [1] CRAN (R 4.4.1)
##  readr        * 2.1.5      2024-01-10 [1] CRAN (R 4.4.0)
##  recipes      * 1.1.0      2024-07-04 [1] CRAN (R 4.4.1)
##  remotes        2.5.0      2024-03-17 [1] CRAN (R 4.4.0)
##  rlang          1.1.4      2024-06-04 [1] CRAN (R 4.4.0)
##  rmarkdown      2.27       2024-05-17 [1] CRAN (R 4.4.0)
##  rpart        * 4.1.23     2023-12-05 [2] CRAN (R 4.4.0)
##  rpart.plot   * 3.1.2      2024-02-26 [1] CRAN (R 4.4.0)
##  rprojroot      2.0.4      2023-11-05 [1] CRAN (R 4.4.0)
##  rsample      * 1.2.1      2024-03-25 [1] CRAN (R 4.4.0)
##  rstudioapi     0.16.0     2024-03-24 [1] CRAN (R 4.4.0)
##  sass           0.4.9      2024-03-15 [1] CRAN (R 4.4.0)
##  scales       * 1.3.0      2023-11-28 [1] CRAN (R 4.4.0)
##  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.4.0)
##  shiny          1.8.1.1    2024-04-02 [1] CRAN (R 4.4.0)
##  stringi        1.8.4      2024-05-06 [1] CRAN (R 4.4.0)
##  stringr      * 1.5.1      2023-11-14 [1] CRAN (R 4.4.0)
##  survival       3.5-8      2024-02-14 [2] CRAN (R 4.4.0)
##  svglite        2.1.3      2023-12-08 [1] CRAN (R 4.4.0)
##  systemfonts    1.1.0      2024-05-15 [1] CRAN (R 4.4.0)
##  tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.4.0)
##  tidymodels   * 1.2.0      2024-03-25 [1] CRAN (R 4.4.0)
##  tidyr        * 1.3.1      2024-01-24 [1] CRAN (R 4.4.0)
##  tidyselect     1.2.1      2024-03-11 [1] CRAN (R 4.4.0)
##  tidyverse    * 2.0.0      2023-02-22 [1] CRAN (R 4.4.0)
##  timechange     0.3.0      2024-01-18 [1] CRAN (R 4.4.0)
##  timeDate       4032.109   2023-12-14 [1] CRAN (R 4.4.0)
##  tune         * 1.2.1      2024-04-18 [1] CRAN (R 4.4.0)
##  tzdb           0.4.0      2023-05-12 [1] CRAN (R 4.4.0)
##  urlchecker     1.0.1      2021-11-30 [1] CRAN (R 4.4.0)
##  usethis        2.2.3      2024-02-19 [1] CRAN (R 4.4.0)
##  utf8           1.2.4      2023-10-22 [1] CRAN (R 4.4.0)
##  vctrs          0.6.5      2023-12-01 [1] CRAN (R 4.4.0)
##  vip          * 0.4.1      2023-08-21 [1] CRAN (R 4.4.0)
##  viridisLite    0.4.2      2023-05-02 [1] CRAN (R 4.4.0)
##  withr          3.0.0      2024-01-16 [1] CRAN (R 4.4.0)
##  workflows    * 1.1.4      2024-02-19 [1] CRAN (R 4.4.0)
##  workflowsets * 1.1.0      2024-03-21 [1] CRAN (R 4.4.0)
##  xfun           0.46       2024-07-18 [1] CRAN (R 4.4.1)
##  xml2           1.3.6      2023-12-04 [1] CRAN (R 4.4.0)
##  xtable         1.8-4      2019-04-21 [1] CRAN (R 4.4.0)
##  yaml           2.3.9      2024-07-05 [1] CRAN (R 4.4.1)
##  yardstick    * 1.3.1      2024-03-21 [1] CRAN (R 4.4.0)
## 
##  [1] C:/Users/wayne/AppData/Local/R/win-library/4.4
##  [2] C:/Program Files/R/R-4.4.0/library
## 
## ──────────────────────────────────────────────────────────────────────────────