XGBoost which stands for eXtreme Gradient Boosting is an efficent implementation of gradient boosting. Gradient boosting is an ensemble technique in machine learning. Unlike traditional models that learn from the data independently, boosting combines the predictions of multiple weak learners to create a single, more accurate strong learner.
An XGBoost model is based on trees, so we don’t need to do much preprocessing for our data; we don’t need to worry about the factors or centering or scaling our data.
Available R packages
There are multiple packages that can be used to to implement xgboost in R.
{tidymodels} and {caret} easy ways to access xgboost easily. This example will use {tidymodels} because of the functionality included in {tidymodels} and is being heavily supported by Posit.
Data used
Data used for this example is birthwt which is part of the {MASS} package. This data-set considers a number of risk factors associated with birth weight in infants.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Our modeling goal using the birthwt dataset is to predict whether the birth weight is low or not low based on factors such as mother’s age, smoking status, and history of hypertension.
Example Code
Use {tidymodels} metadata package to split the data into training and testing data. For classification, we need to change the Low variable into a factor, since currently coded as an integer (0,1).
After creating the data split, we setup the params of the model.
xgboost_spec <-boost_tree(trees =15) %>%# This model can be used for classification or regression, so set modeset_mode("classification") %>%set_engine("xgboost")xgboost_spec
Boosted Tree Model Specification (classification)
Main Arguments:
trees = 15
Computational engine: xgboost
xgboost_cls_fit <- xgboost_spec %>%fit(low_f ~ ., data = brthwt_train)xgboost_cls_fit
bind_cols(predict(xgboost_cls_fit, brthwt_test),predict(xgboost_cls_fit, brthwt_test, type ="prob"))
# A tibble: 48 × 3
.pred_class `.pred_Not Low` .pred_Low
<fct> <dbl> <dbl>
1 Not Low 0.985 0.0151
2 Not Low 0.985 0.0151
3 Not Low 0.988 0.0116
4 Not Low 0.988 0.0116
5 Not Low 0.988 0.0116
6 Not Low 0.988 0.0116
7 Not Low 0.988 0.0116
8 Not Low 0.988 0.0116
9 Not Low 0.988 0.0116
10 Not Low 0.988 0.0116
# ℹ 38 more rows
Regression
To perform xgboost with regression, when setting up the parameter of the model, set the mode of xgboost to regression. After that switch and then changing the variable of interest back to an integer, the rest of the code is the same.
xgboost_reg_spec <-boost_tree(trees =15) %>%# This model can be used for classification or regression, so set modeset_mode("regression") %>%set_engine("xgboost")xgboost_reg_spec
Boosted Tree Model Specification (regression)
Main Arguments:
trees = 15
Computational engine: xgboost
# For a regression model, the outcome should be `numeric`, not a `factor`.xgboost_reg_fit <- xgboost_reg_spec %>%fit(low~ ., data = brthwt_train)xgboost_reg_fit