The sgb_fit
function is a wrapper for
xgboost designed to implement survival
analyses.
sgb_fit(sgb_df, nrounds = NULL, eval_time_quants = c(0.1, 0.9),
missing = NA, weight = NULL, params = sgb_params(), verbose = 1,
print_every_n = max(c(1, round(nrounds/5))),
early_stopping_rounds = NULL, maximize = NULL, save_period = NULL,
save_name = "sgboost.model", xgb_model = NULL, callbacks = list())
Arguments
sgb_df |
An object of class 'sgb_data' (see sgb_data). |
nrounds |
max number of boosting iterations. |
eval_time_quants |
To evaluate risk prediction models, a set of
evaluation times are created using the observed event times in sgb_df .
These unique event times are truncated by including the times that are
above and below and lower and upper quantiles of time specified in
eval_time_quants , respectively. For example, To include all times,
use eval_time_quants = c(0,100) . To include the times between the
first and second quartiles, use eval_time_quants = c(25, 50) |
missing |
by default is set to NA, which means that NA values should be considered as 'missing'
by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values.
This parameter is only used when input is a dense matrix. |
weight |
a vector indicating the weight for each row of the input. |
params |
the list of parameters.
The complete list of parameters is available at http://xgboost.readthedocs.io/en/latest/parameter.html.
Below is a shorter summary:
1. General Parameters
2. Booster Parameters
2.1. Parameter for Tree Booster
eta control the learning rate: scale the contribution of each tree by a factor of 0 < eta < 1 when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for eta implies larger value for nrounds : low eta value means model more robust to overfitting but slower to compute. Default: 0.3
gamma minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
max_depth maximum depth of a tree. Default: 6
min_child_weight minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
subsample subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with eta and increase nrounds . Default: 1
colsample_bytree subsample ratio of columns when constructing each tree. Default: 1
num_parallel_tree Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set colsample_bytree < 1 , subsample < 1 and round = 1 ) accordingly. Default: 1
monotone_constraints A numerical vector consists of 1 , 0 and -1 with its length equals to the number of features in the training data. 1 is increasing, -1 is decreasing and 0 is no constraint.
interaction_constraints A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from 0 (0 references the first column). Leave argument unspecified for no interaction constraints.
2.2. Parameter for Linear Booster
lambda L2 regularization term on weights. Default: 0
lambda_bias L2 regularization term on bias. Default: 0
alpha L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
3. Task Parameters
objective specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
reg:squarederror Regression with squared loss (Default).
reg:logistic logistic regression.
binary:logistic logistic regression for binary classification. Output probability.
binary:logitraw logistic regression for binary classification, output score before logistic transformation.
num_class set the number of classes. To use only with multiclass objectives.
multi:softmax set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to num_class - 1 .
multi:softprob same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
rank:pairwise set xgboost to do ranking task by minimizing the pairwise loss.
base_score the initial prediction score of all instances, global bias. Default: 0.5
eval_metric evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
|
verbose |
If 0, xgboost will stay silent. If 1, it will print information about performance.
If 2, some additional information will be printed out.
Note that setting verbose > 0 automatically engages the
cb.print.evaluation(period=1) callback function. |
print_every_n |
Print each n-th iteration evaluation messages when verbose>0 .
Default is 1 which means all messages are printed. This parameter is passed to the
cb.print.evaluation callback. |
early_stopping_rounds |
If NULL , the early stopping function is not triggered.
If set to an integer k , training with a validation set will stop if the performance
doesn't improve for k rounds.
Setting this parameter engages the cb.early.stop callback. |
maximize |
If feval and early_stopping_rounds are set,
then this parameter must be set as well.
When it is TRUE , it means the larger the evaluation score the better.
This parameter is passed to the cb.early.stop callback. |
save_period |
when it is non-NULL, model is saved to disk after every save_period rounds,
0 means save at the end. The saving is handled by the cb.save.model callback. |
save_name |
the name or path for periodically saved model file. |
xgb_model |
a previously built model to continue the training from.
Could be either an object of class xgb.Booster , or its raw data, or the name of a
file with a previously saved model. |
callbacks |
a list of callback functions to perform various task during boosting.
See callbacks . Some of the callbacks are automatically created depending on the
parameters' values. User can provide either existing or their own callback methods in order
to customize the training process. |
Value
An sgb_booster
object containing:
fit
: An xgb.booster
object (see xgboost).
label
: A numeric vector with time-to-event values, where
censored observations have negative times and uncensored
observations have positive times (see sgb_label).
predictions
Predicted values from fit
for the training
data. These predictions are saved as they are required to
estimate the baseline hazard function of fit
.
Examples
#> Applying cross-validation to determine nrounds
#> [1] train-cox-nloglik:3.347596+0.049284 test-cox-nloglik:1.413523+0.322500
#> Multiple eval metrics are present. Will use test_cox_nloglik for early stopping.
#> Will train until test_cox_nloglik hasn't improved in 50 rounds.
#>
#> [11] train-cox-nloglik:3.100688+0.052263 test-cox-nloglik:1.318558+0.336452
#> [21] train-cox-nloglik:3.046859+0.052314 test-cox-nloglik:1.342073+0.341736
#> [31] train-cox-nloglik:3.010286+0.052207 test-cox-nloglik:1.355201+0.318792
#> [41] train-cox-nloglik:2.979802+0.052953 test-cox-nloglik:1.364158+0.316515
#> [51] train-cox-nloglik:2.952572+0.054067 test-cox-nloglik:1.373592+0.316030
#> [61] train-cox-nloglik:2.927996+0.055627 test-cox-nloglik:1.375714+0.324239
#> Stopping. Best iteration:
#> [12] train-cox-nloglik:3.092801+0.052727 test-cox-nloglik:1.306783+0.328540
#>
#> [1] train-cox-nloglik:3.450225
#> [11] train-cox-nloglik:3.211365
#> [12] train-cox-nloglik:3.203185