pbc_scalib
is a list with three datasets: train
, test
, and
predrisk
.
train
is the data used to train A proportional hazards model,
a gradient boosting tree ensemble, and a random forest with axis
based and oblique splits.
test
is the data that the trained models computed predictions for.
predrisk
is the predicted risk values for the models listed above.
The prediction horizon for these predictions is 2,500 days after
baseline assessment. Put another way, predicted values are the
predicted probability that a person will have an event within
2,500 of baseline.
The train
and test
data are a light modification of the
survival::pbc data. The modifications are:
removed rows with missing data
converted status
into 0 for censor or transplant, 1 for dead
removed the id
column.
pbc_scalib
train and test are random subsets of roughly equal size from a dataset with 276 rows and 20 variables. Those variables are:
number of days between registration and the earlier of death, transplantion, or study analysis in July, 1986
status at endpoint, 0 for censored or transplant, 1 for dead
1/2/NA for D-penicillmain, placebo, not randomised
in years
m/f
presence of ascites
presence of hepatomegaly or enlarged liver
blood vessel malformations in the skin
0 no edema, 0.5 untreated or successfully treated 1 edema despite diuretic therapy
serum bilirunbin (mg/dl)
serum cholesterol (mg/dl)
serum albumin (g/dl)
urine copper (ug/day)
alkaline phosphotase (U/liter)
aspartate aminotransferase, once called SGOT (U/ml)
triglycerides (mg/dl)
platelet count
standardised blood clotting time
histologic stage of disease (needs biopsy)
predrisk is a dataset with 138 rows and 4 variables:
predictions from a proportional hazards model
predictions from a gradient boosting tree ensemble
predictions from a random forest with axis based splits
predictions from a random forest with oblique splits
T Therneau and P Grambsch (2000), Modeling Survival Data: Extending the Cox Model, Springer-Verlag, New York. ISBN: 0-387-98784-3.
See example
for code to generate the data and fit the models
if (FALSE) {
library(riskRegression)
library(survival)
library(randomForestSRC)
library(gbm)
library(orsf2)
dataset <- pbc[complete.cases(pbc), ]
dataset$status[dataset$status > 0] <- dataset$status[dataset$status > 0] - 1
dataset$id <- NULL
dataset$stage <- as.integer(dataset$stage)
n_total <- nrow(dataset)
n_train <- round(n_total * 1/2)
set.seed(32987)
train_index <- sample(nrow(dataset), size = n_train)
dataset_train <- dataset[train_index, ]
dataset_test <- dataset[-train_index, ]
cph <- coxph(Surv(time, status) ~ .,
data = dataset_train,
x = TRUE)
rf <- rfsrc(Surv(time, status) ~ .,
data = dataset_train,
nodesize = 15,
ntree = 1000)
bst_cv <- gbm(Surv(time, status) ~ .,
data = dataset_train,
interaction.depth = 1,
shrinkage = 0.025,
n.trees = 500,
cv.folds = 10)
bst_final <- gbm(Surv(time, status) ~ .,
data = dataset_train,
interaction.depth = 1,
shrinkage = 0.025,
n.trees = 150)
aorsf <- orsf(data = dataset_train,
formula = Surv(time, status) ~ .,
n_tree = 1000)
predictRisk.aorsf <- function(object, newdata, times, ...){
predict(object, new_data = newdata, times = times, risk = TRUE)
}
models <- list(prop_hazard = cph,
rsf_axis = rf,
gradient_booster = bst_final,
rsf_oblique = aorsf)
pred_horizon <- 2500
data_predrisk <- as.data.frame(
lapply(models,
predictRisk,
newdata = dataset_test,
times = pred_horizon)
)
data_predrisk$prop_hazard[data_predrisk$prop_hazard==1] <- 0.999
}