pbc_scalib is a list with three datasets: train, test, and predrisk.

  • train is the data used to train A proportional hazards model, a gradient boosting tree ensemble, and a random forest with axis based and oblique splits.

  • test is the data that the trained models computed predictions for.

  • predrisk is the predicted risk values for the models listed above. The prediction horizon for these predictions is 2,500 days after baseline assessment. Put another way, predicted values are the predicted probability that a person will have an event within 2,500 of baseline.

The train and test data are a light modification of the survival::pbc data. The modifications are:

  1. removed rows with missing data

  2. converted status into 0 for censor or transplant, 1 for dead

  3. removed the id column.

pbc_scalib

Format

train and test are random subsets of roughly equal size from a dataset with 276 rows and 20 variables. Those variables are:

time

number of days between registration and the earlier of death, transplantion, or study analysis in July, 1986

status

status at endpoint, 0 for censored or transplant, 1 for dead

trt

1/2/NA for D-penicillmain, placebo, not randomised

age

in years

sex

m/f

ascites

presence of ascites

hepato

presence of hepatomegaly or enlarged liver

spiders

blood vessel malformations in the skin

edema

0 no edema, 0.5 untreated or successfully treated 1 edema despite diuretic therapy

bili

serum bilirunbin (mg/dl)

chol

serum cholesterol (mg/dl)

albumin

serum albumin (g/dl)

copper

urine copper (ug/day)

alk.phos

alkaline phosphotase (U/liter)

ast

aspartate aminotransferase, once called SGOT (U/ml)

trig

triglycerides (mg/dl)

platelet

platelet count

protime

standardised blood clotting time

stage

histologic stage of disease (needs biopsy)

predrisk is a dataset with 138 rows and 4 variables:

prop_hazard

predictions from a proportional hazards model

rsf_axis

predictions from a gradient boosting tree ensemble

gradient_booster

predictions from a random forest with axis based splits

rsf_oblique

predictions from a random forest with oblique splits

Source

T Therneau and P Grambsch (2000), Modeling Survival Data: Extending the Cox Model, Springer-Verlag, New York. ISBN: 0-387-98784-3.

Details

See example for code to generate the data and fit the models

Examples

if (FALSE) {

library(riskRegression)
library(survival)
library(randomForestSRC)
library(gbm)
library(orsf2)

dataset <- pbc[complete.cases(pbc), ]
dataset$status[dataset$status > 0] <- dataset$status[dataset$status > 0] - 1
dataset$id <- NULL
dataset$stage <- as.integer(dataset$stage)

n_total <- nrow(dataset)
n_train <- round(n_total * 1/2)

set.seed(32987)

train_index <- sample(nrow(dataset), size = n_train)

dataset_train <- dataset[train_index, ]
dataset_test <- dataset[-train_index, ]

cph <- coxph(Surv(time, status) ~ .,
             data = dataset_train,
             x = TRUE)

rf <- rfsrc(Surv(time, status) ~ .,
            data = dataset_train,
            nodesize = 15,
            ntree = 1000)

bst_cv <- gbm(Surv(time, status) ~ .,
              data = dataset_train,
              interaction.depth = 1,
              shrinkage = 0.025,
              n.trees = 500,
              cv.folds = 10)

bst_final <- gbm(Surv(time, status) ~ .,
                 data = dataset_train,
                 interaction.depth = 1,
                 shrinkage = 0.025,
                 n.trees = 150)

aorsf <- orsf(data = dataset_train,
              formula = Surv(time, status) ~ .,
              n_tree = 1000)

predictRisk.aorsf <- function(object, newdata, times, ...){
 predict(object, new_data = newdata, times = times, risk = TRUE)
}

models <- list(prop_hazard = cph,
               rsf_axis = rf,
               gradient_booster = bst_final,
               rsf_oblique = aorsf)

pred_horizon <- 2500

data_predrisk <- as.data.frame(
lapply(models,
       predictRisk,
       newdata = dataset_test,
       times = pred_horizon)
)

data_predrisk$prop_hazard[data_predrisk$prop_hazard==1] <- 0.999
}