Mayo Clinic Primary Biliary Cholangitis Data — pbc

pbc_scalib is a list with three datasets: train, test, and predrisk.

train is the data used to train A proportional hazards model, a gradient boosting tree ensemble, and a random forest with axis based and oblique splits.
test is the data that the trained models computed predictions for.
predrisk is the predicted risk values for the models listed above. The prediction horizon for these predictions is 2,500 days after baseline assessment. Put another way, predicted values are the predicted probability that a person will have an event within 2,500 of baseline.

The train and test data are a light modification of the survival::pbc data. The modifications are:

removed rows with missing data
converted status into 0 for censor or transplant, 1 for dead
removed the id column.

pbc_scalib

Format

train and test are random subsets of roughly equal size from a dataset with 276 rows and 20 variables. Those variables are:

time: number of days between registration and the earlier of death, transplantion, or study analysis in July, 1986
status: status at endpoint, 0 for censored or transplant, 1 for dead
trt: 1/2/NA for D-penicillmain, placebo, not randomised
age: in years
sex: m/f
ascites: presence of ascites
hepato: presence of hepatomegaly or enlarged liver
spiders: blood vessel malformations in the skin
edema: 0 no edema, 0.5 untreated or successfully treated 1 edema despite diuretic therapy
bili: serum bilirunbin (mg/dl)
chol: serum cholesterol (mg/dl)
albumin: serum albumin (g/dl)
copper: urine copper (ug/day)
alk.phos: alkaline phosphotase (U/liter)
ast: aspartate aminotransferase, once called SGOT (U/ml)
trig: triglycerides (mg/dl)
platelet: platelet count
protime: standardised blood clotting time
stage: histologic stage of disease (needs biopsy)

predrisk is a dataset with 138 rows and 4 variables:

prop_hazard: predictions from a proportional hazards model
rsf_axis: predictions from a gradient boosting tree ensemble
gradient_booster: predictions from a random forest with axis based splits
rsf_oblique: predictions from a random forest with oblique splits

Source

T Therneau and P Grambsch (2000), Modeling Survival Data: Extending the Cox Model, Springer-Verlag, New York. ISBN: 0-387-98784-3.

Details

See example for code to generate the data and fit the models

Examples

if (FALSE) {

library(riskRegression)
library(survival)
library(randomForestSRC)
library(gbm)
library(orsf2)

dataset <- pbc[complete.cases(pbc), ]
dataset$status[dataset$status > 0] <- dataset$status[dataset$status > 0] - 1
dataset$id <- NULL
dataset$stage <- as.integer(dataset$stage)

n_total <- nrow(dataset)
n_train <- round(n_total * 1/2)

set.seed(32987)

train_index <- sample(nrow(dataset), size = n_train)

dataset_train <- dataset[train_index, ]
dataset_test <- dataset[-train_index, ]

cph <- coxph(Surv(time, status) ~ .,
             data = dataset_train,
             x = TRUE)

rf <- rfsrc(Surv(time, status) ~ .,
            data = dataset_train,
            nodesize = 15,
            ntree = 1000)

bst_cv <- gbm(Surv(time, status) ~ .,
              data = dataset_train,
              interaction.depth = 1,
              shrinkage = 0.025,
              n.trees = 500,
              cv.folds = 10)

bst_final <- gbm(Surv(time, status) ~ .,
                 data = dataset_train,
                 interaction.depth = 1,
                 shrinkage = 0.025,
                 n.trees = 150)

aorsf <- orsf(data = dataset_train,
              formula = Surv(time, status) ~ .,
              n_tree = 1000)

predictRisk.aorsf <- function(object, newdata, times, ...){
 predict(object, new_data = newdata, times = times, risk = TRUE)
}

models <- list(prop_hazard = cph,
               rsf_axis = rf,
               gradient_booster = bst_final,
               rsf_oblique = aorsf)

pred_horizon <- 2500

data_predrisk <- as.data.frame(
lapply(models,
       predictRisk,
       newdata = dataset_test,
       times = pred_horizon)
)

data_predrisk$prop_hazard[data_predrisk$prop_hazard==1] <- 0.999
}