Machine Learning for Risk Prediction using Oblique Random Survival Forests

.title[
# Machine Learning for Risk Prediction using Oblique Random Survival Forests
]
.subtitle[
## A demo using the aorsf package
]
.author[
### Byron C. Jaeger
]
.date[
### April 20, 2023
]

---

# Hello!

## Slides are here: https://www.byronjaeger.com/talk/

---

# Outline

**Background and Jargon**

- Machine Learning & Supervised Learning

- Decision Trees & Random Forests: Axis-based & Oblique

- Censoring & Random Survival Forests

- Oblique Random Survival Forests

**Demo with `aorsf`**

- Fit

- Interpret

- Benchmark

- Extend

---
background-image: url(img/data_mining.jpg)
background-size: 15%
background-position: 95% 5%

# Machine Learning

### Supervised learning

- Labeled data

- Predict an outcome

- Learners

- Risk prediction

]

### Unsupervised learning

- Unlabeled data

- Find structure

- Clusters

- Organize medical records

]

---
class: center, middle

# Supervised learning

---

---

---

---

---

---

---

---

---

---
layout: false
class: center, middle, inverse

# Decision Trees and Random Forests

---
background-image: url(img/penguins.png)
background-size: 45%
background-position: 85% 72.5%

## Decision trees

- Partition the space of predictor variables.

- Used in classification, regression, and survival analysis.

- May be **axis-based** or **oblique** (more on this soon)

.pull-left[
We'll demonstrate the mechanics of decision trees by developing a prediction rule to classify penguin1 species (chinstrap, gentoo, or adelie) based on bill depth and bill length.
]

.footnote[
1Data were collected and made available by [Dr. Kristen Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php) and the [Palmer Station](https://pal.lternet.edu/), a member of the [Long Term Ecological Research Network](https://lternet.edu/).
]

---

Dimensions for Adelie, Chinstrap and Gentoo Penguins at Palmer Station

---

Partition all the penguins into flipper length < 207 or ≥ 207

---

Partition penguins on the left into bill length < 43 or ≥ 43

---

The same partitions visualized as a binary tree

![](index_files/figure-html/unnamed-chunk-14-1.png)

---

## Random Forest

Each tree is grown through a randomized process:

- tree-specific bootstrapped replicate of the training data (out-of-bag data means not in the bootstrapped replicate)

- node-specific random subsets of predictors

Randomness makes the trees more independent

- Each randomized tree is *individually weaker* than a standard decision tree.

- The *average prediction* from many randomized trees is more accurate than the prediction from a single tree.

Why? Consider 100 classification trees, each with 55% chance of choosing the right answer, 45% chance of choosing wrong.

- if the trees are all the same, the ensemble has 45% chance of being wrong
    
- if they are completely independent and we use the majority vote:
    
  
  ```r
  # chance of 49 or fewer trees being right
  pbinom(q = 49, size = 100, prob = 0.55)
  ```
  
  ```
  ## [1] 0.1345762
  ```

---

Predictions from a non-random, **individual tree**

---

Predictions from **randomized individual tree**

---

**Ensemble** predictions from a random forest

---

## Oblique trees

Instead of using one variable to split the data, use a weighted combination of variables.

*I.e.,* instead of `$x_1 < \text{cutpoint}$`, use `$c_1 * x_1 + c_2 * x_2 < \text{cutpoint}$`

---

Predictions from **randomized individual oblique tree**

---

class: center, middle
background-image: url("img/picasso_dalle2.png")
background-size: contain

---

**Ensemble** predictions from an oblique random forest

---
class: middle, center

# Censoring

---

---
class: center, middle

# Random Survival Forests (RSFs)

---

---

## Oblique RSFs

Around 2018, I wrote some code to fit RSFs with oblique splits.

- Published a paper to share the idea + code.1

- TLDR: Cox regression in each non-leaf node, use the regression coefficients to make oblique splits.

Around 2020, that code got picked up and used for heart failure risk prediction.2

- oblique RSF did better than boosting, penalized, and stepwise regression! 😄

- oblique RSF was slow and hard to interpret. 😞

1. Jaeger BC, Long DL, Long DM, Sims M, Szychowski JM, Min YI, Mcclure LA, Howard G, Simon N. OBLIQUE RANDOM SURVIVAL FORESTS. Ann Appl Stat. 2019 Sep;13(3):1847-1883. doi: 10.1214/19-aoas1261. Epub 2019 Oct 17. PMID: 36704751; PMCID: PMC9875945.

2. Segar MW, Jaeger BC, Patel KV, Nambi V, Ndumele CE, Correa A, Butler J, Chandra A, Ayers C, Rao S, Lewis AA, Raffield LM, Rodriguez CJ, Michos ED, Ballantyne CM, Hall ME, Mentz RJ, de Lemos JA, Pandey A. Development and Validation of Machine Learning-Based Race-Specific Models to Predict 10-Year Risk of Heart Failure: A Multicohort Analysis. Circulation. 2021 Jun 15;143(24):2370-2383. doi: 10.1161/CIRCULATIONAHA.120.053134. Epub 2021 Apr 13. PMID: 33845593; PMCID: PMC9976274.

]

---
class: center, middle
background-image: url("img/meme_slow_R.jpg")
background-size: contain

---

## Oblique RSFs

Around 2021, I committed to `aorsf`, a re-write of my oblique RSF code

- Open review by rOpenSci and published in JOSS.1

- Large benchmark and more details in pre-print.2

- Named after the software my Dad wrote while I was growing up: AORSA3

1. Jaeger BC, Welden S, Lenoir K, Pajewski NM. aorsf: An R package for supervised learning using the oblique random survival forest. Journal of Open Source Software. 2022 Sep 28;7(77):4705.

2. Jaeger BC, Welden S, Lenoir K, Speiser JL, Segar MW, Pandey A, Pajewski NM. Accelerated and interpretable oblique random survival forests. arXiv preprint arXiv:2208.01129. 2022 Aug 1.

3. [Article from Oak Ridge National Laboratory Cray User group featuring AORSA](https://www.laboratorynetwork.com/doc/oak-ridge-national-laboratory-speeds-tests-of-0001)

]

---
class: center, middle, inverse

# Demo with aorsf

---
layout: true

## Fit an oblique RSF

---

**Note**: Copy/paste my code and run it locally1

Step 1: install/load the `aorsf` R package

```r
# un-comment line below to install aorsf if needed
# install.packages('aorsf')

library(aorsf)
library(tidyverse) # will be used throughout
```

.footnote[
1Thanks to [Garrick Aden-Buie's xaringanExtra](https://github.com/gadenbuie/xaringanExtra) for this awesome feature.
]

---

Step 2: Inspect the `pbc_orsf` data

```r
as_tibble(pbc_orsf)
```

```
## # A tibble: 276 x 20
## id time status trt age sex ascites hepato spiders edema bili chol
## <int> <int> <dbl> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <dbl> <int>
## 1 1 400 1 d_pe~ 58.8 f 1 1 1 1 14.5 261
## 2 2 4500 0 d_pe~ 56.4 f 0 1 1 0 1.1 302
## 3 3 1012 1 d_pe~ 70.1 m 0 0 0 0.5 1.4 176
## 4 4 1925 1 d_pe~ 54.7 f 0 1 1 0.5 1.8 244
## 5 5 1504 0 plac~ 38.1 f 0 1 1 0 3.4 279
## 6 7 1832 0 plac~ 55.5 f 0 1 0 0 1 322
## 7 8 2466 1 plac~ 53.1 f 0 0 0 0 0.3 280
## 8 9 2400 1 d_pe~ 42.5 f 0 0 1 0 3.2 562
## 9 10 51 1 plac~ 70.6 f 1 0 1 1 12.6 200
## 10 11 3762 1 plac~ 53.7 f 0 1 1 0 1.4 259
## # i 266 more rows
## # i 8 more variables: albumin <dbl>, copper <int>, alk.phos <dbl>, ast <dbl>,
## # trig <int>, platelet <int>, protime <dbl>, stage <ord>
```

---

Step 3: Fit and inspect your oblique RSF

```r
fit_orsf <- orsf(data = pbc_orsf, 
 formula = time + status ~ . - id)

fit_orsf
```

```
## ---------- Oblique random survival forest
## 
##      Linear combinations: Accelerated
##           N observations: 276
##                 N events: 111
##                  N trees: 500
##       N predictors total: 17
##    N predictors per node: 5
##  Average leaves per tree: 25
## Min observations in leaf: 5
##       Min events in leaf: 1
##           OOB stat value: 0.84
##            OOB stat type: Harrell's C-statistic
##      Variable importance: anova
## 
## -----------------------------------------
```

---

Do you prefer the classic syntax?

```r
library(survival)

fit_cph <- coxph(formula = Surv(time, status) ~ . - id,
 data = pbc_orsf)
```

`aorsf` supports that:

```r
fit_orsf <- orsf(formula = Surv(time, status) ~ . - id,
 data = pbc_orsf)
```

Prefer pipes? `aorsf` supports that too.

```r
set.seed(329) # use this seed to make your results match slides

fit_orsf <- pbc_orsf |>
 orsf(formula = time + status ~ . - id)
```

---

Fan of `tidymodels`? (me too!) `aorsf` is a valid engine for censored regression.

```r
library(parsnip)
library(censored) # must be version 0.2.0 or higher

rf_spec <- 
 rand_forest(trees = 200) %>%
 set_engine("aorsf") %>% 
 set_mode("censored regression")

fit_tidy <- rf_spec %>% 
 parsnip::fit(data = pbc_orsf, 
 formula = Surv(time, status) ~ . - id)

# fit printed on next slide for space
```

---

Fan of `tidymodels`? (me too!) `aorsf` is a valid engine for censored regression.

```r
fit_tidy
```

```
## parsnip model object
## 
## ---------- Oblique random survival forest
## 
##      Linear combinations: Accelerated
##           N observations: 276
##                 N events: 111
##                  N trees: 200
##       N predictors total: 17
##    N predictors per node: 5
##  Average leaves per tree: 25
## Min observations in leaf: 5
##       Min events in leaf: 1
##           OOB stat value: 0.84
##            OOB stat type: Harrell's C-statistic
##      Variable importance: anova
## 
## -----------------------------------------
```

---

## Interpret an oblique RSF

---

Get expected risk (i.e., partial dependence [`pd`]) in descending order of importance

```r
orsf_summarize_uni(fit_orsf, n_variables = 1)
```

```
## 
## -- bili (VI Rank: 1) ----------------------------
## 
## |---------------- risk ----------------|
## Value Mean Median 25th % 75th %
## <char> <num> <num> <num> <num>
## 0.80 0.2343668 0.1116206 0.04509389 0.3729834
## 1.4 0.2547884 0.1363122 0.05985486 0.4103148
## 3.5 0.3698634 0.2862611 0.16196924 0.5533383
## 
## Predicted risk at time t = 1788 for top 1 predictors
```

---

`orsf_pd()` functions give full control over expected risk

- `orsf_pd_inb()`: computes expected risk using all training data
- `orsf_pd_oob()`: computes expected risk using out-of-bag only
- `orsf_pd_new()`: computes expected risk using new data

```r
pd_oob <- orsf_pd_oob(fit_orsf, pred_spec = list(bili = 1:5))

pd_oob
```

```
## pred_horizon bili mean lwr medn upr
## <num> <num> <num> <num> <num> <num>
## 1: 1788 1 0.2390257 0.01072737 0.1137410 0.8746058
## 2: 1788 2 0.2898135 0.04198550 0.1849255 0.8930730
## 3: 1788 3 0.3466884 0.07529034 0.2593297 0.9112996
## 4: 1788 4 0.3906128 0.10570410 0.3148943 0.9213005
## 5: 1788 5 0.4346375 0.15449708 0.3647937 0.9305869
```

.footnote[Why is `pred_horizon` set at 1788? If you don't set a `pred_horizon`, `orsf` functions will set it for you as the median follow-up time]

---

Use `pred_horizon` to make expected risk more interpretable.

- expected risk for men and women at 1, 2, 3, 4, and 5 years:

```r
 pd_by_gender <- orsf_pd_oob(fit_orsf, 
 pred_spec = list(sex = c("m", "f")),
 pred_horizon = 365 * 1:5)
 
 pd_by_gender %>% 
 dplyr::select(pred_horizon, sex, mean) %>% 
 tidyr::pivot_wider(names_from = sex, values_from = mean) %>% 
 dplyr::mutate(ratio = m / f)
 ```
 
 ```
 ## # A tibble: 5 x 4
 ## pred_horizon m f ratio
 ## <dbl> <dbl> <dbl> <dbl>
 ## 1 365 0.0768 0.0728 1.06
 ## 2 730 0.125 0.111 1.13
 ## 3 1095 0.230 0.195 1.18
 ## 4 1460 0.298 0.251 1.19
 ## 5 1825 0.355 0.296 1.20
 ```

Does expected risk increase faster over time for men?

---

Use `pred_horizon` to investigate expected risk profiles over time

```r
fit_orsf %>% 
  orsf_pd_oob(pred_spec = list(sex = c("m", "f")),
              pred_horizon = seq(365, 365*5, by = 25)) %>% 
  ggplot(aes(x = pred_horizon, y = mean, color = sex)) +
  geom_line() +
  labs(x = 'Time since baseline', y = 'Expected risk')
```

![](index_files/figure-html/unnamed-chunk-35-1.png)

---

Or, fix `pred_horizon` and look at risk profiles over other variables

```r
pred_spec = list(bili = seq(1, 5, length.out = 20),
                 edema = levels(pbc_orsf$edema),
                 trt = levels(pbc_orsf$trt))

pd_bili_edema <- orsf_pd_oob(fit_orsf, pred_spec)

fig <- ggplot(pd_bili_edema) + 
 aes(x = bili, y = medn, col = trt, linetype = edema) +
 geom_line() + 
 labs(y = 'Expected predicted risk')

# fig on next slide for space
```

---

Or, fix `pred_horizon` and look at risk profiles over other variables

![](index_files/figure-html/unnamed-chunk-37-1.png)

---
layout: false
class: center, middle

# Variable importance

---

## ANOVA importance

A fast method to compute variable importance with (some) oblique random forests:

For each predictor:

1. Compute p-values for each coefficient

1. Compute the proportion of p-values that are low (<0.01)

Importance of a predictor = the proportion of times its p-value is low.

---

## Permutation importance

For each predictor:

1. Permute predictor values

1. Measure prediction error with permuted values

Importance of a predictor = increase in prediction error after permutation

---

---

---

## Negation importance

For each predictor:

1. Multiply coefficient in linear combination by -1

1. Measure prediction error with negated coefficients

Importance of a predictor = increase in prediction error after negation

---

---

---

## Variable importance

```r
orsf_vi_anova(fit_orsf)[1:5]
```

```
##   ascites      bili     edema    copper   albumin 
## 0.3832659 0.2720345 0.2383323 0.2016109 0.1750125
```

```r
orsf_vi_permute(fit_orsf)[1:5]
```

```
##        bili         age      copper     albumin     protime 
## 0.016670140 0.008908106 0.005521984 0.005365701 0.003594499
```

```r
orsf_vi_negate(fit_orsf)[1:5]
```

```
##       bili        age     copper    protime    albumin 
## 0.07637008 0.02740154 0.02505730 0.01302355 0.01000208
```

---
layout: false
class: center, middle, inverse

# But how accurate is `aorsf`'s predicted risk?

---
layout:true

## Benchmarking `aorsf`

---

Let's run a benchmark using `mlr3`1 comparing `aorsf`'s oblique RSF to:

- axis-based RSFs in `randomForestSRC`
- axis-based RSFs in `ranger`

We'll run an experiment where

1. each approach is used to make a prediction model with the same training data
2. each model is validated in the same external data.

Here are the packages we'll use:

```r
library(mlr3verse)
library(mlr3proba)
library(mlr3extralearners)
library(mlr3viz)
library(mlr3benchmark)
library(OpenML) # for some of the prediction tasks
```

.footnote[1`aorsf` is compatible with `mlr3` too. Find more about `mlr3` here: https://mlr3.mlr-org.com/]

---