MELODEM data workshop



Byron C. Jaeger, PhD

Wake Forest University School of Medicine


Hello! My name is Byron

I am an R enthusiast.

I love dogs.

I study risk
prediction + machine learning


This workshop heavily leverages R, but you may prefer not to.

  • You can partner up with an R user

  • You can use the simulated data (makes exercises much easier)

  • You can relax and get ☕ during exercises

Learning under stress may not be ideal. Do what works best for you.

I will do my part to make the concepts taught here R-agnostic so that you can learn plenty of valuable things no matter what language you use.


Session 1: Friday, 4:00pm - 5:30pm

  • Introduction, data management

Session 2a: Saturday 9:30am - 11:00am

  • Decision trees and random forests

  • Break from 10:45am - 11:00am


Session 2b: Saturday 11:00am - 12:30am

  • Oblique random forests

  • Lunch from 12:30pm - 2:00pm

Session 3: Saturday 2:00pm - 5:30pm

  • Causal random forests

  • Break from 3:45pm - 4:00pm

  • Finish slides or get a head start on collaboration


Session 4: Sunday 9:30am - 12:30 pm

  • Discuss manuscript aims, GitHub issues (30m)

  • Work in small groups (45m)

  • Break from 10:45am - 11:00am

Session 5: Sunday 2:00pm - 4:00 pm

  • Progress updates and discussion (30m)

  • Work in small groups (90m)

Part 1: Introduction and data management


Get our data organized.

  • Build familiarity with R/Rstudio and git/GitHub

  • Learn how to use a stellar R package: targets

  • Plug your data into the workshop pipeline

  • Help a friend

Whole game

First, you pull code down from the GitHub repo.

Whole game

Next, you commit code and summary results. No data!

Whole game

Last, you push your code and summary results to the GitHub repo.

Why GitHub?

So we can work together, separately!

  • Store and coordinate code from multiple authors

  • Public facing team science

  • Free website for our work (i.e., this workshop).

Set-up R packages

Make sure we all have up-to-date versions of these R packages:

# Install required packages for the workshop
pkgs <- 
  c("tidyverse", "tidymodels", "data.table", "haven", "magrittr",
    "glue", "grf", "aorsf", "glmnet", "xgboost", "randomForestSRC",
    "party", "riskRegression", "survival", "officer", "flextable", 
    "table.glue", "gtsummary", "usethis", "cli", "ggforce",
    "rpart", "rpart.plot", "ranger", "withr", "gt", "recipes", 
    "butcher", "sandwich", "lmtest", "gbm", "officedown", "Matrix",
    "ggsurvfit", "tidycmprsk", "here", "tarchetypes", "targets")



Make sure you have a GitHub account with personal access token (PAT) stored in Rstudio

  1. Open Rstudio
  2. Copy/paste the code on this slide into an R script
  3. Important: adjust destdir
  4. Run

  destdir = "path/of/choice", 
  fork = TRUE

Introducing targets

Your turn

  1. Open _targets.R in the melodem-apoe4-het project.
  2. Run library(targets) to load the targets package.
  3. Run tar_load_globals() to load relevant functions and packages.
  4. Run tar_glimpse() to inspect the pipeline.
  5. Run tar_make() to make the pipeline.

Start with data management

In the _targets.R file:

file_sim_tar <- tar_target(
  command = "data/sim-raw.csv",
  format = 'file'

data_melodem_tar <- tar_target(

This will be done with your data, too!

Your turn

We are going to add your data to the pipeline, carefully.

  1. Think of a name for your data.

    • Example name: regards
  2. Save a copy of your data in data/sensitive. The name of your file should be name-raw.csv or name-raw.sas7bdat, where name is your data’s name. E.g., regards-raw.csv

. . .


Your turn

We are going to add your data to the pipeline, carefully.

  1. Switch from the R console to the terminal.

  2. Verify you have no uncommitted changes:

git status

Should return “nothing to commit, working tree clean”

  1. Create a new branch with git:
git branch -b regards

Your turn

We are going to add your data to the pipeline, carefully.

  1. Copy/paste code shown here to _targets.R, just beneath the line that starts with # real data cohorts.

  2. Replace zzzz with the name of your data.

  3. Save the _targets.R file

  4. Run tar_make() in the R console.

file_zzzz_tar <- tar_target(               
  command = "data/sensitive/zzzz-raw.csv",
  format = "file"

data_zzzz_tar <- tar_target(
    file_name = "data/sensitive/zzzz-raw.csv"

# don't forget to add these targets
# to the targets list at the bottom!

Your turn

  1. run tar_read(data_zzzz), where zzzz is your data name.
------------------------------------- sprint ------------------------------------- 
# A tibble: 7,159 × 23
  status  time frailty_catg statin aspirin  egfr sub_ckd   age sub_cvd race4
   <dbl> <dbl> <fct>         <dbl>   <dbl> <dbl>   <dbl> <dbl>   <dbl> <fct>
1      0 11.8  Pre-frail         0       0  88.2       0    57       0 BLACK
2      0 12.4  Pre-frail         1       1  82.1       0    61       0 WHITE
3      0 10.4  Frail             1       0  97.5       0    56       0 WHITE
4      0  7.34 Pre-frail         0       0  87.3       0    63       0 WHITE
5      0  7.36 Pre-frail         0       0  68.0       0    71       1 WHITE
# ℹ 7,154 more rows
# ℹ 13 more variables: CHR <dbl>, GLUR <dbl>, HDL <dbl>, TRR <dbl>,
#   UMALCR <dbl>, BMI <dbl>, sbp <dbl>, dbp <dbl>, fr_risk10yrs <dbl>,
#   orth_hypo <dbl>, education <dbl>, treatment <fct>, sex <fct>

 ----------------------------------  exclusions  ---------------------------------- 
# A tibble: 2 × 2
  label               n_obs
  <glue>              <int>
1 sprint participants  8541
2 Aged 55-80 years     7159

Data management

We used data_prepare() to make this object.

Let’s check out what data_prepare does.

data_prepare <- function(file_name, ...){

  output <- data_load(file_name) %>%
    data_clean() %>%
    data_derive() %>%
    data_select() %>%
    data_recode(labels = labels) %>%
  # checks not shown



Data management

We used data_prepare() to make this object.

Let’s check out what data_prepare does. First, it loads the data

data_prepare <- function(file_name, ...){

  output <- data_load(file_name) %>%
    data_clean() %>%
    data_derive() %>%
    data_select() %>%

              c("age", "sex", "apoe4", "time", "status"))



Data management

Let’s check out data_load().

data_load <- function(file_path){

    # ... file management code not shown ...
    .Data = list(
      values = data_input,
      exclusions = tibble(label = glue("{cohort_name} participants"),
                          n_obs = nrow(data_input))
    class = c(paste("melodem", cohort_name, sep = "_"), 
    label = cohort_label

Data management

Let’s check out data_load(). The object returned from this function includes data and a preliminary exclusion table.

data_load <- function(file_path){

    # ... file management code not shown ...
    .Data = list(
      values = data_input,
      exclusions = tibble(label = glue("{cohort_name} participants"),
                          n_obs = nrow(data_input))
    class = c(cohort_name, 'melodem_data'),
    label = cohort_label

Data management

The object returned also has customized class based on the dataset. The output also belongs to a broader class called melodem_data

data_load <- function(file_path){

  # ... file management code not shown ...
    .Data = list(
      values = data_input,
      exclusions = tibble(label = glue("{cohort_name} participants"),
                          n_obs = nrow(data_input))
    class = c(cohort_name, 'melodem_data'),
    label = cohort_label


Each dataset is unique, and some may require customized preparation:

  • Different elements need to be cleaned.

  • Different variables need to be derived.

  • Different variables may be selected.

  • Different exclusions may be applied.

data_load makes its output have a customized class based on the name of the dataset so that you, the owner of the data, are in control of these steps that may be uniquely defined for your data.


R’s generic function system. Generic functions (e.g., plot()) dispatch different methods depending on the type of input object.

Here’s a look at the generic function for cleaning an object of class sim:

data_clean.melodem_sim <- function(data){

  dt <- data_clean_minimal(data$values)

  dt[, age := age * 5 + 65]
  dt[, sex := fifelse(sex > 0, 1, 0)]
  dt[, sex := factor(sex, levels = c(0, 1),
                     labels = c("male", "female"))]
  data$values <- dt


R’s generic function system. Generic functions (e.g., plot()) dispatch different methods depending on the type of input object.

Here’s the generic function for cleaning an object of class melodem_data:

data_clean.melodem_data <- function(data){



Your data

  • Your data’s first class is melodem plus the name you picked, e.g., melodem_regards, and second class is melodem_data. Verify by running class()

  • When you run data_clean() with, e.g., the regards data,

    • R will look for a function called data_clean.melodem_regards
    • If it doesn’t exist, R runs data_clean.melodem_data

TLDR: If you don’t write a specific function for data_clean, data_derive, etc. for your data, then these functions will not do anything to your data.

Your turn

For the rest of this session,

  • Implement specific data_clean, data_derive, data_select, and data_exclude for your data.

  • If you already did these operations before the workshop, move the code you used into the corresponding function.

  • If you finish early, help someone else!