MELODEM data workshop

Introduction

Byron C. Jaeger, PhD

Wake Forest University School of Medicine

Welcome!

Wi-Fi network name

CF-Bellinglise

Wi-Fi password

BELLINGLISE2005

Hello! My name is Byron

I am an R enthusiast.

I love dogs.

I study risk
prediction + machine learning

Options

This workshop heavily leverages R, but you may prefer not to.

  • You can partner up with an R user

  • You can use the simulated data (makes exercises much easier)

  • You can relax and get ☕ during exercises

Learning under stress may not be ideal. Do what works best for you.

I will do my part to make the concepts taught here R-agnostic so that you can learn plenty of valuable things no matter what language you use.

Schedule

Session 1: Friday, 4:00pm - 5:30pm

  • Introduction, data management

Session 2a: Saturday 9:30am - 11:00am

  • Decision trees and random forests

  • Break from 10:45am - 11:00am

Schedule

Session 2b: Saturday 11:00am - 12:30am

  • Oblique random forests

  • Lunch from 12:30pm - 2:00pm

Session 3: Saturday 2:00pm - 5:30pm

  • Causal random forests

  • Break from 3:45pm - 4:00pm

  • Finish slides or get a head start on collaboration

Schedule

Session 4: Sunday 9:30am - 12:30 pm

  • Discuss manuscript aims, GitHub issues (30m)

  • Work in small groups (45m)

  • Break from 10:45am - 11:00am

Session 5: Sunday 2:00pm - 4:00 pm

  • Progress updates and discussion (30m)

  • Work in small groups (90m)

Sticky notes

While you’re working on exercises,

  • Place pink sticky note on the back of your laptop if you want help.

  • Place blue sticky note on the back of your laptop when you are done

Part 1: Introduction and data management

Goals

Get our data organized.

  • Build familiarity with R/Rstudio and git/GitHub

  • Learn how to use a stellar R package: targets

  • Plug your data into the workshop pipeline

  • Help a friend

Whole game

First, you pull code down from the GitHub repo.

Whole game

Next, you commit code and summary results. No data!

Whole game

Last, you push your code and summary results to the GitHub repo.

Why GitHub?

So we can work together, separately!

  • Store and coordinate code from multiple authors

  • Public facing team science

  • Free website for our work (i.e., this workshop).

Set-up R packages

Make sure we all have up-to-date versions of these R packages:

# Install required packages for the workshop
pkgs <- 
  c("tidyverse", "tidymodels", "data.table", "haven", "magrittr",
    "glue", "grf", "aorsf", "glmnet", "xgboost", "randomForestSRC",
    "party", "riskRegression", "survival", "officer", "flextable", 
    "table.glue", "gtsummary", "usethis", "cli", "ggforce",
    "rpart", "rpart.plot", "ranger", "withr", "gt", "recipes", 
    "butcher", "sandwich", "lmtest", "gbm", "officedown", "Matrix",
    "ggsurvfit", "tidycmprsk", "here", "tarchetypes", "targets")

install.packages(pkgs)

Pull!

Make sure you have a GitHub account with personal access token (PAT) stored in Rstudio

  1. Open Rstudio
  2. Copy/paste the code on this slide into an R script
  3. Important: adjust destdir
  4. Run
library(usethis)

create_from_github(
  "bcjaeger/melodem-apoe4-het",
  destdir = "path/of/choice", 
  fork = TRUE
)

Introducing targets

Your turn

  1. Open _targets.R in the melodem-apoe4-het project.
  2. Run library(targets) to load the targets package.
  3. Run tar_load_globals() to load relevant functions and packages.
  4. Run tar_glimpse() to inspect the pipeline.
  5. Run tar_make() to make the pipeline.
05:00

Start with data management

In the _targets.R file:

file_sim_tar <- tar_target(
  file_sim,
  command = "data/sim-raw.csv",
  format = 'file'
)

data_sim_tar <- tar_target(
  data_sim,
  data_prepare(file_sim)
)

This will be done with your data, too!

Your turn

We are going to add your data to the pipeline, carefully.

  1. Think of a name for your data.

    • Example name: regards
  2. Save a copy of your data in data/sensitive. The name of your file should be name-raw.csv or name-raw.sas7bdat, where name is your data’s name. E.g., regards-raw.csv

03:00

Your turn

We are going to add your data to the pipeline, carefully.

  1. Switch from the R console to the terminal.

  2. Verify you have no uncommitted changes:

git status

Should return “nothing to commit, working tree clean”

  1. Create a new branch with git:
git branch -b regards
03:00

Your turn

We are going to add your data to the pipeline, carefully.

  1. Copy/paste code shown here to _targets.R, just beneath the line that starts with # real data cohorts.

  2. Replace zzzz with the name of your data.

  3. Save the _targets.R file

  4. Run tar_make() in the R console.

file_zzzz_tar <- tar_target(               
  file_zzzz,
  command = "data/sensitive/zzzz-raw.csv",
  format = "file"
)

data_zzzz_tar <- tar_target(
  data_zzzz,
  data_prepare(
    file_name = "data/sensitive/zzzz-raw.csv"
  )
)

# don't forget to add these targets
# to the targets list at the bottom!
05:00

Your turn

  1. run tar_read(data_zzzz), where zzzz is your data name.
tar_read(data_sim)
-------------------------------------- sim -------------------------------------- 
# A tibble: 1,000 × 8
   time status apoe4       sex      age biomarker_1 biomarker_2 biomarker_3
  <dbl>  <int> <fct>       <fct>  <dbl>       <dbl>       <dbl>       <dbl>
1 0.530      1 carrier     female  66.3      0.747      -0.742       -1.71 
2 0.855      0 non_carrier female  65.5     -0.562      -0.0513       0.546
3 0.120      1 carrier     male    71.4      0.0133     -0.184        0.826
4 1.16       1 carrier     female  59.5      0.795      -1.32        -1.28 
5 5.84       0 non_carrier female  58.5     -0.293       0.191        2.54 
# ℹ 995 more rows

 ----------------------------------  exclusions  ---------------------------------- 
# A tibble: 1 × 2
  label            n_obs
  <glue>           <int>
1 sim participants  1000

Data management

We used data_prepare() to make this object.

Let’s check out what data_prepare does.

data_prepare <- function(file_name, ...){

  output <- data_load(file_name) %>%
    data_clean() %>%
    data_derive() %>%
    data_select() %>%
    data_recode(labels = labels) %>%
    data_exclude(...)
  
  # checks not shown

  output

}

Data management

We used data_prepare() to make this object.

Let’s check out what data_prepare does. First, it loads the data

data_prepare <- function(file_name, ...){

  output <- data_load(file_name) %>%
    data_clean() %>%
    data_derive() %>%
    data_select() %>%
    data_exclude(...)

  check_names(output$values,
              c("age", "sex", "apoe4", "time", "status"))

  output

}

Data management

Let’s check out data_load().

data_load <- function(file_path){

    # ... file management code not shown ...
  
  structure(
    .Data = list(
      values = data_input,
      exclusions = tibble(label = glue("{cohort_name} participants"),
                          n_obs = nrow(data_input))
    ),
    class = c(paste("melodem", cohort_name, sep = "_"), 
              'melodem_data'),
    label = cohort_label
  )
  
}

Data management

Let’s check out data_load(). The object returned from this function includes data and a preliminary exclusion table.

data_load <- function(file_path){

    # ... file management code not shown ...
  
  structure(
    .Data = list(
      values = data_input,
      exclusions = tibble(label = glue("{cohort_name} participants"),
                          n_obs = nrow(data_input))
    ),
    class = c(cohort_name, 'melodem_data'),
    label = cohort_label
  )
  
}

Data management

The object returned also has customized class based on the dataset. The output also belongs to a broader class called melodem_data

data_load <- function(file_path){

  # ... file management code not shown ...
  
  structure(
    .Data = list(
      values = data_input,
      exclusions = tibble(label = glue("{cohort_name} participants"),
                          n_obs = nrow(data_input))
    ),
    class = c(cohort_name, 'melodem_data'),
    label = cohort_label
  )
  
}

Why?

Each dataset is unique, and some may require customized preparation:

  • Different elements need to be cleaned.

  • Different variables need to be derived.

  • Different variables may be selected.

  • Different exclusions may be applied.

data_load makes its output have a customized class based on the name of the dataset so that you, the owner of the data, are in control of these steps that may be uniquely defined for your data.

How?

R’s generic function system. Generic functions (e.g., plot()) dispatch different methods depending on the type of input object.


Here’s a look at the generic function for cleaning an object of class sim:

data_clean.melodem_sim <- function(data){

  dt <- data_clean_minimal(data$values)

  dt[, age := age * 5 + 65]
  dt[, sex := fifelse(sex > 0, 1, 0)]
  dt[, sex := factor(sex, levels = c(0, 1),
                     labels = c("male", "female"))]
  data$values <- dt
  data
}

How?

R’s generic function system. Generic functions (e.g., plot()) dispatch different methods depending on the type of input object.


Here’s the generic function for cleaning an object of class melodem_data:

data_clean.melodem_data <- function(data){

  data_clean_minimal(data)

}

Your data

  • Your data’s first class is melodem plus the name you picked, e.g., melodem_regards, and second class is melodem_data. Verify by running class()

  • When you run data_clean() with, e.g., the regards data,

    • R will look for a function called data_clean.melodem_regards
    • If it doesn’t exist, R runs data_clean.melodem_data

TLDR: If you don’t write a specific function for data_clean, data_derive, etc. for your data, then these functions will not do anything to your data.

Your turn

For the rest of this session,

  • Implement specific data_clean, data_derive, data_select, and data_exclude for your data.

  • If you already did these operations before the workshop, move the code you used into the corresponding function.

  • If you finish early, help someone else!