MELODEM data workshop

Introduction

Author

Affiliation

Byron C. Jaeger, PhD

Wake Forest University School of Medicine

Welcome!

Wi-Fi network name

CF-Bellinglise

Wi-Fi password

BELLINGLISE2005

Hello! My name is Byron

I am an R enthusiast.

I love dogs.

I study risk
prediction + machine learning

Options

This workshop heavily leverages R, but you may prefer not to.

You can partner up with an R user
You can use the simulated data (makes exercises much easier)
You can relax and get ☕ during exercises

Learning under stress may not be ideal. Do what works best for you.

I will do my part to make the concepts taught here R-agnostic so that you can learn plenty of valuable things no matter what language you use.

Schedule

Session 1: Friday, 4:00pm - 5:30pm

Introduction, data management

Session 2a: Saturday 9:30am - 11:00am

Decision trees and random forests
Break from 10:45am - 11:00am

Schedule

Session 2b: Saturday 11:00am - 12:30am

Oblique random forests
Lunch from 12:30pm - 2:00pm

Session 3: Saturday 2:00pm - 5:30pm

Causal random forests
Break from 3:45pm - 4:00pm
Finish slides or get a head start on collaboration

Schedule

Session 4: Sunday 9:30am - 12:30 pm

Discuss manuscript aims, GitHub issues (30m)
Work in small groups (45m)
Break from 10:45am - 11:00am

Session 5: Sunday 2:00pm - 4:00 pm

Progress updates and discussion (30m)
Work in small groups (90m)

Sticky notes

While you’re working on exercises,

Place pink sticky note on the back of your laptop if you want help.
Place blue sticky note on the back of your laptop when you are done

Part 1: Introduction and data management

Goals

Get our data organized.

Build familiarity with R/Rstudio and git/GitHub
Learn how to use a stellar R package: targets
Plug your data into the workshop pipeline
Help a friend

Whole game

First, you pull code down from the GitHub repo.

Whole game

Next, you commit code and summary results. No data!

Whole game

Last, you push your code and summary results to the GitHub repo.

Why GitHub?

So we can work together, separately!

Store and coordinate code from multiple authors
Public facing team science
Free website for our work (i.e., this workshop).

Set-up R packages

Make sure we all have up-to-date versions of these R packages:

# Install required packages for the workshop
pkgs <- 
  c("tidyverse", "tidymodels", "data.table", "haven", "magrittr",
    "glue", "grf", "aorsf", "glmnet", "xgboost", "randomForestSRC",
    "party", "riskRegression", "survival", "officer", "flextable", 
    "table.glue", "gtsummary", "usethis", "cli", "ggforce",
    "rpart", "rpart.plot", "ranger", "withr", "gt", "recipes", 
    "butcher", "sandwich", "lmtest", "gbm", "officedown", "Matrix",
    "ggsurvfit", "tidycmprsk", "here", "tarchetypes", "targets")

install.packages(pkgs)

Pull!

Make sure you have a GitHub account with personal access token (PAT) stored in Rstudio

Open Rstudio
Copy/paste the code on this slide into an R script
Important: adjust destdir
Run

library(usethis)

create_from_github(
  "bcjaeger/melodem-apoe4-het",
  destdir = "path/of/choice", 
  fork = TRUE
)

Introducing `targets`

Your turn

Open _targets.R in the melodem-apoe4-het project.
Run library(targets) to load the targets package.
Run tar_load_globals() to load relevant functions and packages.
Run tar_glimpse() to inspect the pipeline.
Run tar_make() to make the pipeline.

05:00

Start with data management

In the _targets.R file:

file_sim_tar <- tar_target(
  file_sim,
  command = "data/sim-raw.csv",
  format = 'file'
)

data_melodem_tar <- tar_target(
  data_melodem,
  data_prepare(file_sim)
)

This will be done with your data, too!

Your turn

We are going to add your data to the pipeline, carefully.

Think of a name for your data.
- Example name: regards
Save a copy of your data in data/sensitive. The name of your file should be name-raw.csv or name-raw.sas7bdat, where name is your data’s name. E.g., regards-raw.csv

. . .

03:00

Your turn

We are going to add your data to the pipeline, carefully.

Switch from the R console to the terminal.
Verify you have no uncommitted changes:

git status

Should return “nothing to commit, working tree clean”

Create a new branch with git:

git branch -b regards

03:00

Your turn

We are going to add your data to the pipeline, carefully.

Copy/paste code shown here to _targets.R, just beneath the line that starts with # real data cohorts.
Replace zzzz with the name of your data.
Save the _targets.R file
Run tar_make() in the R console.

file_zzzz_tar <- tar_target(               
  file_zzzz,
  command = "data/sensitive/zzzz-raw.csv",
  format = "file"
)

data_zzzz_tar <- tar_target(
  data_zzzz,
  data_prepare(
    file_name = "data/sensitive/zzzz-raw.csv"
  )
)

# don't forget to add these targets
# to the targets list at the bottom!

05:00

Your turn

run tar_read(data_zzzz), where zzzz is your data name.

tar_read(data_melodem)

------------------------------------- sprint ------------------------------------- 
# A tibble: 7,159 × 23
  status  time frailty_catg statin aspirin  egfr sub_ckd   age sub_cvd race4
   <dbl> <dbl> <fct>         <dbl>   <dbl> <dbl>   <dbl> <dbl>   <dbl> <fct>
1      0 11.8  Pre-frail         0       0  88.2       0    57       0 BLACK
2      0 12.4  Pre-frail         1       1  82.1       0    61       0 WHITE
3      0 10.4  Frail             1       0  97.5       0    56       0 WHITE
4      0  7.34 Pre-frail         0       0  87.3       0    63       0 WHITE
5      0  7.36 Pre-frail         0       0  68.0       0    71       1 WHITE
# ℹ 7,154 more rows
# ℹ 13 more variables: CHR <dbl>, GLUR <dbl>, HDL <dbl>, TRR <dbl>,
#   UMALCR <dbl>, BMI <dbl>, sbp <dbl>, dbp <dbl>, fr_risk10yrs <dbl>,
#   orth_hypo <dbl>, education <dbl>, treatment <fct>, sex <fct>

 ----------------------------------  exclusions  ---------------------------------- 
# A tibble: 2 × 2
  label               n_obs
  <glue>              <int>
1 sprint participants  8541
2 Aged 55-80 years     7159

Data management

We used data_prepare() to make this object.

Let’s check out what data_prepare does.

data_prepare <- function(file_name, ...){

  output <- data_load(file_name) %>%
    data_clean() %>%
    data_derive() %>%
    data_select() %>%
    data_recode(labels = labels) %>%
    data_exclude(...)
  
  # checks not shown

  output

}

Data management

We used data_prepare() to make this object.

Let’s check out what data_prepare does. First, it loads the data

data_prepare <- function(file_name, ...){

  output <- data_load(file_name) %>%
    data_clean() %>%
    data_derive() %>%
    data_select() %>%
    data_exclude(...)

  check_names(output$values,
              c("age", "sex", "apoe4", "time", "status"))

  output

}

Data management

Let’s check out data_load().

data_load <- function(file_path){

    # ... file management code not shown ...
  
  structure(
    .Data = list(
      values = data_input,
      exclusions = tibble(label = glue("{cohort_name} participants"),
                          n_obs = nrow(data_input))
    ),
    class = c(paste("melodem", cohort_name, sep = "_"), 
              'melodem_data'),
    label = cohort_label
  )
  
}

Data management

Let’s check out data_load(). The object returned from this function includes data and a preliminary exclusion table.

data_load <- function(file_path){

    # ... file management code not shown ...
  
  structure(
    .Data = list(
      values = data_input,
      exclusions = tibble(label = glue("{cohort_name} participants"),
                          n_obs = nrow(data_input))
    ),
    class = c(cohort_name, 'melodem_data'),
    label = cohort_label
  )
  
}

Data management

The object returned also has customized class based on the dataset. The output also belongs to a broader class called melodem_data

data_load <- function(file_path){

  # ... file management code not shown ...
  
  structure(
    .Data = list(
      values = data_input,
      exclusions = tibble(label = glue("{cohort_name} participants"),
                          n_obs = nrow(data_input))
    ),
    class = c(cohort_name, 'melodem_data'),
    label = cohort_label
  )
  
}

Why?

Each dataset is unique, and some may require customized preparation:

Different elements need to be cleaned.
Different variables need to be derived.
Different variables may be selected.
Different exclusions may be applied.

data_load makes its output have a customized class based on the name of the dataset so that you, the owner of the data, are in control of these steps that may be uniquely defined for your data.

How?

R’s generic function system. Generic functions (e.g., plot()) dispatch different methods depending on the type of input object.

Here’s a look at the generic function for cleaning an object of class sim:

data_clean.melodem_sim <- function(data){

  dt <- data_clean_minimal(data$values)

  dt[, age := age * 5 + 65]
  dt[, sex := fifelse(sex > 0, 1, 0)]
  dt[, sex := factor(sex, levels = c(0, 1),
                     labels = c("male", "female"))]
  data$values <- dt
  data
}

How?

R’s generic function system. Generic functions (e.g., plot()) dispatch different methods depending on the type of input object.

Here’s the generic function for cleaning an object of class melodem_data:

data_clean.melodem_data <- function(data){

  data_clean_minimal(data)

}

Your data

Your data’s first class is melodem plus the name you picked, e.g., melodem_regards, and second class is melodem_data. Verify by running class()
When you run data_clean() with, e.g., the regards data,
- R will look for a function called data_clean.melodem_regards
- If it doesn’t exist, R runs data_clean.melodem_data

TLDR: If you don’t write a specific function for data_clean, data_derive, etc. for your data, then these functions will not do anything to your data.

Your turn

For the rest of this session,

Implement specific data_clean, data_derive, data_select, and data_exclude for your data.
If you already did these operations before the workshop, move the code you used into the corresponding function.
If you finish early, help someone else!

Other Formats

Hello! My name is Byron

Options

Schedule

Schedule

Schedule

Sticky notes

Part 1: Introduction and data management

Goals

Whole game

Whole game

Whole game

Why GitHub?

Set-up R packages

Pull!

Introducing targets

Your turn

Start with data management

Your turn

Your turn

Your turn

Your turn

Data management

Data management

Data management

Data management

Data management

Why?

How?

How?

Your data

Your turn

Introducing `targets`