# Install required packages for the workshop
<-
pkgs c("tidyverse", "tidymodels", "data.table", "haven", "magrittr",
"glue", "grf", "aorsf", "glmnet", "xgboost", "randomForestSRC",
"party", "riskRegression", "survival", "officer", "flextable",
"table.glue", "gtsummary", "usethis", "cli", "ggforce",
"rpart", "rpart.plot", "ranger", "withr", "gt", "recipes",
"butcher", "sandwich", "lmtest", "gbm", "officedown", "Matrix",
"ggsurvfit", "tidycmprsk", "here", "tarchetypes", "targets")
install.packages(pkgs)
MELODEM data workshop
Introduction
Welcome!
Wi-Fi network name
CF-Bellinglise
Wi-Fi password
BELLINGLISE2005
Hello! My name is Byron
I am an R enthusiast.
I love dogs.
I study risk
prediction + machine learning
Options
This workshop heavily leverages R, but you may prefer not to.
You can partner up with an R user
You can use the simulated data (makes exercises much easier)
You can relax and get ☕ during exercises
Learning under stress may not be ideal. Do what works best for you.
I will do my part to make the concepts taught here R-agnostic so that you can learn plenty of valuable things no matter what language you use.
Schedule
Session 1: Friday, 4:00pm - 5:30pm
- Introduction, data management
Session 2a: Saturday 9:30am - 11:00am
Decision trees and random forests
Break from 10:45am - 11:00am
Schedule
Session 2b: Saturday 11:00am - 12:30am
Oblique random forests
Lunch from 12:30pm - 2:00pm
Session 3: Saturday 2:00pm - 5:30pm
Causal random forests
Break from 3:45pm - 4:00pm
Finish slides or get a head start on collaboration
Schedule
Session 4: Sunday 9:30am - 12:30 pm
Discuss manuscript aims, GitHub issues (30m)
Work in small groups (45m)
Break from 10:45am - 11:00am
Session 5: Sunday 2:00pm - 4:00 pm
Progress updates and discussion (30m)
Work in small groups (90m)
Sticky notes
While you’re working on exercises,
Place pink sticky note on the back of your laptop if you want help.
Place blue sticky note on the back of your laptop when you are done
Part 1: Introduction and data management
Goals
Get our data organized.
Build familiarity with R/Rstudio and git/GitHub
Learn how to use a stellar R package:
targets
Plug your data into the workshop pipeline
Help a friend
Whole game
First, you pull code down from the GitHub repo.
Whole game
Next, you commit code and summary results. No data!
Whole game
Last, you push your code and summary results to the GitHub repo.
Why GitHub?
So we can work together, separately!
Store and coordinate code from multiple authors
Public facing team science
Free website for our work (i.e., this workshop).
Set-up R packages
Make sure we all have up-to-date versions of these R packages:
Pull!
Make sure you have a GitHub account with personal access token (PAT) stored in Rstudio
- Open Rstudio
- Copy/paste the code on this slide into an R script
- Important: adjust
destdir
- Run
library(usethis)
create_from_github(
"bcjaeger/melodem-apoe4-het",
destdir = "path/of/choice",
fork = TRUE
)
Introducing targets
Your turn
- Open
_targets.R
in themelodem-apoe4-het
project. - Run
library(targets)
to load thetargets
package. - Run
tar_load_globals()
to load relevant functions and packages. - Run
tar_glimpse()
to inspect the pipeline. - Run
tar_make()
to make the pipeline.
05:00
Start with data management
In the _targets.R
file:
<- tar_target(
file_sim_tar
file_sim,command = "data/sim-raw.csv",
format = 'file'
)
<- tar_target(
data_melodem_tar
data_melodem,data_prepare(file_sim)
)
This will be done with your data, too!
Your turn
We are going to add your data to the pipeline, carefully.
Think of a name for your data.
- Example name: regards
Save a copy of your data in
data/sensitive
. The name of your file should bename-raw.csv
orname-raw.sas7bdat
, wherename
is your data’s name. E.g.,regards-raw.csv
. . .
03:00
Your turn
We are going to add your data to the pipeline, carefully.
Switch from the R console to the terminal.
Verify you have no uncommitted changes:
git status
Should return “nothing to commit, working tree clean”
- Create a new branch with git:
git branch -b regards
03:00
Your turn
We are going to add your data to the pipeline, carefully.
Copy/paste code shown here to
_targets.R
, just beneath the line that starts with# real data cohorts
.Replace
zzzz
with the name of your data.Save the
_targets.R
fileRun
tar_make()
in the R console.
<- tar_target(
file_zzzz_tar
file_zzzz,command = "data/sensitive/zzzz-raw.csv",
format = "file"
)
<- tar_target(
data_zzzz_tar
data_zzzz,data_prepare(
file_name = "data/sensitive/zzzz-raw.csv"
)
)
# don't forget to add these targets
# to the targets list at the bottom!
05:00
Your turn
- run
tar_read(data_zzzz)
, wherezzzz
is your data name.
tar_read(data_melodem)
------------------------------------- sprint -------------------------------------
# A tibble: 7,159 × 23
status time frailty_catg statin aspirin egfr sub_ckd age sub_cvd race4
<dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 0 11.8 Pre-frail 0 0 88.2 0 57 0 BLACK
2 0 12.4 Pre-frail 1 1 82.1 0 61 0 WHITE
3 0 10.4 Frail 1 0 97.5 0 56 0 WHITE
4 0 7.34 Pre-frail 0 0 87.3 0 63 0 WHITE
5 0 7.36 Pre-frail 0 0 68.0 0 71 1 WHITE
# ℹ 7,154 more rows
# ℹ 13 more variables: CHR <dbl>, GLUR <dbl>, HDL <dbl>, TRR <dbl>,
# UMALCR <dbl>, BMI <dbl>, sbp <dbl>, dbp <dbl>, fr_risk10yrs <dbl>,
# orth_hypo <dbl>, education <dbl>, treatment <fct>, sex <fct>
---------------------------------- exclusions ----------------------------------
# A tibble: 2 × 2
label n_obs
<glue> <int>
1 sprint participants 8541
2 Aged 55-80 years 7159
Data management
We used data_prepare()
to make this object.
Let’s check out what data_prepare
does.
<- function(file_name, ...){
data_prepare
<- data_load(file_name) %>%
output data_clean() %>%
data_derive() %>%
data_select() %>%
data_recode(labels = labels) %>%
data_exclude(...)
# checks not shown
output
}
Data management
We used data_prepare()
to make this object.
Let’s check out what data_prepare
does. First, it loads the data
Data management
Let’s check out data_load()
.
<- function(file_path){
data_load
# ... file management code not shown ...
structure(
.Data = list(
values = data_input,
exclusions = tibble(label = glue("{cohort_name} participants"),
n_obs = nrow(data_input))
),class = c(paste("melodem", cohort_name, sep = "_"),
'melodem_data'),
label = cohort_label
)
}
Data management
Let’s check out data_load()
. The object returned from this function includes data and a preliminary exclusion table.
Data management
The object returned also has customized class based on the dataset. The output also belongs to a broader class called melodem_data
Why?
Each dataset is unique, and some may require customized preparation:
Different elements need to be cleaned.
Different variables need to be derived.
Different variables may be selected.
Different exclusions may be applied.
data_load
makes its output have a customized class based on the name of the dataset so that you, the owner of the data, are in control of these steps that may be uniquely defined for your data.
How?
R’s generic function system. Generic functions (e.g., plot()
) dispatch different methods depending on the type of input object.
Here’s a look at the generic function for cleaning an object of class sim
:
<- function(data){
data_clean.melodem_sim
<- data_clean_minimal(data$values)
dt
:= age * 5 + 65]
dt[, age := fifelse(sex > 0, 1, 0)]
dt[, sex := factor(sex, levels = c(0, 1),
dt[, sex labels = c("male", "female"))]
$values <- dt
data
data }
How?
R’s generic function system. Generic functions (e.g., plot()
) dispatch different methods depending on the type of input object.
Here’s the generic function for cleaning an object of class melodem_data
:
<- function(data){
data_clean.melodem_data
data_clean_minimal(data)
}
Your data
Your data’s first class is
melodem
plus the name you picked, e.g.,melodem_regards
, and second class ismelodem_data
. Verify by runningclass()
When you run
data_clean()
with, e.g., theregards
data,- R will look for a function called
data_clean.melodem_regards
- If it doesn’t exist, R runs
data_clean.melodem_data
- R will look for a function called
TLDR: If you don’t write a specific function for data_clean
, data_derive
, etc. for your data, then these functions will not do anything to your data.
Your turn
For the rest of this session,
Implement specific
data_clean
,data_derive
,data_select
, anddata_exclude
for your data.If you already did these operations before the workshop, move the code you used into the corresponding function.
If you finish early, help someone else!