class: center, middle, inverse, title-slide # Iteration ## Introduction to purrr ### Byron C. Jaeger ### Last updated: 2020-07-23 --- class: inverse, center, middle # A tedious task --- ## Multiple means Suppose I ask you to find the mean value of every numeric variable in the synthetic CVD dataset. ```r cvd <- read_rds('data/cvd.rds') mean_status <- mean(cvd$cvd_status, na.rm = TRUE) mean_time <- mean(cvd$cvd_time, na.rm = TRUE) mean_sbp <- mean(cvd$sbp, na.rm = TRUE) mean_dbp <- mean(cvd$dbp, na.rm = TRUE) mean_age <- mean(cvd$age_number, na.rm = TRUE) mean_hba1c <- mean(cvd$hba1c, na.rm = TRUE) ``` What are the __good__ things about this approach? - Simple and clear - Easy to do - Easy to learn --- ## Multiple means Suppose I ask you to find the mean value of every numeric variable in the synthetic CVD dataset. ```r cvd <- read_rds('data/cvd.rds') mean_status <- mean(cvd$cvd_status, na.rm = TRUE) mean_time <- mean(cvd$cvd_time, na.rm = TRUE) mean_sbp <- mean(cvd$sbp, na.rm = TRUE) mean_dbp <- mean(cvd$dbp, na.rm = TRUE) mean_age <- mean(cvd$age_number, na.rm = TRUE) mean_hba1c <- mean(cvd$hba1c, na.rm = TRUE) ``` What are the __bad__ things about this approach? - Repetition increases probability of making a mistake - Hard to update and make changes - Does not scale well --- ## Classic iteration In the beginning, there was the `for` loop. - Based on some index, usually denoted as `i`. - Requires a set of pre-defined index values (i.e., `i` = 1, 2, and 3) ```r for (i in 1:5){ print(i) } ``` ``` ## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ``` --- ## Classic iteration - `for` loop index options are flexible. - e.g., loop over a set of character values instead of a set of numbers ```r for (i in names(cvd)){ print(i) } ``` ``` ## [1] "ID" ## [1] "cvd_status" ## [1] "cvd_time" ## [1] "sbp" ## [1] "dbp" ## [1] "bp_meds" ## [1] "age_number" ## [1] "drink" ## [1] "smoke" ## [1] "hba1c" ## [1] "diabetes" ## [1] "albuminuria" ## [1] "bp_midrange" ## [1] "rec_bpmeds_acc_aha" ## [1] "rec_bpmeds_jnc7" ``` --- ## Classic iteration - `for` loops can make the `mean` task much less tedious. - loop over all the names and use `if` to act on continuous variables. ```r for (i in names(cvd)) { if (is.numeric(cvd[[i]])) { print( tbl_string('mean of {i}: {mean(cvd[[i]], na.rm = TRUE)}') ) } } ``` ``` ## [1] "mean of ID: 5,001" ## [1] "mean of cvd_status: 0.11" ## [1] "mean of cvd_time: 11" ## [1] "mean of sbp: 127" ## [1] "mean of dbp: 76" ## [1] "mean of age_number: 55" ## [1] "mean of hba1c: 6.0" ``` --- ## Classic iteration The hard thing about `for` loops: storing values ```r # initialize empty vectors for results mean_values <- c() mean_names <- c() for (i in names(cvd)) { if (is.numeric(cvd[[i]])) { # append the new mean to the list mean_values <- c(mean_values, mean(cvd[[i]], na.rm = TRUE)) mean_names <- c(mean_names, i) } } names(mean_values) <- mean_names mean_values ``` ``` ## ID cvd_status cvd_time sbp dbp age_number ## 5000.5000000 0.1106163 10.6186771 127.2892713 75.7400810 54.7624000 ## hba1c ## 6.0093278 ``` --- ## Tidy iteration `purrr` is an R package in the tidyverse. - designed to abstract away some of the extraneous syntax in `for` loops. - main function is `map`, which works like a for loop - works extremely well with `lists` and `dplyr` functions --- ## Tidy iteration ```r *cvd %>% select(where(is.numeric)) %>% map(.f = ~ mean(.x, na.rm = TRUE)) ``` ``` ## # A tibble: 10,000 x 15 ## ID cvd_status cvd_time sbp dbp bp_meds age_number ## <int> <dbl> <dbl> <int> <int> <fct> <dbl> ## 1 1 0 11.5 133 78 No 50 ## 2 2 0 12.1 110 63 No 43 ## 3 3 0 11.7 125 67 No 69 ## 4 4 0 11.6 124 68 No 33 ## 5 5 0 11.2 113 81 No 40 ## 6 6 1 2.14 144 68 No 67 ## 7 7 0 10.8 145 78 Yes 61 ## 8 8 0 8.28 126 69 No 71 ## 9 9 NA NA 138 81 No 42 ## 10 10 0 11.9 141 81 Yes 54 ## # ... with 9,990 more rows, and 8 more variables: ## # drink <fct>, smoke <fct>, hba1c <dbl>, diabetes <fct>, ## # albuminuria <fct>, bp_midrange <fct>, ## # rec_bpmeds_acc_aha <fct>, rec_bpmeds_jnc7 <fct> ``` --- ## Tidy iteration ```r cvd %>% * select(where(is.numeric)) %>% map(.f = ~ mean(.x, na.rm = TRUE)) ``` ``` ## # A tibble: 10,000 x 7 ## ID cvd_status cvd_time sbp dbp age_number hba1c ## <int> <dbl> <dbl> <int> <int> <dbl> <dbl> ## 1 1 0 11.5 133 78 50 6.4 ## 2 2 0 12.1 110 63 43 5 ## 3 3 0 11.7 125 67 69 4.9 ## 4 4 0 11.6 124 68 33 5.1 ## 5 5 0 11.2 113 81 40 6 ## 6 6 1 2.14 144 68 67 6.3 ## 7 7 0 10.8 145 78 61 6.1 ## 8 8 0 8.28 126 69 71 6 ## 9 9 NA NA 138 81 42 5.3 ## 10 10 0 11.9 141 81 54 4.9 ## # ... with 9,990 more rows ``` --- ## Tidy iteration ```r cvd %>% select(where(is.numeric)) %>% * map(.f = ~ mean(.x, na.rm = TRUE)) ``` ``` ## $ID ## [1] 5000.5 ## ## $cvd_status ## [1] 0.1106163 ## ## $cvd_time ## [1] 10.61868 ## ## $sbp ## [1] 127.2893 ## ## $dbp ## [1] 75.74008 ## ## $age_number ## [1] 54.7624 ## ## $hba1c ## [1] 6.009328 ``` --- ## Tidy iteration `map` has variations that can return a specific type of vector, - `map_dbl` returns a double vector ```r cvd %>% select(where(is.numeric)) %>% * map_dbl(.f = ~ mean(.x, na.rm = TRUE)) ``` ``` ## ID cvd_status cvd_time sbp dbp age_number ## 5000.5000000 0.1106163 10.6186771 127.2892713 75.7400810 54.7624000 ## hba1c ## 6.0093278 ``` --- ## Tidy iteration `map` has variations that can return a specific type of vector, - `map_chr` returns a character vector ```r cvd %>% select(where(is.numeric)) %>% * map_chr(.f = ~ mean(.x, na.rm = TRUE)) ``` ``` ## ID cvd_status cvd_time sbp dbp ## "5000.500000" "0.110616" "10.618677" "127.289271" "75.740081" ## age_number hba1c ## "54.762400" "6.009328" ``` --- ## Tidy iteration `map` has variations that can return a specific type of vector, - `map_df` binds results into a tibble. ```r cvd %>% select(where(is.numeric)) %>% * map_df(.f = ~ mean(.x, na.rm = TRUE)) ``` ``` ## # A tibble: 1 x 7 ## ID cvd_status cvd_time sbp dbp age_number hba1c ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 5000. 0.111 10.6 127. 75.7 54.8 6.01 ``` --- ## Mapping example Let's say we are trying to guess the regression slope for the following problem. ```r df <- tibble(x = rnorm(250), y = 3 * x + 1 + rnorm(250, sd = 5)) ggplot(df, aes(x = x, y = y)) + geom_point() ``` <!-- --> --- ## Mapping example We want to select a regression slope value that minimizes the sum of squared differences between predicted values (red line) and observed values (gray points). ```r ggplot(df, aes(x = x, y = y)) + geom_point(shape = 21, col = 'black', fill = 'grey') + geom_smooth(method = 'lm', col = 'red', se = FALSE) ``` <!-- --> --- ## Mapping example So how about we try a whole bunch of different slope values for our line and see which one minimizes the sum of squared differences? <!-- --> --- ## Mapping example first we'll do some single cases ```r estimated_intercept <- 3 estimated_slope <- 1 predictions <- estimated_intercept + estimated_slope * df$x # mean squared error sq_error <- (predictions - df$y)^2 mse <- mean(sq_error) mse ``` ``` ## [1] 35.45463 ``` --- ## Mapping example first we'll do some single cases ```r estimated_intercept <- 3 estimated_slope <- 1.5 predictions <- estimated_intercept + estimated_slope * df$x # mean squared error sq_error <- (predictions - df$y)^2 mse <- mean(sq_error) mse ``` ``` ## [1] 34.00117 ``` --- ## Mapping example Now we `map`. ```r slopes <- seq(1, 5, length.out = 1000) results <- map_dbl( .x = slopes, .f = ~ { estimated_intercept <- 3 estimated_slope <- .x predictions <- estimated_intercept + estimated_slope * df$x # mean squared error sq_error <- (predictions - df$y)^2 mse <- mean(sq_error) mse } ) ``` --- ## Mapping example and then `ggplot`. <!-- --> --- ## Learning more To learn more, see - The excellent purrr [tutorials](https://jennybc.github.io/purrr-tutorial/) - The purrr [website](https://purrr.tidyverse.org/index.html)