Iteration

class: center, middle, inverse, title-slide

# Iteration
## Introduction to purrr
### Byron C. Jaeger
### Last updated: 2020-07-23

---

class: inverse, center, middle

# A tedious task

---

## Multiple means

Suppose I ask you to find the mean value of every numeric variable in the synthetic CVD dataset.

```r
cvd <- read_rds('data/cvd.rds')

mean_status <- mean(cvd$cvd_status, na.rm = TRUE)
mean_time   <- mean(cvd$cvd_time, na.rm = TRUE)
mean_sbp    <- mean(cvd$sbp, na.rm = TRUE)
mean_dbp    <- mean(cvd$dbp, na.rm = TRUE)
mean_age    <- mean(cvd$age_number, na.rm = TRUE)
mean_hba1c  <- mean(cvd$hba1c, na.rm = TRUE)
```

What are the __good__ things about this approach?

- Simple and clear

- Easy to do

- Easy to learn

---

## Multiple means

Suppose I ask you to find the mean value of every numeric variable in the synthetic CVD dataset.

```r
cvd <- read_rds('data/cvd.rds')

What are the __bad__ things about this approach?

- Repetition increases probability of making a mistake

- Hard to update and make changes

- Does not scale well

---

## Classic iteration

In the beginning, there was the `for` loop.

- Based on some index, usually denoted as `i`.

- Requires a set of pre-defined index values (i.e., `i` = 1, 2, and 3)

```r
for (i in 1:5){
  
  print(i)
  
}
```

```
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
```

---

## Classic iteration

- `for` loop index options are flexible.

- e.g., loop over a set of character values instead of a set of numbers

```r
for (i in names(cvd)){
  
  print(i)
  
}
```

```
## [1] "ID"
## [1] "cvd_status"
## [1] "cvd_time"
## [1] "sbp"
## [1] "dbp"
## [1] "bp_meds"
## [1] "age_number"
## [1] "drink"
## [1] "smoke"
## [1] "hba1c"
## [1] "diabetes"
## [1] "albuminuria"
## [1] "bp_midrange"
## [1] "rec_bpmeds_acc_aha"
## [1] "rec_bpmeds_jnc7"
```

---

## Classic iteration

- `for` loops can make the `mean` task much less tedious.

- loop over all the names and use `if` to act on continuous variables.

```r
for (i in names(cvd)) {

if (is.numeric(cvd[[i]])) {
    print(
      tbl_string('mean of {i}: {mean(cvd[[i]], na.rm = TRUE)}')
    )
  }  
  
}
```

```
## [1] "mean of ID: 5,001"
## [1] "mean of cvd_status: 0.11"
## [1] "mean of cvd_time: 11"
## [1] "mean of sbp: 127"
## [1] "mean of dbp: 76"
## [1] "mean of age_number: 55"
## [1] "mean of hba1c: 6.0"
```

---

## Classic iteration

The hard thing about `for` loops: storing values

```r
# initialize empty vectors for results
mean_values <- c()
mean_names <- c()
for (i in names(cvd)) {

if (is.numeric(cvd[[i]])) {
    # append the new mean to the list
    mean_values <- c(mean_values, mean(cvd[[i]], na.rm = TRUE))
    mean_names <- c(mean_names, i)
  }  
  
}

names(mean_values) <- mean_names

mean_values
```

```
##           ID   cvd_status     cvd_time          sbp          dbp   age_number 
## 5000.5000000    0.1106163   10.6186771  127.2892713   75.7400810   54.7624000 
##        hba1c 
##    6.0093278
```

---

## Tidy iteration

`purrr` is an R package in the tidyverse.

- designed to abstract away some of the extraneous syntax in `for` loops.

- main function is `map`, which works like a for loop

- works extremely well with `lists` and `dplyr` functions

---

## Tidy iteration

```r
*cvd %>%
  select(where(is.numeric)) %>% 
  map(.f = ~ mean(.x, na.rm = TRUE))
```

```
## # A tibble: 10,000 x 15
##       ID cvd_status cvd_time   sbp   dbp bp_meds age_number
##    <int>      <dbl>    <dbl> <int> <int> <fct>        <dbl>
##  1     1          0    11.5    133    78 No              50
##  2     2          0    12.1    110    63 No              43
##  3     3          0    11.7    125    67 No              69
##  4     4          0    11.6    124    68 No              33
##  5     5          0    11.2    113    81 No              40
##  6     6          1     2.14   144    68 No              67
##  7     7          0    10.8    145    78 Yes             61
##  8     8          0     8.28   126    69 No              71
##  9     9         NA    NA      138    81 No              42
## 10    10          0    11.9    141    81 Yes             54
## # ... with 9,990 more rows, and 8 more variables:
## #   drink <fct>, smoke <fct>, hba1c <dbl>, diabetes <fct>,
## #   albuminuria <fct>, bp_midrange <fct>,
## #   rec_bpmeds_acc_aha <fct>, rec_bpmeds_jnc7 <fct>
```

---

## Tidy iteration

```r
cvd %>% 
* select(where(is.numeric)) %>%
  map(.f = ~ mean(.x, na.rm = TRUE))
```

```
## # A tibble: 10,000 x 7
##       ID cvd_status cvd_time   sbp   dbp age_number hba1c
##    <int>      <dbl>    <dbl> <int> <int>      <dbl> <dbl>
##  1     1          0    11.5    133    78         50   6.4
##  2     2          0    12.1    110    63         43   5  
##  3     3          0    11.7    125    67         69   4.9
##  4     4          0    11.6    124    68         33   5.1
##  5     5          0    11.2    113    81         40   6  
##  6     6          1     2.14   144    68         67   6.3
##  7     7          0    10.8    145    78         61   6.1
##  8     8          0     8.28   126    69         71   6  
##  9     9         NA    NA      138    81         42   5.3
## 10    10          0    11.9    141    81         54   4.9
## # ... with 9,990 more rows
```

---

## Tidy iteration

```r
cvd %>% 
  select(where(is.numeric)) %>% 
* map(.f = ~ mean(.x, na.rm = TRUE))
```

```
## $ID
## [1] 5000.5
## 
## $cvd_status
## [1] 0.1106163
## 
## $cvd_time
## [1] 10.61868
## 
## $sbp
## [1] 127.2893
## 
## $dbp
## [1] 75.74008
## 
## $age_number
## [1] 54.7624
## 
## $hba1c
## [1] 6.009328
```

---

## Tidy iteration

`map` has variations that can return a specific type of vector,

- `map_dbl` returns a double vector

```r
cvd %>% 
  select(where(is.numeric)) %>% 
* map_dbl(.f = ~ mean(.x, na.rm = TRUE))
```

---

## Tidy iteration

`map` has variations that can return a specific type of vector,

- `map_chr` returns a character vector

```r
cvd %>% 
  select(where(is.numeric)) %>% 
* map_chr(.f = ~ mean(.x, na.rm = TRUE))
```

```
##            ID    cvd_status      cvd_time           sbp           dbp 
## "5000.500000"    "0.110616"   "10.618677"  "127.289271"   "75.740081" 
##    age_number         hba1c 
##   "54.762400"    "6.009328"
```

---

## Tidy iteration

`map` has variations that can return a specific type of vector,

- `map_df` binds results into a tibble.

```r
cvd %>% 
  select(where(is.numeric)) %>% 
* map_df(.f = ~ mean(.x, na.rm = TRUE))
```

```
## # A tibble: 1 x 7
##      ID cvd_status cvd_time   sbp   dbp age_number hba1c
##   <dbl>      <dbl>    <dbl> <dbl> <dbl>      <dbl> <dbl>
## 1 5000.      0.111     10.6  127.  75.7       54.8  6.01
```
---

## Mapping example

Let's say we are trying to guess the regression slope for the following problem.

```r
df <- tibble(x = rnorm(250), y = 3 * x + 1 + rnorm(250, sd = 5))
ggplot(df, aes(x = x, y = y)) + geom_point()
```

![](index_files/figure-html/unnamed-chunk-16-1.png)

---

## Mapping example

We want to select a regression slope value that minimizes the sum of squared differences between predicted values (red line) and observed values (gray points).

```r
ggplot(df, aes(x = x, y = y)) +
  geom_point(shape = 21, col = 'black', fill = 'grey') +
  geom_smooth(method = 'lm', col = 'red', se = FALSE)
```

![](index_files/figure-html/unnamed-chunk-17-1.png)

---

## Mapping example

So how about we try a whole bunch of different slope values for our line and see which one minimizes the sum of squared differences?

![](index_files/figure-html/unnamed-chunk-18-1.png)

---

## Mapping example

first we'll do some single cases

```r
estimated_intercept <- 3
estimated_slope <- 1

predictions <- estimated_intercept + estimated_slope * df$x

# mean squared error
sq_error <- (predictions - df$y)^2
mse <- mean(sq_error)
mse
```

```
## [1] 35.45463
```

---

## Mapping example

first we'll do some single cases

```r
estimated_intercept <- 3
estimated_slope <- 1.5

predictions <- estimated_intercept + estimated_slope * df$x

# mean squared error
sq_error <- (predictions - df$y)^2
mse <- mean(sq_error)
mse
```

```
## [1] 34.00117
```

---

## Mapping example

Now we `map`.

```r
slopes <- seq(1, 5, length.out = 1000)

results <- map_dbl(
  .x = slopes,
  .f = ~ {
    estimated_intercept <- 3
    estimated_slope <- .x
    
    predictions <- estimated_intercept + estimated_slope * df$x
    
    # mean squared error
    sq_error <- (predictions - df$y)^2
    mse <- mean(sq_error)
    mse
  }
)
```

---

## Mapping example

and then `ggplot`.

![](index_files/figure-html/unnamed-chunk-22-1.png)

---

## Learning more

To learn more, see

- The excellent purrr [tutorials](https://jennybc.github.io/purrr-tutorial/)

- The purrr [website](https://purrr.tidyverse.org/index.html)