Iteration

You can download this .qmd file from here. Just hit the Download Raw File button.

This leans on parts of R4DS Chapter 26: Iteration, in addition to parts of the first edition of R4DS.

# Initial packages required
library(tidyverse)

Iteration

Reducing duplication of code will reduce errors and make debugging much easier. We’ve already seen how functions (Ch 25) can help reduce duplication by extracting repeated patterns of code. Another tool is iteration, when you find you’re doing the same thing to multiple inputs – repeating the same operation on different columns or datasets.

Here we’ll see two important iteration paradigms: imperative programming and functional programming.

Imperation programming for iteration

Examples: for loops and while loops

Pros: relatively easy to learn, make iteration very explicit so it’s obvious what’s happening, not as inefficient as some people believe

Cons: require lots of bookkeeping code that’s duplicated for every loop

Every for loop has three components:

  1. output - plan ahead and allocate enough space for output
  2. sequence - determines what to loop over; cycles through different values of \(i\)
  3. body - code that does the work; run repeatedly with different values of \(i\)
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df
# A tibble: 10 × 4
         a        b      c       d
     <dbl>    <dbl>  <dbl>   <dbl>
 1 -1.15    1.41    -1.34   0.0917
 2  0.0835 -0.357   -1.47   1.32  
 3  0.0599  0.719   -0.444  0.103 
 4  1.63   -0.364   -1.20   0.501 
 5 -2.06    1.53     0.497 -1.02  
 6  1.06    2.08     0.663  0.881 
 7 -0.253  -1.07     0.509 -0.0608
 8 -0.230  -1.03     0.414  1.26  
 9  0.210  -1.10    -2.23   0.867 
10 -1.05    0.00930  0.557 -0.138 
# want median of each column (w/o cutting and pasting)
#   Be careful using square brackets vs double square brackets when
#   selecting elements
median(df[[1]])
[1] -0.08498661
median(df[1])
Error in median.default(df[1]): need numeric data
df[1]
# A tibble: 10 × 1
         a
     <dbl>
 1 -1.15  
 2  0.0835
 3  0.0599
 4  1.63  
 5 -2.06  
 6  1.06  
 7 -0.253 
 8 -0.230 
 9  0.210 
10 -1.05  
df[[1]]
 [1] -1.15291482  0.08350447  0.05987359  1.62738676 -2.05597192  1.06219806
 [7] -0.25327413 -0.22984681  0.21007370 -1.05352815
class(df[1])
[1] "tbl_df"     "tbl"        "data.frame"
class(df[[1]])
[1] "numeric"
# basic for loop to take median of each column
output <- vector("double", ncol(df))  # 1. output
for (i in 1:4) {                      # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output
[1] -0.08498661 -0.17386353 -0.01533625  0.30223811
# ?seq_along - a safer option if had zero length vectors
output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output
[1] -0.08498661 -0.17386353 -0.01533625  0.30223811
# use [[.]] even if don't have to signal working with single elements

# alternative solution - don't hardcode in "4"
output <- vector("double", ncol(df))  # 1. output
for(i in 1:ncol(df)) {                # 2. sequence
  output[i] <- median(df[[i]])        # 3. body
}
output
[1] -0.08498661 -0.17386353 -0.01533625  0.30223811
# another approach - no double square brackets since df not a tibble
df <- as.data.frame(df)
output <- vector("double", ncol(df))  # 1. output
for(i in 1:ncol(df)) {                # 2. sequence
  output[i] <- median(df[,i])         # 3. body
}
output
[1] -0.08498661 -0.17386353 -0.01533625  0.30223811

One advantage of seq_along(): works with unknown output length. However, the second approach below is much more efficient, since each iteration doesn’t copy all data from previous iterations.

[Pause to Ponder:] What does the code below do? Be prepared to explain both chunks line-by-line!

# for loop: unknown output length

means <- c(0, 1, 2)
output <- double()
for (i in seq_along(means)) {
  n <- sample(100, 1)
  output <- c(output, rnorm(n, means[[i]]))
}
str(output)        ## inefficient
 num [1:115] -0.277 -1.08 -2.203 0.686 -1.584 ...
out <- vector("list", length(means))
for (i in seq_along(means)) {
  n <- sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)           ## more efficient
List of 3
 $ : num [1:91] 1.946 -1.146 1.125 -0.718 1.042 ...
 $ : num [1:95] -0.584 0.12 -1.468 1.471 0.412 ...
 $ : num [1:72] 1.75 2.4 1.56 1.66 2.65 ...
str(unlist(out))   ## flatten list of vectors into single vector
 num [1:258] 1.946 -1.146 1.125 -0.718 1.042 ...

Finally, the while() loop can be used with unknown sequence length. This is used more in simulation than in data analysis.

[Pause to Ponder:] What does the following code do?

flip <- function() sample(c("T", "H"), 1)
flips <- 0
nheads <- 0
while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
[1] 17

Using iteration for simulation

This applet contains data from a 2005 study on the use of dolphin-facilitated therapy on the treatment of depression. In that study, 10 of the 15 subjects (67%) assigned to dolphin therapy showed improvement, compared to only 3 of the 15 subjects (20%) assigned to the control group. But with such small sample sizes, is this significant evidence that the dolphin group had greater improvement of their depressive symptoms? To answer that question, we can use simulation to conduct a randomization test.

We will simulate behavior in the “null world” where there is no real effect of treatment. In that case, the 13 total improvers would have improved no matter the treatment assigned, and the 17 total non-improvers would have not improved no matter the treatment assigned. So in the “null world”, treatment is a meaningless label that can be just as easily shuffled among subjects without any effect. In that world, the fact we observed a 47 percentage point difference in success rates (67 - 20) was just random luck. But we should ask: how often would we expect a difference as large as 47% by chance, assuming we’re living in the null world where there is no effect of treatment?

You could think about simulating this situation with the following steps:

  1. write code to calculate the difference in success rates in the observed data

  2. write a loop to calculate the differences in success rates from 1000 simulated data sets from the null world. Store those 1000 simulated differences

  3. calculate how often we found a difference in the null world as large as that found in the observed data. In statistics, when this probability is below .05, we typically reject the null world, and conclude that there is likely a real difference between the two groups (i.e. a “statistically significant” difference)

[Pause to Ponder:] Fill in Step 2 in the second R chunk below to carry out the three steps above. (The first R chunk provides some preliminary code.) Then describe what you can conclude from this study based on your plot and “p_value” from Step 3.

### Preliminary code ###

# generate a tibble with our observed data
dolphin_data <- tibble(treatment = rep(c("Dolphin", "Control"), each = 15),
                       improve = c(rep("Yes", 10), rep("No", 5), 
                                   rep("Yes", 3), rep("No", 12)))
print(dolphin_data, n = Inf)
# A tibble: 30 × 2
   treatment improve
   <chr>     <chr>  
 1 Dolphin   Yes    
 2 Dolphin   Yes    
 3 Dolphin   Yes    
 4 Dolphin   Yes    
 5 Dolphin   Yes    
 6 Dolphin   Yes    
 7 Dolphin   Yes    
 8 Dolphin   Yes    
 9 Dolphin   Yes    
10 Dolphin   Yes    
11 Dolphin   No     
12 Dolphin   No     
13 Dolphin   No     
14 Dolphin   No     
15 Dolphin   No     
16 Control   Yes    
17 Control   Yes    
18 Control   Yes    
19 Control   No     
20 Control   No     
21 Control   No     
22 Control   No     
23 Control   No     
24 Control   No     
25 Control   No     
26 Control   No     
27 Control   No     
28 Control   No     
29 Control   No     
30 Control   No     
# `sample()` can be used to shuffle the treatments among the 30 subjects
sample(dolphin_data$treatment)
 [1] "Control" "Dolphin" "Dolphin" "Control" "Control" "Control" "Control"
 [8] "Dolphin" "Control" "Dolphin" "Dolphin" "Dolphin" "Dolphin" "Dolphin"
[15] "Dolphin" "Control" "Control" "Control" "Dolphin" "Control" "Dolphin"
[22] "Control" "Control" "Dolphin" "Dolphin" "Dolphin" "Control" "Control"
[29] "Dolphin" "Control"
### Fill in Step 2 and remove "eval: FALSE" ###

# Step 1
dolphin_summary <- dolphin_data |>
  group_by(treatment) |>
  summarize(prop_yes = mean(improve == "Yes"))
dolphin_summary
observed_diff <- dolphin_summary[[2]][2] - dolphin_summary[[2]][1]

# Step 2

### Write a loop to create 1000 simulated differences from the null world

# Step 3
null_world <- tibble(simulated_diffs = simulated_diffs)
ggplot(null_world, aes(x = simulated_diffs)) +
  geom_histogram() +
  geom_vline(xintercept = observed_diff, color = "red")

p_value <- sum(abs(simulated_diffs) >= abs(observed_diff)) / 1000
p_value

You have written code to conduct a randomization test for the difference in two proportions, a powerful test of statistical significance that is demonstrated in the original applet!

Functional programming for iteration

Examples: map functions and across()

Pros: less code, fewer errors, code that’s easier to read; takes advantage of fact that R is a functional programming language

Cons: little more complicated to master vocabulary and use – a step up in abstraction

R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly. Passing one function to another is a very powerful coding approach!!

# Below you can avoid writing separate functions for mean, median, 
#   SD, etc. by column
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)
[1]  0.1816327 -0.1419651 -0.3460145 -0.4993033
col_summary(df, mean)
[1]  0.02882953 -0.09418887 -0.19108437 -0.30853424
col_summary(df, IQR)
[1] 2.0741200 0.7074219 1.0894442 0.6626714

The purrr package provides map functions to eliminate need for for loops, plus it makes code easier to read!

# using map functions for summary stats by column as above
map_dbl(df, mean)
          a           b           c           d 
 0.02882953 -0.09418887 -0.19108437 -0.30853424 
map_dbl(df, median)
         a          b          c          d 
 0.1816327 -0.1419651 -0.3460145 -0.4993033 
map_dbl(df, sd)
        a         b         c         d 
1.3271205 0.5119352 0.8464530 0.8991931 
map_dbl(df, mean, trim = 0.5)
         a          b          c          d 
 0.1816327 -0.1419651 -0.3460145 -0.4993033 
# map_dbl means make a double vector
# can also do map() for list, map_lgl(), map_int(), and map_chr()

# even more clear
df |> map_dbl(mean)
          a           b           c           d 
 0.02882953 -0.09418887 -0.19108437 -0.30853424 
df |> map_dbl(median)
         a          b          c          d 
 0.1816327 -0.1419651 -0.3460145 -0.4993033 
df |> map_dbl(sd)
        a         b         c         d 
1.3271205 0.5119352 0.8464530 0.8991931 

The across() function from dplyr also works well:

df |> summarize(
  n = n(),
  across(.cols = a:d, .fns = median, .names = "median_{.col}")
)
# A tibble: 1 × 5
      n median_a median_b median_c median_d
  <int>    <dbl>    <dbl>    <dbl>    <dbl>
1    10    0.182   -0.142   -0.346   -0.499
# other ways to repeat across the numeric columns of df:
df |> summarize(
  n = n(),
  across(everything(), median, .names = "median_{.col}")
)
# A tibble: 1 × 6
      n median_a median_b median_c median_d median_n
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <int>
1    10    0.182   -0.142   -0.346   -0.499       10
df |> summarize(
  n = n(),
  across(where(is.numeric), median, .names = "median_{.col}")
)
# A tibble: 1 × 6
      n median_a median_b median_c median_d median_n
  <int>    <dbl>    <dbl>    <dbl>    <dbl>    <int>
1    10    0.182   -0.142   -0.346   -0.499       10
# Here "across" effectively expands to the following code.  Note that 
#   across() will write over old columns unless you change the name!
df |> 
  summarize(
    median_a = median(a),
    median_b = median(b),
    median_c = median(c),
    median_d = median(d),
    n = n()
  )
# A tibble: 1 × 5
  median_a median_b median_c median_d     n
     <dbl>    <dbl>    <dbl>    <dbl> <int>
1    0.182   -0.142   -0.346   -0.499    10
# And if we're worried about NAs, we can't call median directly, we
#   must create a new function that we can pass options into
df_miss <- df
df_miss[2, 1] <- NA
df_miss[4:5, 2] <- NA
df_miss
# A tibble: 10 × 4
        a        b       c       d
    <dbl>    <dbl>   <dbl>   <dbl>
 1  0.511 -0.670   -1.52   -0.348 
 2 NA     -0.300   -1.02   -0.0179
 3  0.836  0.217   -0.635  -1.00  
 4  2.31  NA        0.709   1.78  
 5 -0.148 NA       -0.586  -0.676 
 6  0.913 -0.653   -0.369   0.429 
 7 -0.525  0.771    1.25   -0.792 
 8 -1.40  -0.00696  0.644  -0.436 
 9  1.05  -0.0164  -0.0663 -0.563 
10 -1.80  -0.268   -0.323  -1.46  
df_miss |> 
  summarize(
    across(
      a:d,
      list(
        median = \(x) median(x, na.rm = TRUE),
        n_miss = \(x) sum(is.na(x))
      ),
      .names = "{.fn}_{.col}"
    ),
    n = n(),
  )
# A tibble: 1 × 9
  median_a n_miss_a median_b n_miss_b median_c n_miss_c median_d n_miss_d     n
     <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int> <int>
1    0.511        1   -0.142        2   -0.346        0   -0.499        0    10
# where \ is shorthand for an anonymous function - i.e. you could
#   replace "\" with "function" if you like typing more letters :)

# across-like functions can also be used with filter():

# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
df_miss |> filter(if_any(a:d, is.na))
# A tibble: 3 × 4
       a      b      c       d
   <dbl>  <dbl>  <dbl>   <dbl>
1 NA     -0.300 -1.02  -0.0179
2  2.31  NA      0.709  1.78  
3 -0.148 NA     -0.586 -0.676 
# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
df_miss |> filter(if_all(a:d, is.na))
# A tibble: 0 × 4
# ℹ 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>

When you input a list of functions (like the lubridate functions below), across() assigns default names as columnname_functionname:

library(lubridate)
expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

df_date <- tibble(
  name = c("Amy", "Bob"),
  date = ymd(c("2009-08-03", "2010-01-16"))
)

df_date |> 
  expand_dates()
# A tibble: 2 × 5
  name  date       date_year date_month date_day
  <chr> <date>         <dbl>      <dbl>    <int>
1 Amy   2009-08-03      2009          8        3
2 Bob   2010-01-16      2010          1       16

Here the default is to summarize all numeric columns, but as with all functions, we can override the default if we choose:

summarize_means <- function(df, summary_vars = where(is.numeric)) {
  df |> 
    summarize(
      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
      n = n(),
      .groups = "drop"
    )
}
diamonds |> 
  group_by(cut) |> 
  summarize_means()
# A tibble: 5 × 9
  cut       carat depth table price     x     y     z     n
  <ord>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 Fair      1.05   64.0  59.1 4359.  6.25  6.18  3.98  1610
2 Good      0.849  62.4  58.7 3929.  5.84  5.85  3.64  4906
3 Very Good 0.806  61.8  58.0 3982.  5.74  5.77  3.56 12082
4 Premium   0.892  61.3  58.7 4584.  5.97  5.94  3.65 13791
5 Ideal     0.703  61.7  56.0 3458.  5.51  5.52  3.40 21551
diamonds |> 
  group_by(cut) |> 
  summarize_means(c(carat, x:z))
# A tibble: 5 × 6
  cut       carat     x     y     z     n
  <ord>     <dbl> <dbl> <dbl> <dbl> <int>
1 Fair      1.05   6.25  6.18  3.98  1610
2 Good      0.849  5.84  5.85  3.64  4906
3 Very Good 0.806  5.74  5.77  3.56 12082
4 Premium   0.892  5.97  5.94  3.65 13791
5 Ideal     0.703  5.51  5.52  3.40 21551

pivot_longer() with group_by() and summarize() also provides a nice solution:

long <- df |> 
  pivot_longer(a:d) |> 
  group_by(name) |> 
  summarize(
    median = median(value),
    mean = mean(value)
  )
long
# A tibble: 4 × 3
  name  median    mean
  <chr>  <dbl>   <dbl>
1 a      0.182  0.0288
2 b     -0.142 -0.0942
3 c     -0.346 -0.191 
4 d     -0.499 -0.309 

Here are a couple of other nice features of map functions: - perform analyses (like fitting a line) by subgroup - extracting components from a model or elements by position

# fit linear model to each group based on cylinder
#   - split designed to split into new dfs (unlike group_by)
#   - map returns a vector or list, which can be limiting
map = purrr::map
models <- split(mtcars, mtcars$cyl) |>
  map(function(df) lm(mpg ~ wt, data = df))
models
$`4`

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  


$`6`

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
      28.41        -2.78  


$`8`

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
     23.868       -2.192  
models[[1]]

Call:
lm(formula = mpg ~ wt, data = df)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  
# shortcut using purrr - 1-sided formulas
models <- split(mtcars, mtcars$cyl) |> 
  map(~lm(mpg ~ wt, data = .))
models
$`4`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
     39.571       -5.647  


$`6`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
      28.41        -2.78  


$`8`

Call:
lm(formula = mpg ~ wt, data = .)

Coefficients:
(Intercept)           wt  
     23.868       -2.192  
# extract named components from each model
str(models)
List of 3
 $ 4:List of 12
  ..$ coefficients : Named num [1:2] 39.57 -5.65
  .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
  ..$ residuals    : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
  .. ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  ..$ effects      : Named num [1:11] -88.433 10.171 0.695 6.231 1.728 ...
  .. ..- attr(*, "names")= chr [1:11] "(Intercept)" "wt" "" "" ...
  ..$ rank         : int 2
  ..$ fitted.values: Named num [1:11] 26.5 21.6 21.8 27.1 30.5 ...
  .. ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  ..$ assign       : int [1:2] 0 1
  ..$ qr           :List of 5
  .. ..$ qr   : num [1:11, 1:2] -3.317 0.302 0.302 0.302 0.302 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  .. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..$ qraux: num [1:2] 1.3 1.5
  .. ..$ pivot: int [1:2] 1 2
  .. ..$ tol  : num 1e-07
  .. ..$ rank : int 2
  .. ..- attr(*, "class")= chr "qr"
  ..$ df.residual  : int 9
  ..$ xlevels      : Named list()
  ..$ call         : language lm(formula = mpg ~ wt, data = .)
  ..$ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x00000210e473ef60> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..$ model        :'data.frame':   11 obs. of  2 variables:
  .. ..$ mpg: num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
  .. ..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
  .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. .. ..$ : chr "wt"
  .. .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x00000210e473ef60> 
  .. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..- attr(*, "class")= chr "lm"
 $ 6:List of 12
  ..$ coefficients : Named num [1:2] 28.41 -2.78
  .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
  ..$ residuals    : Named num [1:7] -0.125 0.584 1.929 -0.69 0.355 ...
  .. ..- attr(*, "names")= chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
  ..$ effects      : Named num [1:7] -52.235 -2.427 2.111 -0.353 0.679 ...
  .. ..- attr(*, "names")= chr [1:7] "(Intercept)" "wt" "" "" ...
  ..$ rank         : int 2
  ..$ fitted.values: Named num [1:7] 21.1 20.4 19.5 18.8 18.8 ...
  .. ..- attr(*, "names")= chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
  ..$ assign       : int [1:2] 0 1
  ..$ qr           :List of 5
  .. ..$ qr   : num [1:7, 1:2] -2.646 0.378 0.378 0.378 0.378 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:7] "Mazda RX4" "Mazda RX4 Wag" "Hornet 4 Drive" "Valiant" ...
  .. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..$ qraux: num [1:2] 1.38 1.12
  .. ..$ pivot: int [1:2] 1 2
  .. ..$ tol  : num 1e-07
  .. ..$ rank : int 2
  .. ..- attr(*, "class")= chr "qr"
  ..$ df.residual  : int 5
  ..$ xlevels      : Named list()
  ..$ call         : language lm(formula = mpg ~ wt, data = .)
  ..$ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x00000210e4846078> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..$ model        :'data.frame':   7 obs. of  2 variables:
  .. ..$ mpg: num [1:7] 21 21 21.4 18.1 19.2 17.8 19.7
  .. ..$ wt : num [1:7] 2.62 2.88 3.21 3.46 3.44 ...
  .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. .. ..$ : chr "wt"
  .. .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x00000210e4846078> 
  .. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..- attr(*, "class")= chr "lm"
 $ 8:List of 12
  ..$ coefficients : Named num [1:2] 23.87 -2.19
  .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
  ..$ residuals    : Named num [1:14] 2.374 -1.741 1.455 1.61 -0.381 ...
  .. ..- attr(*, "names")= chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
  ..$ effects      : Named num [1:14] -56.499 -6.003 0.816 1.22 -0.807 ...
  .. ..- attr(*, "names")= chr [1:14] "(Intercept)" "wt" "" "" ...
  ..$ rank         : int 2
  ..$ fitted.values: Named num [1:14] 16.3 16 14.9 15.7 15.6 ...
  .. ..- attr(*, "names")= chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
  ..$ assign       : int [1:2] 0 1
  ..$ qr           :List of 5
  .. ..$ qr   : num [1:14, 1:2] -3.742 0.267 0.267 0.267 0.267 ...
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:14] "Hornet Sportabout" "Duster 360" "Merc 450SE" "Merc 450SL" ...
  .. .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. .. ..- attr(*, "assign")= int [1:2] 0 1
  .. ..$ qraux: num [1:2] 1.27 1.11
  .. ..$ pivot: int [1:2] 1 2
  .. ..$ tol  : num 1e-07
  .. ..$ rank : int 2
  .. ..- attr(*, "class")= chr "qr"
  ..$ df.residual  : int 12
  ..$ xlevels      : Named list()
  ..$ call         : language lm(formula = mpg ~ wt, data = .)
  ..$ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x00000210e4a15e78> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..$ model        :'data.frame':   14 obs. of  2 variables:
  .. ..$ mpg: num [1:14] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
  .. ..$ wt : num [1:14] 3.44 3.57 4.07 3.73 3.78 ...
  .. ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. .. ..$ : chr "wt"
  .. .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. .. ..- attr(*, "order")= int 1
  .. .. .. ..- attr(*, "intercept")= int 1
  .. .. .. ..- attr(*, "response")= int 1
  .. .. .. ..- attr(*, ".Environment")=<environment: 0x00000210e4a15e78> 
  .. .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
  ..- attr(*, "class")= chr "lm"
str(models[[1]])
List of 12
 $ coefficients : Named num [1:2] 39.57 -5.65
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
 $ residuals    : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
  ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
 $ effects      : Named num [1:11] -88.433 10.171 0.695 6.231 1.728 ...
  ..- attr(*, "names")= chr [1:11] "(Intercept)" "wt" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:11] 26.5 21.6 21.8 27.1 30.5 ...
  ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:11, 1:2] -3.317 0.302 0.302 0.302 0.302 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.3 1.5
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 9
 $ xlevels      : Named list()
 $ call         : language lm(formula = mpg ~ wt, data = .)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. ..- attr(*, "variables")= language list(mpg, wt)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. ..$ : chr "wt"
  .. ..- attr(*, "term.labels")= chr "wt"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x00000210e473ef60> 
  .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 $ model        :'data.frame':  11 obs. of  2 variables:
  ..$ mpg: num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
  ..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language mpg ~ wt
  .. .. ..- attr(*, "variables")= language list(mpg, wt)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. .. ..$ : chr "wt"
  .. .. ..- attr(*, "term.labels")= chr "wt"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: 0x00000210e473ef60> 
  .. .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 - attr(*, "class")= chr "lm"
str(summary(models[[1]]))
List of 11
 $ call         : language lm(formula = mpg ~ wt, data = .)
 $ terms        :Classes 'terms', 'formula'  language mpg ~ wt
  .. ..- attr(*, "variables")= language list(mpg, wt)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "mpg" "wt"
  .. .. .. ..$ : chr "wt"
  .. ..- attr(*, "term.labels")= chr "wt"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: 0x00000210e473ef60> 
  .. ..- attr(*, "predvars")= language list(mpg, wt)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "mpg" "wt"
 $ residuals    : Named num [1:11] -3.6701 2.8428 1.0169 5.2523 -0.0513 ...
  ..- attr(*, "names")= chr [1:11] "Datsun 710" "Merc 240D" "Merc 230" "Fiat 128" ...
 $ coefficients : num [1:2, 1:4] 39.57 -5.65 4.35 1.85 9.1 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
 $ aliased      : Named logi [1:2] FALSE FALSE
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "wt"
 $ sigma        : num 3.33
 $ df           : int [1:3] 2 9 2
 $ r.squared    : num 0.509
 $ adj.r.squared: num 0.454
 $ fstatistic   : Named num [1:3] 9.32 1 9
  ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
 $ cov.unscaled : num [1:2, 1:2] 1.701 -0.705 -0.705 0.308
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "(Intercept)" "wt"
  .. ..$ : chr [1:2] "(Intercept)" "wt"
 - attr(*, "class")= chr "summary.lm"
models |>
  map(summary) |> 
  map_dbl("r.squared")
        4         6         8 
0.5086326 0.4645102 0.4229655 
# can use integer to select elements by position
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x |> map_dbl(2)
[1] 2 5 8

Iterative techniques for reading multiple files

library(readxl)

# our usual path doesn't work with excel files
# read_excel("https://joeroith.github.io/264_spring_2025/Data/gapminder/1952.xlsx")

# this will work if .xlsx files live in a Data folder that's at the same
#   level as your .qmd file
gap1952 <- read_excel("Data/gapminder/1952.xlsx")
gap1957 <- read_excel("Data/gapminder/1957.xlsx")

Since the 1952 and 1957 data have the same 5 columns, if we want to combine this data into a single data set showing time trends, we could simply bind_rows() (after adding a 6th column for year)

gap1952 <- gap1952 |>
  mutate(year = 1952)
gap1957 <- gap1957 |>
  mutate(year = 1957)
gap_data <- bind_rows(gap1952, gap1957)

Of course, with 10 more years worth of data still left to read in and merge, this process could get pretty onerous. Section 26.3 shows how to automate this process in 3 steps:

  1. use list.files() to list all the files in a directory
paths <- list.files("Data/gapminder", pattern = "[.]xlsx$", full.names = TRUE)
paths
 [1] "Data/gapminder/1952.xlsx" "Data/gapminder/1957.xlsx"
 [3] "Data/gapminder/1962.xlsx" "Data/gapminder/1967.xlsx"
 [5] "Data/gapminder/1972.xlsx" "Data/gapminder/1977.xlsx"
 [7] "Data/gapminder/1982.xlsx" "Data/gapminder/1987.xlsx"
 [9] "Data/gapminder/1992.xlsx" "Data/gapminder/1997.xlsx"
[11] "Data/gapminder/2002.xlsx" "Data/gapminder/2007.xlsx"
  1. use purrr::map() to read each of them into a list (we will discuss lists more in 06_data_types.qmd)
gap_files <- map(paths, readxl::read_excel)
length(gap_files)
[1] 12
str(gap_files)
List of 12
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 28.8 55.2 43.1 30 62.5 ...
  ..$ pop      : num [1:142] 8425333 1282697 9279525 4232095 17876956 ...
  ..$ gdpPercap: num [1:142] 779 1601 2449 3521 5911 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 30.3 59.3 45.7 32 64.4 ...
  ..$ pop      : num [1:142] 9240934 1476505 10270856 4561361 19610538 ...
  ..$ gdpPercap: num [1:142] 821 1942 3014 3828 6857 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 32 64.8 48.3 34 65.1 ...
  ..$ pop      : num [1:142] 10267083 1728137 11000948 4826015 21283783 ...
  ..$ gdpPercap: num [1:142] 853 2313 2551 4269 7133 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 34 66.2 51.4 36 65.6 ...
  ..$ pop      : num [1:142] 11537966 1984060 12760499 5247469 22934225 ...
  ..$ gdpPercap: num [1:142] 836 2760 3247 5523 8053 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 36.1 67.7 54.5 37.9 67.1 ...
  ..$ pop      : num [1:142] 13079460 2263554 14760787 5894858 24779799 ...
  ..$ gdpPercap: num [1:142] 740 3313 4183 5473 9443 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 38.4 68.9 58 39.5 68.5 ...
  ..$ pop      : num [1:142] 14880372 2509048 17152804 6162675 26983828 ...
  ..$ gdpPercap: num [1:142] 786 3533 4910 3009 10079 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 39.9 70.4 61.4 39.9 69.9 ...
  ..$ pop      : num [1:142] 12881816 2780097 20033753 7016384 29341374 ...
  ..$ gdpPercap: num [1:142] 978 3631 5745 2757 8998 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 40.8 72 65.8 39.9 70.8 ...
  ..$ pop      : num [1:142] 13867957 3075321 23254956 7874230 31620918 ...
  ..$ gdpPercap: num [1:142] 852 3739 5681 2430 9140 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 41.7 71.6 67.7 40.6 71.9 ...
  ..$ pop      : num [1:142] 16317921 3326498 26298373 8735988 33958947 ...
  ..$ gdpPercap: num [1:142] 649 2497 5023 2628 9308 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 41.8 73 69.2 41 73.3 ...
  ..$ pop      : num [1:142] 22227415 3428038 29072015 9875024 36203463 ...
  ..$ gdpPercap: num [1:142] 635 3193 4797 2277 10967 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 42.1 75.7 71 41 74.3 ...
  ..$ pop      : num [1:142] 25268405 3508512 31287142 10866106 38331121 ...
  ..$ gdpPercap: num [1:142] 727 4604 5288 2773 8798 ...
 $ : tibble [142 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ country  : chr [1:142] "Afghanistan" "Albania" "Algeria" "Angola" ...
  ..$ continent: chr [1:142] "Asia" "Europe" "Africa" "Africa" ...
  ..$ lifeExp  : num [1:142] 43.8 76.4 72.3 42.7 75.3 ...
  ..$ pop      : num [1:142] 31889923 3600523 33333216 12420476 40301927 ...
  ..$ gdpPercap: num [1:142] 975 5937 6223 4797 12779 ...
gap_files[[1]]   # pull off the first object in the list (i.e. 1952 data)
# A tibble: 142 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 132 more rows
  1. use purrr::list_rbind() to combine them into a single data frame
gap_tidy <- list_rbind(gap_files)
class(gap_tidy)
[1] "tbl_df"     "tbl"        "data.frame"
gap_tidy
# A tibble: 1,704 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 1,694 more rows

We could even do all steps in a single pipeline:

list.files("Data/gapminder", pattern = "[.]xlsx$", full.names = TRUE) |>
  map(readxl::read_excel) |>
  list_rbind()
# A tibble: 1,704 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         28.8  8425333      779.
 2 Albania     Europe       55.2  1282697     1601.
 3 Algeria     Africa       43.1  9279525     2449.
 4 Angola      Africa       30.0  4232095     3521.
 5 Argentina   Americas     62.5 17876956     5911.
 6 Australia   Oceania      69.1  8691212    10040.
 7 Austria     Europe       66.8  6927772     6137.
 8 Bahrain     Asia         50.9   120447     9867.
 9 Bangladesh  Asia         37.5 46886859      684.
10 Belgium     Europe       68    8730405     8343.
# ℹ 1,694 more rows

Note that we are lacking a 6th column with the year represented by each row of data. Here is one way to solve that issue:

# This extracts file names, which are carried along as data frame names by
#   map functions
paths |> set_names(basename)
                 1952.xlsx                  1957.xlsx 
"Data/gapminder/1952.xlsx" "Data/gapminder/1957.xlsx" 
                 1962.xlsx                  1967.xlsx 
"Data/gapminder/1962.xlsx" "Data/gapminder/1967.xlsx" 
                 1972.xlsx                  1977.xlsx 
"Data/gapminder/1972.xlsx" "Data/gapminder/1977.xlsx" 
                 1982.xlsx                  1987.xlsx 
"Data/gapminder/1982.xlsx" "Data/gapminder/1987.xlsx" 
                 1992.xlsx                  1997.xlsx 
"Data/gapminder/1992.xlsx" "Data/gapminder/1997.xlsx" 
                 2002.xlsx                  2007.xlsx 
"Data/gapminder/2002.xlsx" "Data/gapminder/2007.xlsx" 
# The middle line ensures that each of the 12 data frames in the list for
#   gap_files has a name determined by its filepath, unlike the gap_files
#   we created in step 2 above, which had no names (we could only identify
#   data frames by their position)
gap_files <- paths |> 
  set_names(basename) |>  
  map(readxl::read_excel)

# Now we can extract a particular year by its name:
gap_files[["1962.xlsx"]]
# A tibble: 142 × 5
   country     continent lifeExp      pop gdpPercap
   <chr>       <chr>       <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia         32.0 10267083      853.
 2 Albania     Europe       64.8  1728137     2313.
 3 Algeria     Africa       48.3 11000948     2551.
 4 Angola      Africa       34    4826015     4269.
 5 Argentina   Americas     65.1 21283783     7133.
 6 Australia   Oceania      70.9 10794968    12217.
 7 Austria     Europe       69.5  7129864    10751.
 8 Bahrain     Asia         56.9   171863    12753.
 9 Bangladesh  Asia         41.2 56839289      686.
10 Belgium     Europe       70.2  9218400    10991.
# ℹ 132 more rows
# Finally, take advantage of the `names_to` argument in list_rbind to 
#   create that 6th column with `year`
gap_tidy <- paths |> 
  set_names(basename) |> 
  map(readxl::read_excel) |> 
  list_rbind(names_to = "year") |> 
  mutate(year = parse_number(year))

You could then save your result using write_csv so you don’t have to run the reading and wrangling code every time!

On Your Own

  1. Compute the mean of every column of the mtcars data set using (a) a for loop, (b) a map function, (c) the across() function, and (d) pivot_longer().

  2. Write a function that prints the mean of each numeric column in a data frame. Try it on the iris data set. (Hint: keep(is.numeric))

  3. Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

out <- ""
for (x in letters) {
  out <- stringr::str_c(out, x)
}
out
[1] "abcdefghijklmnopqrstuvwxyz"
x <- runif(100)
out <- vector("numeric", length(x))
out[1] <- x[1]
for (i in 2:length(x)) {
  out[i] <- out[i - 1] + x[i]
}
out
  [1]  0.6893619  1.4042819  2.0077847  2.2822652  2.3227460  2.3416952
  [7]  2.6538528  3.1927217  3.4925728  4.3044673  4.5061334  4.9250880
 [13]  5.6434140  5.6979402  6.3135131  6.7924267  7.2320004  8.2108159
 [19]  8.5673972  9.5237295  9.6441774 10.5774514 11.4454355 11.6679920
 [25] 11.7176301 12.2649093 12.3905001 12.6230936 12.6790302 13.4304206
 [31] 13.7662465 14.4321998 15.0075245 15.0256092 15.9279683 16.8659135
 [37] 17.1683920 18.0914741 18.2799029 19.0841170 19.7329325 20.4384807
 [43] 21.4120651 21.9686174 22.8576337 23.1560227 23.3756385 23.5120127
 [49] 23.9192568 24.1464942 25.1305495 25.2865869 26.1741305 26.7135711
 [55] 27.6342829 28.0916828 28.1811252 28.5048283 28.6600228 28.7286863
 [61] 29.3243154 29.4266304 29.9259725 30.6010346 31.2714237 31.8655282
 [67] 32.7006807 33.2990713 34.2684129 34.5955244 35.2485176 36.2201558
 [73] 36.6511104 37.4870354 37.5929115 38.3038871 38.9788436 39.4336698
 [79] 40.3673601 40.4606749 41.2806644 42.0925873 42.6247889 43.0647370
 [85] 43.9434074 44.8335387 45.4346060 46.1020342 46.1740449 46.4073601
 [91] 46.9903895 47.1505518 47.1569493 48.1279247 49.0823653 49.6461595
 [97] 49.7773567 49.9864203 50.8327248 51.4922226
  1. Compute the number of unique values in each column of the iris data set using at least 2 of your favorite iteration methods. Bonus points if you can use pivot_longer()!

  2. Carefully explain each step in the pipeline below:

show_missing <- function(df, group_vars, summary_vars = everything()) {
  df |> 
    group_by(pick({{ group_vars }})) |> 
    summarize(
      across({{ summary_vars }}, \(x) sum(is.na(x))),
      .groups = "drop"
    ) |>
    select(where(\(x) any(x > 0)))
}
nycflights13::flights |> show_missing(c(year, month, day))
# A tibble: 365 × 9
    year month   day dep_time dep_delay arr_time arr_delay tailnum air_time
   <int> <int> <int>    <int>     <int>    <int>     <int>   <int>    <int>
 1  2013     1     1        4         4        5        11       0       11
 2  2013     1     2        8         8       10        15       2       15
 3  2013     1     3       10        10       10        14       2       14
 4  2013     1     4        6         6        6         7       2        7
 5  2013     1     5        3         3        3         3       1        3
 6  2013     1     6        1         1        1         3       0        3
 7  2013     1     7        3         3        3         3       1        3
 8  2013     1     8        4         4        4         7       1        7
 9  2013     1     9        5         5        7         9       2        9
10  2013     1    10        3         3        3         3       2        3
# ℹ 355 more rows
  1. Write a function called summary_stats() that allows a user to input a tibble, numeric variables in that tibble, and summary statistics that they would like to see for each variable. Using across(), the function’s output should look like the example below.
summary_stats(mtcars, 
              vars = c(mpg, hp, wt), 
              stat_fcts = list(mean = mean, 
                               median = median, 
                               sd = sd, 
                               IQR = IQR))

#  mpg_mean mpg_median   mpg_sd mpg_IQR  hp_mean hp_median    hp_sd hp_IQR
#1 20.09062       19.2 6.026948   7.375 146.6875       123 68.56287   83.5
#  wt_mean wt_median     wt_sd  wt_IQR  n
#1 3.21725     3.325 0.9784574 1.02875 32
  1. The power of a statistical test is the probability that it rejects the null hypothesis when the null hypothesis is false. In other words, it’s the probability that a statistical test can detect when a true difference exists. The power depends on a number of factors, including:
  • sample size
  • type I error level (probability of declaring there is a statistically significant difference when there really isn’t)
  • variability in the data
  • size of the true difference

The following steps can be followed to simulate a power calculation using iteration techniques:

  1. generate simulated data where there is a true difference or effect

  2. run your desired test on the simulated data and record if the null hypothesis was rejected or not (i.e. if the p-value was below .05)

  3. repeat (a)-(b) a large number of times and record the total proportion of times that the null hypothesis was rejected; that proportion is the power of your test under those conditions

Create a power curve for a two-sample t-test by filling in Step C below and then removing eval: FALSE:

# Step A

# set parameters for two-sample t-test
mean1 <- 100   # mean response in Group 1
truediff <- 5    # true mean difference between Groups 1 and 2
mean2 <- mean1 + truediff   # mean response in Group 2
sd1 <- 10   # standard deviation in Group 1
sd2 <- 10   # standard deviation in Group 2
n1 <- 20    # sample size in Group 1
n2 <- 20    # sample size in Group 2
numsims <- 1000   # number of simulations (iterations) to run

# generate sample data for Groups 1 and 2 based on normal distributions
#   with the parameters above (note that there is truly a difference in means!)
samp1 <- rnorm(n1, mean1, sd1)
samp2 <- rnorm(n2, mean2, sd2)

# organize the simulated data into a tibble
sim_data <- tibble(response = c(samp1, samp2), 
       group = c(rep("Group 1", n1), rep("Group 2", n2)))
sim_data

# Step B

# exploratory analysis of the simulated data
mosaic::favstats(response ~ group, data = sim_data)
ggplot(sim_data, aes(x = response, y = group)) +
  geom_boxplot()

# run a two-sample t-test to see if there is a significant difference
#   in means between Groups 1 and 2 (i.e. is the p-value < .05?)
p_value <- t.test(x = samp1, y = samp2)$p.value
p_value
p_value < .05   # if TRUE, then we reject the null hypothesis and conclude
                #   there is a statistically significant difference

# Step C

# find the power = proportion of time null is rejected when
#   true difference is not 0 (i.e. number of simulated data sets that
#   result in p-values below .05)