Untidy Tables & In-Place Mutation

Hadley has written extensively about tidy data and why it’s unsound to implement in-place data mutations. Some notes below on breaking both of those rules = ).

In-place changes to data using tidyverse.

library(tidyverse)
df <- tibble(
  "team" = c("A", "A", "B", "B", "C", "C"),
  "individual" = c(1, 2, 3, 4, 5, 6),
  "performance" = c(NA, 4, 5, 6, 2, 3),
  "affect" = c(NA, 6, 7, 8, 4, 2),
  "fav_color" = c(NA, "blue", "green", "orange", "yellow", "purple")
)
df
# A tibble: 6 × 5
  team  individual performance affect fav_color
  <chr>      <dbl>       <dbl>  <dbl> <chr>    
1 A              1          NA     NA <NA>     
2 A              2           4      6 blue     
3 B              3           5      7 green    
4 B              4           6      8 orange   
5 C              5           2      4 yellow   
6 C              6           3      2 purple   

Insert a performance, affect, and favorite color value for individual 1 within team A.

df %>% 
  filter(team == "A" & individual == 1) %>% 
  mutate(performance = 8,
         affect = 2,
         fav_color = "silver") %>% 
  rbind(df %>% filter(team != "A" & individual != 1))
# A tibble: 5 × 5
  team  individual performance affect fav_color
  <chr>      <dbl>       <dbl>  <dbl> <chr>    
1 A              1           8      2 silver   
2 B              3           5      7 green    
3 B              4           6      8 orange   
4 C              5           2      4 yellow   
5 C              6           3      2 purple   

Now for the note on untidy tables. Here’s some tidy data displayed using kable.

library(kableExtra)
dt <- tibble(
  
  'team' = c('A', 'A', 'A',
             'B', 'B', 'B',
             'C', 'C', 'C'),
  'person' = c(1,2,3,
               4,5,6,
               7,8,9),
  'score' = c(rnorm(9, 23, 3))
  
)

dt %>% 
  group_by(team) %>% 
  summarize(
    "Mean" = mean(score),
    "SD" = sd(score)
  ) %>% 
  kable() %>% 
  kable_styling()
team Mean SD
A 23.00442 2.824705
B 23.98117 2.700796
C 25.38279 1.456739

Looks great to me. The issue is that sometimes people expect to see data displayed in “untidy” formats. Let’s change the output so that each team is listed across the first row and the table displays the mean score alongside the standard deviation within parentheses.

To do so, I’m going to put string parentheses around the SD values, unite the mean and SD columns, then transform the data from long to wide format. Don’t forget to ungroup as well.

dt %>% 
  group_by(team) %>% 
  summarize(
    "Mean" = round(mean(score), digits = 2),
    "SD" = round(sd(score), digits = 2)
  ) %>% 
  ungroup() %>% 
  # insert parentheses
  mutate(SD = paste0("(", SD, ")")) %>% 
  # combine mean and SD columns
  unite(meansd, Mean, SD, sep = " ", remove = T) %>% 
  # make wide
  pivot_wider(names_from = team, values_from = meansd) %>% 
  rename("Team A" = "A",
         "Team B" = "B",
         "Team C" = "C") %>% 
  kable(caption = "Team Scores") %>% 
  kable_styling() %>% 
  footnote("Mean (SD)")
Table 1: Team Scores
Team A Team B Team C
23 (2.82) 23.98 (2.7) 25.38 (1.46)
Note:
Mean (SD)

Bo\(^2\)m =)