Christopher R. Dishop: Reveal Hidden NAs in Longitudinal Data

Longitudinal data sets often have hidden NAs when they are in long-form. For example, in the data set below Zoe is missing on days 2 and 4, but it isn’t obvious because there are no specific “NA’s” within the data.

   time   id q1 q2
1     1  Jac  4  3
2     1 Jess  5  2
3     1  Zoe  3  4
4     2  Jac  6  1
5     2 Jess  7  2
6     3  Jac  5  3
7     3 Jess  4  4
8     3  Zoe  3  2
9     4  Jac  4  3
10    4 Jess  5  4

Usually I recommend cleaning within the tidyverse package, but in this case I prefer reshape. Change the data frame to wide

library(reshape2)
wide_cd <- reshape(cd, timevar = 'time', idvar = 'id', direction = 'wide')

and then back to long to reveal the hidden NA’s.

cd_reveal <- reshape(wide_cd, timevar = 'time', idvar = 'id', direction = 'long')
cd_reveal

         id time q1.1 q2.1
Jac.1   Jac    1    4    3
Jess.1 Jess    1    5    2
Zoe.1   Zoe    1    3    4
Jac.2   Jac    2    6    1
Jess.2 Jess    2    7    2
Zoe.2   Zoe    2   NA   NA
Jac.3   Jac    3    5    3
Jess.3 Jess    3    4    4
Zoe.3   Zoe    3    3    2
Jac.4   Jac    4    4    3
Jess.4 Jess    4    5    4
Zoe.4   Zoe    4   NA   NA

It is possible to do all of this within tidyverse, but it’s tricky because the spread command only applies to one column (the value parameter only takes one entry), so anytime your data frame contains multiple columns to spread over (almost always the case) then spread does not work well.

library(tidyverse)
cd %>%
 spread(key = time, value = q1)

    id q2  1  2  3  4
1  Jac  1 NA  6 NA NA
2  Jac  3  4 NA  5  4
3 Jess  2  5  7 NA NA
4 Jess  4 NA NA  4  5
5  Zoe  2 NA NA  3 NA
6  Zoe  4  3 NA NA NA

Notice how it only used q1. The proper way to go from long to wide and then back to long to reveal the NA’s using tidyverse is either of the following:

cd %>%
  select(time, id, q1) %>%
  spread(key = time, value = q1) %>%
  gather(key = time, value = 'q1', '1','2','3','4') # string code needed

     id time q1
1   Jac    1  4
2  Jess    1  5
3   Zoe    1  3
4   Jac    2  6
5  Jess    2  7
6   Zoe    2 NA
7   Jac    3  5
8  Jess    3  4
9   Zoe    3  3
10  Jac    4  4
11 Jess    4  5
12  Zoe    4 NA

time_string <- as.character(unique(cd$time))

cd %>%
  select(time, id, q1) %>%
  spread(key = time, value = q1) %>%
  gather(key = time, value = 'q1', time_string) # string code not needed due to pre-allocation

     id time q1
1   Jac    1  4
2  Jess    1  5
3   Zoe    1  3
4   Jac    2  6
5  Jess    2  7
6   Zoe    2 NA
7   Jac    3  5
8  Jess    3  4
9   Zoe    3  3
10  Jac    4  4
11 Jess    4  5
12  Zoe    4 NA

Again, I prefer reshape because the spread commands in tidyverse are not easy to read.

Bo\(^2\)m =)