The Premature Covariate

A replication of Patricia Cohen’s wonderful, “problem of the premature covariate” (chapter 2 in Collins & Horn, 1991). Here is a simple version of the problem. Imagine that we want to know the influence of a life event, like meeting a friend, on happiness. We conduct a study where we measure people’s happiness at time one, wait two weeks, and then measure their happiness again along with whether or not they met a friend since we last observed them. To assess the relationship between meeting a friend and happiness, we regress post happiness on both pre happiness and whether or not they met a friend. That is, our regression takes the form:

Happy\(_{post}\) ~ Happy\(_{pre}\) + Met_Friend.

What Dr. Cohen draws attention to is that the coefficient relating “met friend” to happiness will be biased if (a) there is some non-zero probability of reverse causality and (b) we do not measure “met friend” exactly when it occurs. Remember, in our mock study we assessed both happiness and whether or not someone met a friend during our post-measure. Is that exactly when different people across the sample actually met a new friend? Perhaps, but most of our sample either did or did not meet a friend at some unknown time within the past two weeks; Dr. Cohen’s point is that this unknown is an issue to linger on.

Simulation Explanation

If Dr. Cohen’s issue is worthwhile, then, under a system where “met friend” truly does not influence happiness, we should be able to make “met friend” appear to influence happiness based on the points she raises. That is, when “met friend” does not influence happiness we should be able to make it appear so if we create a system where (1) happiness instead influences the probability of meeting a friend (reverse causality) and (2) we do not measure “met friend” exactly when it occurs. Below, we generate data where “met friend” does not influence happiness but the coefficient relating “met friend” to happiness will still be significant (and large) because of the issues raised.

Here are the steps to the simulation:

  1. Start with a random value for happiness (distributed normally across 600 people) at time one.

  2. Happiness at \(t+1\) is its previous value plus one of the following, all with equal probability: +0.25, -0.25, or 0.

  3. At each time point, concurrent happiness influences the probability of meeting a friend. When happiness is low people are unlikely to meet a friend, whereas when happiness is high people are more likely to meet a friend. Meeting a friend is coded as 0 or 1 for each time point (i.e., no or yes).

  4. Continue for 25 time points.

  5. Assess the relationships between post happiness, pre happiness, and “met friend.” Pre happiness is always time one, whereas we will explore different post assesssments (e.g., post happiness is time 25 vs. post happiness is time 20). “Met friend” will always be whether the individual met a friend within 5 time points of the post happiness assessment. So, if we analyze post happiness as time 20, then “met friend” is whether the individual met a friend during times 15 through 20.

Notice that the simulation captures the notions raised above: “met friend” does not influence happiness, instead the reverse happens. And after making a decision about the timing of our pre and post assessment we lose information about when “met friend” actually happened. We know whether it happened but not when; we also don’t retain information on the differences in timing across our sample.

Meeting a Friend or Not

The most difficult aspect of the simulation is specifying step 3: “met friend” is some function of concurrent happiness. Dr. Cohen’s original explanation is, “the probability of \(X\) for each unit of time was determined by a Markov process, with probability increasing as a function of the level of contemporaneous \(Y\). Probabilities used increased from 0 for those with current \(Y\) less than -1.00 to 0.25 for those with current scores of 1.5 or greater” (she uses different variables for x and y in her discussion). What does that mean? How do we specify a Markov process where the probability of “met friend” is between 0 and 0.25 with respect to happiness cutoffs like -1.00 and 1.5? I don’t know either. But we can make it easier by recognizing that, at its core, the idea is simply, “meeting a friend is more likely when people are happier,” which we can represent with a simple linear equation like \(y = mx + b\). All we need to do is to find the slope and y-intercept, then we’ll have an equation where we can plug in “happiness” and get “probability of meeting a friend.” Here is how.

Remember that we can find the slope and y-intercept of a line if we know the location of two of its points. Here, we know that the probability of “met friend” needs to be between 0 and 0.25, and the happiness cutoffs need to be -1.00 and 1.5. If I want to relate happiness to “met friend,” then, I can put happiness on the x-axis and “met friend” on the y-axis and recognize that by combining these cutoffs I get the end-points of a line: (1.5, 0.25) is one point and (-1, 1.5) is the other. Computing rise-over-run and then solving for the intercept gives me the following:

Probability of meeting a friend = 0.1*Happy + 0.01

Now we have a way to compute the probability of meeting a friend based on happiness. It is not as precise as the Markov process but it will work just fine. (Note: I actuallly use the points (1.4999, 2.4999) and (-0.999, 1.4999) to calculate the slope and intercept in the simulation because I will also use if-statements for the cutoffs)

Simulate One Person

It’s always helpful to make sure we can get a simulation to work on one person. In the simulation below, \(y\) is happiness and \(x\) is “met friend.”

time <- 25
y <- numeric(time)
x <- numeric(time)

count <- 0

for(i in 1:time){
  count <- count + 1
  
  if(i == 1){
    
    y[count] <- rnorm(1, mean = 0.5, sd = 0.5)
    x[count] <- 0
    
  }else{
    
    
    # y up or down with autoregression
    
    updownsame <- sample(c('up', 'down', 'same'), 1)
    
    if(updownsame == 'up'){
      
      y[count] <- y[count - 1] + 0.25
      
    }else if(updownsame == 'down'){
      
      y[count] <- y[count - 1] - 0.25
      
    }else{
      
      y[count] <- y[count - 1]
      
    }
    
    # x is a function of y
    
    if(y[count] <= -1.00){
      
      x_prob <- 0
      
    }else if(y[count] >= 1.5){
      
      x_prob <- 0.25
      
    }else{
      
      x_prob <- 0.10004*y[count] + 0.09994
      
      
    }
    
    x[count] <- rbinom(1, 1, x_prob)
    
  }
  
}

Full Sample

That script worked, so now let’s update the code slightly and run it across 600 people.

people <- 600
time <- 25
df <- matrix(, nrow = people*time, ncol = 4)

count <- 0

for(j in 1:people){
  
  

for(i in 1:time){
  count <- count + 1
  
  if(i == 1){
    
    df[count, 1] <- j
    df[count, 2] <- i
    df[count, 3] <- rnorm(1, mean = 0.5, sd = 0.5)
    df[count, 4] <- 0
    
  }else{
    
    df[count, 1] <- j
    df[count, 2] <- i
    
    # y up or down with autoregression
    
    updownsame <- sample(c('up', 'down', 'same'), 1)
    
    if(updownsame == 'up'){
      
      df[count, 3] <- df[count - 1, 3] + 0.25
      
    }else if(updownsame == 'down'){
      
      df[count, 3] <- df[count - 1, 3] - 0.25
      
    }else{
      
      df[count, 3] <- df[count - 1, 3]
      
    }
    
    # x is a function of y
    
    if(df[count, 3] <= -1.00){
      
      x_prob <- 0
      
    }else if(df[count, 3] >= 1.5){
      
      x_prob <- 0.25
      
    }else{
      
      x_prob <- 0.10004*df[count, 3] + 0.09994
      
    }
    
    df[count, 4] <- rbinom(1, 1, x_prob)
    
  }
  
}

  
  
}


df <- data.frame(df)
names(df) <- c('id', 'time', 'happy', 'met_friend')
library(tidyverse)

Results

Remember, we generated data where “met friend” did not influence happiness. Now we are going to assess the coefficient relating “met friend” to happiness to see if it differs from zero. First, let’s say our post-assessment happened at time 10.

Trim down our data set to just that time frame.

happy10_sample <- df %>%
  filter(time < 11)

How many friends did each person meet between times 5 and 10?

friend_count <- happy10_sample %>%
  filter(time > 4) %>%
  group_by(id) %>%
  summarise(
    friend_count = sum(met_friend)
  )

Now change that to:

friend_count <- friend_count %>%
  mutate(friend_event = case_when(
    friend_count == 0 ~ 0,
    friend_count != 0 ~ 1
  ))

Merge that count back into the happy10 data set and prepare the data for regression.

# Merge back into y10 df

happy10_sample <- left_join(happy10_sample, friend_count)

# Filter down to what's needed for regression

happy10_filter <- happy10_sample %>%
  select(id, time, happy, friend_event) %>%
  filter(time == 1 | time == 10)

library(reshape2)

happy10_wide <- reshape(happy10_filter, idvar = 'id', timevar = 'time', direction = 'wide')

# The x columns are synonymous, so I can remove one 

happy10_wide <- happy10_wide[, c('id', 'happy.10', 'happy.1', 'friend_event.1')]
names(happy10_wide) <- c('id', 'happy_post', 'happy_pre', 'met_friend')

Now regress post happy on pre happy and whether or not they met a friend between times 5 and 10.

summary(lm(happy_post ~ happy_pre + met_friend,
           data = happy10_wide))$coefficients
               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -0.08305241 0.04376560 -1.897664 5.822247e-02
happy_pre    0.86224682 0.05237323 16.463503 1.730324e-50
met_friend   0.20006802 0.05412227  3.696593 2.386306e-04

The coefficient relating “met friend” to happiness is about 0.3 and it is significant (remember there was no influence from “met friend” to happiness).

What about when we change the post assessment to time point 15?

First create a function out of all the “tidying” steps above:

df_create <- function(time1){
  library(reshape2)
  library(tidyverse)
  time2 <- time1 - 5
  
  y_sample <- df %>%
    filter(time <= time1)
  
  friend_count <- y_sample %>%
    filter(time >= time2) %>%
    group_by(id) %>%
    summarise(
      friend_count = sum(met_friend)
    )
  
  friend_count <- friend_count %>%
    mutate(friend_event = case_when(
      friend_count == 0 ~ 0,
      friend_count != 0 ~ 1
    ))
  
  y_sample <- left_join(y_sample, friend_count)
  y_filter <- y_sample %>%
    select(id, time, happy, friend_event) %>%
    filter(time == 1 | time == time1)
  
  y_wide <- reshape(y_filter, idvar = 'id', timevar = 'time', direction = 'wide')
  
  yname <- paste('happy.', time1, sep = '')
  
  y_wide <- y_wide[, c('id', yname, 'happy.1', 'friend_event.1')]
  names(y_wide) <- c('id', 'happy_post', 'happy_pre', 'met_friend')
  
  return(y_wide)

}

Here are the results:

happy15_wide <- df_create(15)

summary(lm(happy_post ~ happy_pre + met_friend,
           data = happy15_wide))$coefficients
              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -0.1990938 0.05192122 -3.834536 1.391674e-04
happy_pre    0.7839255 0.06106283 12.838014 1.737539e-33
met_friend   0.4890845 0.06212003  7.873218 1.636154e-14

What about when we select time 25 as our post assessment?

happy25_wide <- df_create(25)

summary(lm(happy_post ~ happy_pre + met_friend,
           data = happy25_wide))$coefficients
              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) -0.3211208 0.06592975 -4.870651 1.426504e-06
happy_pre    0.6821083 0.07913204  8.619876 6.016853e-17
met_friend   0.8063170 0.07986960 10.095418 3.125506e-22

Notice how large the coefficient relating “met friend” to happiness is here: close to 0.9 – remember, there truly is no effect relating “met friend” to happiness.

Conclusion

If there is some probability of reverse causality and we don’t measure the event exactly when it occurs then the estimate relating that event to our outcome will be biased. If many “event opportunities” occur between our pre and post measure then our estimate will be extremely biased.

Bo\(^2\)m = )