Christopher R. Dishop: Bias From Measurement Error

The Problem

Consider the notion of incremental validity:

“If measurements of construct X correlate significantly with outcome Y even when controlling for existing measure(s) Z, then X is a useful predictor of Y, over and above Z.”

The issue is that measurement error leads to spurious inferences of incremental validity. To be confident that an incremental validity argument is sound, one needs to either ensure perfect measurement reliability or formally account for unreliability in one’s model.

Example

Suppose heat is a common cause of swimming pool deaths and ice cream sales.

heat <- 1:100
swimmingdeaths <- 0.6*heat + rnorm(100, 0, 3)
creamsales <- 0.5*heat + rnorm(100, 0, 4)

df <- data.frame(
  'heat' = c(heat),
  'swimmingdeaths' = c(swimmingdeaths),
  'creamsales' = c(creamsales)
)

If I regress ice cream sales on swimming pool deaths, I (spuriously) conclude that swimming pool deaths predict ice cream sales.

round(summary(lm(
  creamsales ~ swimmingdeaths,
  data = df
))$coefficients, 2)

               Estimate Std. Error t value Pr(>|t|)
(Intercept)       -0.34       0.89   -0.39      0.7
swimmingdeaths     0.86       0.03   33.54      0.0

If instead I control for heat as a common cause, then the relationship between swimming pool deaths and ice cream sales goes away.

round(summary(lm(
  creamsales ~ swimmingdeaths + heat,
  data = df
))$coefficients, 2)

               Estimate Std. Error t value Pr(>|t|)
(Intercept)       -0.34       0.77   -0.44     0.66
swimmingdeaths     0.13       0.13    0.97     0.33
heat               0.44       0.08    5.76     0.00

What if heat is subjectively measured?

heatperceptions <- heat + rnorm(100, 0, 5)

library(tidyverse)

df <- df %>% 
  mutate(heatperceptions = heatperceptions)

Now, even when I control for heat perceptions, the spurious relationship between swimming pool deaths and ice cream sales will return.

round(summary(lm(
  creamsales ~ swimmingdeaths + heatperceptions,
  data = df
))$coefficients, 2)

                Estimate Std. Error t value Pr(>|t|)
(Intercept)        -0.92       0.82   -1.12     0.27
swimmingdeaths      0.45       0.09    4.90     0.00
heatperceptions     0.25       0.06    4.48     0.00

What is the solution? Use SEM to control for measurement error.

Let’s assume that heat is measured with three subjective indicators.

perception1 <- 0.6*heat + rnorm(100, 0, 3)
perception2 <- 0.3*heat + rnorm(100, 0, 3)
perception3 <- 1*heat + rnorm(100, 0, 3)

df <- df %>% 
  mutate(hp1 = perception1,
         hp2 = perception2,
         hp3 = perception3)

Now use SEM. The perceptions are caused by a latent heat factor, and then we regress ice cream sales on swimming pool deaths and the latent heat factor.

library(lavaan)

modelstring <- '

latentheat =~ fl1*hp1 + fl2*hp2 + fl3*hp3

creamsales ~ b1*swimmingdeaths + b2*latentheat

'

model <- sem(modelstring, data = df)
summary(model)

lavaan 0.6-9 ended normally after 119 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         9
                                                      
  Number of observations                           100
                                                      
Model Test User Model:
                                                      
  Test statistic                               333.675
  Degrees of freedom                                 5
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  latentheat =~                                       
    hp1      (fl1)    1.000                           
    hp2      (fl2)    0.474    0.019   25.072    0.000
    hp3      (fl3)    1.685    0.033   51.029    0.000

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  creamsales ~                                        
    swmmngdth (b1)    0.280    0.023   12.330    0.000
    latenthet (b2)    0.581    0.024   24.262    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .hp1               7.979    1.917    4.163    0.000
   .hp2               8.834    1.322    6.681    0.000
   .hp3               8.023    4.497    1.784    0.074
   .creamsales       14.383    2.142    6.714    0.000
    latentheat      300.071   43.578    6.886    0.000

Why is the relationship between swimming pool deaths and ice cream sales still significant?

See Jacob Westfall’s original paper for more on this issue.

Bo\(^2\)m =)