Counting Degrees of Freedom

This post contains a bunch of examples where I practice counting dfs. In each example, I generate the data, estimate the parameters using SEM, count the dfs, and then compare my count to what the model spits back. To count dfs, I need to know the number of knowns and unknowns in my system:

\[\begin{equation} \textrm{DFs} = \textrm{knowns - unknowns} \end{equation}\]

To count the number of knowns, I need to know the number of observed variables, p:

\[\begin{equation} \textrm{knowns} = p*(p+1) / 2 \end{equation}\]

To count the number of unknowns, I count the number of parameters that my model estimates. Now for the examples.

Example 1 - Trust and availability cause helping

DGP

people <- 400
trust <- rnorm(people, 40, 2)
availability <- rnorm(people, 20, 5)
error <- rnorm(people, 0, 2)

helping <- 3 + 0.2*trust + 0.7*availability + error

SEM

library(tidyverse)
library(lavaan)

df <- data.frame(
  'id' = c(1:people),
  'trust' = c(trust),
  'availability' = c(availability),
  'helping' = c(helping)
)

ex1_string <- '

helping ~ b1*trust + b2*availability

'

ex1_model <- sem(ex1_string, data = df)

Count dfs

Knowns (count the observed variables)

# p*(p + 1) / 2

3*(3+1) / 2
[1] 6

Unknowns (count the estimated parameters)

show(ex1_model)
lavaan 0.6-9 ended normally after 14 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         3
                                                      
  Number of observations                           400
                                                      
Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Now if I restrict the covariance of trust and availability to be zero I should have 1 df

ex1_string_restrict <- '

helping ~ b1*trust + b2*availability
trust ~~ 0*availability

'

ex1_model_restrict <- sem(ex1_string_restrict, data = df)
show(ex1_model_restrict) # yup
lavaan 0.6-9 ended normally after 15 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5
                                                      
  Number of observations                           400
                                                      
Model Test User Model:
                                                      
  Test statistic                                 0.281
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.596

Example 2 - Common factor underlying 6 observed items

DGP

common_factor <- rnorm(people, 30, 2)
error_cf <- rnorm(people, 0, 2)
item1 <- 0.35*common_factor + error_cf
item2 <- 0.22*common_factor + error_cf
item3 <- 0.18*common_factor + error_cf
item4 <- 0.24*common_factor + error_cf
item5 <- 0.31*common_factor + error_cf
item6 <- 0.44*common_factor + error_cf

# nope, that approach is wrong. If I do above then my errors are not independent
# prediction errors (in this case measurement) should be independent

item1 <- 0.35*common_factor + rnorm(people, 0, 2)
item2 <- 0.22*common_factor + rnorm(people, 0, 2)
item3 <- 0.18*common_factor + rnorm(people, 0, 2)
item4 <- 0.24*common_factor + rnorm(people, 0, 2)
item5 <- 0.31*common_factor + rnorm(people, 0, 2)
item6 <- 0.44*common_factor + rnorm(people, 0, 2)

df_cf <- data.frame(
  'id' = c(1:people),
  'item1' = c(item1),
  'item2' = c(item2),
  'item3' = c(item3),
  'item4' = c(item4),
  'item5' = c(item5),
  'item6' = c(item6)
)

SEM

ex2_string <- '

com_factor =~ 1*item1 + fl2*item2 + fl3*item3 + fl4*item4 + fl5*item5 + fl6*item6
'

ex2_model <- sem(ex2_string, data = df_cf)

Count dfs

knowns (count the observed variables)

# p*(p + 1) / 2

6*(6 + 1) / 2
[1] 21

unknowns (count the estimated parameters)

show(ex2_model)
lavaan 0.6-9 ended normally after 69 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        12
                                                      
  Number of observations                           400
                                                      
Model Test User Model:
                                                      
  Test statistic                                15.528
  Degrees of freedom                                 9
  P-value (Chi-square)                           0.077

Example 3 - Two latent variables predict one observed outcome

Cognitive ability (latent variable 1) and assertiveness (latent variable 2) predict productivity. Cognitive ability and assertiveness are both captured with 2 manifest items/variables.

DGP

# cog ability (latent exogenous variable 1)
cog_ability <- rnorm(people, 100, 15)
ca_item1 <- 0.78*cog_ability + rnorm(people, 0, 1)
ca_item2 <- 0.11*cog_ability + rnorm(people, 0, 1)

# assertiveness (latent exogenous variable 2)
assertive <- rnorm(people, 30, 8)
ass_item1 <- 0.81*assertive + rnorm(people, 0, 1)
ass_item2 <- 0.34*assertive + rnorm(people, 0, 1)

# productivity (observed outcome)

productivity <- 0.55*cog_ability + 0.82*assertive + rnorm(people, 0, 5)

# data

df_3 <- data.frame(
  'id' = c(1:people),
  'ca_item1' = c(ca_item1),
  'ca_item2' = c(ca_item2),
  'ass_item1' = c(ass_item1),
  'ass_item2' = c(ass_item2),
  'productivity' = c(productivity)
  
  )

SEM

ex3_string <- '

cog_ability =~ 1*ca_item1 + fl2*ca_item2
assertiveness =~ 1*ass_item1 + fla*ass_item2

cog_ability ~~ cog_ability
assertiveness ~~ assertiveness
cog_ability ~~ assertiveness

productivity ~ b1*cog_ability + b2*assertiveness

'

ex3_model <- sem(ex3_string, data = df_3)

Count dfs

knowns (count the observed variables)

# p*(p+1) / 2

5*(5+1) / 2
[1] 15

unknowns (count the estimated parameters)

show(ex3_model)
lavaan 0.6-9 ended normally after 170 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        12
                                                      
  Number of observations                           400
                                                      
Model Test User Model:
                                                      
  Test statistic                                 2.776
  Degrees of freedom                                 3
  P-value (Chi-square)                           0.428

Nope. I’m one off, where did I go wrong?

Ah, there is only 1 prediction error because productivity is being predicted. I counted 2 prediction errors because I gave one to both b1 and b2. So, the unknowns should be…

show(ex3_model)
lavaan 0.6-9 ended normally after 170 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        12
                                                      
  Number of observations                           400
                                                      
Model Test User Model:
                                                      
  Test statistic                                 2.776
  Degrees of freedom                                 3
  P-value (Chi-square)                           0.428

Example 4 - a causes b, which causes c, which causes d

DGP

a <- rnorm(people, 300, 3)
b <- 0.67*a + rnorm(people, 0, 1)
c <- 0.99*b + rnorm(people, 0, 10)
d <- 4 + 4*c + rnorm(people, 0, 4)

df_chain <- data.frame(
  'id' = c(1:people),
  'a' = c(a),
  'b' = c(b),
  'c' = c(c),
  'd' = c(d)
)

SEM

ex4_string <- '

b ~ b1*a
c ~ b2*b
d ~ b3*c

a ~~ a

'

ex4_model <- sem(ex4_string, data = df_chain)

Count dfs

knowns (count the observed variables)

# p*(p+1) / 2
4*(4+1) / 2
[1] 10

unknowns (count the estimated parameters)

show(ex4_model)
lavaan 0.6-9 ended normally after 41 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         7
                                                      
  Number of observations                           400
                                                      
Model Test User Model:
                                                      
  Test statistic                                 3.046
  Degrees of freedom                                 3
  P-value (Chi-square)                           0.385

Example 5 - Observed affect over 7 time points

DGP

time <- 7
affect_store <- matrix(, ncol = 3, nrow = time*people)
count <- 0
for(i in 1:people){
  
  unob_het <- rnorm(1, 0, 3)
  
  for(j in 1:time){
    count <- count + 1
    
    if(j == 1){
      affect_store[count, 1] <- i
      affect_store[count, 2] <- j
      affect_store[count, 3] <- unob_het + 50 + rnorm(1, 0, 1)
    }else{
      affect_store[count, 1] <- i
      affect_store[count, 2] <- j
      affect_store[count, 3] <- 0.8*affect_store[count - 1, 3] + unob_het + rnorm(1, 0, 1)
      
    }
  }
  
}
df5 <- data.frame(affect_store)
names(df5) <- c('id', 'time', 'affect')
library(reshape2)
df5_wide <- reshape(df5, idvar = 'id', timevar = 'time', direction = 'wide')

SEM

ex5_string <- '

unob_het =~ 1*affect.2 + 1*affect.3 + 1*affect.4 + 1*affect.5 + 1*affect.6 + 1*affect.7

affect.2 ~ ar*affect.1
affect.3 ~ ar*affect.2
affect.4 ~ ar*affect.3
affect.5 ~ ar*affect.4
affect.6 ~ ar*affect.5
affect.7 ~ ar*affect.6

affect.1 ~~ affect.1
unob_het ~~ unob_het
affect.1 ~~ unob_het

'

ex5_model <- sem(ex5_string, data = df5_wide)

Count dfs

knowns (count the observed variables)

# p*(p+1) / 2
7*(7+1) / 2
[1] 28

unknowns (count the estimated parameters)

show(ex5_model)
lavaan 0.6-9 ended normally after 120 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        15
  Number of equality constraints                     5
                                                      
  Number of observations                           400
                                                      
Model Test User Model:
                                                      
  Test statistic                                29.227
  Degrees of freedom                                18
  P-value (Chi-square)                           0.046

Why didn’t I estimate a mean for unobserved heterogeneity here? In all of the other examples I estimated the variance (1 parameter) and the mean (1 parameter) of the latent exogenous variable. In this case, unobserved heterogeneity is the latent exogenous variable but I only estimated its variance. That’s because in this model we don’t really care about the mean of unobserved heterogeneity, it’s just a latent variable that we incorporate to account for stable individual differences. In other words, when I estimate latent cog ability and assertiveness as IVs to predict an outcome, I care about their means. Here, unobserved heterogeneity is just an additional factor to account for, not a variable whose mean I really care to know. That said, if I wanted to estimate the mean of unobserved heterogeneity (which would result in one additional estimated parameter and one fewer df) then I would incorporate the following into the model string.

'

unob_het ~ 1 # lavaan code for estimating the mean of a latent variable

'

Bo\(^2\)m =)