This post contains a bunch of examples where I practice counting dfs. In each example, I generate the data, estimate the parameters using SEM, count the dfs, and then compare my count to what the model spits back. To count dfs, I need to know the number of knowns and unknowns in my system:
\[\begin{equation} \textrm{DFs} = \textrm{knowns - unknowns} \end{equation}\]
To count the number of knowns, I need to know the number of observed variables, p:
\[\begin{equation} \textrm{knowns} = p*(p+1) / 2 \end{equation}\]
To count the number of unknowns, I count the number of parameters that my model estimates. Now for the examples.
# p*(p + 1) / 2
3*(3+1) / 2
[1] 6
1 for b1
1 for b2
1 for the variance of trust
1 for the variance of availability
1 for the covariance of trust and availability
1 for the prediction error on helping
total = 6
6 - 6 = 0
show(ex1_model)
lavaan 0.6-9 ended normally after 14 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 3
Number of observations 400
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Now if I restrict the covariance of trust and availability to be zero I should have 1 df
ex1_string_restrict <- '
helping ~ b1*trust + b2*availability
trust ~~ 0*availability
'
ex1_model_restrict <- sem(ex1_string_restrict, data = df)
show(ex1_model_restrict) # yup
lavaan 0.6-9 ended normally after 15 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 5
Number of observations 400
Model Test User Model:
Test statistic 0.281
Degrees of freedom 1
P-value (Chi-square) 0.596
common_factor <- rnorm(people, 30, 2)
error_cf <- rnorm(people, 0, 2)
item1 <- 0.35*common_factor + error_cf
item2 <- 0.22*common_factor + error_cf
item3 <- 0.18*common_factor + error_cf
item4 <- 0.24*common_factor + error_cf
item5 <- 0.31*common_factor + error_cf
item6 <- 0.44*common_factor + error_cf
# nope, that approach is wrong. If I do above then my errors are not independent
# prediction errors (in this case measurement) should be independent
item1 <- 0.35*common_factor + rnorm(people, 0, 2)
item2 <- 0.22*common_factor + rnorm(people, 0, 2)
item3 <- 0.18*common_factor + rnorm(people, 0, 2)
item4 <- 0.24*common_factor + rnorm(people, 0, 2)
item5 <- 0.31*common_factor + rnorm(people, 0, 2)
item6 <- 0.44*common_factor + rnorm(people, 0, 2)
df_cf <- data.frame(
'id' = c(1:people),
'item1' = c(item1),
'item2' = c(item2),
'item3' = c(item3),
'item4' = c(item4),
'item5' = c(item5),
'item6' = c(item6)
)
ex2_string <- '
com_factor =~ 1*item1 + fl2*item2 + fl3*item3 + fl4*item4 + fl5*item5 + fl6*item6
'
ex2_model <- sem(ex2_string, data = df_cf)
# p*(p + 1) / 2
6*(6 + 1) / 2
[1] 21
6 factor loadings, but I constrained the first one to be 1 (I have to to estimate the latent variable), so 5 parameters
5 measurement errors for the 5 factor loadings
1 variance for the latent exogenous variable
1 mean for the latent exogenous variable
total = 12
21 - 12 = 9
show(ex2_model)
lavaan 0.6-9 ended normally after 69 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 12
Number of observations 400
Model Test User Model:
Test statistic 15.528
Degrees of freedom 9
P-value (Chi-square) 0.077
Cognitive ability (latent variable 1) and assertiveness (latent variable 2) predict productivity. Cognitive ability and assertiveness are both captured with 2 manifest items/variables.
# cog ability (latent exogenous variable 1)
cog_ability <- rnorm(people, 100, 15)
ca_item1 <- 0.78*cog_ability + rnorm(people, 0, 1)
ca_item2 <- 0.11*cog_ability + rnorm(people, 0, 1)
# assertiveness (latent exogenous variable 2)
assertive <- rnorm(people, 30, 8)
ass_item1 <- 0.81*assertive + rnorm(people, 0, 1)
ass_item2 <- 0.34*assertive + rnorm(people, 0, 1)
# productivity (observed outcome)
productivity <- 0.55*cog_ability + 0.82*assertive + rnorm(people, 0, 5)
# data
df_3 <- data.frame(
'id' = c(1:people),
'ca_item1' = c(ca_item1),
'ca_item2' = c(ca_item2),
'ass_item1' = c(ass_item1),
'ass_item2' = c(ass_item2),
'productivity' = c(productivity)
)
ex3_string <- '
cog_ability =~ 1*ca_item1 + fl2*ca_item2
assertiveness =~ 1*ass_item1 + fla*ass_item2
cog_ability ~~ cog_ability
assertiveness ~~ assertiveness
cog_ability ~~ assertiveness
productivity ~ b1*cog_ability + b2*assertiveness
'
ex3_model <- sem(ex3_string, data = df_3)
# p*(p+1) / 2
5*(5+1) / 2
[1] 15
4 factor loadings but I constrained 2 of them, so 2 factor loadings
2 measurement errors (4 items, but constrained 2 of them)
1 variance on cog ability
1 mean on cog ability
1 variance on assertiveness
1 mean on assertiveness
1 covariance among cog ability and assertiveness
b1
b2
2 prediction errors
total = 13
15 - 13 = 2
show(ex3_model)
lavaan 0.6-9 ended normally after 170 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 12
Number of observations 400
Model Test User Model:
Test statistic 2.776
Degrees of freedom 3
P-value (Chi-square) 0.428
Nope. I’m one off, where did I go wrong?
Ah, there is only 1 prediction error because productivity is being predicted. I counted 2 prediction errors because I gave one to both b1 and b2. So, the unknowns should be…
4 factor loadings but I constrained 2 of them, so 2 factor loadings
2 measurement errors (4 items, but constrained 2 of them)
1 variance on cog ability
1 mean on cog ability
1 variance on assertiveness
1 mean on assertiveness
1 covariance among cog ability and assertiveness
b1
b2
1 prediction error
total = 12
15 - 12 = 3
show(ex3_model)
lavaan 0.6-9 ended normally after 170 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 12
Number of observations 400
Model Test User Model:
Test statistic 2.776
Degrees of freedom 3
P-value (Chi-square) 0.428
ex4_string <- '
b ~ b1*a
c ~ b2*b
d ~ b3*c
a ~~ a
'
ex4_model <- sem(ex4_string, data = df_chain)
# p*(p+1) / 2
4*(4+1) / 2
[1] 10
b1
b2
b3
3 prediction errors
1 variance for the lone exogenous variable (a)
total = 7
10 - 7 = 3
show(ex4_model)
lavaan 0.6-9 ended normally after 41 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 7
Number of observations 400
Model Test User Model:
Test statistic 3.046
Degrees of freedom 3
P-value (Chi-square) 0.385
time <- 7
affect_store <- matrix(, ncol = 3, nrow = time*people)
count <- 0
for(i in 1:people){
unob_het <- rnorm(1, 0, 3)
for(j in 1:time){
count <- count + 1
if(j == 1){
affect_store[count, 1] <- i
affect_store[count, 2] <- j
affect_store[count, 3] <- unob_het + 50 + rnorm(1, 0, 1)
}else{
affect_store[count, 1] <- i
affect_store[count, 2] <- j
affect_store[count, 3] <- 0.8*affect_store[count - 1, 3] + unob_het + rnorm(1, 0, 1)
}
}
}
df5 <- data.frame(affect_store)
names(df5) <- c('id', 'time', 'affect')
library(reshape2)
df5_wide <- reshape(df5, idvar = 'id', timevar = 'time', direction = 'wide')
ex5_string <- '
unob_het =~ 1*affect.2 + 1*affect.3 + 1*affect.4 + 1*affect.5 + 1*affect.6 + 1*affect.7
affect.2 ~ ar*affect.1
affect.3 ~ ar*affect.2
affect.4 ~ ar*affect.3
affect.5 ~ ar*affect.4
affect.6 ~ ar*affect.5
affect.7 ~ ar*affect.6
affect.1 ~~ affect.1
unob_het ~~ unob_het
affect.1 ~~ unob_het
'
ex5_model <- sem(ex5_string, data = df5_wide)
# p*(p+1) / 2
7*(7+1) / 2
[1] 28
ar is 1 estimated parameter
1 variance of unobserved heterogeneity
1 variance of affect.1
1 covariance among affect.1 and unobserved heterogeneity
6 prediction errors
total = 10
28 - 10 = 18
show(ex5_model)
lavaan 0.6-9 ended normally after 120 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 15
Number of equality constraints 5
Number of observations 400
Model Test User Model:
Test statistic 29.227
Degrees of freedom 18
P-value (Chi-square) 0.046
Why didn’t I estimate a mean for unobserved heterogeneity here? In all of the other examples I estimated the variance (1 parameter) and the mean (1 parameter) of the latent exogenous variable. In this case, unobserved heterogeneity is the latent exogenous variable but I only estimated its variance. That’s because in this model we don’t really care about the mean of unobserved heterogeneity, it’s just a latent variable that we incorporate to account for stable individual differences. In other words, when I estimate latent cog ability and assertiveness as IVs to predict an outcome, I care about their means. Here, unobserved heterogeneity is just an additional factor to account for, not a variable whose mean I really care to know. That said, if I wanted to estimate the mean of unobserved heterogeneity (which would result in one additional estimated parameter and one fewer df) then I would incorporate the following into the model string.
'
unob_het ~ 1 # lavaan code for estimating the mean of a latent variable
'
Bo\(^2\)m =)