Imagine that you are interested in the relationship between stress and performance. To assess it, you observe 600 people at work and measure their stress via a self-report (e.g., “I feel stressed”) and their performance via objective performance scores for the day (e.g., number of sales). You regress performance on stress and find that the estimated coefficient relating to two is 0.45. You then build a 95% confidence interval using the standard error that the analysis spit out and find that the CI is 0.45 +- 0.1.
What does that confidence interval actually mean? The purpose of this exercise is to build intuition behind frequentist CIs.
Generate the population
Sample the population. On that sample…
2a) Regress performance on stress
2b) Calculate a CI
2c) Does the CI contain the population parameter?
Re-sample and repeat
Our population will contain 100,000 people
pop_number <- 100000
with stress scores distributed about zero. The scale here doesn’t matter – we care about the relationship between stress and performance and less about (in this example) the distributions of stress and performance themselves.
population_stress <- rnorm(pop_number, 0, 5)
The true relationship between stress and performance will be 0.45. Let’s set that parameter
stress_performance_coefficient <- 0.45
and then generate performance.
population_performance <- stress_performance_coefficient*population_stress + rnorm(pop_number, 0, 1)
Now plug everything into a data set. Remember, this is the population.
df <- data.frame(
'person' = c(1:pop_number),
'stress' = c(population_stress),
'performance' = c(population_performance)
)
What is the paramter relating stress to performance? 0.45, keep that in mind. Time to sample the population as if we conducted a study and run our regression.
Randomly select 600 people from our population. That is, pretend we ran a study on 600 subjects.
Use the lm
command for regression in R
.
Call:
lm(formula = performance ~ stress, data = sample_df)
Residuals:
Min 1Q Median 3Q Max
-3.2166 -0.7325 -0.0102 0.7308 3.1016
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0003186 0.0427267 -0.007 0.994
stress 0.4482278 0.0087063 51.483 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.044 on 598 degrees of freedom
Multiple R-squared: 0.8159, Adjusted R-squared: 0.8156
F-statistic: 2651 on 1 and 598 DF, p-value: < 2.2e-16
Save the output of the regression in an object so we can pull out the specific coefficients that we are interested in.
Pull out the coefficient relating stress to performance
slope_coefficient <- output$coefficients[2,1]
and use it along with the SEs to calculate the confidence interval.
se_upper <- slope_coefficient + 1.96*output$coefficients[2,2]
se_lower <- slope_coefficient - 1.96*output$coefficients[2,2]
Remember that the parameter is 0.45.
contain_parameter <- NULL
if(se_lower <= stress_performance_coefficient && se_upper >= stress_performance_coefficient){
contain_parameter <- 'yes'
}else{
contain_parameter <- 'no'
}
contain_parameter
[1] "yes"
What did we do? We sampled the population, ran a regression to relate stress to performance, and then calculated a CI on the slope term. The interpretation of a CI, however, is across infinite samples. Now we need to run through the sample, regress, and calculate CI procedure again and again and again – Monte Carlo.
I created a function that samples the population, runs a regression, calculates the CI, and then saves whether or not the interval contained 0.45 (‘yes’ or ‘no’). You can view that code in the raw rmarkdown file. For now, just know that the function is called sample_regress_calc_ci
.
We are going to re-run step 2 from above 900 times
sims <- 900
and store the ‘yes’ or ‘no’ result in a vector.
all_ci_contains <- numeric(sims)
Here is the full Monte Carlo code.
sims <- 900
all_ci_contains <- numeric(sims)
for(i in 1:sims){
result <- sample_regress_calc_ci()
all_ci_contains[i] <- result
}
How many computed intervals contain the population parameter?
sum(all_ci_contains == 'yes') / sims
[1] 0.9511111
Bo\(^2\)m =)