Regression Creates Weighted Linear Composites

One way to think about regression is as a tool that takes a set of predictors and creates a weighted, linear composite that maximally correlates with the response variable. It finds a way to combine multiple predictors into a single thing, using regression weights, and the weights are chosen such that, once the single composite is formed, it maximally correlates with the outcome.

Here’s a simulation to punch that point home.

500 people.

N <- 500

The correlation matrix for three variables, x1, x2, and the outcome, y. The correlation between x1 and x2 is 0.1, the correlation between x1 and y is 0.4, and the correlation between x2 and y is 0.4.

sigma <- matrix(c(1.0, 0.1, 0.4,
                  0.1, 1.0, 0.4,
                  0.4, 0.4, 1.0), 3, 3, byrow = T)

The mean for each variable is 0.

mu <- c(0,0,0)

Use the correlation matrix and mean specifications to generate data.

library(MASS)

df <- mvrnorm(N, mu, sigma)

Turn it into a data frame and label it.

df <- data.frame(df)
names(df) <- c('x1', 'x2', 'y')
df$id <- c(1:N)

Run regression and print the output.

summary(lm(y ~ x1 + x2,
           data = df))

Call:
lm(formula = y ~ x1 + x2, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.81245 -0.57685 -0.03564  0.64622  2.46418 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.01390    0.03931  -0.354    0.724    
x1           0.28256    0.03784   7.468 3.69e-13 ***
x2           0.36403    0.03889   9.361  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8758 on 497 degrees of freedom
Multiple R-squared:  0.2489,    Adjusted R-squared:  0.2459 
F-statistic: 82.36 on 2 and 497 DF,  p-value: < 2.2e-16

Here’s the kicker: you can think of those weights as optimal functions telling us how to create the composite.

Create a composite using the regression weights.

library(tidyverse)
df <- df %>%
  mutate(composite_x = 0.33*x1 + 0.4*x2)

Those weights provide the maximum correlation between our composite and the outcome.

cor(df$y, df$composite_x)
[1] 0.4987514

In other words, the above correlation could not be higher with any other set of weights. Regression found the weights that makes the correlation above as large as it can be.

summary(lm(y ~ composite_x,
           data = df))

Call:
lm(formula = y ~ composite_x, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.80292 -0.58917 -0.03499  0.62971  2.48528 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.01470    0.03920  -0.375    0.708    
composite_x  0.88706    0.06908  12.841   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.875 on 498 degrees of freedom
Multiple R-squared:  0.2488,    Adjusted R-squared:  0.2472 
F-statistic: 164.9 on 1 and 498 DF,  p-value: < 2.2e-16

Bo\(^2\)m =)