Multicollinearity

Perfect and Imperfect Collinearity, VIF, and Regularization

Part I: What is Multicollinearity?

A data problem that doesn’t violate Gauss-Markov — but it an still be problematic.

Roadmap

We’ve covered the properties of OLS — unbiasedness, efficiency — and what happens when assumptions like homoskedasticity are violated.

Now we turn to a different kind of problem: multicollinearity.

  • It doesn’t technically violate any Gauss-Markov assumption
  • OLS is still BLUE
  • But your estimates can become very imprecise

Today’s plan:

  1. Perfect vs. imperfect multicollinearity
  2. Why multicollinearity inflates standard errors
  3. A fictitious problem as a cautionary tale
  4. Detection: VIF and other diagnostics
  5. Common mistakes
  6. MSE and the bias-variance tradeoff
  7. Ridge and Lasso regression as alternatives

Two Categories

We can split multicollinearity into two categories:

Perfect Imperfect
Type Data problem Estimator problem
What happens OLS cannot be computed OLS is imprecise
\(r_{X_1 X_2}\) \(= 1\) \(0 < r < 1\)
\(\mathbf{X^TX}\) Singular Near-singular

Perfect multicollinearity: the data do not contain enough information to estimate the model.

Imperfect multicollinearity: the data contain enough information, but OLS is not doing a good job extracting it.

Part II: Perfect Collinearity

When OLS simply cannot be computed.

The Setup

Consider the model:

\[ y_i = \beta_1 + \beta_2 X_1 + \beta_3 X_2 + \epsilon_i \]

When \(r_{X_1 X_2} = 1\), the denominator for \(\hat{\beta}_2\) and \(\hat{\beta}_3\) is zero, and \(\mathbf{X^TX}\) is singular.

You literally cannot invert the matrix — the estimator \((\mathbf{X^TX})^{-1}\mathbf{X^Ty}\) does not exist.

Why? If \(X_1\) and \(X_2\) are perfectly correlated, there is no way to tell them apart. The regression cannot easily partial out how much of the variation in \(y\) to attribute to \(X_1\) versus \(X_2\). The variables carry identical information.

More Generally: Linear Combinations

More precisely, we should use the term “multicollinearity.” If one variable is a perfect linear combination of \(k\) other variables in our model, the same problem exists:

\[X_j = C_1 d_1 + \cdots + C_k d_k\]

If any variable is determined by some combination of \(X_k\) (i.e., \(C \neq 0\)), there will exist perfect multicollinearity.

This isn’t just about pairs of variables being correlated — it’s about any linear dependency among the columns of \(\mathbf{X}\).

The Dummy Variable Problem

Recall from our discussion of categorical predictors: with \(k\) categories, we include \(k-1\) dummies. The omitted group serves as the reference category.

Why omit one? Precisely because of multicollinearity.

Consider party identification coded as Republican, Independent, and Democrat:

  • \(D_{\text{Rep}} = 1\) if Republican, 0 otherwise
  • \(D_{\text{Dem}} = 1\) if Democrat, 0 otherwise
  • \(D_{\text{Ind}} = 1\) if Independent, 0 otherwise

Notice: \(D_{\text{Rep}} + D_{\text{Dem}} + D_{\text{Ind}} = 1\) for every observation.

Why This Creates Perfect Collinearity

The intercept column in \(\mathbf{X}\) is also a column of 1s.

So the three dummy columns sum to produce the intercept column:

\[D_{\text{Rep}} + D_{\text{Dem}} + D_{\text{Ind}} = \mathbf{1}\]

That’s a perfect linear dependency among the columns of \(\mathbf{X}\), which means \(\mathbf{X^TX}\) is singular and we cannot estimate the model.

The Overspecified Model

The model with all \(k\) dummies and an intercept:

\[ Y_i = \alpha + \gamma_{\text{Rep}} D_{\text{Rep}} + \gamma_{\text{Dem}} D_{\text{Dem}} + \gamma_{\text{Ind}} D_{\text{Ind}} + \beta X + e_i \]

Since \(D_{\text{Ind}} = 1 - D_{\text{Rep}} - D_{\text{Dem}}\), the Independent dummy is a perfect linear function of the other two dummies and the intercept.

There is no unique solution.

The Fix: Drop One Category

When we drop one category (say, Independents), the intercept now represents the baseline for that group, and the remaining coefficients represent differences from that baseline.

An Independent is simply a case when \(D_{\text{Rep}} = 0\) and \(D_{\text{Dem}} = 0\).

The choice of reference category is arbitrary in terms of \(\hat{Y}_i\) — you’ll get the same predicted values — but it changes the interpretation of the dummy coefficients.

Demonstration in R

Show code
set.seed(42)
n <- 100
party <- sample(c("Republican", "Independent", "Democrat"), n, replace = TRUE)
auth <- runif(n, 0, 1)
y_trust <- 0.5 + 0.3 * (party == "Republican") - 0.2 * (party == "Democrat") +
    0.4 * auth + rnorm(n, 0, 0.3)

d_rep <- as.numeric(party == "Republican")
d_dem <- as.numeric(party == "Democrat")
d_ind <- as.numeric(party == "Independent")

# Include all 3 dummies -- R drops one
cat("All 3 dummies (R drops one):\n")
All 3 dummies (R drops one):
Show code
coef(lm(y_trust ~ d_rep + d_dem + d_ind + auth))
(Intercept)       d_rep       d_dem       d_ind        auth 
  0.3916643   0.3492268  -0.2866816          NA   0.5729838 

R sets the coefficient on d_ind to NA — it detects the perfect linear dependency and removes a redundant variable.

Another Example: Continuous Variables

Show code
set.seed(42)
n <- 100
x1 <- rnorm(n, 10, 2)
x2 <- 2 * x1 + 5   # x2 is a PERFECT linear function of x1
y <- 3 + 1.5 * x1 + rnorm(n)

model_perfect <- lm(y ~ x1 + x2)
summary(model_perfect)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.88842 -0.50664  0.01225  0.54106  2.86240 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.77584    0.45043   6.163 1.59e-08 ***
x1           1.51358    0.04383  34.531  < 2e-16 ***
x2                NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9083 on 98 degrees of freedom
Multiple R-squared:  0.9241,    Adjusted R-squared:  0.9233 
F-statistic:  1192 on 1 and 98 DF,  p-value: < 2.2e-16

Again, R drops the redundant variable. The coefficient on x2 is NA.

The Mathematical Problem

Let’s connect this back to the matrix formula:

\[\mathbf{b} = (\mathbf{X^TX})^{-1}\mathbf{X^Ty}\]

This requires \(\mathbf{X^TX}\) to be invertible.

A matrix is invertible only if its determinant is nonzero.

When one column of \(\mathbf{X}\) is a linear combination of the others:

\[\det(\mathbf{X^TX}) = 0\]

The matrix is singular, and no inverse exists.

Summary: Perfect Collinearity

Key Takeaway

Perfect collinearity = OLS does not exist. The data do not contain enough information to estimate the model. The solution: ensure your variables are not linearly dependent. Watch out for the dummy variable trap (\(\sum D_j = 1\)) and variables that are exact linear functions of each other.

Part III: Imperfect Multicollinearity

The far more common — and more serious — problem.

The Setup

More frequently, we are in the situation where \(0 < r_{X_1 X_2} < 1\).

This is incredibly common:

  • Education and income are correlated
  • Age and experience are correlated
  • Partisanship and ideology are correlated

The critical point: Imperfect collinearity is not a violation of the Gauss-Markov assumptions.

OLS is still BLUE. The estimator is still unbiased. The variance is still correctly estimated.

So why should we care?

The Problem: Inflated Variance

The variance of the OLS estimator can become very large, which means our estimates are very imprecise.

Recall the variance formula:

\[ \text{var}(b_j) = \frac{1}{1 - R^2_j} \times \frac{\sigma^2}{\sum x^2_i} \]

where \(R^2_j\) is the \(R^2\) from regressing \(X_j\) on all the other independent variables in the model.

This is not the \(R^2\) from the main regression — it’s the \(R^2\) from the auxiliary regression of one predictor on all the others.

The Variance Inflation Factor (VIF)

The term:

\[\frac{1}{1 - R^2_j} = \text{VIF}\]

is called the Variance Inflation Factor.

  • \(\sqrt{\text{VIF}}\) tells us how much the standard error of \(b_j\) is inflated relative to the case with no collinearity
  • As VIF increases, \(\text{var}(b_j)\) increases

Precision, Not Bias

OLS is still BLUE. The problem is not bias — we can still obtain an unbiased estimate. The problem is that the variance itself is large. Don’t confuse “large variance” with “biased estimates” — these are fundamentally different things.

Simulation: What Happens as Correlation Increases?

Let’s simulate data where the correlation between predictors varies, and watch what happens to the standard errors, VIF, and slope estimates.

Show code
set.seed(123)
n <- 200
beta_1_true <- 1.5
beta_2_true <- 1.0

correlations <- seq(0, 0.99, by = 0.01)
sim_results <- data.frame(
    correlation = correlations,
    se_x1 = NA, se_x2 = NA,
    b1 = NA, b2 = NA,
    vif_x1 = NA
)

for (i in seq_along(correlations)) {
    r <- correlations[i]
    sigma <- matrix(c(1, r, r, 1), nrow = 2)
    X <- mvrnorm(n, mu = c(0, 0), Sigma = sigma)
    y <- 2 + beta_1_true * X[, 1] + beta_2_true * X[, 2] + rnorm(n, 0, 2)
    dat_sim <- data.frame(y = y, x1 = X[, 1], x2 = X[, 2])
    mod <- lm(y ~ x1 + x2, data = dat_sim)
    sim_results$se_x1[i] <- summary(mod)$coefficients[2, 2]
    sim_results$se_x2[i] <- summary(mod)$coefficients[3, 2]
    sim_results$b1[i] <- coef(mod)[2]
    sim_results$b2[i] <- coef(mod)[3]
    sim_results$vif_x1[i] <- vif(mod)[1]
}

Standard Errors Explode

Figure 1

At \(r = 0.9\), the standard errors are already more than double what they are at \(r = 0\). At \(r = 0.99\), they’re enormous.

VIF Tracks This Exactly

Figure 2

VIF crosses 4 around \(r \approx 0.87\), and 10 around \(r \approx 0.95\).

But Slopes Are Still Unbiased!

Figure 3

The slopes bounce around their true values at every level of correlation. They are still unbiased. The estimates get noisier, but they are centered on the true value.

Three Key Takeaways from the Simulation

  1. Standard errors increase drastically as correlation approaches 1
  1. VIF tracks this exactly — it’s a direct measure of variance inflation
  1. Slope estimates remain unbiased — the problem is precision, not bias

The Core Message

OLS is still BLUE in the presence of imperfect multicollinearity. The estimator is unbiased. The variance is correctly estimated. The problem is that the variance itself is large. Don’t confuse “large variance” with “biased estimates.”

Part IV: A Made Up Problem

A cautionary tale about correlated treatments and non-randomized data.

The Setup

A startup has launched “No Sniffles” — a homeopathic allergy drug. To build their case for efficacy, they analyze data from a large non-randomized survey of Arizona residents during peak allergy season (March–June).

Respondents report:

  • How many doses of “No Sniffles” they took
  • How many pain relievers (Ibuprofen, Advil, etc.) they took
  • “How much was your headache reduced?” (0–10 scale)

The problem: During allergy season, pollen triggers sinus headaches. People who suffer reach for both an allergy remedy and a pain reliever at the same time.

The Causal Structure

Figure 4: Allergy season drives usage of both drugs — creating collinearity

Pain relievers genuinely reduce headaches (\(\beta_2 = 1.8\)). “No Sniffles” — being homeopathic — has only a modest placebo effect (\(\beta_1 = 0.15\)). Can the regression tell these apart?

The True Model

\[\text{Headache Reduction}_i = 1.0 + 0.15 \times \text{NoSniffles}_i + 1.8 \times \text{PainRelievers}_i + \epsilon_i\]

Let’s simulate 800 made up residents during March–June:

Show code
set.seed(2024)
n <- 800
beta_0 <- 1.0; beta_nosniffles <- 0.15; beta_painrelief <- 1.8

doses <- MASS::mvrnorm(n,
    mu = c(2, 2.5),
    Sigma = matrix(c(1.0, 0.55, 0.55, 1.2), nrow = 2))
no_sniffles <- pmax(round(doses[, 1]), 0)
pain_relievers <- pmax(round(doses[, 2]), 0)

headache_reduction <- beta_0 + beta_nosniffles * no_sniffles +
    beta_painrelief * pain_relievers + rnorm(n, 0, 1.0)
headache_reduction <- pmin(pmax(headache_reduction, 0), 10)

survey_data <- data.frame(no_sniffles, pain_relievers, headache_reduction)

Full Sample: Moderate Collinearity

Show code
cat("Correlation:", round(cor(survey_data$no_sniffles,
    survey_data$pain_relievers), 3), "\n\n")
Correlation: 0.478 
Show code
model_full <- lm(headache_reduction ~ no_sniffles + pain_relievers,
    data = survey_data)
summary(model_full)

Call:
lm(formula = headache_reduction ~ no_sniffles + pain_relievers, 
    data = survey_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9335 -0.6296  0.0084  0.6675  2.8020 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1.10174    0.09131  12.065  < 2e-16 ***
no_sniffles     0.18864    0.03693   5.109 4.06e-07 ***
pain_relievers  1.72688    0.03552  48.621  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9564 on 797 degrees of freedom
Multiple R-squared:  0.8103,    Adjusted R-squared:  0.8098 
F-statistic:  1702 on 2 and 797 DF,  p-value: < 2.2e-16
Show code
cat("\nVIF:\n")

VIF:
Show code
vif(model_full)
   no_sniffles pain_relievers 
      1.295352       1.295352 

With moderate collinearity, we can probably tell the drugs apart — pain relievers show a strong effect; “No Sniffles” shows a small one.

What If Collinearity Increases?

Imagine different sampling strategies. What if we restrict to respondents whose “No Sniffles” and pain reliever doses are more similar?

Show code
max_diffs <- c(Inf, 3, 2, 1)
subset_labels <- c("Full Sample", "|Diff| <= 3", "|Diff| <= 2", "|Diff| <= 1")
results <- data.frame()

for (i in seq_along(max_diffs)) {
    d <- max_diffs[i]
    sub <- survey_data[abs(survey_data$no_sniffles -
        survey_data$pain_relievers) <= d, ]
    mod <- lm(headache_reduction ~ no_sniffles + pain_relievers, data = sub)
    v <- vif(mod)
    s <- summary(mod)
    results <- rbind(results, data.frame(
        Subset = subset_labels[i], N = nrow(sub),
        r_xy = round(cor(sub$no_sniffles, sub$pain_relievers), 3),
        b_nosniffles = round(coef(mod)["no_sniffles"], 3),
        se_nosniffles = round(s$coefficients["no_sniffles", 2], 3),
        b_painrelief = round(coef(mod)["pain_relievers"], 3),
        se_painrelief = round(s$coefficients["pain_relievers", 2], 3),
        VIF = round(v[1], 2)))
}

The Results

Table 1
Subset N r(NS, PR) b_NoSniffles SE_NoSniffles b_PainRelief SE_PainRelief VIF
no_sniffles Full Sample 800 0.478 0.189 0.037 1.727 0.036 1.30
no_sniffles1 |Diff| <= 3 799 0.482 0.187 0.037 1.728 0.036 1.30
no_sniffles2 |Diff| <= 2 769 0.561 0.192 0.041 1.727 0.039 1.46
no_sniffles3 |Diff| <= 1 636 0.747 0.173 0.059 1.758 0.056 2.26

As the sample is restricted to similar dosing patterns:

  • Correlation increases
  • VIF increases
  • Standard errors explode
  • We can no longer distinguish \(\beta_1 = 0.15\) from \(\beta_2 = 1.8\)

Visualizing the Data Cloud

Figure 5

As we restrict to similar dosing patterns, the data cloud aligns along the diagonal — collinearity makes it impossible to separate the effects.

The Lesson: Randomization Breaks Collinearity

The startup’s problem:

  • With moderate collinearity (full sample): the regression can distinguish the drugs
  • With high collinearity (restricted sample): we can no longer tell a drug that works (\(\beta_2 = 1.8\)) from one that barely works (\(\beta_1 = 0.15\))

A randomized trial would independently assign “No Sniffles” and pain reliever usage — this would break the correlation between the right-hand-side variables.

The observational survey during allergy season — where everyone reaches for both drugs at once — cannot do this.

Part V: Detecting Multicollinearity

How do you know when you have a problem?

The Classic Indicators

Symptom 1: Large \(R^2\), significant \(F\)-test, but non-significant \(t\)-tests

  • The model as a whole explains variation in \(y\)
  • But you can’t pin down any particular effect
  • Divergence between the \(F\)-test (model fit) and the \(t\)-tests (individual coefficients)

Symptom 2: Examine \(r_{xx}\) — the correlation matrix of independent variables

  • If you observe an entry of 0.80 or greater, collinearity may be present
  • But: a variable might not be highly correlated with any single other variable, yet strongly predicted by a combination of variables

Symptom 3: The VIF

  • VIF \(> 4\) (or \(\sqrt{\text{VIF}} > 2\)): collinearity may be an issue
  • Some textbooks use VIF \(> 10\) as the threshold
  • It’s an indicator, a useful heuristic

Using VIF in R

Show code
library(car)

model <- lm(y ~ x1 + x2 + x3 + x4, data = mydata)

# Variance Inflation Factors
vif(model)

# If VIF > 4, investigate
# If VIF > 10, you likely have a serious problem

Remember: VIF is the \(R^2\) from regressing \(X_j\) on all other predictors, transformed as \(\frac{1}{1 - R^2_j}\).

A VIF of 4 means \(R^2_j = 0.75\) — three-quarters of the variation in \(X_j\) is explained by the other predictors.

A VIF of 10 means \(R^2_j = 0.90\) — ninety percent.

Part VI: Common Mistakes

Where students — and researchers — tend to go wrong.

Mistake 1: Dropping Variables to “Fix” Collinearity

The most common response to multicollinearity is to drop one of the correlated variables.

Example: You’re modeling Trump’s vote margin in Arizona precincts as a function of median household income and median age. They’re correlated, so standard errors are large. You drop median age.

Now the standard errors on income are small and the coefficient is significant. Fixed it, right?

Wrong. If median age actually affects voting patterns, dropping it introduces omitted variable bias.

You’ve traded a precision problem for a bias problem.

Conflating Effects

The coefficient on income now captures both the direct effect of income and the indirect effect through its correlation with age.

The Rule

Do not drop theoretically important variables just because they are correlated with other predictors. An unbiased estimate with a large standard error is preferable to a biased estimate with a small standard error. If you can’t reject the null, that’s informative — it means the data don’t contain enough information to distinguish between the effects.

Mistake 2: “The Variables Don’t Matter”

When multicollinearity inflates standard errors and you fail to reject \(H_0: \beta_j = 0\), it’s tempting to conclude that the variables don’t affect \(y\).

But this is a statistical power issue:

  • You don’t have enough information to distinguish between the effects
  • But that doesn’t mean those effects aren’t there
  • Failing to reject a null does not mean the null is true

With highly collinear predictors, the individual \(t\)-tests have low power — our ability to correctly reject a false null is diminished. You should expect non-significant results, even if the variables truly matter.

Mistake 3: Stepwise Regression

You’ll occasionally hear about stepwise regression procedures — iterate through different specifications and settle on a model.

This is atheoretical and should be avoided.

A “correct” regression model must be correctly specified — and that requires theory, not data-driven selection. Model specification should be guided by your causal theory, not by which variables happen to reduce collinearity or produce significant \(p\)-values.

Mistake 4: Ignoring the Problem Entirely

On the other end, some researchers simply ignore multicollinearity and report their results as if nothing is wrong.

This can lead to incorrect conclusions from “non-significant” findings.

The right approach: Report VIFs. Discuss the implications. Acknowledge limitations in precision. Don’t pretend the problem doesn’t exist, and don’t let it drive you to drop important variables either.

Part VII: The Mean Squared Error

A framework for thinking about the bias-variance tradeoff.

MSE Decomposition

The mean squared error measures the expected squared difference between the estimator and the true parameter:

\[ MSE(\hat{b}) = E[(\hat{b} - b)^2] \]

Add and subtract \(E[\hat{b}]\) inside the squared term:

\[ MSE(\hat{b}) = E[(\hat{b} - E[\hat{b}] + E[\hat{b}] - b)^2] \]

Expanding the MSE

Expanding and taking expectations (the cross term vanishes since \(E[\hat{b} - E[\hat{b}]] = 0\)):

\[MSE(\hat{b}) = \underbrace{E[(\hat{b} - E[\hat{b}])^2]}_{\text{Variance}} + \underbrace{(E[\hat{b}] - b)^2}_{\text{Bias}^2}\]

\[\boxed{MSE(\hat{b}) = \text{var}(\hat{b}) + \text{bias}(\hat{b})^2}\]

This is fundamental. We can always evaluate and compare estimators using the MSE. It captures both the variance and the bias.

Why MSE Matters for Multicollinearity

For OLS with multicollinearity:

  • \(\text{bias}(\hat{b}) = 0\) (OLS is unbiased)
  • \(\text{var}(\hat{b})\) is large (multicollinearity inflates variance)

So \(MSE_{OLS} = 0 + \text{var}(\hat{b})\) — the entire MSE is variance.

The key question: Could we do better by accepting some bias in exchange for a large reduction in variance?

If \(\text{var}_{\text{new}} + \text{bias}^2_{\text{new}} < \text{var}_{OLS}\), then the biased estimator has lower MSE than OLS.

This is the bias-variance tradeoff.

Part VIII: Ridge and Lasso Regression

Regularization as an alternative to OLS when multicollinearity is severe.

The Idea Behind Regularization

In statistics, we sometimes prefer a somewhat biased model if it is estimated with more precision.

One consequence of multicollinearity: it can lead to large coefficients. Why? Because the regression is trying to allocate variation in \(y\) between highly correlated predictors, and small perturbations in the data can cause dramatic swings in the coefficients.

Regularization (or shrinkage): penalize regression parameters that are large.

The Ridge Estimator

The Ridge estimator minimizes:

\[ b_{\text{Ridge}} = \min\left[\sum_{i=1}^N\left(y_i - b_0 - \sum_j b_j x_{ji}\right)^2 + \lambda \sum_j b_j^2\right] \]

The left part is the familiar SSR. But now we also penalize by \(\lambda \sum_j b_j^2\).

  • If \(\lambda = 0\): reduces to OLS
  • If \(\lambda > 0\): some degree of “shrinkage”
  • Larger \(\lambda\) = more shrinkage = more bias, less variance

Ridge in Matrix Form

The closed-form solution:

\[ b_{\text{Ridge}} = (\mathbf{X^TX} + \lambda \mathbf{I})^{-1}\mathbf{X^Ty} \]

We’re adding a constant \(\lambda\) to the diagonals of \(\mathbf{X^TX}\).

This does two things:

  1. Guarantees the matrix is invertible (even with near-perfect collinearity)
  2. Shrinks the coefficients toward zero

The degree of bias is determined by \(\lambda\). As a consequence, we also decrease the variance of the estimate.

Choosing \(\lambda\): Cross-Validation

How do we choose \(\lambda\)? 10-fold cross-validation:

  1. Split the data into 10 folds
  2. Hold out one fold as the test set
  3. Fit the model on the remaining 9 folds
  4. Calculate the RMSE on the test fold
  5. Average across all 10 folds
  6. Repeat for a grid of \(\lambda\) values
  7. Choose the \(\lambda\) with the smallest average RMSE

Ridge Applied to Arizona Precinct Data

Show code
library(glmnet)

load(file.path(dirname(getwd()), "..", "linear_regression",
    "quarto-book", "precinct_voter_summary.rda"))

precinct_voter_summary <- precinct_voter_summary |>
    dplyr::mutate(
        pct_latino = (tract_acs_latino / tract_acs_total_population) * 100,
        pct_white = (tract_acs_non_latino_white / tract_acs_total_population) * 100)

az_complete <- precinct_voter_summary |>
    dplyr::select(trump_harris_margin, tract_acs_median_household_income,
        pct_latino, tract_acs_median_age, tract_acs_gini_index,
        pct_democrat, pct_independent) |>
    na.omit()

X_mat <- model.matrix(
    ~ tract_acs_median_household_income + pct_latino +
        tract_acs_median_age + tract_acs_gini_index +
        pct_democrat + pct_independent,
    data = az_complete)[, -1]

y_ridge <- az_complete$trump_harris_margin

cv_ridge <- cv.glmnet(X_mat, y_ridge, alpha = 0, nfolds = 10)
plot(cv_ridge)
title("Ridge Regression: 10-Fold CV", line = 3)

Figure 6: 10-fold cross validation for Ridge regression

Lambda Min vs. Lambda 1SE

The cross-validation plot shows two important values:

lambda.min: The \(\lambda\) that minimizes cross-validation error (lowest RMSE). Best predictive performance.

lambda.1se: A larger \(\lambda\) chosen using the 1 standard error rule — the most regularized model whose error is within 1 SE of the minimum.

Why use lambda.1se? The 1-SE rule reflects a preference for parsimony:

  • Models with more shrinkage are more stable
  • Greater bias but lower variance
  • If two models have statistically indistinguishable performance, prefer the simpler one

Ridge vs. OLS Coefficients

Variable Ridge (lambda.min) OLS Difference
(Intercept) 0.9663 1.0710 -0.1046
Median HH Income 0.0000 0.0000 0.0000
% Latino 0.0009 0.0014 -0.0005
Median Age 0.0009 0.0001 0.0009
Gini Index -0.2734 -0.1883 -0.0851
% Democrat -0.0208 -0.0233 0.0025
% Independent -0.0097 -0.0109 0.0011

The Ridge coefficients are shrunk toward zero. The degree of shrinkage depends on \(\lambda\).

The Lasso Estimator

Ridge does not shrink any parameter to zero. If we want variable selection — knowing which variables have no effect — we need the Lasso (Tibshirani, 1996):

\[ b_{\text{Lasso}} = \min\left[\sum_{i=1}^N\left(y_i - b_0 - \sum_j b_j x_{ji}\right)^2 + \lambda \sum_j |b_j|\right] \]

The key difference: \(|b_j|\) instead of \(b_j^2\).

The L1 penalty (\(|b_j|\)) can shrink coefficients exactly to zero, effectively performing variable selection.

Ridge vs. Lasso: The Penalty

Ridge (\(L_2\)) Lasso (\(L_1\))
Penalty \(\lambda \sum b_j^2\) \(\lambda \sum \|b_j\|\)
Shrinkage Toward zero, never exactly zero Can shrink exactly to zero
Variable selection No Yes
When \(p\) > \(n\) Keeps all variables Selects at most \(n\)
Correlated predictors Shares effect across group Picks one, drops others

Lasso Applied to Arizona Data

Show code
cv_lasso <- cv.glmnet(X_mat, y_ridge, alpha = 1, nfolds = 10)
plot(cv_lasso)
title("Lasso Regression: 10-Fold CV", line = 3)

Figure 7: 10-fold cross validation for Lasso regression

Lasso vs. OLS Coefficients

Variable Lasso (lambda.min) Lasso (lambda.1se) OLS
(Intercept) 1.0440 0.7820 1.0710
Median HH Income 0.0000 0.0000 0.0000
% Latino 0.0012 0.0000 0.0014
Median Age 0.0000 0.0000 0.0001
Gini Index -0.1554 0.0000 -0.1883
% Democrat -0.0231 -0.0205 -0.0233
% Independent -0.0103 -0.0052 -0.0109

Notice: at lambda.1se, the Lasso may have shrunk some coefficients exactly to zero — performing automatic variable selection.

The Bias-Variance Tradeoff in Regularization

As \(\lambda\) increases:

  • Bias increases: coefficients are shrunk further from their true values
  • Variance decreases: estimates become more stable across samples

At some value of \(\lambda\), the MSE of Ridge/Lasso can be less than the MSE of OLS — even though they are biased.

Cross-validation helps us find the \(\lambda\) that optimizes this tradeoff.

When to Use Regularization

Use Ridge/Lasso when:

  • Your goal is prediction rather than unbiased inference
  • You face severe multicollinearity (VIF \(> 10\))
  • You have many predictors relative to sample size
  • You want to regularize model complexity

But remember: These are biased estimators. If your goal is causal inference with unbiased coefficient estimates, OLS may still be preferred — even with large standard errors.

Part IX: Omitted Variable Bias — Formally

Why dropping variables is usually worse than living with multicollinearity.

The Setup

Assume the true model is:

\[Y_i = \beta_1 + \beta_2 X_1 + \beta_3 X_2 + \epsilon_i\]

But due to collinearity, you estimate:

\[Y_i = \alpha_1 + \alpha_2 X_1 + \omega_i\]

where \(\omega_i = \beta_3 X_2 + \epsilon_i\) — the omitted variable is absorbed into the error term.

Deriving the Bias

The bivariate slope estimate \(a_2\):

\[a_2 = \frac{\sum x_1 y_i}{\sum x_1^2}\]

Substituting the true model:

\[a_2 = \beta_2 + \beta_3 \frac{\sum x_1 x_2}{\sum x_1^2}\]

Since \(b_2 = \frac{\sum x_1 x_2}{\sum x_1^2}\) is the slope from regressing \(X_2\) on \(X_1\):

\[\boxed{a_2 = \beta_2 + \beta_3 \cdot b_2}\]

\(a_2\) = direct effect of \(X_1\) on \(Y\) plus the indirect effect of \(X_1\) on \(Y\) through \(X_2\).

The Reduced Form

Think of it this way. The true equation:

\[Y = b_0 + b_1 X_1 + \gamma_1 X_2 + e\]

And the auxiliary equation:

\[X_2 = a_0 + a_1 X_1 + u\]

Substitute:

\[Y = b_0 + b_1 X_1 + \gamma_1(a_0 + a_1 X_1 + u) + e\]

\[Y = \underbrace{(b_0 + \gamma_1 a_0)}_{\text{new intercept}} + \underbrace{(b_1 + \gamma_1 a_1)}_{\text{biased slope}} X_1 + \underbrace{(e + \gamma_1 u)}_{\text{new error}}\]

In Matrix Form

\[\mathbf{b} = (\mathbf{X^TX})^{-1}\mathbf{X^Ty}\]

Substituting \(\mathbf{y} = \mathbf{X\beta + Z\gamma + u}\) (where \(\mathbf{Z}\) contains the omitted variables):

\[E(\mathbf{b}) = \boldsymbol{\beta} + \underbrace{(\mathbf{X^TX})^{-1}\mathbf{X^TZ\gamma}}_{\text{Omitted Variable Bias}}\]

The bias is zero only if:

  • \(\gamma = 0\) (the omitted variable doesn’t affect \(Y\)), or
  • \(\mathbf{X^TZ} = 0\) (the included and omitted variables are uncorrelated)

The Takeaway

If you exclude a relevant variable that is correlated with your included variables, your estimator is biased. You’ve traded a precision problem (large variance) for a bias problem (wrong on average). Don’t drop relevant variables because of collinearity.

Part X: Summary

Putting it all together.

Summary Table

Feature With Multicollinearity
Bias of OLS None — still unbiased
Variance of OLS Inflated (larger SEs)
BLUE property Still BLUE
Individual \(t\)-tests Low power, often non-significant
Overall \(F\)-test May still be significant
\(R^2\) Unaffected
Coefficient stability Sensitive to small data changes
Primary diagnostic VIF \(> 4\) or \(\sqrt{\text{VIF}} > 2\)

What Should You Do?

  1. Diagnose: Check VIFs. Examine the correlation matrix. Look for the classic pattern (large \(R^2\), significant \(F\), non-significant \(t\)’s).
  1. Report honestly: Report VIFs alongside your results. Acknowledge that multicollinearity limits your ability to distinguish effects.
  1. Don’t drop variables: An unbiased estimate with a large SE beats a biased estimate with a small SE. Theory should drive specification, not collinearity.
  1. Consider regularization: If prediction is your goal and collinearity is severe, Ridge or Lasso may reduce MSE at the cost of some bias.
  1. Consider your research design: Randomization breaks collinearity. If your observational data suffer from severe collinearity, think about what kind of data would allow you to separate the effects.

The Bottom Line

One More Time

The worst mistake you can make is to drop theoretically important variables to “fix” a problem that is really just the data telling you it doesn’t contain enough unique information to separate the effects. Report VIFs, discuss the implications, and resist the urge to let statistical convenience override substantive, theoretically informed reasoning.

Discussion Questions

  1. You run a model and find VIF = 8 for two predictors. What do you do? What do you not do?
  1. Your \(R^2 = 0.85\) and the \(F\)-test is significant at \(p < 0.001\), but no individual predictor is significant. What’s happening?
  1. A reviewer tells you to “fix the multicollinearity” in your model. How do you respond?
  1. When is it appropriate to drop a correlated variable? (Hint: think about theory vs. data.)
  1. A colleague argues that Ridge regression “solves” multicollinearity. Is this correct? What nuance is missing?

References

  • Hastie, T., Tibshirani, R., & Friedman, J. (2013). The Elements of Statistical Learning. Springer.
  • Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. JRSS Series B, 58(1), 267–288.
  • Greene, W. H. (2018). Econometric Analysis (8th ed.). Pearson.
  • Wooldridge, J. M. (2020). Introductory Econometrics: A Modern Approach (7th ed.). Cengage.