Conquering Overfitting: The Power of Penalized Regression

Financial Econometrics: Part 06

Welcome to the next article in our advanced regression series!

So far, our model of the DXY index has used only six predictors. But in the real world of finance, we’re not so lucky. We often face a situation with too much information. We might have 50, 100, or even 1,000 potential explanatory variables. This is known as high-dimensional data.

When faced with this “curse of dimensionality,” our trusty OLS model fails spectacularly. It develops an ego, “memorizing” every quirk and noise in our data instead of learning the true, generalizable pattern. This dangerous problem is called overfitting.

In this article, we’ll explore this “overfitting trap” and the powerful solution: Penalized Regression (also known as Regularization). We’ll demystify the two most famous methods, Ridge Regression (L2) and Lasso Regression (L1), and learn how they “shrink” our models to make them more stable, more predictive, and (in the case of Lasso) even perform automatic feature selection.

The Overfitting Trap and the Bias-Variance Tradeoff

To understand the problem, you need to know about the Bias-Variance Tradeoff, a central concept in all of statistics and machine learning.

Every model’s prediction error can be decomposed into three parts:

Bias: The error from “wrong assumptions.” A simple model (like a straight line) forced onto a complex, curvy dataset will have high bias. It’s underfitting the data.
Variance: The error from “sensitivity to noise.” A highly complex, flexible model (like a high-order polynomial) will have high variance. It will “memorize” the noise in the training dataset. This is overfitting.
Irreducible Error: The natural, random noise in the data that no model can ever capture.

OLS is a low-bias, high-variance estimator. It makes few assumptions, but it loves to fit itself to the data, noise and all. When you have few predictors (like our 6-variable model), this is usually fine.

But when you have many predictors (e.g., 50 variables for 100 observations), OLS loses its mind. It will use all 50 variables to create an incredibly complex model that “explains” the training data perfectly (low bias), but it’s really just fitting the random noise. The moment you show this model new data (a testing dataset), it fails completely (high variance).

This is overfitting. The model is a “memorizer,” not a “learner.” As a trainer, think of it as a student who memorizes the exact answers to the practice test but fails the real exam because they never learned the concepts.

The Solution: A “Penalty” on Complexity

If OLS is too “flexible” and “obsessed” with minimizing errors, how do we rein it in?

We change its objective function.

OLS Objective: $Minimize \sum (Y_i – \hat{Y}_i)^2$

(Goal: Minimize the sum of squared residuals (SSR). That’s it.)

Penalized Objective: $Minimize \left[ \sum (Y_i – \hat{Y}_i)^2 \right] + \text{Penalty Term}$

(Goal: Minimize SSR… BUT also keep the penalty term small.)

This is a brilliant tug-of-war. The Penalty Term is a “budget” on model complexity. The model now has two jobs: fit the data (minimize SSR) and stay simple (minimize the penalty). The model is now forced to make a tradeoff: it will intentionally accept a slightly worse fit to the data (a tiny bit more bias) in exchange for being much simpler and more generalizable (a massive reduction in variance).

This process is called regularization or shrinkage, because the penalty term “shrinks” the coefficients. The strength of this penalty is controlled by a “tuning knob” called alpha ($\alpha$) (or lambda, $\lambda$).

There are two primary ways to define this penalty term: Ridge and Lasso.

1. Ridge Regression (L2 Regularization)

Ridge regression adds a penalty on the sum of the squares of the coefficients. This is the L2-Norm.

Ridge Objective: $Minimize \left[ \sum (Y_i – \hat{Y}_i)^2 \right] + \alpha \sum_{j=1}^{p} \beta_j^2$

How it works: Ridge hates large coefficients. If OLS wants to make $\beta_{METALS} = 0.5$ and $\beta_{OIL} = 0.8$, Ridge says, “Whoa, that 0.8 is too big. Its squared penalty is 0.64, which is much higher than 0.5’s penalty of 0.25. I’m going to shrink that 0.8 down.”

Key Property: Ridge shrinks all coefficients proportionally towards zero, but it never makes them exactly zero. It’s like a manager who cuts everyone’s budget by 10%. It reduces the overall “noise” and multicollinearity but doesn’t eliminate any single variable.

Best for: Situations where you believe many of your predictors all have a small but non-zero effect on Y.

2. Lasso Regression (L1 Regularization)

Lasso regression adds a penalty on the sum of the absolute values of the coefficients. This is the L1-Norm.

Lasso Objective: $Minimize \left[ \sum (Y_i – \hat{Y}_i)^2 \right] + \alpha \sum_{j=1}^{p} |\beta_j|$

How it works: This seems like a small change, but the effect is profound. The “sharp corner” of the absolute value function at zero creates a different kind of penalty.

Key Property: Lasso is capable of shrinking useless coefficients all the way to exactly zero. It performs automatic feature selection. It’s like a manager who fires the 20% most useless variables (sets their $\beta$ to 0) and keeps the rest.

Best for: Situations where you believe only a few of your predictors are truly important (a “sparse” model), and you want the model to automatically find and discard the “noise” variables.

The “How”: Finding the Best $\alpha$ with Cross-Validation

This is all great, but how do we pick the penalty strength, $alpha$?

If $\alpha=0$, we just get OLS (high variance).
If $\alpha=\infty$, all coefficients become zero (high bias).

We need to find the “Goldilocks” $\alpha$. We do this with K-Fold Cross-Validation (CV).

Instead of just one training/testing split, we create K (e.g., 10) of them:

Split Data: Take your training data and split it into 10 equal “folds”.
Iteration 1: Hold out Fold 1 as a validation set. Train your model (e.g., Lasso with $\alpha=0.1$) on Folds 2-10. Then, test its error (Mean Squared Error) on the held-out Fold 1.
Iteration 2: Hold out Fold 2. Train on Folds 1, 3-10. Test on Fold 2.
…repeat 10 times, with each fold getting one turn as the validation set.
Average: You now have 10 error scores for $\alpha=0.1$. Average them to get a robust estimate of that $\alpha$’s performance.
Find the Best: Repeat this entire 10-fold process for a whole grid of $\alpha$ values (e.g., 0.001, 0.01, 0.1, 1, 10).
Final Choice: Select the $\alpha$ that gave you the lowest average error across the K-folds.

This process prevents us from overfitting to our test set and gives us the $alpha$ that is “most likely to perform well on new, unseen data.”

Practical Application: The DXY Model vs. “Noise”

Let’s put this to the test. Our 6-variable DXY model isn’t “high-dimensional,” so OLS is probably fine. To simulate the overfitting problem, we’re going to add 14 new columns of pure, random noise (N1 through N14).

Our new dataset has 20 predictors: 6 “signal” and 14 “noise.”

An OLS model would be a disaster. It would try its best to fit all 20 variables, “finding” relationships with the noise variables by pure chance. It would be horribly overfit.
A Ridge model would be better. It would shrink all 20 coefficients, especially the 14 useless ones, but it would still keep all 14 noise variables in the model.
A Lasso model is the hero we need. Its superpower is feature selection. We hope it’s smart enough to figure out that N1 through N14 are useless and set their coefficients to zero.

The Python Code Walkthrough

We will use the scikit-learn library, the gold standard for machine learning in Python.

Crucial First Step: Scaling!

Penalized regression “punishes” coefficients based on their size. This means if EURUSD is in percent (0.01) and OIL is in dollars (50.0), the model will unfairly punish the OIL coefficient.

You MUST scale your data first. We will use StandardScaler to give every variable a mean of 0 and a standard deviation of 1.

Get sample data

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV, LassoCV, LinearRegression
from sklearn.metrics import mean_squared_error

# 1. LOAD AND PREPARE THE DATA (Same as last article)

print("--- 1. Loading Data ---")
data = pd.read_csv('M2_data.csv')
data['Date'] = pd.to_datetime(data['Date'])
data = data.set_index('Date')

dependent_var = 'DXY'
independent_vars = ['METALS', 'OIL', 'US_STK', 'INTL_STK', 'X10Y_TBY', 'EURUSD']
data_clean = data[[dependent_var] + independent_vars].dropna()

Y = data_clean[dependent_var]
X_signal = data_clean[independent_vars]

print(f"Data loaded. {len(X_signal)} observations, {X_signal.shape[1]} 'signal' predictors.")
print("-" * 30 + "\n")


# 2. CREATE THE "HIGH-DIMENSIONAL" DATASET WITH NOISE

print("--- 2. Injecting 14 'Noise' Variables ---")
np.random.seed(42) # For reproducible results
noise_data = pd.DataFrame(
    np.random.randn(len(X_signal), 14),
    index=X_signal.index,
    columns=[f'N{i+1}' for i in range(14)]
)

# Our new 20-predictor dataset
X_full = pd.concat([X_signal, noise_data], axis=1)

print(f"New dataset has {X_full.shape[1]} total predictors (6 signal + 14 noise).")
print("-" * 30 + "\n")


# 3. SCALE THE DATA (CRITICAL STEP!)

print("--- 3. Scaling Data (StandardScaler) ---")
# Note: In a real project, you would 'fit' the scaler
# ONLY on the training data. For this demo, we'll scale it all.
scaler = StandardScaler()

# StandardScaler returns a numpy array, so we'll put it back in a DataFrame
X_scaled = pd.DataFrame(
    scaler.fit_transform(X_full),
    columns=X_full.columns,
    index=X_full.index
)
# We don't need to scale Y, but it's good practice for it to be 0-mean
Y_scaled = Y - Y.mean()

print("X and Y variables are scaled and centered.")
print("-" * 30 + "\n")


# 4. RUN AND COMPARE OLS, RIDGECV, AND LASSOCV

print("--- 4. Running Models ---")

# Define our list of alphas for the models to search over
# (100 values from 10^-3 to 10^1)
alphas_to_try = np.logspace(-3, 1, 100)

# --- Model 1: OLS (using scikit-learn for comparison) ---
# Note: OLS on the scaled data
ols_model = LinearRegression()
ols_model.fit(X_scaled, Y_scaled)


# --- Model 2: RidgeCV ---
# RidgeCV automatically performs K-fold cross-validation (cv=10 by default)
# to find the best alpha from our list.
ridge_cv_model = RidgeCV(alphas=alphas_to_try, store_cv_values=True)
ridge_cv_model.fit(X_scaled, Y_scaled)
best_ridge_alpha = ridge_cv_model.alpha_


# --- Model 3: LassoCV ---
# LassoCV also finds the best alpha using CV.
# We add 'max_iter=10000' to ensure it finds a solution.
lasso_cv_model = LassoCV(alphas=alphas_to_try, cv=10, max_iter=10000)
lasso_cv_model.fit(X_scaled, Y_scaled)
best_lasso_alpha = lasso_cv_model.alpha_

print(f"Models fit. Best Ridge Alpha: {best_ridge_alpha:.4f}")
print(f"Models fit. Best Lasso Alpha: {best_lasso_alpha:.4f}")
print("-" * 30 + "\n")


# 5. COMPARE THE FINAL COEFFICIENTS

print("--- 5. Final Coefficient Comparison ---")
# Create a DataFrame to hold all coefficients
coef_df = pd.DataFrame(index=X_full.columns)
coef_df['OLS_Coef'] = ols_model.coef_
coef_df['Ridge_Coef'] = ridge_cv_model.coef_
coef_df['Lasso_Coef'] = lasso_cv_model.coef_

# Let's see how many coefficients Lasso set to zero
lasso_zero_count = (coef_df['Lasso_Coef'].abs() < 1e-6).sum()

print(coef_df.to_markdown(floatfmt=".6f"))
print("\n--- ANALYSIS ---")
print(f"Lasso set {lasso_zero_count} out of 20 coefficients to zero.")

Interpreting the Results: The “Aha!” Moment

When you run this code, the final output table is the entire story. Here’s a summary of what you’ll see:

Table displaying the coefficients from three regression models: OLS, Ridge, and Lasso, with variables listed including METALS, OIL, US_STK, INTL_STK, X10Y_TBY, EURUSD, and 14 noise variables (N1 to N14). Coefficients for OLS, Ridge, and Lasso are shown side by side.

Look at this! It’s a stunning success!

OLS: Is a mess. It has assigned coefficients to all 20 variables. It’s hopelessly overfit, “finding” patterns in the N1 to N14 noise.
Ridge: Is better. It has shrunk all 20 coefficients. Notice how Ridge_Coef is always a little smaller in magnitude than OLS_Coef. This is shrinkage in action. But it still keeps all 14 noise variables.
Lasso: Is the clear winner. It correctly identified 9 of the 14 noise variables (N1, N3, N4, N6, N7, N11, N13, N14) as useless and set their coefficients to exactly zero. It also dramatically shrank the other 5 noise variables to be almost zero. Meanwhile, it preserved the 6 “signal” variables, identifying them as the most important predictors.

This is the power of Lasso: it’s a regression model and a feature-selection tool in one.

Conclusion

We’ve now built an incredibly powerful and resilient modeling toolkit.

We’ve defeated heteroscedasticity (unequal noise) with WLS.
We’ve defended against outliers (tyrannical data points) with RLM.
And now, we’ve tamed the curse of dimensionality (overfitting) with Penalized Regression, using Ridge to stabilize our model and Lasso to perform automatic feature selection.

We have handled problems with our data (noise, outliers) and our model (overfitting). But we’ve overlooked one final, fundamental assumption.

All three of these advanced methods—WLS, RLM, and Penalized Regression—are still linear. They are all trying to fit a straight line (or a flat plane) to the data.

What happens if the true relationship isn’t a line at all? What if it’s a curve?

For that, we must leave the world of linear models behind and enter the flexible, powerful domain of Non-Parametric Regression. That will be the subject of our next article.