Master Regularization: Ridge, Lasso & ElasticNet Explained

Taming the Noise with Regularization and Hyperparameter Tuning

Keywords: Ridge, Lasso, Elastic-net

Introduction

In last part, we built a linear regression model to predict stock returns. We discovered a fundamental truth: simple models often underfit (fail to capture signal), while complex models (like high-degree polynomials) overfit (memorize noise).

In Finance, the “signal-to-noise” ratio is incredibly low. To build robust trading strategies, we need models that are complex enough to learn patterns but constrained enough to ignore random market fluctuation.

This tutorial introduces Regularization: the mathematical art of applying “brakes” to your model’s learning process to prevent overfitting. We will upgrade our Momentum strategy using Ridge, Lasso, and ElasticNet regressions.

Learning Objectives

By the end of this tutorial, you will be able to:

Explain the mathematical difference between Ridge (L2), Lasso (L1), and ElasticNet regularization.
Apply these techniques using scikit-learn to improve out-of-sample performance.
Perform Hyperparameter Tuning using Grid Search with time-series aware validation (TimeSeriesSplit).
Visualize how Lasso regression performs automatic feature selection by zeroing out useless signals.

Prerequisites

Completion of Part 1: You need the dataset and feature engineering logic from the previous article.
Libraries: scikit-learn, pandas, numpy, matplotlib, yfinance.

Core Concepts

1. The Cost Function & The Penalty

In standard Linear Regression (Ordinary Least Squares – OLS), the model minimizes the Mean Squared Error (MSE):

$$\text{Minimize} \sum (\text{Actual} – \text{Predicted})^2$$

Regularization adds a Penalty Term to this equation. The model must now minimize both the error and the size of its coefficients (weights).

2. Ridge Regression (L2 Norm)

Ridge adds a penalty equal to the square of the coefficients:

$$\text{Cost} = \text{MSE} + \alpha \sum \theta_i^2$$

Effect: It shrinks all coefficients towards zero but rarely reaches exactly zero.
Use Case: Good when all features have small effects (e.g., many correlated technical indicators).
Hyperparameter ($\alpha$): Controls strength. If $\alpha=0$, it’s just OLS. If $\alpha$ is huge, the model becomes a flat line.

3. Lasso Regression (L1 Norm)

Lasso adds a penalty equal to the absolute value of the coefficients:

$$\text{Cost} = \text{MSE} + \alpha \sum |\theta_i|$$

Effect: It can force coefficients to be exactly zero.
Use Case: Feature Selection. If you have 100 indicators but only 3 matter, Lasso will automatically delete the other 97 by setting their weights to 0.

4. ElasticNet

A hybrid that combines both L1 and L2 penalties. It is generally the safest bet for financial time series.

The Hands-On Practice

Step 1: Data Preparation (Recap)

We will quickly recreate the dataset from Part 1, but this time we will generate more features (lags 1 to 10) to increase the risk of overfitting, making regularization more necessary.

import numpy as np

import pandas as pd

import yfinance as yf

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV

from sklearn.linear_model import Ridge, Lasso, ElasticNet

from sklearn.metrics import mean_squared_error

# 1. Fetch Data

df = yf.download(‘SPY’, start=‘2010-01-01’, end=‘2026-01-01′)

df[‘Return’] = df[‘Close’].pct_change()

df.dropna(inplace=True)

# 2. Engineer More Features (Lags 1 through 10)

# More features = higher risk of overfitting = better test for Regularization

lags = 10

feature_cols = []

for i in range(1, lags + 1):

col_name = f‘Lag_{i}’

df[col_name] = df[‘Return’].shift(i)

feature_cols.append(col_name)

df.dropna(inplace=True)

X = df[feature_cols]

y = df[‘Return’]

# 3. Split Data (Chronological Split)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

print(f“Training Features: {X_train.shape[1]} inputs (Lags 1-10)”)

Step 2: Ridge Regression (Manual Tuning)

Let’s try Ridge regression with an arbitrary alpha value to see it in action.

# Initialize Ridge with a specific alpha (regularization strength)

# Alpha = 1.0 is a standard starting point

ridge_model = Ridge(alpha=1.0)

ridge_model.fit(X_train, y_train)

# Predict

y_pred_ridge = ridge_model.predict(X_test)

mse_ridge = mean_squared_error(y_test, y_pred_ridge)

print(f“Ridge MSE (alpha=1.0): {mse_ridge:.8f}”)

print(“Ridge Coefficients:”, ridge_model.coef_)

Step 3: Lasso Regression (Feature Selection)

Now observe Lasso. Pay attention to how many coefficients become exactly 0.0.

# Initialize Lasso

# We assume a small alpha because returns are small numbers

lasso_model = Lasso(alpha=0.0001)

lasso_model.fit(X_train, y_train)

# Check Coefficients

print(“nLasso Coefficients:”)

for feature, coef in zip(feature_cols, lasso_model.coef_):

print(f“{feature}: {coef:.6f}”)

# Count non-zero features

n_selected = np.sum(lasso_model.coef_ != 0)

print(f“nLasso selected {n_selected} out of {lags} features.”)

Note: If Lasso sets all coefficients to zero, your alpha is too high. If it keeps all of them, alpha is too low.

Step 4: Hyperparameter Tuning with Grid Search

Guessing alpha is inefficient. We use GridSearchCV to test many values automatically.

Crucially, for finance, we use TimeSeriesSplit instead of random K-Fold validation to avoid look-ahead bias.

# Define the TimeSeriesSplit (Cross-Validation compatible with time)

# This splits training data into expanding windows

tscv = TimeSeriesSplit(n_splits=5)

# Define the model

elastic = ElasticNet()

# Define the grid of parameters to test

# alpha: strength of regulation

# l1_ratio: mix between Lasso (1.0) and Ridge (0.0)

param_grid = {

‘alpha’: [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0],

‘l1_ratio’: [0.1, 0.5, 0.9]

}

# Setup Grid Search

grid_search = GridSearchCV(

estimator=elastic,

param_grid=param_grid,

cv=tscv,

scoring=‘neg_mean_squared_error‘,

verbose=1

)

# Fit on Training Data (GridSearch handles the internal validation splits)

grid_search.fit(X_train, y_train)

print(f“Best Parameters: {grid_search.best_params_}”)

print(f“Best Internal Score (Negative MSE): {grid_search.best_score_}”)

Step 5: Final Evaluation

Take the “best” model found by Grid Search and evaluate it on the hold-out Test set.

# Get the best model

best_model = grid_search.best_estimator_

# Predict on Test Set

y_pred_best = best_model.predict(X_test)

final_mse = mean_squared_error(y_test, y_pred_best)

print(f“Final Optimized ElasticNet MSE: {final_mse:.8f}”)

# Visualization check

plt.figure(figsize=(10, 5))

plt.plot(y_test.values[:50], label=‘Actual’, alpha=0.7)

plt.plot(y_pred_best[:50], label=‘Predicted (ElasticNet)’, alpha=0.7)

plt.legend()

plt.title(“Optimized ElasticNet Predictions vs Actual”)

plt.show()

Line graph comparing actual values and ElasticNet predictions. The x-axis represents data points from 0 to 50, while the y-axis shows values ranging from -0.02 to 0.05. Actual values are depicted with a blue line, and predictions are shown with an orange line.

Check Your Work

Lasso Zeros: In Step 3, ensure Lasso produced at least one coefficient that is exactly 0.0 or extremely close to it.
Grid Search Output: Ensure GridSearchCV printed “Best Parameters”.
Coefficient Magnitude: Compare the coefficients of the OLS model from Part 1 with your new Ridge/ElasticNet coefficients. The regularized ones should be smaller (closer to zero).

Challenge: The Alpha Curve

Write a loop that tests Lasso with alpha values ranging from 1e-6 to 1e-2. Store the number of non-zero coefficients for each alpha. Plot Alpha (x-axis) vs Number of Selected Features (y-axis). This visualizes how increasing regularization aggressively removes features.

Conclusion & Next Steps

We have successfully “tamed” our model. By using Regularization, we reduced the risk of overfitting to noise. We also implemented GridSearch with TimeSeriesSplit, a professional-grade workflow for tuning financial models.

However, we are still predicting a continuous number (Return). In trading, we often care more about the direction (Up or Down) than the exact magnitude.

Next Steps: In Part 3, we will shift gears from Regression to Classification. We will use Logistic Regression to predict the probability of the market going up, and we will learn how to evaluate success using accuracy, precision, and the Confusion Matrix.

Troubleshooting / FAQ

Q: My Grid Search says the best alpha is the smallest one I provided.

A: This suggests your model is underfitting and wants less regularization. Try adding smaller values to your grid (e.g., 1e-6, 1e-7) or adding more complex features (like polynomial features) to justify the need for regularization.

Q: Why neg_mean_squared_error?

A: scikit-learn optimization always tries to maximize a score. Since MSE is an error (which we want to minimize), sklearn flips the sign to negative so that “maximizing the negative error” is mathematically the same as “minimizing the error.”

Q: Can I use KFold instead of TimeSeriesSplit?

A: In generic ML, yes. In Finance, NO. KFold shuffles data, meaning you would use data from 2020 to validate a model trained on 2021 data. This destroys the causal relationship of time and leads to unrealistic performance estimates.

SimplifiedZone

Machine Learning for Quants (Part 2)

Like this:

Leave a ReplyCancel reply

Machine Learning for Quants (Part 2)

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from SimplifiedZone

Discover more from SimplifiedZone