Taming the Noise with Regularization and Hyperparameter Tuning
Keywords: Ridge, Lasso, Elastic-net
Introduction
In last part, we built a linear regression model to predict stock returns. We discovered a fundamental truth: simple models often underfit (fail to capture signal), while complex models (like high-degree polynomials) overfit (memorize noise).
In Finance, the “signal-to-noise” ratio is incredibly low. To build robust trading strategies, we need models that are complex enough to learn patterns but constrained enough to ignore random market fluctuation.
This tutorial introduces Regularization: the mathematical art of applying “brakes” to your model’s learning process to prevent overfitting. We will upgrade our Momentum strategy using Ridge, Lasso, and ElasticNet regressions.
Learning Objectives
By the end of this tutorial, you will be able to:
- Explain the mathematical difference between Ridge (L2), Lasso (L1), and ElasticNet regularization.
- Apply these techniques using scikit-learn to improve out-of-sample performance.
- Perform Hyperparameter Tuning using Grid Search with time-series aware validation (TimeSeriesSplit).
- Visualize how Lasso regression performs automatic feature selection by zeroing out useless signals.
Prerequisites
- Completion of Part 1: You need the dataset and feature engineering logic from the previous article.
- Libraries: scikit-learn, pandas, numpy, matplotlib, yfinance.
Core Concepts
1. The Cost Function & The Penalty
In standard Linear Regression (Ordinary Least Squares – OLS), the model minimizes the Mean Squared Error (MSE):
$$\text{Minimize} \sum (\text{Actual} – \text{Predicted})^2$$
Regularization adds a Penalty Term to this equation. The model must now minimize both the error and the size of its coefficients (weights).
2. Ridge Regression (L2 Norm)
Ridge adds a penalty equal to the square of the coefficients:
$$\text{Cost} = \text{MSE} + \alpha \sum \theta_i^2$$
- Effect: It shrinks all coefficients towards zero but rarely reaches exactly zero.
- Use Case: Good when all features have small effects (e.g., many correlated technical indicators).
- Hyperparameter (\(\alpha\)): Controls strength. If \(\alpha=0\), it’s just OLS. If \(\alpha\) is huge, the model becomes a flat line.
3. Lasso Regression (L1 Norm)
Lasso adds a penalty equal to the absolute value of the coefficients:
$$\text{Cost} = \text{MSE} + \alpha \sum |\theta_i|$$
- Effect: It can force coefficients to be exactly zero.
- Use Case: Feature Selection. If you have 100 indicators but only 3 matter, Lasso will automatically delete the other 97 by setting their weights to 0.
4. ElasticNet
A hybrid that combines both L1 and L2 penalties. It is generally the safest bet for financial time series.
The Hands-On Practice
Step 1: Data Preparation (Recap)
We will quickly recreate the dataset from Part 1, but this time we will generate more features (lags 1 to 10) to increase the risk of overfitting, making regularization more necessary.
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error
# 1. Fetch Data
df = yf.download(‘SPY’, start=‘2010-01-01’, end=‘2026-01-01′)
df[‘Return’] = df[‘Close’].pct_change()
df.dropna(inplace=True)
# 2. Engineer More Features (Lags 1 through 10)
# More features = higher risk of overfitting = better test for Regularization
lags = 10
feature_cols = []
for i in range(1, lags + 1):
col_name = f‘Lag_{i}’
df[col_name] = df[‘Return’].shift(i)
feature_cols.append(col_name)
df.dropna(inplace=True)
X = df[feature_cols]
y = df[‘Return’]
# 3. Split Data (Chronological Split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)
print(f“Training Features: {X_train.shape[1]} inputs (Lags 1-10)”)
Step 2: Ridge Regression (Manual Tuning)
Let’s try Ridge regression with an arbitrary alpha value to see it in action.
# Initialize Ridge with a specific alpha (regularization strength)
# Alpha = 1.0 is a standard starting point
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train, y_train)
# Predict
y_pred_ridge = ridge_model.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
print(f“Ridge MSE (alpha=1.0): {mse_ridge:.8f}”)
print(“Ridge Coefficients:”, ridge_model.coef_)
Step 3: Lasso Regression (Feature Selection)
Now observe Lasso. Pay attention to how many coefficients become exactly 0.0.
# Initialize Lasso
# We assume a small alpha because returns are small numbers
lasso_model = Lasso(alpha=0.0001)
lasso_model.fit(X_train, y_train)
# Check Coefficients
print(“nLasso Coefficients:”)
for feature, coef in zip(feature_cols, lasso_model.coef_):
print(f“{feature}: {coef:.6f}”)
# Count non-zero features
n_selected = np.sum(lasso_model.coef_ != 0)
print(f“nLasso selected {n_selected} out of {lags} features.”)
Note: If Lasso sets all coefficients to zero, your alpha is too high. If it keeps all of them, alpha is too low.
Step 4: Hyperparameter Tuning with Grid Search
Guessing alpha is inefficient. We use GridSearchCV to test many values automatically.
Crucially, for finance, we use TimeSeriesSplit instead of random K-Fold validation to avoid look-ahead bias.
# Define the TimeSeriesSplit (Cross-Validation compatible with time)
# This splits training data into expanding windows
tscv = TimeSeriesSplit(n_splits=5)
# Define the model
elastic = ElasticNet()
# Define the grid of parameters to test
# alpha: strength of regulation
# l1_ratio: mix between Lasso (1.0) and Ridge (0.0)
param_grid = {
‘alpha’: [0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0],
‘l1_ratio’: [0.1, 0.5, 0.9]
}
# Setup Grid Search
grid_search = GridSearchCV(
estimator=elastic,
param_grid=param_grid,
cv=tscv,
scoring=‘neg_mean_squared_error‘,
verbose=1
)
# Fit on Training Data (GridSearch handles the internal validation splits)
grid_search.fit(X_train, y_train)
print(f“Best Parameters: {grid_search.best_params_}”)
print(f“Best Internal Score (Negative MSE): {grid_search.best_score_}”)
Step 5: Final Evaluation
Take the “best” model found by Grid Search and evaluate it on the hold-out Test set.
# Get the best model
best_model = grid_search.best_estimator_
# Predict on Test Set
y_pred_best = best_model.predict(X_test)
final_mse = mean_squared_error(y_test, y_pred_best)
print(f“Final Optimized ElasticNet MSE: {final_mse:.8f}”)
# Visualization check
plt.figure(figsize=(10, 5))
plt.plot(y_test.values[:50], label=‘Actual’, alpha=0.7)
plt.plot(y_pred_best[:50], label=‘Predicted (ElasticNet)’, alpha=0.7)
plt.legend()
plt.title(“Optimized ElasticNet Predictions vs Actual”)
plt.show()

Check Your Work
- Lasso Zeros: In Step 3, ensure Lasso produced at least one coefficient that is exactly 0.0 or extremely close to it.
- Grid Search Output: Ensure GridSearchCV printed “Best Parameters”.
- Coefficient Magnitude: Compare the coefficients of the OLS model from Part 1 with your new Ridge/ElasticNet coefficients. The regularized ones should be smaller (closer to zero).
Challenge: The Alpha Curve
Write a loop that tests Lasso with alpha values ranging from 1e-6 to 1e-2. Store the number of non-zero coefficients for each alpha. Plot Alpha (x-axis) vs Number of Selected Features (y-axis). This visualizes how increasing regularization aggressively removes features.
Conclusion & Next Steps
We have successfully “tamed” our model. By using Regularization, we reduced the risk of overfitting to noise. We also implemented GridSearch with TimeSeriesSplit, a professional-grade workflow for tuning financial models.
However, we are still predicting a continuous number (Return). In trading, we often care more about the direction (Up or Down) than the exact magnitude.
Next Steps: In Part 3, we will shift gears from Regression to Classification. We will use Logistic Regression to predict the probability of the market going up, and we will learn how to evaluate success using accuracy, precision, and the Confusion Matrix.
Troubleshooting / FAQ
Q: My Grid Search says the best alpha is the smallest one I provided.
A: This suggests your model is underfitting and wants less regularization. Try adding smaller values to your grid (e.g., 1e-6, 1e-7) or adding more complex features (like polynomial features) to justify the need for regularization.
Q: Why neg_mean_squared_error?
A: scikit-learn optimization always tries to maximize a score. Since MSE is an error (which we want to minimize), sklearn flips the sign to negative so that “maximizing the negative error” is mathematically the same as “minimizing the error.”
Q: Can I use KFold instead of TimeSeriesSplit?
A: In generic ML, yes. In Finance, NO. KFold shuffles data, meaning you would use data from 2020 to validate a model trained on 2021 data. This destroys the causal relationship of time and leads to unrealistic performance estimates.

