Understanding Robust Regression in Financial Econometrics

Financial Econometrics: Part 06

Welcome to our next article on advanced regression! In our last article, we took a close look at Ordinary Least Squares (OLS) and confronted one of its key weaknesses: heteroscedasticity. We learned that when the variance of the errors isn’t constant, the OLS standard errors become unreliable. Our solution was Weighted Least Squares (WLS), a fantastic tool for “re-weighting” the data to trust reliable observations more and “noisy” observations less.

But what happens when an observation isn’t just “noisy”? What if it’s… extreme?

Financial data is a wild beast. It’s filled with “fat tails,” “black swan” events, and data points that look like typos but are, in fact, real. Consider the 2008 financial crisis, the 2020 COVID-19 flash crash, or a single stock’s absurd rally. These are outliers: data points that are “far away” from the rest of the data.

Why is this a problem? Because OLS hates outliers. Or rather, it loves them too much.

In this article, we’ll expose the fundamental flaw that makes OLS so vulnerable to outliers. Then, we’ll introduce a powerful alternative: Robust Regression. We’ll learn how this family of models, specifically M-estimators like the Huber and Bisquare methods, can stand up to outliers and give us a more “robust” and truthful picture of our data.

The “Achilles’ Heel” of OLS: The Objective Function

To understand the problem, we must go back to the very heart of OLS. What is OLS trying to do? It finds the “best-fit” line by minimizing a specific cost function, known as the Objective Function.

For OLS, that objective function is the Sum of Squared Residuals (SSR):

$Minimize \sum_{i=1}^{n} e_i^2 = Minimize \sum_{i=1}^{n} (Y_i – \hat{Y}_i)^2$

This “least squares” approach seems democratic, but it has a tyrannical flaw. By squaring the residual ($e_i$), OLS gives disproportionately massive influence to points that are far from the regression line (i.e., outliers).

Let’s use a simple example. Imagine our model has two data points:

Point A: A “normal” point with a residual (error) of 2.
Point B: An “outlier” with a residual (error) of 10.

How much “influence” (or “cost”) does each point have in the OLS objective function?

Point A’s Cost: $2^2 = 4$
Point B’s Cost: $10^2 = 100$

The outlier (Point B) is 5 times farther from the line than Point A (10 vs. 2), but it contributes 25 times (100 / 4) more to the total cost. The OLS algorithm, in its obsessive quest to minimize the total cost, will be dominated by the outlier. It will “pull” the entire regression line away from the bulk of the data just to reduce that one, massive squared residual.

This is the tyranny of the outlier. A single, extreme data point can hold your entire model hostage, corrupting your coefficients and giving you a line that doesn’t represent the true underlying relationship for the other 99% of your data.

The Solution: Change the Rules with Robust Regression

If the problem is the objective function, the solution is to change the objective function.

This is the central idea of Robust Regression, and its most popular form, M-Estimation (which stands for “maximum likelihood-type” estimation).

M-Estimation proposes that instead of minimizing the sum of squares, we should minimize a different function of the residuals, which we’ll call $rho$ (rho).

$Minimize \sum_{i=1}^{n} \rho(e_i)$

For OLS, this function is the simple quadratic: $\rho(e_i) = e_i^2$ (or $0.5e_i^2$, the 0.5 is for mathematical convenience). As we saw, this function grows quadratically, giving infinite penalty to large errors.
For Robust Regression, we choose a $rho$ function that dampens the influence of large errors.

We’re going to look at the two most famous robust objective functions: Huber and Bisquare.

1. The Huber Method: The “Compromise”

The Huber function is a clever hybrid. It operates on a simple philosophy: “I’ll trust OLS for small, well-behaved errors, but I’ll switch to something more forgiving for large, ‘outlier-ish’ errors.”

The Huber function is defined in two parts, with a “tuning constant” k that sets the boundary:

For small errors ($|e_i| \le k$):

$\rho(e_i) = \frac{1}{2} e_i^2$

It is the OLS quadratic function. For points close to the line, it behaves just like OLS.

For large errors ($|e_i| > k$):

$\rho(e_i) = k|e_i| – \frac{1}{2}k^2$

This is a linear function. The cost still grows as the error gets bigger, but it grows linearly, not quadratically.

The result: An outlier’s influence is “capped.” By switching to a linear cost, the Huber method prevents a single extreme point from having the 25x (or 100x, or 1000x) influence it would have under OLS. It’s a “soft” down-weighting.

2. The Bisquare (Tukey’s Biweight) Method: The “Dismissive”

The Bisquare method (also known as Tukey’s Biweight) is more ruthless. Its philosophy is: “If a data point is really far out, it’s probably a mistake, and I’m going to completely ignore it.”

The Bisquare function is also defined with a tuning constant k:

For small errors ($|e_i| \le k$):

$\rho(e_i) = \frac{k^2}{6} \left[ 1 – \left(1 – \left(\frac{e_i}{k}\right)^2\right)^3 \right]$

This looks complicated, but it’s just a smooth curve that approximates the OLS parabola.

For large errors ($|e_i| > k$):

$\rho(e_i) = \frac{k^2}{6}$

The function becomes constant.

The result: This is profound. If the function is constant, it means the penalty for a residual of k+1 is the exact same as the penalty for a residual of $1,000,000$. The model has zero incentive to move its line to accommodate this extreme outlier. It effectively gives the outlier a weight of zero. This is a “hard” down-weighting.

The “How”: Iteratively Reweighted Least Squares (IRLS)

This is great, but how do we actually solve these new minimization problems? OLS has a clean, one-step “closed-form” solution. These M-estimators do not.

The answer is a beautiful, iterative process called Iteratively Reweighted Least Squares (IRLS). Here is the step-by-step logic, which perfectly builds on previous article:

The IRLS algorithm “learns” which points are outliers over several rounds:

Start (Iteration 0): Run an initial OLS (or WLS) regression to get a starting set of coefficients ($\beta_0$) and residuals ($e_0$).
Calculate Weights: Use these residuals ($e_0$) and your chosen robust method (e.g., Huber) to calculate a “robustness weight” ($w_i$) for every single data point.
- If a point has a small residual, it gets a weight close to 1.
- If a point has a large residual (an outlier), it gets a small weight (e.g., 0.2). If using Bisquare, it might even get a weight of 0.
Run WLS (Iteration 1): Perform a Weighted Least Squares (WLS) regression (our tool from Article 1!) using the robustness weights you just calculated. This produces a new, updated set of coefficients ($\beta_1$).
Get New Residuals: Use these new coefficients ($\beta_1$) to calculate new, updated residuals ($e_1$).
Calculate New Weights: Use these new residuals ($e_1$) to calculate a new set of weights. Some points that looked like outliers before might look better now, and vice-versa.
Run WLS (Iteration 2): Run another WLS regression with these new weights to get $\beta_2$.
Iterate: Repeat this “WLS -> get residuals -> get weights” loop over and over.
Converge: Stop when the coefficient estimates ($\beta$) stop changing in any meaningful way from one iteration to the next.

This process is fantastic. The model “discovers” the outliers, down-weights them, re-fits the line to the “good” data, then re-evaluates.

Practical Application: The DXY Model “Stress Test”

Let’s go back to our DXY model from Article 1. We’ve already run OLS and WLS. Now, let’s “stress test” those results by running two Robust Linear Models (RLM): one with Huber and one with Bisquare.

If our data has no significant, influential outliers, the results from all four models (OLS, WLS, Huber, Bisquare) should be very similar.

If, however, our data is contaminated by outliers, the OLS/WLS results will “disagree” with the RLM results. The RLM results would then be considered a more truthful fit to the bulk of the data.

Let’s see what happens.

The Python Code Walkthrough

Here is the complete Python code to run all four models side-by-side. We use statsmodels just as before, but this time we’ll add sm.RLM (“Robust Linear Model”) and specify the “norm” (the objective function) we want to use.

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.robust import norms # To get Huber and Bisquare
from IPython.display import display # Import display for better table formatting

# 1. LOAD AND PREPARE THE DATA (Same as Article 1)

print("--- 1. Loading Data ---")
data = pd.read_csv('M2_data.csv')
data['Date'] = pd.to_datetime(data['Date'])
data = data.set_index('Date')

dependent_var = 'DXY'
independent_vars = ['METALS', 'OIL', 'US_STK', 'INTL_STK', 'X10Y_TBY', 'EURUSD']

data_clean = data[[dependent_var] + independent_vars].dropna()

Y = data_clean[dependent_var]
X = data_clean[independent_vars]
X = sm.add_constant(X)

print(f"Data loaded. Modeling {dependent_var} with {len(independent_vars)} predictors.")
print(f"Total observations: {len(Y)}")
print("-" * 30 + "\n")


# ---
# 2. RUN ALL FOUR MODELS (OLS, WLS, RLM-Huber, RLM-Bisquare)
# ---

# --- MODEL 1: OLS ---
print("--- 2a. Running OLS (Model 1) ---")
ols_model = sm.OLS(Y, X)
ols_results = ols_model.fit()
# print(ols_results.summary()) # We'll print a custom summary later


# --- MODEL 2: WLS ---
# (Using the same method as Article 1)
print("--- 2b. Running WLS (Model 2) ---")
ols_residuals = ols_results.resid
ols_resid_sq = ols_residuals**2

aux_Y = np.log(ols_resid_sq + 1e-8)
aux_model = sm.OLS(aux_Y, X)
aux_results = aux_model.fit()

log_variance_hat = aux_results.fittedvalues
weights = 1.0 / np.exp(log_variance_hat)

wls_model = sm.WLS(Y, X, weights=weights)
wls_results = wls_model.fit()
# print(wls_results.summary())


# --- MODEL 3: Robust Regression (Huber) ---
print("--- 2c. Running RLM with Huber Norm (Model 3) ---")
# sm.RLM uses the M-estimator and IRLS process by default
# We pass the 'HuberT' norm
rlm_huber_model = sm.RLM(Y, X, M=norms.HuberT())
rlm_huber_results = rlm_huber_model.fit()
# print(rlm_huber_results.summary())


# --- MODEL 4: Robust Regression (Bisquare) ---
print("--- 2d. Running RLM with Bisquare Norm (Model 4) ---")
# Here we pass the 'TukeyBiweight' norm
rlm_bisquare_model = sm.RLM(Y, X, M=norms.TukeyBiweight())
rlm_bisquare_results = rlm_bisquare_model.fit()
# print(rlm_bisquare_results.summary())-

print("\nAll models have been fit.")
print("-" * 30 + "\n")


# 3. CREATE A COMPARISON TABLE OF RESULTS

print("--- 3. Side-by-Side Model Comparison ---")

# Create a list of our results objects
all_results = [ols_results, wls_results, rlm_huber_results, rlm_bisquare_results]
model_names = ['OLS', 'WLS', 'RLM-Huber', 'RLM-Bisquare']

# Create an empty DataFrame to store coefficients
# We use the X variable names as the index
comparison_df = pd.DataFrame(index=['const'] + independent_vars)

# Populate the DataFrame
for res, name in zip(all_results, model_names):
    comparison_df[f'{name}_Coef'] = res.params
    comparison_df[f'{name}_StdErr'] = res.bse

# A nice trick to re-format the output for readability
# We'll create a multi-index column header
comparison_df.columns = pd.MultiIndex.from_tuples(
    [col.split('_') for col in comparison_df.columns],
    names=['Model', 'Statistic']
)

# Transpose for easier reading in the console
# (Models as rows, variables as columns)
final_comparison = comparison_df.transpose()

# Print the final comparison tables
print("\n*** COEFFICIENT COMPARISON ***")
# Corrected line: Select rows where Statistic is 'Coef' and drop the redundant 'Statistic' level
display(final_comparison.loc[(slice(None), 'Coef'), :].droplevel('Statistic'))

print("\n\n*** STANDARD ERROR COMPARISON ***")
# Corrected line: Select rows where Statistic is 'StdErr' and drop the redundant 'Statistic' level
display(final_comparison.loc[(slice(None), 'StdErr'), :].droplevel('Statistic'))

print("\n\n--- End of Analysis ---")

Interpreting the Results: The Grand Comparison

When you run the code, it will output two clean tables: one comparing the coefficients and one comparing the standard errors from all four models.

Here is the data from our run, formatted for clarity (your numbers may be identical or a tiny fraction different):

Comparison table displaying coefficients and standard errors for OLS, WLS, RLM-Huber, and RLM-Bisquare models.

What does this tell us? This is a fantastic result. Look at the Coefficient table. The coefficients for WLS, RLM-Huber, and RLM-Bisquare are all very close to each other, while the OLS coefficients are noticeably different.

For example, look at US_STK: OLS has a coefficient of 0.306, but the other three models all agree the “true” coefficient is lower, clustering around 0.260 — 0.274.
Similarly, for INTL_STK: OLS finds -0.334, while the other three (more reliable) models cluster around -0.285 — -0.301.

Now look at the Standard Error table. The same story emerges. The standard errors for WLS, Huber, and Bisquare are all in the same ballpark, validating the WLS standard errors we trusted in Article 1.

This is a very important finding.

The fact that the robust models (Huber, Bisquare) give us the same answers as our WLS model is a good thing. It tells us that our dataset, while it did suffer from heteroscedasticity (which WLS fixed), does not suffer from a significant, influential outlier problem.

If there were a major outlier, we would have seen the RLM coefficients diverge significantly from the OLS/WLS ones. Since they didn’t, we’ve just used robust regression as a powerful “check” on our WLS model from Article 1, and our WLS model passed the check. This should give us more confidence in our WLS results.

Pros and Cons of M-Estimation

Robust regression is a powerful addition to our toolkit, but it’s not a magic wand.

Pros:

Resistant to Outliers: This is its main purpose. It provides a more “honest” fit to the bulk of the data.
Identifies Outliers: The final weights from the IRLS process are a useful diagnostic tool. Any observation with a very low weight is being flagged as an outlier.

Cons:

Computationally Expensive: OLS is a simple, one-step calculation. RLM requires an iterative process (IRLS) that can be much slower on large datasets.
The Tuning Constant ($k$): This is the biggest “con.” How do you choose the value of $k$ that defines the boundary for an outlier? statsmodels picks a reasonable default, but changing this “knob” will change your results. This adds a layer of subjectivity that OLS doesn’t have.
Lower Efficiency: If your data is, in fact, perfectly normal (no outliers), OLS is the most efficient estimator. Using a robust method on clean data will result in slightly higher standard errors (less efficiency).

Conclusion

We’ve now faced two major “monsters” that threaten OLS.

In Article 1, we defeated heteroscedasticity (unequal noise) with WLS.
In this article, we learned how to defend against outliers (tyrannical data points) with Robust Regression (RLM).

We used RLM as a “stress test” for our DXY model and found that our WLS results were, in fact, “robust,” which strengthens our conclusions.

We’re building a truly resilient model. But there’s a third, and perhaps most common, monster lurking in modern data: the “Curse of Dimensionality,” or, more simply, the problem of having too many variables.

What happens when we have 50, 100, or 1,000 potential predictors? This leads to overfitting and models that are great at “memorizing” the past but terrible at predicting the future.

For this, we’ll need a completely different set of tools: Penalized Regression (Ridge and Lasso). And that will be the subject of our next article. Stay tuned!