Unveiling Relationships in Data: A Deep Dive into Linear Regression
Financial Econometrics: 01
Welcome to the fascinating world of financial econometrics! In this first article, we embark on a journey to understand one of the most fundamental and powerful tools in a quantitative analyst’s toolkit: Linear Regression. Whether you’re trying to understand what drives stock returns, predict company revenues, or simply find a relationship between two variables, linear regression is often the first and best place to start.
Imagine you’re an analyst for a beverage company. A key part of your job is to understand the performance of your company’s stock. You might hypothesize that your stock’s performance is related to the overall performance of the stock market. But how can you quantify this relationship? How can you be sure it’s statistically meaningful? This is where linear regression comes in. It provides a framework to model the relationship between a dependent variable (your stock’s return) and one or more independent variables (the market’s return).
This article will guide you through the theory, application, and evaluation of linear regression models. We’ll start with the basics of a simple linear regression, understand its components, and then expand to multiple regression. Along the way, we’ll use a real-world dataset to make these concepts tangible and provide Python code to put theory into practice.
The Core Idea: Modeling Linear Relationships
At its heart, linear regression is about drawing a straight line through a scatter plot of data points that best represents their relationship.
- Endogenous Variable (Dependent Variable): This is the variable we are trying to predict or explain. It’s our target. In our example, it would be the excess return of a company’s stock, like Coca-Cola. We often denote this as Y.
- Exogenous Variable (Independent Variable): This is the variable we believe influences the dependent variable. It’s the predictor. In our case, this could be the excess return of the entire market, like the Dow Jones Industrial Average. We denote this as X.
A simple linear regression model can be expressed with the following equation:
Yi=β0+β1Xi+ϵi
Let’s break this down:
- Yi is the value of the dependent variable for the ith observation.
- Xi is the value of the independent variable for the ith observation.
- β0 is the intercept. It’s the predicted value of Y when X is 0.
- β1 is the slope. It represents the change in Y for a one-unit change in X. This is often the coefficient we are most interested in.
- ϵi is the error term. It captures all the other factors that influence Y but are not included in the model. It represents the random, unexplained variation.
The goal of linear regression is to find the values of β0 and β1 that create the “best-fitting” line. The most common method to achieve this is the Ordinary Least Squares (OLS) method. OLS works by minimizing the sum of the squared differences between the actual observed values (Yi) and the values predicted by our model (Yi).
Practical Application: Setting Up Our Data
Before we dive into building models, let’s get our hands dirty with some data. We’ll use a dataset containing quarterly returns for several stocks and the Dow Jones index from 2016 to 2020. Our first task is to load this data and prepare it for analysis.
Here’s how we can load the data (download) using Python’s pandas library and take a first look.
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Load the dataset
df = pd.read_csv('data.csv')
# Display the first few rows of the dataframe
print("First 5 rows of the dataset:")
print(df.head())
# Display summary statistics
print("\nSummary statistics:")
print(df.describe())
This simple script loads our data.csv file into a pandas DataFrame, which is a powerful, table-like data structure. df.head() shows us the first few entries, giving us a feel for the columns and data types, while df.describe() provides key statistical information like mean, standard deviation, and quartiles for each numerical column.
Visualizing Relationships with Scatter Plots
The first step in any regression analysis should be to visualize the data. A scatter plot is the perfect tool for this, as it helps us see if a linear relationship between our variables is plausible. Let’s plot the excess quarterly return of Coca-Cola (Coke_Q_EX_R) against the excess quarterly return of the Dow Jones Index (Dow_Q_EX_R).

Create a scatter plot to visualize the relationship
# Create a scatter plot to visualize the relationship
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Dow_Q_EX_R', y='Coke_Q_EX_R', data=df)
plt.title('Coca-Cola vs. Dow Jones Excess Quarterly Returns', fontsize=16)
plt.xlabel('Dow Jones Excess Quarterly Returns', fontsize=12)
plt.ylabel('Coca-Cola Excess Quarterly Returns', fontsize=12)
plt.grid(True)
plt.show()

The scatter plot suggests a positive relationship: as the Dow Jones returns go up, Coca-Cola’s returns tend to go up as well. The points cluster in a way that suggests a straight line could be a reasonable approximation of this trend.
Building Our First Simple Linear Regression Model
Now, let’s quantify the relationship we observed. We will use the statsmodels library in Python, a go-to for rigorous statistical modeling. We’ll define our dependent variable (Y) as Coca-Cola’s excess returns and our independent variable (X) as the Dow’s excess returns.
It’s crucial to add a constant (the intercept, β0) to our independent variable.
# Define the dependent and independent variables
Y = df['Coke_Q_EX_R']
X = df['Dow_Q_EX_R']
# Add a constant (intercept) to the independent variable
X = sm.add_constant(X)
# Fit the OLS model
model_simple = sm.OLS(Y, X).fit()
# Print the model summary
print(model_simple.summary())

The model_simple.summary() function provides a wealth of information. Let’s dissect the most important parts of the output table.
Interpreting the Regression Results
The summary table is split into three main parts.
- Model Summary (Top Section): This gives an overview of the model’s performance.
- R-squared: This value, ranging from 0 to 1, tells us the proportion of the variance in the dependent variable (Y) that is predictable from the independent variable (X). In our case, an R-squared of 0.47 would mean that 47% of the variation in Coca-Cola’s excess returns can be explained by the Dow’s excess returns.
- Adj. R-squared: This is a modified version of R-squared that adjusts for the number of predictors in the model. It’s particularly useful when comparing models with different numbers of independent variables.
- F-statistic & Prob (F-statistic): These test the overall significance of the model. A low p-value (typically < 0.05) for the F-statistic indicates that our model is statistically significant and our independent variable(s) collectively have an explanatory power. In our model, value of 18.71 signify that model is not statistically significant.
- Coefficients Table (Middle Section): This is where we find our estimated values.
- const (coef): This is our estimated intercept (βi). It’s the predicted excess return for Coca-Cola when the Dow’s excess return is zero.
- Dow_Q_EX_R (coef): This is our estimated slope (β1). It tells us that for a 1% increase in the Dow’s excess return, we expect Coca-Cola’s excess return to increase by β1 percent.
- std err: The standard error measures the average amount that the coefficient estimates vary from the actual average value of our response variable.
- t: The t-statistic is the coefficient divided by its standard error. It measures how many standard deviations our coefficient estimate is from zero.
- P>|t|: This is the p-value. It tells us the probability of observing our result (or a more extreme one) if the null hypothesis were true. The null hypothesis is that the coefficient is zero (i.e., the independent variable has no effect on the dependent variable). A small p-value (< 0.05) leads us to reject the null hypothesis, concluding that the coefficient is statistically significant.
- Residual Diagnostics (Bottom Section): This section provides tests to check if our model’s assumptions are met. The errors (ϵi) should ideally be normally distributed and independent.
- Omnibus/Prob(Omnibus): Tests for the normality of residuals. A high p-value suggests normality.
- Durbin-Watson: Tests for autocorrelation in the residuals. A value around 2 suggests no autocorrelation.
- Jarque-Bera (JB)/Prob(JB): Another test for the normality of residuals.
Expanding to Multiple Linear Regression
The real world is complex. It’s rare that a single variable can fully explain another. To build more realistic models, we use Multiple Linear Regression, which incorporates several independent variables.
The equation extends naturally:
Yi=β0+β1X1i+β2X2i+…+βkXki+ϵi
Here, we have k independent variables, each with its own slope coefficient (β1,β2,…,βk). Each βj now represents the expected change in Y for a one-unit change in Xj, holding all other independent variables constant.
Let’s build a model to predict Coca-Cola’s excess returns using the excess returns of the Dow Jones, Google, Bank of America, and Pfizer.
# Define the dependent and independent variables for the multiple regression
Y_multi = df['Coke_Q_EX_R']
X_multi = df[['Dow_Q_EX_R', 'GOOG_Q_EX_R', 'BAC_Q_EX_R', 'PFE_Q_EX_R']]
# Add a constant
X_multi = sm.add_constant(X_multi)
# Fit the OLS model
model_multi = sm.OLS(Y_multi, X_multi).fit()
# Print the model summary
print(model_multi.summary())

When interpreting the results, we look at the same statistics as before. The Adjusted R-squared becomes more important now, as it helps us determine if adding the new variables actually improved the model. We examine the p-value for each coefficient to see which variables are statistically significant predictors.
Model Diagnostics: Are We on the Right Track?
A regression model is only reliable if it satisfies certain assumptions about the error term (ϵ). This process is known as diagnostics.
- Linearity: The relationship between X and Y is linear. (We checked this with the scatter plot).
- Normality of Errors: The errors are normally distributed. (We can check this with a Q-Q plot or the Jarque-Bera test).
- Homoscedasticity: The variance of the errors is constant across all levels of X. (We can check this with a residuals vs. fitted plot).
- Independence of Errors: The errors are independent of each other. (The Durbin-Watson test helps here).
Identifying Influential Points
Sometimes, a few data points can have a disproportionately large impact on our regression line. These are called influential points. They can be a combination of:
- Outliers: Observations with a large residual (the model predicts them poorly).
- Leverage Points: Observations with an extreme value for an independent variable (X).
Cook’s distance is a useful metric to identify influential points. A common rule of thumb is that a Cook’s distance greater than 4/n (where n is the number of observations) warrants investigation.
Let’s create an influence plot, which visualizes leverage and residuals simultaneously, to spot these points.
# Create an influence plot
fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.influence_plot(model_multi, ax=ax, criterion="cooks")
plt.title("Influence Plot", fontsize=16)
plt.show()

The influence plot helps us identify points that might be skewing our results. If we find highly influential points, we might consider investigating them. Are they data entry errors? Or do they represent a unique event that we might want to exclude or model differently?
Conclusion: The First Step on a Longer Journey
In this article, we’ve laid the groundwork for linear regression analysis. We’ve learned how to specify, estimate, and interpret both simple and multiple regression models using Python. We’ve also touched upon the critical importance of visualizing our data and performing diagnostics to ensure our model is reliable.
We saw how to assess a model’s overall fit with R-squared and the F-test, and how to determine the significance of individual predictors using t-tests and p-values. Finally, we learned to be vigilant for influential points that can unduly affect our conclusions.
The journey doesn’t end here. Issues like multicollinearity (high correlation between independent variables), variable transformation, and measurement errors are crucial topics that we will explore in subsequent articles. But with the solid foundation built here, you are now well-equipped to start building your own models and uncovering the stories hidden within your data.

