No of Post Views:

35 hits

Machine Learning for Quants (Part 3)

Introduction

In Parts 1 and 2, we treated trading as a Regression problem; trying to predict the exact numerical return of an asset (e.g., “+1.2%” or “-0.5%”). However, predicting the exact magnitude of price movement is notoriously difficult due to market noise.

Often, a Quant doesn’t need to know exactly how much a stock will move; they just need to know the direction. Will it go UP or DOWN?

This part pivots to Classification. We will transform our Momentum strategy into a binary classifier using Logistic Regression. We will also introduce Feature Scaling, a mandatory preprocessing step for many ML algorithms, and learn why “Accuracy” is a dangerous metric for traders.

Learning Objectives

By the end of this tutorial, you will be able to:

  • Transform a continuous time-series problem into a binary classification problem.
  • Apply Feature Scaling (StandardScaler) to prevent model bias.
  • Train a Logistic Regression model to predict the probability of a positive return.
  • Evaluate performance using the Confusion Matrix, Precision, Recall, and ROC Curves, rather than just simple Accuracy.

Prerequisites

  1. Completion of Part 2: Familiarity with the SPY dataset and lag generation.
  2. Libraries: scikit-learn, pandas, numpy, matplotlib, yfinance, seaborn (for heatmap visualization).

Core Concepts

1. Regression vs. Classification

  • Regression: Predicts a continuous number ($y in mathbb{R}$). Example: “SPY will return 0.04% tomorrow.”
  • Classification: Predicts a discrete class ($y in {0, 1}$). Example: “SPY will be UP (1) or DOWN (0) tomorrow.”

In finance, classification is often more robust. It simplifies the noise into a binary signal.

2. Logistic Regression

Despite its name, Logistic Regression is a classification algorithm. Instead of fitting a straight line, it fits an “S-shaped” curve (the Sigmoid function) to the data.

It outputs a probability between 0 and 1.

  • If Probability > 0.5 $rightarrow$ Predict Class 1 (UP)
  • If Probability < 0.5 $rightarrow$ Predict Class 0 (DOWN)

3. Feature Scaling (The “Apples to Oranges” Problem)

Machine Learning algorithms (especially those using Gradient Descent like Logistic Regression) struggle when features have vastly different scales.

  • Example: A model using “Price” (value ~400) and “Return” (value ~0.01). The model will mathematically obsess over “Price” because the numbers are bigger, ignoring the “Return” signal.
  • Solution: We use Standard Scaling (Z-score normalization) to force all features to have a Mean of 0 and a Standard Deviation of 1.

4. The Accuracy Paradox

If the market goes UP 55% of the time, a “dumb” model that always predicts UP will have 55% Accuracy. It looks good on paper but has zero intelligence.

We need better metrics:

  • Precision: When we predict UP, how often are we right? (Crucial for minimizing bad trades).
  • Recall: Of all the actual UP days, how many did we catch? (Crucial for not missing opportunities).

Step-by-Step Walkthrough (The Hands-On Practice)

Step 1: Data Setup and Binary Target

We start by loading the data and creating a new target variable: Direction.

import numpy as np

import pandas as pd

import yfinance as yf

import matplotlib.pyplot as plt

import seaborn as sns

 

# 1. Fetch Data

df = yf.download(‘SPY’, start=‘2010-01-01’, end=‘2023-01-01’)

df[‘Return’] = df[‘Close’].pct_change()

df.dropna(inplace=True)

 

# 2. Create Binary Target

# If Return > 0, Class is 1 (Up). Otherwise 0 (Down).

df[‘Direction’] = np.where(df[‘Return’] > 0, 1, 0)

 

# 3. Create Lagged Features (Lags 1-5)

lags = 5

feature_cols = []

for i in range(1, lags + 1):

col_name = f‘Lag_{i}’

df[col_name] = df[‘Return’].shift(i)

feature_cols.append(col_name)

 

df.dropna(inplace=True)

 

# Define X (Features) and y (Target)

X = df[feature_cols]

y = df[‘Direction’] # Note: Target is now ‘Direction’, not ‘Return’

 

print(“Class Distribution:”)

print(y.value_counts(normalize=True))

 

# You will see ~55% ‘1’s (Market has a slight upward drift)

Step 2: Splitting and Scaling

Critical: You must fit the Scaler only on the Training data, then transform the Test data. If you scale the whole dataset at once, you leak future information into the past (“Data Leakage”).

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

 

# 1. Split Data (Chronological)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

 

# 2. Initialize Scaler

scaler = StandardScaler()

 

# 3. Fit on Train, Transform Train

X_train_scaled = scaler.fit_transform(X_train)

 

# 4. Transform Test (Do NOT fit on test!)

X_test_scaled = scaler.transform(X_test)

 

print(“First 5 Scaled Train Rows:n”, X_train_scaled[:5])

 

Step 3: Training Logistic Regression

Now we train the classifier.

from sklearn.linear_model import LogisticRegression

 

# Initialize and Train

log_model = LogisticRegression()

log_model.fit(X_train_scaled, y_train)

 

# Predict Classes (0 or 1)

y_pred = log_model.predict(X_test_scaled)

 

# Predict Probabilities (0.0 to 1.0)

y_prob = log_model.predict_proba(X_test_scaled)[:, 1] # Probability of Class 1

 

print(“Predictions generated.”)

 

Step 4: The Confusion Matrix

The best way to visualize classification errors.

from sklearn.metrics import confusion_matrix, accuracy_score

 

# Generate Confusion Matrix

cm = confusion_matrix(y_test, y_pred)

 

# Visualize

plt.figure(figsize=(6, 5))

sns.heatmap(cm, annot=True, fmt=‘d’, cmap=‘Blues’, cbar=False)

plt.xlabel(‘Predicted Label (0=Down, 1=Up)’)

plt.ylabel(‘Actual Label (0=Down, 1=Up)’)

plt.title(‘Confusion Matrix’)

plt.show()

 

# Simple Accuracy

acc = accuracy_score(y_test, y_pred)

print(f“Model Accuracy: {acc:.4f}”)

 

Interpretation:

  1. Top-Left: True Negatives (Predicted Down, Actually Down).
  2. Bottom-Right: True Positives (Predicted Up, Actually Up).
  3. Top-Right: False Positives (Predicted Up, Actually Down) -> Painful Losses.
  4. Bottom-Left: False Negatives (Predicted Down, Actually Up) -> Missed Opportunities.

Step 5: Advanced Metrics (Precision & Recall)

Let’s get the detailed “scorecard.”

from sklearn.metrics import classification_report

 

print(classification_report(y_test, y_pred))

 

Step 6: The ROC Curve

The Receiver Operating Characteristic (ROC) curve visualizes the trade-off between capturing positives and flagging false positives. A perfect model hugs the top-left corner. A random guess follows the diagonal line.

from sklearn.metrics import roc_curve, auc

 

# Calculate rates

fpr, tpr, thresholds = roc_curve(y_test, y_prob)

roc_auc = auc(fpr, tpr)

 

# Plot

plt.figure(figsize=(8, 6))

plt.plot(fpr, tpr, color=darkorange, lw=2, label=f‘ROC curve (area = {roc_auc:.2f})’)

plt.plot([0, 1], [0, 1], color=‘navy’, lw=2, linestyle=‘–‘) # Random Guess Line

plt.xlim([0.0, 1.0])

plt.ylim([0.0, 1.05])

plt.xlabel(‘False Positive Rate’)

plt.ylabel(‘True Positive Rate’)

plt.title(‘ROC Curve’)

plt.legend(loc=“lower right”)

plt.show()

 

Check Your Work

  • Scaling Stats: Check np.mean(X_train_scaled) and np.std(X_train_scaled). They should be essentially 0 and 1, respectively.
  • Baseline Comparison: Calculate the percentage of “1”s in y_test (e.g., 0.54). If your Model Accuracy is 0.54 or lower, your model is no better than a coin flip or a “Always Buy” strategy.

Challenge: Tuning the Threshold

By default, Logistic Regression predicts “1” if Probability > 50%.

In trading, we might only want to bet if we are very sure.

  • Create a new prediction array: Predict “1” only if y_prob > 0.55.
  • Check the Precision of this new high-confidence model. Does it improve? (Usually, precision goes up, but recall goes down—you trade less often, but potentially more safely).

Conclusion & Next Steps

We have moved from predicting “How much?” to “Which way?”. We learned that Scaling is non-negotiable for Logistic Regression and that Accuracy can hide a lot of sins in financial data.

However, predicting pure market movements is difficult because markets are highly efficient. Classification shines brightest when predicting distinct events with richer data profiles.

Next Steps: In Part 4, we will apply these classification skills to a robust, real-world banking case study: Credit Default Prediction. We will work with a rich dataset containing age, income, and loan history to predict who will default on a loan.

Troubleshooting / FAQ

Q: My ROC Score is 0.51 or 0.49. Is my model broken?

A: No, it just means the market is hard to predict! Simple lag-based strategies on efficient markets (like SPY) often hover near random (0.50). To improve this, Quants add alternative data (volume, volatility, sentiment) rather than just past prices.

Q: Why did we drop NaNs?

A: Logistic Regression cannot handle missing values. Since we created lags, the first few rows became empty. We must remove them or the code will crash.

Q: Can I use MinMaxScaler instead of StandardScaler?

A: Yes, MinMaxScaler squashes data between 0 and 1. However, StandardScaler is generally preferred for Logistic Regression because it handles outliers better and centers the data around zero, which helps the optimization algorithm convergence.


Leave a Reply

Discover more from SimplifiedZone

Subscribe now to keep reading and get access to the full archive.

Continue reading

Discover more from SimplifiedZone

Subscribe now to keep reading and get access to the full archive.

Continue reading