Unsupervised Machine Learning for Quants: A Hands-On Tutorial

Introduction

In last article we talked about supervised learning. This tutorial provides a practical guide to unsupervised machine learning, with a focus on clustering techniques. As a quant, you’ll often encounter large datasets without clear labels. Unsupervised learning allows you to discover hidden patterns and structures within this data. This is invaluable for tasks like asset allocation, portfolio management, and identifying trading opportunities. By the end of this tutorial, you will have a solid understanding of clustering. You will have implemented these techniques on real-world financial data.

Learning Objectives

By the end of this tutorial, you will be able to:

Differentiate between supervised and unsupervised machine learning.
Explain the principles of K-Means and hierarchical clustering.
Implement K-Means and hierarchical clustering in Python using scikit-learn and SciPy.
Determine the optimal number of clusters using the Elbow Method and dendrograms.
Apply clustering techniques to analyze foreign exchange rate data.

Core Concepts

Unsupervised vs. Supervised Learning

In supervised learning, we have labeled data and our goal is to predict a known outcome. For example, predicting stock price movements based on historical data with known price changes. In unsupervised learning, we work with unlabeled data and aim to find inherent patterns or structures. An example would be grouping stocks based on their volatility without any predefined categories.

Clustering

Clustering is the process of grouping a set of data points. Points in the same group (or cluster) are more similar to each other than to those in other clusters. The goal is to maximize intra-cluster similarity and minimize inter-cluster similarity.

K-Means Clustering

K-Means is an algorithm that partitions data into a pre-specified number of clusters (k). It works by:

Randomly selecting ‘k’ initial centroids (the center of a cluster).
Assigning each data point to the nearest centroid forming clusters.
Recalculating the centroids as the mean of all data points in the new clusters.
Repeating steps 2 and 3 until the centroids no longer change significantly.

The Elbow Method

The Elbow Method is a technique used to find the optimal number of clusters for K-Means. It involves plotting the within-cluster sum of squares (WCSS) for different values of ‘k’. The point where the rate of decrease in WCSS sharply shifts (forming an “elbow”) is considered the optimal number of clusters.

Hierarchical Clustering

Hierarchical clustering builds a hierarchy of clusters. There are two main types:

Agglomerative (Bottom-up): Each data point starts in its own cluster. Pairs of clusters are merged as one moves up the hierarchy.
Divisive (Top-down): All data points start in one cluster, and splits are performed recursively as one moves down the hierarchy.

Dendrograms

A dendrogram is a tree-like diagram that visualizes the arrangement of the clusters produced by hierarchical clustering. It shows the hierarchy of clusters and the distance at which each merge occurred. By cutting the dendrogram at a certain height, you can determine the number of clusters.

Step-by-Step Walkthrough (The Hands-On Practice)

Step 1: Setting up the Environment

First, let’s import the necessary Python packages.

# Import packages
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import pandas as pd
import seaborn as sns
import yfinance as yf
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import pdist
from sklearn.preprocessing import StandardScaler

# Set plotting style
# You can see available styles by running: print(plt.style.available)
plt.style.use("seaborn-v0_8-darkgrid")
%matplotlib inline
sns.set_theme()

# Set pandas display options
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)

Step 2: K-Means Clustering with a Sample Dataset

Let’s start with a simple example to understand K-Means.

Generate a sample dataset: We’ll use make_blobs to create a dataset with 4 distinct clusters.

X, y = make_blobs(centers=4, n_samples=2500, random_state=0)

Visualize the dataset:

fig = plt.figure(figsize=(8, 6))
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("Dataset with 4 clusters")
plt.xlabel("First feature")
plt.ylabel("Second feature")
plt.show()

Use the Elbow Method to find the optimal ‘k’:

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init="k-means++", max_iter=300, n_init=10, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(7, 5))
plt.plot(range(1, 11), wcss)
plt.title("K-Means Clustering (The Elbow Method)")
plt.xlabel("K")
plt.ylabel("WCSS")
plt.show()

You observe an “elbow” at k=4, which is the optimal number of clusters.

Step 3: Hierarchical Clustering on Foreign Exchange Rates

Data Scraping: We will download foreign exchange rate data using the yfinance library.

# Set the ticker as 'EURUSD=X'
forex_data = yf.download("USDEUR=X", start="2019-01-02", end="2022-06-30")
forex_data = forex_data.reset_index()
euro_df = forex_data[["Date", "Close"]]
euro_df.rename(columns={"Close": "euro"}, inplace=True)
# Repeat for other currencies...

forex_data1 = yf.download("USDRUB=X", start="2019-01-02", end="2022-06-30")
forex_data1 = forex_data1.reset_index()
rub_df = forex_data1[["Date", "Close"]]
rub_df.rename(columns={"Close": "rub"}, inplace=True)

forex_data2 = yf.download("USDGBP=X", start="2019-01-02", end="2022-06-30")
forex_data2 = forex_data2.reset_index()
gbp_df = forex_data2[["Date", "Close"]]
gbp_df.rename(columns={"Close": "gbp"}, inplace=True)

forex_data3 = yf.download("USDJPY=X", start="2019-01-02", end="2022-06-30")
forex_data3 = forex_data3.reset_index()
jpy_df = forex_data3[["Date", "Close"]]
jpy_df.rename(columns={"Close": "jpy"}, inplace=True)

forex_data4 = yf.download("USDKES=X", start="2019-01-02", end="2022-06-30")
forex_data4 = forex_data4.reset_index()
kes_df = forex_data4[["Date", "Close"]]
kes_df.rename(columns={"Close": "kes"}, inplace=True)

forex_data5 = yf.download("USDCNY=X", start="2019-01-02", end="2022-06-30")
forex_data5 = forex_data5.reset_index()
cny_df = forex_data5[["Date", "Close"]]
cny_df.rename(columns={"Close": "cny"}, inplace=True)

forex_data6 = yf.download("USDKRW=X", start="2019-01-02", end="2022-06-30")
forex_data6 = forex_data6.reset_index()
krw_df = forex_data6[["Date", "Close"]]
krw_df.rename(columns={"Close": "krw"}, inplace=True)

forex_data7 = yf.download("USDSGD=X", start="2019-01-02", end="2022-06-30")
forex_data7 = forex_data7.reset_index()
sgd_df = forex_data7[["Date", "Close"]]
sgd_df.rename(columns={"Close": "sgd"}, inplace=True)

forex_data8 = yf.download("USDTWD=X", start="2019-01-02", end="2022-06-30")
forex_data8 = forex_data8.reset_index()
twd_df = forex_data8[["Date", "Close"]]
twd_df.rename(columns={"Close": "twd"}, inplace=True)

forex_data9 = yf.download("USDNGN=X", start="2019-01-02", end="2022-06-30")
forex_data9 = forex_data9.reset_index()
ngn_df = forex_data9[["Date", "Close"]]
ngn_df.rename(columns={"Close": "ngn"}, inplace=True)

forex_data10 = yf.download("USDZAR=X", start="2019-01-02", end="2022-06-30")
forex_data10 = forex_data10.reset_index()
zar_df = forex_data10[["Date", "Close"]]
zar_df.rename(columns={"Close": "zar"}, inplace=True)

forex_data11 = yf.download("USDMYR=X", start="2019-01-02", end="2022-06-30")
forex_data11 = forex_data11.reset_index()
myr_df = forex_data11[["Date", "Close"]]
myr_df.rename(columns={"Close": "myr"}, inplace=True)

forex_data12 = yf.download("USDIDR=X", start="2019-01-02", end="2022-06-30")
forex_data12 = forex_data12.reset_index()
idr_df = forex_data12[["Date", "Close"]]
idr_df.rename(columns={"Close": "idr"}, inplace=True)

forex_data13 = yf.download("USDTHB=X", start="2019-01-02", end="2022-06-30")
forex_data13 = forex_data13.reset_index()
thb_df = forex_data13[["Date", "Close"]]
thb_df.rename(columns={"Close": "thb"}, inplace=True)

forex_data14 = yf.download("USDAUD=X", start="2019-01-02", end="2022-06-30")
forex_data14 = forex_data14.reset_index()
aud_df = forex_data14[["Date", "Close"]]
aud_df.rename(columns={"Close": "aud"}, inplace=True)

forex_data15 = yf.download("USDNZD=X", start="2019-01-02", end="2022-06-30")
forex_data15 = forex_data15.reset_index()
nzd_df = forex_data15[["Date", "Close"]]
nzd_df.rename(columns={"Close": "nzd"}, inplace=True)

forex_data16 = yf.download("USDCAD=X", start="2019-01-02", end="2022-06-30")
forex_data16 = forex_data16.reset_index()
cad_df = forex_data16[["Date", "Close"]]
cad_df.rename(columns={"Close": "cad"}, inplace=True)

forex_data17 = yf.download("USDCHF=X", start="2019-01-02", end="2022-06-30")
forex_data17 = forex_data17.reset_index()
chf_df = forex_data17[["Date", "Close"]]
chf_df.rename(columns={"Close": "chf"}, inplace=True)

forex_data18 = yf.download("USDNOK=X", start="2019-01-02", end="2022-06-30")
forex_data18 = forex_data18.reset_index()
nok_df = forex_data18[["Date", "Close"]]
nok_df.rename(columns={"Close": "nok"}, inplace=True)

forex_data19 = yf.download("USDAUD=X", start="2019-01-02", end="2022-06-30")
forex_data19 = forex_data19.reset_index()
sek_df = forex_data19[["Date", "Close"]]
sek_df.rename(columns={"Close": "sek"}, inplace=True)

forex_data20 = yf.download("USDARS=X", start="2019-01-02", end="2022-06-30")
forex_data20 = forex_data20.reset_index()
ars_df = forex_data20[["Date", "Close"]]
ars_df.rename(columns={"Close": "ars"}, inplace=True)

forex_data21 = yf.download("USDPLN=X", start="2019-01-02", end="2022-06-30")
forex_data21 = forex_data21.reset_index()
pln_df = forex_data21[["Date", "Close"]]
pln_df.rename(columns={"Close": "pln"}, inplace=True)

forex_data22 = yf.download("USDPHP=X", start="2019-01-02", end="2022-06-30")
forex_data22 = forex_data22.reset_index()
php_df = forex_data22[["Date", "Close"]]
php_df.rename(columns={"Close": "php"}, inplace=True)

forex_data23 = yf.download("USDRON=X", start="2019-01-02", end="2022-06-30")
forex_data23 = forex_data23.reset_index()
ron_df = forex_data23[["Date", "Close"]]
ron_df.rename(columns={"Close": "ron"}, inplace=True)

forex_data24 = yf.download("USDHUF=X", start="2019-01-02", end="2022-06-30")
forex_data24 = forex_data24.reset_index()
huf_df = forex_data24[["Date", "Close"]]
huf_df.rename(columns={"Close": "huf"}, inplace=True)

forex_data25 = yf.download("USDBRL=X", start="2019-01-02", end="2022-06-30")
forex_data25 = forex_data25.reset_index()
brl_df = forex_data25[["Date", "Close"]]
brl_df.rename(columns={"Close": "brl"}, inplace=True)

forex_data26 = yf.download("USDCLP=X", start="2019-01-02", end="2022-06-30")
forex_data26 = forex_data26.reset_index()
clp_df = forex_data26[["Date", "Close"]]
clp_df.rename(columns={"Close": "clp"}, inplace=True)

forex_data27 = yf.download("USDMXN=X", start="2019-01-02", end="2022-06-30")
forex_data27 = forex_data27.reset_index()
mxn_df = forex_data27[["Date", "Close"]]
mxn_df.rename(columns={"Close": "mxn"}, inplace=True)

forex_data28 = yf.download("USDCOP=X", start="2019-01-02", end="2022-06-30")
forex_data28 = forex_data28.reset_index()
cop_df = forex_data28[["Date", "Close"]]
cop_df.rename(columns={"Close": "cop"}, inplace=True)

forex_data29 = yf.download("USDILS=X", start="2019-01-02", end="2022-06-30")
forex_data29 = forex_data29.reset_index()
ils_df = forex_data29[["Date", "Close"]]
ils_df.rename(columns={"Close": "ils"}, inplace=True)

forex_data30 = yf.download("USDTRY=X", start="2019-01-02", end="2022-06-30")
forex_data30 = forex_data30.reset_index()
try_df = forex_data30[["Date", "Close"]]
try_df.rename(columns={"Close": "try"}, inplace=True)

forex_data31 = yf.download("USDINR=X", start="2019-01-02", end="2022-06-30")
forex_data31 = forex_data31.reset_index()
inr_df = forex_data31[["Date", "Close"]]
inr_df.rename(columns={"Close": "inr"}, inplace=True)

Combine into a single DataFrame:

from functools import reduce
# Assuming all currency dataframes (euro_df, rub_df, etc.) are created
df_currencies = reduce(
    lambda x, y: pd.merge(x, y, on="Date", how="outer"),
    [kes_df, ars_df, php_df, myr_df, ils_df, cop_df, euro_df, ngn_df, huf_df, ron_df, cny_df, rub_df, clp_df, sgd_df, twd_df, krw_df, idr_df, thb_df, inr_df, pln_df, try_df, brl_df, mxn_df, zar_df, gbp_df, jpy_df, aud_df, nzd_df, cad_df, chf_df, nok_df, sek_df]
)
df_currencies.set_index("Date", inplace=True)

Handle missing values and scale the data:

df_currencies.fillna(method="ffill", inplace=True)
sc = StandardScaler()
subset_scaled_df = pd.DataFrame(
    sc.fit_transform(df_currencies),
    columns=df_currencies.columns,
)

Generate and visualize the dendrogram:

# Using 'ward' linkage as an example
Z = linkage(subset_scaled_df.T, method="ward")
fig = plt.figure(figsize=(25, 10))
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Currency")
plt.ylabel("Distance")
dendrogram(
    Z,
    leaf_rotation=90.,
    leaf_font_size=8.,
    labels=subset_scaled_df.columns
)
plt.show()

How to Read the Dendrogram

Think of a dendrogram as a family tree for your data points (in this case, currencies).

X-Axis (Bottom): This axis lists all the individual data points or leaves of the tree. In our tutorial, these are the different currency tickers (e.g., ‘euro’, ‘jpy’, ‘gbp’).
Y-Axis (Left): This axis represents the distance or dissimilarity between clusters. The scale is determined by the linkage method you chose (e.g., ‘ward’). A smaller distance means the data points are more similar.
Horizontal Lines: Each horizontal line is a merge. It represents two or more clusters (or individual data points) joining to form a larger cluster. The height of this line on the y-axis tells you the distance at which this merge happened.
Vertical Lines: These lines simply connect the clusters. The length of a vertical line represents the distance between the merge points of the clusters it connects. Long vertical lines are significant because they indicate a large distance gap. This suggests that the clusters being merged are not very similar.

What Does It Signify?

The dendrogram provides a rich, visual summary of the relationships within your data. It signifies:

Similarity: Currencies that are joined together at the bottom of the dendrogram are very similar to each other based on their historical price movements. For example, you might find that currencies of geographically close or economically linked countries cluster together early.
Hierarchy: It shows the nested grouping structure of your data. You can see how individual currencies form small clusters, and how those small clusters then merge to form larger ones, all the way up to a single cluster containing all currencies at the top.
Natural Groupings: Unlike K-Means, you don’t have to specify the number of clusters beforehand. The dendrogram allows you to see the potential for different numbers of clusters and make an informed decision based on the data’s structure.

Deciding the Number of Clusters

Your primary next step is to use the dendrogram to decide on the optimal number of clusters. You do this by “cutting” the tree.

Find the Longest Vertical Line: Look for the longest vertical line that doesn’t have a horizontal line (a merge) crossing it. A long vertical line signifies a large jump in distance, which is a good indicator of a natural division in the data.
Draw a Horizontal Cut: Draw a horizontal line through that longest vertical line.
Count the Clusters: The number of vertical lines your horizontal cut intersects is the optimal number of clusters.

For example, if you draw a horizontal line at a distance of ~55, and it is crossing 4 vertical lines, then 4 is a good choice for the number of clusters.

Once you’ve decided on the number of clusters (k=4), you can then use AgglomerativeClustering from scikit-learn. This tool helps formally create these clusters and assign each currency to a group.

# After deciding on k=4 from the dendrogram
cluster = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
cluster_labels = cluster.fit_predict(subset_scaled_df.T)

# See which currency belongs to which cluster
clustered_currencies = pd.DataFrame(data={"currency": subset_scaled_df.columns, "cluster": cluster_labels})
print(clustered_currencies)

from IPython.display import display, HTML

html_output = "<table><tr>"
for cluster_id in sorted(clustered_currencies['cluster'].unique()):
    cluster_df = clustered_currencies[clustered_currencies['cluster'] == cluster_id].reset_index(drop=True)
    html_output += f"<td style='vertical-align: top;'><b>Cluster {cluster_id}:</b><br>{cluster_df.to_html(index=False)}</td>"
html_output += "</tr></table>"

display(HTML(html_output))

The final step is to analyze the resulting groups using your domain knowledge as a quant. For instance, you might find one cluster of stable “safe-haven” currencies, another of volatile emerging market currencies, and a third of commodity-linked currencies. This insight can then be used for portfolio diversification, pairs trading strategies, or risk management.

Conclusion & Next Steps

In this tutorial, you’ve learned the fundamentals of unsupervised machine learning, specifically focusing on K-Means and hierarchical clustering. You’ve implemented these techniques in Python and applied them to a real-world financial dataset of foreign exchange rates.

For next steps, we will explore:

Divisive Hierarchical Clustering: Implement the top-down approach to clustering.
Other Clustering Algorithms: Investigate algorithms like DBSCAN and Gaussian Mixture Models.
Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) can be used to reduce the number of features before clustering.

SimplifiedZone

Leave a ReplyCancel reply