Mastering HDBSCAN: Clustering Variable Density Data Made Easy

DS
LDS Team
Let's Data Science
12 min readAudio
Mastering HDBSCAN: Clustering Variable Density Data Made Easy
0:00 / 0:00

Imagine you are an urban planner analyzing population data. You have a dataset containing the GPS coordinates of every house in a region. You want to identify distinct "neighborhoods."

Some neighborhoods are tightly packed city centers (high density). Others are sprawling suburbs (medium density). And then there are rural villages (low density).

If you use K-Means, it forces every neighborhood to be a circle of roughly the same size—completely failing to capture the irregular shapes of a city.

If you use DBSCAN, you hit a wall. You have to pick a single density threshold (epsilon). If you pick a high density to find the city center, the suburbs look like noise. If you pick a low density to find the suburbs, the city center merges into one giant, useless blob.

Enter HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise).

HDBSCAN is the "smarter cousin" of DBSCAN. It doesn't just find clusters of arbitrary shapes; it finds clusters of varying densities. It automatically determines the optimal number of clusters and handles noise gracefully, all while requiring fewer hyperparameters than its predecessor.

In this guide, we will dismantle the algorithm step-by-step, explain the math behind its "stability" metric, and implement it in Python.

Why does DBSCAN struggle with variable density?

To understand HDBSCAN, we must first understand the limitation of standard DBSCAN.

DBSCAN defines clusters using a fixed distance parameter, epsilon (ϵ\epsilon). Two points are connected if they are within ϵ\epsilon distance of each other.

  • Small ϵ\epsilon: Great for separating dense clusters, but sparse clusters (like suburbs) are fragmented into noise points.
  • Large ϵ\epsilon: Great for finding sparse clusters, but dense clusters (like distinct city districts) merge together into a single "mega-cluster."

DBSCAN assumes that all clusters in your dataset have roughly the same density. In the real world—whether it's geospatial data, customer transactions, or image pixels—this assumption rarely holds.

What is HDBSCAN?

HDBSCAN effectively runs DBSCAN over all possible values of ϵ\epsilon simultaneously and then uses a stability measure to pick the best clusters at each density level.

It combines the best features of two major clustering families:

  1. Density-based clustering: It can find weird shapes and ignore noise.
  2. Hierarchical clustering: It builds a tree of clusters to understand the structure of data at different scales.

The result is an algorithm that says: "I found a very dense cluster here, and a less dense but still valid cluster over there. I'll keep both."

How does HDBSCAN actually work?

The algorithm follows five distinct steps. While the underlying graph theory is complex, the intuition is beautifully simple.

Step 1: Transform the Space (Mutual Reachability Distance)

Real-world data is messy. You often have a dense cluster of points with a few "noise" points floating nearby. If we just use standard Euclidean distance, a noise point might accidentally bridge the gap between two distinct clusters.

HDBSCAN solves this by transforming the space to push sparse points further away. It uses a metric called Mutual Reachability Distance.

First, we define the Core Distance of a point xx, denoted as corek(x)\text{core}_k(x). This is simply the distance from point xx to its kk-th nearest neighbor.

  • If xx is in a dense area, corek(x)\text{core}_k(x) is small.
  • If xx is in a sparse area (noise), corek(x)\text{core}_k(x) is large.

Now, the Mutual Reachability Distance between two points aa and bb is:

dmreach(a,b)=max{corek(a),corek(b),d(a,b)}d_{mreach}(a, b) = \max\{\text{core}_k(a), \text{core}_k(b), d(a, b)\}

Where d(a,b)d(a, b) is the original Euclidean distance.

In Plain English: This formula says "The effective distance between two points is at least the core distance of the sparser point."

If two points are both in a dense cluster, their core distances are tiny, so the distance is just the normal Euclidean distance.

However, if one point is an outlier (noise), it has a huge core distance. This "pushes" it away from everything else mathematically. It makes it much harder for noise points to act as bridges between clusters.

Step 2: Build a Minimum Spanning Tree (MST)

Now that we have this new distance metric, we view the data as a graph. Every data point is a node, and the edges are weighted by the mutual reachability distance.

HDBSCAN builds a Minimum Spanning Tree (MST). This connects all points in the dataset such that the total weight of the edges is minimized. Effectively, it draws the "skeleton" of your data structure.

Step 3: Build the Cluster Hierarchy

If we were doing standard Hierarchical Clustering, we would cut this tree at a specific horizontal line. But HDBSCAN does something smarter.

It sorts the edges of the MST by distance (from smallest to largest) and iterates through them. This creates a dendrogram (a tree diagram).

  • At the bottom (distance = 0), every point is its own cluster.
  • As we move up (increasing distance), points merge into clusters.
  • Eventually, everything merges into one giant cluster.

Step 4: Condense the Cluster Tree

A raw dendrogram for a large dataset is a mess—it's too complex. HDBSCAN simplifies it into a Condensed Tree.

It walks through the hierarchy. As the "density threshold" drops (distance increases), a cluster might split into two.

  1. If a cluster splits and one side is very small (fewer points than min_cluster_size), we treat those points as "falling out" of the cluster. They are effectively noise dying off.
  2. If a cluster splits into two large groups (both > min_cluster_size), we treat this as a genuine split into two new child clusters.

This process trims away the noise and leaves us with a simplified tree of "true" clusters evolving over time.

Step 5: Extract Stable Clusters

This is the magic step. How do we decide which clusters to keep?

HDBSCAN defines a measure called Cluster Stability. We want clusters that persist over a long range of density values. A cluster that appears and immediately splits is unstable. A cluster that stays together while we vary the density threshold is stable.

Mathematically, instead of using distance dd, we use λ=1d\lambda = \frac{1}{d}. For a cluster CC, we calculate its stability S(C)S(C) by summing up the persistence of all points pp inside it:

S(C)=pC(λpλbirth)S(C) = \sum_{p \in C} (\lambda_{p} - \lambda_{birth})

  • λbirth\lambda_{birth}: The density value where the cluster formed.
  • λp\lambda_{p}: The density value where point pp "fell out" of the cluster (became noise or split).

In Plain English: Think of clusters as islands emerging from the sea as the water level drops. The "height" of the island represents density.

A "stable" cluster is a mountain with steep sides and a wide plateau—it stays distinct as an island for a long time as the water level goes down. A "spiky" unstable cluster appears and disappears quickly. HDBSCAN selects the combination of non-overlapping islands that maximizes the total volume (stability) of the system.

Key Hyperparameters

One of the biggest selling points of HDBSCAN is that it is robust to parameter tuning. You really only need to worry about one (or sometimes two) parameters.

1. min_cluster_size

This is the most critical parameter. It defines the smallest grouping of points you are willing to consider a cluster.

  • Too small: You get many micro-clusters (overfitting).
  • Too large: You might merge distinct groups (underfitting).
  • Rule of thumb: Start with the smallest size that makes business sense. For a dataset of 1,000 points, maybe 10 or 20.

2. min_samples (Optional)

This controls how conservative the clustering is. It sets the kk value for the Core Distance calculation in Step 1.

  • Default: usually equals min_cluster_size.
  • Higher value: The algorithm requires more evidence to form a cluster. Points in lower-density areas are more likely to be declared noise. It creates "smoother" clusters.
  • Lower value: The algorithm is more sensitive to local details but may pick up more noise.

Python Implementation

We will use the scikit-learn implementation (available in version 1.3+).

💡 Pro Tip: If you are on an older environment, you can install the standalone hdbscan library (pip install hdbscan), which offers additional plotting capabilities.

We'll generate a dataset with variable densities: one dense blob, one sparse blob, and some noise.

python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import HDBSCAN
from sklearn.datasets import make_blobs

# 1. Generate synthetic data with variable densities
# Dense cluster
X1, _ = make_blobs(n_samples=400, centers=[[0, 0]], cluster_std=0.6, random_state=0)
# Sparse cluster
X2, _ = make_blobs(n_samples=200, centers=[[6, 6]], cluster_std=2.0, random_state=0)
# Random noise
noise = np.random.uniform(low=-6, high=10, size=(50, 2))

X = np.vstack([X1, X2, noise])

# 2. Run HDBSCAN
# min_cluster_size=20: We ignore groups smaller than 20 points
hdbscan = HDBSCAN(min_cluster_size=20, min_samples=10)
labels = hdbscan.fit_predict(X)

# 3. Visualization
plt.figure(figsize=(10, 6))

# Points classified as noise are labeled -1
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]
        label_name = "Noise"
        marker = 'x'
    else:
        label_name = f"Cluster {k}"
        marker = 'o'

    class_member_mask = (labels == k)
    
    xy = X[class_member_mask]
    plt.plot(xy[:, 0], xy[:, 1], marker, markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=8, linestyle='None', label=label_name)

plt.title('HDBSCAN Clustering: Handling Variable Densities')
plt.legend()
plt.show()

Understanding the Output

When you run this code, you will likely see:

  • Cluster 0 (The Dense Blob): Correctly identified, despite being very tight.
  • Cluster 1 (The Sparse Blob): Correctly identified as a single cluster, even though points are far apart.
  • Noise (Black x's): The random points scattered around are labeled as -1.

If you had used standard DBSCAN with a small eps, the sparse blob would have turned entirely into noise. If you used a large eps, the noise points would have merged everything together. HDBSCAN finds the balance automatically.

Soft Clustering with HDBSCAN

Unlike K-Means (which is "hard" clustering—you are either in or out), HDBSCAN can provide probability scores. This tells you how strongly a point belongs to its cluster.

Points near the dense "heart" of a cluster will have a probability close to 1.0. Points on the fuzzy edges might have 0.4 or 0.5.

python
# Accessing probabilities
probabilities = hdbscan.probabilities_

# Check the probability of the first 5 points
print("Probabilities of first 5 points:")
print(probabilities[:5])

# Find points that are "uncertain" (e.g., probability between 0.1 and 0.5)
uncertain_points = X[(probabilities > 0.1) & (probabilities < 0.5)]
print(f"Number of edge-case points: {len(uncertain_points)}")

🔑 Key Insight: This probabilistic output is similar to Gaussian Mixture Models, but without assuming your data is shaped like a bell curve (Gaussian). It gives you the flexibility of soft clustering with the shape-agnostic power of density clustering.

Comparison: HDBSCAN vs. Other Algorithms

FeatureK-MeansDBSCANHDBSCAN
ShapeSpherical/ConvexArbitraryArbitrary
DensityAssumes UniformAssumes UniformHandles Variable
Noise HandlingNo (sensitive to outliers)YesYes
ParametersKK (hard to guess)ϵ,min_samples\epsilon, \text{min\_samples} (sensitive)min_cluster_size\text{min\_cluster\_size} (robust)
Best Use CaseGeneral segmentationGeospatial, dense blobsNoisy, complex, real-world data

When should you NOT use HDBSCAN?

While HDBSCAN is powerful, it is not a silver bullet.

  1. Known Number of Clusters: If you know you need exactly 5 customer segments for a marketing campaign, K-Means is easier to control. HDBSCAN determines the number of clusters for you, and you can't force it to give you exactly 5.
  2. Globular, Uniform Data: If your data is clean and clusters are roughly spherical, K-Means is faster and simpler.
  3. Very High Dimensions: Like all distance-based algorithms, HDBSCAN suffers from the "Curse of Dimensionality." If you have hundreds of features, distances become meaningless. You should apply dimensionality reduction (like PCA or UMAP) before clustering.

Conclusion

HDBSCAN represents the modern standard for general-purpose clustering. By combining the strengths of density-based and hierarchical approaches, it solves the "epsilon parameter" headache that plagued DBSCAN users for years.

It builds a hierarchy of your data, analyzes the stability of clusters at different densities, and extracts the most robust structures—whether they are dense city centers or sparse rural outputs.

If you are dealing with real-world, noisy data where clusters have different densities and shapes, HDBSCAN should be your first choice.

Where to go next?


Hands-On Practice

In this tutorial, we will explore HDBSCAN, a powerful clustering algorithm that overcomes the limitations of K-Means and standard DBSCAN by identifying clusters of varying densities. While standard DBSCAN requires you to choose a rigid density threshold, HDBSCAN automatically determines the optimal number of clusters and handles noise gracefully. We will apply this algorithm to a customer segmentation dataset to identify distinct groups of shoppers based on their income and spending habits, demonstrating how density-based clustering can uncover natural patterns that distance-based methods often miss.

Dataset: Customer Segments (Clustering) Customer segmentation data with 5 natural clusters based on income, spending, and age. Silhouette score ≈ 0.75 with k=5.

Try It Yourself

Clustering
Loading editor...
0/50 runs

Clustering: 200 customer records for segmentation

Experiment by adjusting min_cluster_size in HDBSCAN to see how the hierarchy changes; increasing it will force small clusters to merge or be labeled as noise. You can also revisit the standard DBSCAN implementation and try changing eps (e.g., to 0.5 or 0.2) to witness how sensitive it is compared to HDBSCAN's robust adaptive approach. Observe how HDBSCAN maintains the main customer segments even without fine-tuning a distance threshold.