UMAP Explained: How It Beats t-SNE Under the Hood

If you have ever tried to visualize a dataset with 100,000 rows using t-SNE, you probably had time to brew coffee, drink it, and perhaps write a novel while waiting for the algorithm to finish.

For years, t-SNE was the undisputed king of visualizing high-dimensional data. It produced beautiful clusters that made complex data look simple. But it came with heavy baggage: it was agonizingly slow, it struggled with massive datasets, and perhaps most critically, it often destroyed the global structure of the data. You could see clusters, but you couldn't trust the distance between those clusters.

Enter UMAP (Uniform Manifold Approximation and Projection).

Published in 2018 by Leland McInnes, John Healy, and James Melville, UMAP promised to be the "t-SNE killer." It runs faster, scales better, and preserves the global relationships that t-SNE often discards. But how does it actually work? Is it just a faster t-SNE, or is the math fundamentally different?

In this guide, we will dismantle UMAP to understand the topological machinery under the hood, compare it directly with t-SNE, and show you how to implement it for production-ready data science.

What is UMAP and why do we need it?

UMAP is a non-linear dimensionality reduction algorithm that constructs a high-dimensional graph representation of data and optimizes a low-dimensional layout to be as structurally similar to that graph as possible. Unlike t-SNE, UMAP is grounded in algebraic topology, allowing it to preserve both local cluster details and broad global structure.

Most dimensionality reduction techniques fall into two camps:

Matrix Factorization (e.g., PCA): Fast and preserves global structure, but fails to capture complex, non-linear patterns.
Neighbor Graphs (e.g., t-SNE): Excellent at finding local clusters (non-linear), but slow and loses the "big picture."

UMAP bridges this gap. It assumes your data is uniformly distributed on a "locally connected Riemannian manifold." That sounds terrifyingly complex, but the intuition is simple: UMAP learns the "shape" of your data surface in high dimensions and flattens it out onto a 2D or 3D map, much like peeling an orange to make a flat map of the world—but with less distortion than t-SNE.

Before diving deep, if you are new to the concept of manifold learning or t-SNE, you might want to read our guide on Visualizing the Invisible: How t-SNE Unlocks High-Dimensional Data.

How does UMAP handle varying data density?

UMAP solves the problem of varying density by creating a custom "distance ruler" for every data point. In regions where data is dense, the unit of distance is short; in sparse regions, the unit of distance stretches out. This normalization ensures that every point is connected to its nearest neighbors with roughly equal weight, regardless of how isolated it is.

To understand this, we need to look at how UMAP measures distance. In a standard Euclidean world, points in a dense cluster are very close (distance $\approx 0.1$ ) while outliers are far (distance $\approx 10.0$ ). If you just used raw distance, the outliers would look completely disconnected.

UMAP introduces two key mathematical concepts for each point $x_i$ :

$\rho_i$ (rho): The distance to the nearest neighbor.
$\sigma_i$ (sigma): A scaling factor (normalization).

The similarity (or connection strength) between point $i$ and point $j$ is calculated as:

$w_{i,j} = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)$

In Plain English: This formula says "The connection strength depends on how far apart points are, minus the distance to the closest neighbor." By subtracting $\rho_i$ , UMAP ensures that every point is connected to at least one neighbor with a weight of 1.0. It basically says, "I don't care how lonely you are; your closest neighbor is considered 'close' by definition."

This is a brilliant trick. It makes the algorithm "blind" to absolute density. Whether a cluster is tight or spread out, UMAP sees them as topologically similar.

How does UMAP build the high-dimensional graph?

UMAP constructs a "fuzzy simplicial complex" by calculating nearest neighbors for every point and combining their connection probabilities. Unlike standard graphs where edges are binary (connected or not), UMAP edges represent the probability that a topological connection exists.

Once UMAP calculates the directional weights $w_{i,j}$ (from $i$ to $j$ ) and $w_{j,i}$ (from $j$ to $i$ ), it must combine them into a single symmetric bond. Since the "local metric" at point $i$ is different from point $j$ , these weights are rarely equal.

UMAP combines them using a probabilistic sum (also known as a fuzzy union):

$P_{ij} = w_{i,j} + w_{j,i} - w_{i,j} \cdot w_{j,i}$

In Plain English: This formula is the probability of "A or B" happening ( $P(A \cup B)$ ). It essentially says: "If point A thinks point B is a neighbor, OR if point B thinks point A is a neighbor, then they are connected." This symmetrization ensures that if an outlier connects to a cluster, the cluster "accepts" the connection, keeping the global structure intact.

How does UMAP optimize the low-dimensional layout?

UMAP projects data by minimizing the Cross-Entropy between the high-dimensional graph ( $P$ ) and the low-dimensional graph ( $Q$ ). While t-SNE uses KL Divergence to preserve local neighbors, Cross-Entropy forces UMAP to preserve both neighbors (attraction) and gaps between clusters (repulsion).

This is the mathematical secret sauce that gives UMAP its speed and global structure preservation.

$C = \sum_{i \neq j} \left( P_{ij} \log\left(\frac{P_{ij}}{Q_{ij}}\right) + (1 - P_{ij}) \log\left(\frac{1 - P_{ij}}{1 - Q_{ij}}\right) \right)$

Let's break this monster into two fighting forces:

Attraction (Left Term): $P_{ij} \log(P_{ij} / Q_{ij})$
- This looks like t-SNE's KL Divergence.
- It is active when $P_{ij}$ is high (points are neighbors).
- It pulls points together in the low-dimensional map.
Repulsion (Right Term): $(1 - P_{ij}) \log((1 - P_{ij}) / (1 - Q_{ij}))$
- This is active when $P_{ij}$ is small (points are NOT neighbors).
- It penalizes the model if non-neighbors are placed too close together in the output.
- It pushes disparate clusters apart.

In Plain English: Think of t-SNE as a party planner who focuses entirely on ensuring friends sit at the same table. It doesn't care if two rival tables are placed right next to each other.

UMAP is a wedding planner who ensures friends sit together (Attraction) AND that rival families are seated at opposite ends of the hall (Repulsion). Because UMAP explicitly calculates this repulsion cost, it preserves the empty space (global structure) between clusters much better than t-SNE.

UMAP vs. t-SNE: Which one should you use?

The short answer: For most modern applications, start with UMAP. It is generally superior in speed and utility, though t-SNE still creates slightly sharper visualizations for purely local clustering.

Feature	t-SNE	UMAP
Speed	Slow (especially $N > 100k$ )	Fast (scales to millions)
Global Structure	Poor (clusters arbitrarily placed)	Good (preserves relationships)
Interpretation	Cluster distances are meaningless	Cluster distances are meaningful
New Data	Cannot transform new points	Can transform new points
Initialization	Random (usually)	Spectral (deterministic)
Density	Preserves density variations	Normalizes density

⚠️ Common Pitfall: Just because UMAP is better at global structure doesn't mean the distances are perfectly accurate like in PCA. You still cannot treat a UMAP plot like a ruler-perfect map, but it is far more reliable than t-SNE.

Practical Implementation: UMAP in Python

Let's see UMAP in action. We will use the umap-learn library. If you don't have it, install it via pip:

bash

pip install umap-learn

We will compare UMAP on the famous digits dataset (handwritten numbers).

python

import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits

# 1. Load Data
digits = load_digits()
data = digits.data
labels = digits.target

print(f"Data Shape: {data.shape}") 
# Expected Output: (1797, 64) - 1797 images, 64 pixels each

# 2. Initialize and Fit UMAP
# random_state ensures reproducibility
reducer = umap.UMAP(random_state=42) 
embedding = reducer.fit_transform(data)

print(f"Embedding Shape: {embedding.shape}")
# Expected Output: (1797, 2) - Reduced to 2 dimensions

# 3. Visualize
plt.figure(figsize=(10, 8))
plt.scatter(
    embedding[:, 0], 
    embedding[:, 1], 
    c=labels, 
    cmap='Spectral', 
    s=5
)
plt.colorbar(label='Digit Label', boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP Projection of the Digits Dataset', fontsize=16)
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.show()

When you run this, you will see distinct clusters for each digit (0-9). Notice how similar digits (like 1 and 7, or 3 and 8) might appear closer to each other than to completely distinct digits like 0. This is the global structure preservation in action.

What are the key hyperparameters for UMAP?

The two knobs you must know how to turn are n_neighbors and min_dist. Adjusting these changes the balance between local detail and big-picture structure.

1. n_neighbors

This controls the size of the local neighborhood UMAP looks at when learning the manifold structure.

Low values (e.g., 5-15): Focuses on very local structure. You get sharper, fragmented clusters. Good for finding small sub-groups.
High values (e.g., 50-200): Focuses on global structure. Clusters may merge, but you get a better view of the overall data shape.

2. min_dist

This controls how tightly UMAP is allowed to pack points together in the low-dimensional output.

Low values (e.g., 0.1): Points clump tightly. Good for clustering tasks.
High values (e.g., 0.9): Points are spread out. Good if you want to avoid overlapping points for visualization.

💡 Pro Tip: If you are using UMAP as a preprocessing step for a clustering algorithm like DBSCAN or HDBSCAN, set n_neighbors higher (30+) and min_dist lower (0.0). This creates dense, well-separated islands that are easy for clustering algorithms to detect.

Can we use UMAP for supervised learning?

Yes! One of UMAP's superpowers is that it can accept labels (y) during training. This is called Supervised UMAP.

Standard unsupervised dimensionality reduction doesn't know your class labels—it only looks at pixel similarity. Sometimes, data points from different classes look similar (e.g., a messy '4' looking like a '9'). Unsupervised UMAP might mix them. Supervised UMAP uses the labels to "pull" same-class points together and "push" different-class points apart.

python

# Supervised UMAP Example
supervised_reducer = umap.UMAP(n_neighbors=15, random_state=42)

# Notice we pass 'y=labels' here
embedding_supervised = supervised_reducer.fit_transform(data, y=labels)

plt.figure(figsize=(10, 8))
plt.scatter(
    embedding_supervised[:, 0], 
    embedding_supervised[:, 1], 
    c=labels, 
    cmap='Spectral', 
    s=5
)
plt.title('Supervised UMAP Projection', fontsize=16)
plt.show()

Expected Result: You will see incredibly clean, widely separated clusters. This transformation is extremely useful if you want to feed the reduced data into a simple classifier (like Logistic Regression) that struggles with complex high-dimensional boundaries.

Conclusion

UMAP has rapidly replaced t-SNE as the default choice for manifold learning and visualization. It offers a mathematically rigorous foundation that balances the preservation of local density with global structure, all while running efficiently on large datasets.

By understanding the mechanics of local metric scaling (the $\rho$ and $\sigma$ parameters) and the Cross-Entropy cost function, you can tune UMAP to reveal hidden patterns in your data that linear methods like PCA miss and local methods like t-SNE obscure.

Whether you are exploring genomic data, analyzing customer segments, or preprocessing features for a neural network, UMAP is an essential tool in your modern data science arsenal.

Where to go next?

If you need to cluster the beautiful blobs UMAP created, check out DBSCAN: Finding Clusters of Any Shape.
If you want to understand the linear baseline for dimensionality reduction, read PCA: Reducing Dimensions While Keeping What Matters.
To see where UMAP fits in the broader hierarchy of algorithms, explore Hierarchical Clustering.

Hands-On Practice

Visualizing high-dimensional data often requires balancing global structure with local detail. In this tutorial, we will explore the fundamental concepts behind UMAP (Uniform Manifold Approximation and Projection) by manually dismantling the mathematical machinery it uses—specifically, how it handles varying data density using nearest neighbor graphs. Using the Wine Analysis dataset, we will compare PCA and t-SNE to understand the trade-offs UMAP solves, and then write code to inspect the 'local connectivity' that makes UMAP unique.

Dataset: Wine Analysis (High-Dimensional) Wine chemical analysis with 27 features (13 original + 9 derived + 5 noise) and 3 cultivar classes. PCA: 2 components=45%, 5=64%, 10=83% variance. Noise features have near-zero importance. Perfect for dimensionality reduction, feature selection, and regularization.

Try It Yourself

High Dimensional

Loading editor...

0/50 runs(Ctrl+Enter)

High Dimensional: 180 wine samples with 13 features

By dissecting the nearest neighbor graph, we've seen how UMAP adapts to data density—something standard distance metrics miss. Try experimenting with the k parameter (n_neighbors) in the code; increasing it forces the algorithm to consider a broader 'local' view, often preserving more global structure at the cost of local detail. In a production environment with the full umap-learn library, this n_neighbors parameter is your primary knob for balancing the focus between the forest (global) and the trees (local).

UMAP Explained: The Faster, Smarter Alternative to t-SNE