If you have ever tried to visualize a dataset with 100,000 rows using t-SNE, you probably had time to brew coffee, drink it, and perhaps write a novel while waiting for the algorithm to finish.
For years, t-SNE was the undisputed king of visualizing high-dimensional data. It produced beautiful clusters that made complex data look simple. But it came with heavy baggage: it was agonizingly slow, it struggled with massive datasets, and perhaps most critically, it often destroyed the global structure of the data. You could see clusters, but you couldn't trust the distance between those clusters.
Enter UMAP (Uniform Manifold Approximation and Projection).
Published in 2018 by Leland McInnes, John Healy, and James Melville, UMAP promised to be the "t-SNE killer." It runs faster, scales better, and preserves the global relationships that t-SNE often discards. But how does it actually work? Is it just a faster t-SNE, or is the math fundamentally different?
In this guide, we will dismantle UMAP to understand the topological machinery under the hood, compare it directly with t-SNE, and show you how to implement it for production-ready data science.
What is UMAP and why do we need it?
UMAP is a non-linear dimensionality reduction algorithm that constructs a high-dimensional graph representation of data and optimizes a low-dimensional layout to be as structurally similar to that graph as possible. Unlike t-SNE, UMAP is grounded in algebraic topology, allowing it to preserve both local cluster details and broad global structure.
Most dimensionality reduction techniques fall into two camps:
- Matrix Factorization (e.g., PCA): Fast and preserves global structure, but fails to capture complex, non-linear patterns.
- Neighbor Graphs (e.g., t-SNE): Excellent at finding local clusters (non-linear), but slow and loses the "big picture."
UMAP bridges this gap. It assumes your data is uniformly distributed on a "locally connected Riemannian manifold." That sounds terrifyingly complex, but the intuition is simple: UMAP learns the "shape" of your data surface in high dimensions and flattens it out onto a 2D or 3D map, much like peeling an orange to make a flat map of the world—but with less distortion than t-SNE.
Before diving deep, if you are new to the concept of manifold learning or t-SNE, you might want to read our guide on Visualizing the Invisible: How t-SNE Unlocks High-Dimensional Data.
How does UMAP handle varying data density?
UMAP solves the problem of varying density by creating a custom "distance ruler" for every data point. In regions where data is dense, the unit of distance is short; in sparse regions, the unit of distance stretches out. This normalization ensures that every point is connected to its nearest neighbors with roughly equal weight, regardless of how isolated it is.
To understand this, we need to look at how UMAP measures distance. In a standard Euclidean world, points in a dense cluster are very close (distance ) while outliers are far (distance ). If you just used raw distance, the outliers would look completely disconnected.
UMAP introduces two key mathematical concepts for each point :
- (rho): The distance to the nearest neighbor.
- (sigma): A scaling factor (normalization).
The similarity (or connection strength) between point and point is calculated as:
In Plain English: This formula says "The connection strength depends on how far apart points are, minus the distance to the closest neighbor." By subtracting , UMAP ensures that every point is connected to at least one neighbor with a weight of 1.0. It basically says, "I don't care how lonely you are; your closest neighbor is considered 'close' by definition."
This is a brilliant trick. It makes the algorithm "blind" to absolute density. Whether a cluster is tight or spread out, UMAP sees them as topologically similar.
How does UMAP build the high-dimensional graph?
UMAP constructs a "fuzzy simplicial complex" by calculating nearest neighbors for every point and combining their connection probabilities. Unlike standard graphs where edges are binary (connected or not), UMAP edges represent the probability that a topological connection exists.
Once UMAP calculates the directional weights (from to ) and (from to ), it must combine them into a single symmetric bond. Since the "local metric" at point is different from point , these weights are rarely equal.
UMAP combines them using a probabilistic sum (also known as a fuzzy union):
In Plain English: This formula is the probability of "A or B" happening (). It essentially says: "If point A thinks point B is a neighbor, OR if point B thinks point A is a neighbor, then they are connected." This symmetrization ensures that if an outlier connects to a cluster, the cluster "accepts" the connection, keeping the global structure intact.
How does UMAP optimize the low-dimensional layout?
UMAP projects data by minimizing the Cross-Entropy between the high-dimensional graph () and the low-dimensional graph (). While t-SNE uses KL Divergence to preserve local neighbors, Cross-Entropy forces UMAP to preserve both neighbors (attraction) and gaps between clusters (repulsion).
This is the mathematical secret sauce that gives UMAP its speed and global structure preservation.
Let's break this monster into two fighting forces:
-
Attraction (Left Term):
- This looks like t-SNE's KL Divergence.
- It is active when is high (points are neighbors).
- It pulls points together in the low-dimensional map.
-
Repulsion (Right Term):
- This is active when is small (points are NOT neighbors).
- It penalizes the model if non-neighbors are placed too close together in the output.
- It pushes disparate clusters apart.
In Plain English: Think of t-SNE as a party planner who focuses entirely on ensuring friends sit at the same table. It doesn't care if two rival tables are placed right next to each other.
UMAP is a wedding planner who ensures friends sit together (Attraction) AND that rival families are seated at opposite ends of the hall (Repulsion). Because UMAP explicitly calculates this repulsion cost, it preserves the empty space (global structure) between clusters much better than t-SNE.
UMAP vs. t-SNE: Which one should you use?
The short answer: For most modern applications, start with UMAP. It is generally superior in speed and utility, though t-SNE still creates slightly sharper visualizations for purely local clustering.
| Feature | t-SNE | UMAP |
|---|---|---|
| Speed | Slow (especially ) | Fast (scales to millions) |
| Global Structure | Poor (clusters arbitrarily placed) | Good (preserves relationships) |
| Interpretation | Cluster distances are meaningless | Cluster distances are meaningful |
| New Data | Cannot transform new points | Can transform new points |
| Initialization | Random (usually) | Spectral (deterministic) |
| Density | Preserves density variations | Normalizes density |
⚠️ Common Pitfall: Just because UMAP is better at global structure doesn't mean the distances are perfectly accurate like in PCA. You still cannot treat a UMAP plot like a ruler-perfect map, but it is far more reliable than t-SNE.
Practical Implementation: UMAP in Python
Let's see UMAP in action. We will use the umap-learn library. If you don't have it, install it via pip:
pip install umap-learn
We will compare UMAP on the famous digits dataset (handwritten numbers).
import umap
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
# 1. Load Data
digits = load_digits()
data = digits.data
labels = digits.target
print(f"Data Shape: {data.shape}")
# Expected Output: (1797, 64) - 1797 images, 64 pixels each
# 2. Initialize and Fit UMAP
# random_state ensures reproducibility
reducer = umap.UMAP(random_state=42)
embedding = reducer.fit_transform(data)
print(f"Embedding Shape: {embedding.shape}")
# Expected Output: (1797, 2) - Reduced to 2 dimensions
# 3. Visualize
plt.figure(figsize=(10, 8))
plt.scatter(
embedding[:, 0],
embedding[:, 1],
c=labels,
cmap='Spectral',
s=5
)
plt.colorbar(label='Digit Label', boundaries=np.arange(11)-0.5).set_ticks(np.arange(10))
plt.title('UMAP Projection of the Digits Dataset', fontsize=16)
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.show()
When you run this, you will see distinct clusters for each digit (0-9). Notice how similar digits (like 1 and 7, or 3 and 8) might appear closer to each other than to completely distinct digits like 0. This is the global structure preservation in action.
What are the key hyperparameters for UMAP?
The two knobs you must know how to turn are n_neighbors and min_dist. Adjusting these changes the balance between local detail and big-picture structure.
1. n_neighbors
This controls the size of the local neighborhood UMAP looks at when learning the manifold structure.
- Low values (e.g., 5-15): Focuses on very local structure. You get sharper, fragmented clusters. Good for finding small sub-groups.
- High values (e.g., 50-200): Focuses on global structure. Clusters may merge, but you get a better view of the overall data shape.
2. min_dist
This controls how tightly UMAP is allowed to pack points together in the low-dimensional output.
- Low values (e.g., 0.1): Points clump tightly. Good for clustering tasks.
- High values (e.g., 0.9): Points are spread out. Good if you want to avoid overlapping points for visualization.
Can we use UMAP for supervised learning?
Yes! One of UMAP's superpowers is that it can accept labels (y) during training. This is called Supervised UMAP.
Standard unsupervised dimensionality reduction doesn't know your class labels—it only looks at pixel similarity. Sometimes, data points from different classes look similar (e.g., a messy '4' looking like a '9'). Unsupervised UMAP might mix them. Supervised UMAP uses the labels to "pull" same-class points together and "push" different-class points apart.
# Supervised UMAP Example
supervised_reducer = umap.UMAP(n_neighbors=15, random_state=42)
# Notice we pass 'y=labels' here
embedding_supervised = supervised_reducer.fit_transform(data, y=labels)
plt.figure(figsize=(10, 8))
plt.scatter(
embedding_supervised[:, 0],
embedding_supervised[:, 1],
c=labels,
cmap='Spectral',
s=5
)
plt.title('Supervised UMAP Projection', fontsize=16)
plt.show()
Expected Result: You will see incredibly clean, widely separated clusters. This transformation is extremely useful if you want to feed the reduced data into a simple classifier (like Logistic Regression) that struggles with complex high-dimensional boundaries.
Conclusion
UMAP has rapidly replaced t-SNE as the default choice for manifold learning and visualization. It offers a mathematically rigorous foundation that balances the preservation of local density with global structure, all while running efficiently on large datasets.
By understanding the mechanics of local metric scaling (the and parameters) and the Cross-Entropy cost function, you can tune UMAP to reveal hidden patterns in your data that linear methods like PCA miss and local methods like t-SNE obscure.
Whether you are exploring genomic data, analyzing customer segments, or preprocessing features for a neural network, UMAP is an essential tool in your modern data science arsenal.
Where to go next?
- If you need to cluster the beautiful blobs UMAP created, check out DBSCAN: Finding Clusters of Any Shape.
- If you want to understand the linear baseline for dimensionality reduction, read PCA: Reducing Dimensions While Keeping What Matters.
- To see where UMAP fits in the broader hierarchy of algorithms, explore Hierarchical Clustering.
Hands-On Practice
Visualizing high-dimensional data often requires balancing global structure with local detail. In this tutorial, we will explore the fundamental concepts behind UMAP (Uniform Manifold Approximation and Projection) by manually dismantling the mathematical machinery it uses—specifically, how it handles varying data density using nearest neighbor graphs. Using the Wine Analysis dataset, we will compare PCA and t-SNE to understand the trade-offs UMAP solves, and then write code to inspect the 'local connectivity' that makes UMAP unique.
Dataset: Wine Analysis (High-Dimensional) Wine chemical analysis with 27 features (13 original + 9 derived + 5 noise) and 3 cultivar classes. PCA: 2 components=45%, 5=64%, 10=83% variance. Noise features have near-zero importance. Perfect for dimensionality reduction, feature selection, and regularization.
Try It Yourself
High Dimensional: 180 wine samples with 13 features
By dissecting the nearest neighbor graph, we've seen how UMAP adapts to data density—something standard distance metrics miss. Try experimenting with the k parameter (n_neighbors) in the code; increasing it forces the algorithm to consider a broader 'local' view, often preserving more global structure at the cost of local detail. In a production environment with the full umap-learn library, this n_neighbors parameter is your primary knob for balancing the focus between the forest (global) and the trees (local).