How t-SNE Actually Works Under the Hood

Imagine trying to draw a map of the world, but instead of three dimensions (latitude, longitude, altitude), the world has 784 dimensions. This is the reality of working with data. Whether you are analyzing genetic sequences, processing images, or clustering customer behaviors, you often grapple with datasets so complex that human intuition fails.

We cannot visualize 50 dimensions, let alone 784. This is where dimensionality reduction comes in, and few algorithms are as artistic or popular as t-SNE (t-Distributed Stochastic Neighbor Embedding). Unlike traditional methods that flatten data like a pancake, t-SNE acts like a skilled translator, converting complex high-dimensional relationships into a stunning 2D or 3D map that reveals the hidden structure of your data.

This guide moves beyond the basics. We will deconstruct the mathematics of t-SNE, implement it in Python, and crucially, learn how to interpret the results without falling into common traps.

What is t-SNE?

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction technique designed to visualize high-dimensional data in two or three dimensions. Unlike linear methods such as PCA, t-SNE prioritizes preserving local structure, ensuring that points similar in high-dimensional space remain close in the low-dimensional visualization.

How does t-SNE differ from PCA?

Principal Component Analysis (PCA) focuses on preserving large pairwise distances to maximize variance, often crushing detailed local structures. t-SNE focuses on preserving local neighborhoods, making t-SNE superior for separating distinct clusters, while PCA is better for understanding global geometry and feature importance.

💡 Pro Tip: Use PCA first to reduce dimensions (e.g., to 50) before running t-SNE. This reduces noise and speeds up computation significantly.

How does t-SNE actually work?

To understand t-SNE, we need to abandon the idea of rigid rulers and fixed distances. Instead, t-SNE thinks in terms of probability and social circles.

The algorithm follows three major steps:

High-Dimensional Space: It calculates the probability that two points are "neighbors" in the original high-dimensional space. Points that are close have a high probability of being picked as neighbors; distant points have a near-zero probability.
Low-Dimensional Space: It creates a random starting map in 2D. It calculates similar neighbor probabilities for these points, but uses a slightly different probability distribution (the Student's t-distribution).
Optimization: It minimizes the difference between these two probability distributions using Gradient Descent. It physically pushes and pulls the 2D points until their "social circle" matches the original high-dimensional data.

Step 1: Similarity in High Dimensions (The Gaussian)

In the original space, t-SNE measures similarity using a Gaussian (Normal) distribution. For every point $x_i$ , we define the probability $p_{j|i}$ that point $x_i$ would pick $x_j$ as its neighbor:

$p_{j|i} = \frac{\exp(-||x_i - x_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||x_i - x_k||^2 / 2\sigma_i^2)}$

In Plain English: This formula acts like a spotlight centered on point $x_i$ . Points standing close to $x_i$ (small distance $||x_i - x_j||$ ) are brightly lit (high probability). Points far away fade into the darkness (probability approaches zero). The term $\sigma_i$ (sigma) adjusts the width of the spotlight based on the density of the data around $x_i$ .

Step 2: The Crowding Problem and the t-Distribution

Why can't we just use a Gaussian distribution in the low-dimensional (2D) map as well? This leads to the Crowding Problem.

High-dimensional space is incredibly vast. In 100 dimensions, you can fit many points that are all equidistant from each other. In 2D space, you simply don't have enough room to accommodate all these distant neighbors. If you try to preserve distances exactly, all the points get crushed together in the center of the map.

t-SNE solves this by using a Student's t-distribution (specifically with one degree of freedom, also known as a Cauchy distribution) for the low-dimensional map.

$q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k - y_l||^2)^{-1}}$

In Plain English: The t-distribution is like a Gaussian but with "heavier tails." It is more forgiving. In 2D, this distribution says, "It's okay if distant points are placed REALLY far apart." This prevents data points from piling up on top of each other, creating the beautiful white space (gaps) you see between clusters in t-SNE plots.

Step 3: Comparing the Maps (KL Divergence)

Now we have $P$ (the high-D probabilities) and $Q$ (the low-D probabilities). We want $Q$ to look as much like $P$ as possible. We measure the difference using Kullback-Leibler (KL) Divergence:

$C = KL(P||Q) = \sum_{i} \sum_{j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$

In Plain English: Think of this as the "Penalty Score." If two points are close in high-D ( $p_{ij}$ is high) but far in low-D ( $q_{ij}$ is low), the ratio $p/q$ becomes huge, creating a massive penalty. The algorithm frantically moves those points closer to reduce the score. If points are far in high-D but close in low-D, the penalty is smaller. This is why t-SNE is amazing at preserving local clusters but sometimes messes up global distances.

What is Perplexity?

Perplexity is the most critical hyperparameter in t-SNE. It effectively controls the number of effective nearest neighbors t-SNE considers when defining the local structure. Roughly speaking, it balances the attention the algorithm pays to local variations versus global structure.

Technically, perplexity is related to the Shannon entropy of the conditional probability distribution. But intuitively:

Low Perplexity (5-30): The algorithm focuses on very local relationships. It may break meaningful clusters into small, isolated clumps.
High Perplexity (30-50+): The algorithm looks at a wider view of the data. It preserves more global structure but might merge distinct small clusters.

⚠️ Common Pitfall: There is no single "correct" perplexity. A value of 30 is a safe default, but you should always test a range (e.g., 5, 30, 50, 100). If the data looks like "soup" (no structure), your perplexity might be too high or too low.

Implementing t-SNE in Python

Let's apply t-SNE to a real-world scenario. We will use the digits dataset (handwritten numbers), which is a classic benchmark for dimensionality reduction.

We will use sklearn.manifold.TSNE.

python

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA

# 1. Load Data (Handwritten digits)
# shape: (1797 samples, 64 features/pixels)
digits = load_digits()
X, y = digits.data, digits.target

# 2. Pre-process with PCA (Recommended for high-dimensional data)
# This reduces noise and speeds up t-SNE
pca = PCA(n_components=30)
X_pca = pca.fit_transform(X)

# 3. Initialize and fit t-SNE
# random_state ensures reproducibility
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000, random_state=42)
X_tsne = tsne.fit_transform(X_pca)

# 4. Visualization
plt.figure(figsize=(10, 8))
sns.scatterplot(
    x=X_tsne[:, 0], 
    y=X_tsne[:, 1], 
    hue=y, 
    palette=sns.color_palette("hls", 10),
    legend="full",
    alpha=0.7
)

plt.title('t-SNE Visualization of Handwritten Digits', fontsize=16)
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.show()

Expected Output: You will see a scatter plot where distinct clusters of colors emerge. The "0"s will form a tight group, the "1"s another. Unlike PCA, which might show overlapping blobs, t-SNE usually separates these digits into distinct islands, demonstrating how well it disentangles the 64-dimensional pixel data.

How do we interpret t-SNE plots correctly?

Interpreting t-SNE requires caution. It is a stochastic (randomized) algorithm, and the visualization can mislead you if you treat it like a literal map.

1. Cluster Size Does Not Equal Variance

In a t-SNE plot, you might see one cluster that is tiny and dense, and another that is huge and sparse. Reality: The dense cluster might not actually be denser in the original data. t-SNE expands dense clusters and contracts sparse ones to equalizes densities visually. Do not infer density from the size of the blobs.

2. Distances Between Clusters May Be Meaningless

You might see the "Dog" cluster very close to the "Cat" cluster, but the "Truck" cluster far away. Reality: While t-SNE preserves local structure well (neighbors stay neighbors), it struggles with global structure (distances between far-away groups). The distance between the islands on your map is often arbitrary.

3. Random Noise Can Look Like Clusters

If you run t-SNE on pure random noise with low perplexity, it will often "hallucinate" patterns, creating clumps that look like clusters. Reality: Always verify your findings. If you see clusters in t-SNE, check them with a clustering algorithm like K-Means or DBSCAN on the original data.

When should you use t-SNE vs UMAP?

While t-SNE sparked the revolution in manifold learning, a newer algorithm called UMAP (Uniform Manifold Approximation and Projection) has gained massive popularity.

Feature	t-SNE	UMAP
Speed	Slow on large datasets ( $O(N^2)$ )	Very Fast
Global Structure	Poor preservation	Good preservation
Initialization	Random (usually)	Laplacian Eigenmaps
Use Case	Visualization of small/medium data	Visualization & Feature Engineering

t-SNE is still the gold standard for high-quality, glossy visualizations where you want distinct separation between clusters. However, for larger datasets or when you need to preserve global relationships, UMAP is often the better choice.

To dive deeper: Check out our detailed comparison in UMAP: The Faster, Better Alternative to t-SNE.

Conclusion

t-SNE remains one of the most powerful tools in a data scientist's arsenal for Exploratory Data Analysis (EDA). By cleverly converting high-dimensional Euclidean distances into conditional probabilities, it allows us to see structure in data that was previously invisible.

However, it is a tool that demands respect. Remember that perplexity matters, cluster sizes can be deceiving, and global distances are not always reliable. Use t-SNE to generate hypotheses, inspect labels, and find quality issues in your data—but verify those hypotheses with statistical analysis on the raw data.

Now that you have mastered the visualization aspect, you might want to explore how to group these visualizations mathematically. For that, understanding Hierarchical Clustering or Gaussian Mixture Models is the natural next step.

Hands-On Practice

High-dimensional data is notoriously difficult to interpret because our brains can't visualize more than three dimensions. In this tutorial, you'll learn how to use t-SNE (t-Distributed Stochastic Neighbor Embedding) to unlock hidden structures in complex datasets that simpler methods like PCA might miss. We will use a high-dimensional Wine Analysis dataset, applying t-SNE to reveal distinct clusters of wine cultivars based on their chemical properties.

Dataset: Wine Analysis (High-Dimensional) Wine chemical analysis with 27 features (13 original + 9 derived + 5 noise) and 3 cultivar classes. PCA: 2 components=45%, 5=64%, 10=83% variance. Noise features have near-zero importance. Perfect for dimensionality reduction, feature selection, and regularization.

Try It Yourself

High Dimensional

Loading editor...

0/50 runs(Ctrl+Enter)

High Dimensional: 180 wine samples with 13 features

Experiment with the perplexity parameter (try 2, 50, and 100) to observe how the cluster tightness changes. You can also try changing the init parameter to 'random' to see how sensitive t-SNE is to initialization compared to 'pca'. Finally, observe the effect of the noisy features in this dataset by trying to run t-SNE only on the first 13 'original' columns versus the full noisy set.

Visualizing the Invisible: How t-SNE Unlocks High-Dimensional Data