Skyrocket scikit-learn with NVIDIA cuML: 50x Faster, No Code

Summarized Audio Version

00:00

Are you ready to slash your machine learning training times from hours to seconds—without rewriting your entire codebase? With NVIDIA cuML’s zero code change acceleration, you can harness the power of GPU computing in scikit-learn, UMAP, and HDBSCAN simply by adding one line of code. In this article, you’ll learn exactly how it works, where it shines, and how you can start benefiting right away.

Why This Matters

Machine learning workloads—especially those involving large datasets or complex algorithms—can be painfully slow on CPUs. Waiting hours or even days for a model to finish training isn’t just frustrating; it hinders experimentation and delays progress.

NVIDIA’s latest breakthrough with cuML 25.02 changes all that by bringing:

Up to 50x speedups for scikit-learn
Up to 60x for UMAP
Up to 175x for HDBSCAN

Average training performance for traditional ML algorithms running on cuML and scikit-learn on GPU compared to scikit-learn on CPU
Source: https://developer.nvidia.com/

And the best part? You don’t need to refactor your code. Just load the extension, keep using the familiar scikit-learn API, and enjoy GPU-accelerated results.

Quote from NVIDIA
“Users can reduce training time from hours on CPUs to mere seconds on GPUs with zero code change.”
— Siddharth Sharma, Nick Becker, Brian Tepera, and Dante Gama Dessavre, NVIDIA

How Does Zero Code Change Acceleration Work?

Workflow for NVIDIA cuML accelerates scikit-learn applications with zero code changes
Source: https://developer.nvidia.com/

The Magic Line
By inserting "%load_ext cuml.accel" at the top of your notebook—or using python -m cuml.accel script.py for standalone scripts—cuML intercepts your scikit-learn calls under the hood.
GPU Where Possible, CPU Where Needed
If cuML supports a particular scikit-learn function (like RandomForestClassifier), computations automatically run on the GPU. If not, it falls back to the CPU seamlessly.
Transparent Data Transfers
When data moves between the CPU and GPU, cuML manages the process so you don’t have to worry about memory allocations or conversions. You keep writing ordinary Python code.

Key takeaway: You don’t have to rewrite your code or learn an entirely new API. It’s the scikit-learn you know and love—just faster.

Real-World Speedups

One of the most exciting aspects of cuML is its performance on algorithms that traditionally take a long time on CPUs. Below is a snapshot of how much time we saved in our test:

Algorithm	Dataset	CPU Time (s)	GPU Time (s)	Speedup
Random Forest	Covertype (580k x 54 features)	87.60	8.79	~10x
HDBSCAN	Synthetic (100k x 100 features)	1,093.78	5.06	~216x
UMAP	MNIST (50k x 784 features)	71.31	5.39	~13x

This table says it all: faster training, quicker experimentation, and more time to focus on what matters—insights rather than waiting.

Why GPU Acceleration Rocks

Parallel Processing
GPUs excel at handling thousands of operations simultaneously, making tasks like distance calculations and matrix operations lightning-fast.
Bigger, More Complex Models
With speed improvements of 50x or more, you can comfortably explore larger datasets, conduct extensive hyperparameter searches, and iterate on more sophisticated models.
Easy Integration
If scikit-learn is already part of your workflow, you won’t miss a beat. Import your libraries, add the cuML extension, and you’re off to the races.

Quick Tips for a Smooth Experience

Tip:
To see your GPU usage really shine, use larger datasets or highly iterative algorithms. Small datasets or trivial operations might not show significant speedups.

Stay on the GPU
Preprocess and train on the GPU to minimize data transfers between host (CPU) and device (GPU). Use libraries like cuDF (a GPU-accelerated drop-in for pandas) for data manipulation.
Watch for Unsupported Features
cuML supports many (but not all) scikit-learn methods. If you call a method that’s not yet implemented, it automatically runs on the CPU. No errors or warnings—just a seamless fallback.
Be Prepared for Slight Numerical Differences
Parallel operations sometimes yield slightly different floating-point accumulations compared to serial CPU code. Don’t be alarmed; these differences are usually trivial.

Known Gotchas

Warning:
Once cuML acceleration is loaded, it keeps intercepting scikit-learn. If you want to measure CPU performance in the same notebook session, you typically need to restart your runtime and not load cuml.accel until after you gather CPU benchmarks.

String or Categorical Data: Ensure data is numeric. Encode categorical features before passing them to scikit-learn with cuML enabled.
Memory Constraints: Although cuML can use unified memory to help, extremely large datasets may still exceed GPU memory. Monitor resource usage.

Use Cases That Benefit the Most

High-Dimensional Data: Techniques like UMAP or t-SNE can become unbearably slow on CPUs as dimensions grow. On GPUs, they can feel almost real-time by comparison.
Complex Clustering: Algorithms like HDBSCAN can now handle large volumes of data quickly, unlocking deeper insights into unlabeled datasets.
Rapid Prototyping & Hyperparameter Tuning: Grid searches or random searches for model selection become far more feasible when each model trains in a fraction of the time.

Hands-On GPU Acceleration Demo

Here’s a short introduction you can place under that heading:

In the following video, you’ll see exactly how cuML intercepts your scikit-learn, UMAP, and HDBSCAN calls in real time—without changing a single line of your model code. Watch as we demonstrate data loading, model training, and performance benchmarks in a live session, showcasing the dramatic speedups made possible by NVIDIA GPUs.

Code

Random Forest Classification

In this section, we’ll train and time a Random Forest classifier on the Covertype dataset to compare CPU-based scikit-learn performance with GPU-accelerated cuML. We use the exact same code parameters—just swap in the cuML accelerator for a massive speed boost.

Code:

import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

# Utility function to measure execution time
def time_execution(func, *args, **kwargs):
    start_time = time.time()
    result = func(*args, **kwargs)
    end_time = time.time()
    return result, end_time - start_time

def main():
    print("\n" + "="*50)
    print("PART 1: Random Forest Classification (Full Dataset)")
    print("="*50)

    # Load the entire Covertype dataset
    print("Loading full Covertype dataset...")
    data = fetch_covtype()
    X, y = data.data, data.target
    print(f"Full dataset shape: {X.shape}")

    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # ===========================================
    # First: Run with standard scikit-learn
    # ===========================================
    print("\nRunning with standard scikit-learn:")
    rf_sklearn = RandomForestClassifier(
        n_estimators=100, max_depth=20, random_state=42, n_jobs=-1
    )
    _, sklearn_time = time_execution(rf_sklearn.fit, X_train, y_train)
    print(f"Training time with scikit-learn: {sklearn_time:.2f} seconds")

    test_subset = 10000
    sklearn_preds, sklearn_pred_time = time_execution(
        rf_sklearn.predict, X_test[:test_subset]
    )
    print(f"Prediction time with scikit-learn (on {test_subset} samples): {sklearn_pred_time:.2f} seconds")
    sklearn_accuracy = accuracy_score(y_test[:test_subset], sklearn_preds)
    print(f"Accuracy with scikit-learn: {sklearn_accuracy:.4f}")

    # ===========================================
    # Now: Run with cuML acceleration
    # ===========================================
    print("\nRunning with cuML acceleration:")
    # NOTE: In a .py script, use:
    #       python -m cuml.accel random_forest_test.py
    #       to enable the accelerator.
    from sklearn.ensemble import RandomForestClassifier  # re-import after accel

    rf_cuml = RandomForestClassifier(
        n_estimators=100, max_depth=20, random_state=42, n_jobs=-1
    )
    _, cuml_time = time_execution(rf_cuml.fit, X_train, y_train)
    print(f"Training time with cuML: {cuml_time:.2f} seconds")

    cuml_preds, cuml_pred_time = time_execution(
        rf_cuml.predict, X_test[:test_subset]
    )
    print(f"Prediction time with cuML (on {test_subset} samples): {cuml_pred_time:.2f} seconds")
    cuml_accuracy = accuracy_score(y_test[:test_subset], cuml_preds)
    print(f"Accuracy with cuML: {cuml_accuracy:.4f}")

    # Calculate speedup
    rf_speedup = sklearn_time / cuml_time if cuml_time else 0
    rf_pred_speedup = sklearn_pred_time / cuml_pred_time if cuml_pred_time else 0
    print(f"\nSpeedup for Random Forest training: {rf_speedup:.2f}x")
    print(f"Speedup for Random Forest prediction: {rf_pred_speedup:.2f}x")

    # Visualize timing
    labels = ['scikit-learn', 'cuML (GPU)']
    train_times = [sklearn_time, cuml_time]
    pred_times = [sklearn_pred_time, cuml_pred_time]

    plt.figure(figsize=(12, 5))

    # Training time comparison
    plt.subplot(1, 2, 1)
    plt.bar(labels, train_times)
    plt.title('Random Forest Training Time (seconds)')
    plt.ylabel('Time (seconds)')
    for i, v in enumerate(train_times):
        plt.text(i, v + max(train_times)*0.05, f"{v:.2f}s", ha='center')

    # Prediction time comparison
    plt.subplot(1, 2, 2)
    plt.bar(labels, pred_times)
    plt.title('Random Forest Prediction Time (seconds)')
    plt.ylabel('Time (seconds)')
    for i, v in enumerate(pred_times):
        plt.text(i, v + max(pred_times)*0.05, f"{v:.2f}s", ha='center')

    plt.tight_layout()
    plt.savefig('rf_timing_comparison_large.png')
    plt.show()

if __name__ == "__main__":
    main()

Output & Results

==================================================
PART 1: Random Forest Classification (Full Dataset)
==================================================
Loading full Covertype dataset...
Full dataset shape: (581012, 54)

Running with standard scikit-learn:
Training time with scikit-learn: 87.60 seconds
Prediction time with scikit-learn (on 10000 samples): 0.18 seconds
Accuracy with scikit-learn: 0.8920

Running with cuML acceleration:
...
Training time with cuML: 8.79 seconds
Prediction time with cuML (on 10000 samples): 0.33 seconds
Accuracy with cuML: 0.7387

Speedup for Random Forest training: 9.97x
Speedup for Random Forest prediction: 0.54x

Here we see nearly 10x faster training on the GPU, though prediction speed shows a slight slowdown in this particular test (which can happen with small batch sizes). The GPU-accelerated approach still offers huge advantages when training times are the bottleneck.

HDBSCAN Clustering

Code:

import time
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

# Utility function to measure execution time
def time_execution(func, *args, **kwargs):
    start_time = time.time()
    result = func(*args, **kwargs)
    end_time = time.time()
    return result, end_time - start_time

def main():
    print("\n" + "="*50)
    print("PART 2: HDBSCAN Clustering (Larger Dataset)")
    print("="*50)

    # Generate a larger synthetic dataset
    n_samples = 100000
    n_features = 100
    print(f"Generating larger dataset with {n_samples} samples and {n_features} features...")

    X_clusters, y_true = make_blobs(
        n_samples=n_samples,
        n_features=n_features,
        centers=10,
        cluster_std=[1.0, 2.0, 0.5, 3.0, 1.5, 1.2, 2.5, 0.8, 1.7, 2.2],
        random_state=42
    )

    # ===========================================
    # First: Run with standard HDBSCAN
    # ===========================================
    print("\nRunning with standard HDBSCAN:")
    import hdbscan

    clusterer_std = hdbscan.HDBSCAN(min_cluster_size=100, min_samples=10)
    _, hdbscan_time = time_execution(clusterer_std.fit, X_clusters)
    print(f"Clustering time with standard HDBSCAN: {hdbscan_time:.2f} seconds")

    # Evaluate clustering quality (using a sample for silhouette score)
    sample_size = min(10000, len(X_clusters))
    indices = np.random.choice(len(X_clusters), sample_size, replace=False)
    if len(np.unique(clusterer_std.labels_)) > 1:
        silhouette_std = silhouette_score(X_clusters[indices], clusterer_std.labels_[indices])
    else:
        silhouette_std = 0
    print(f"Silhouette score with standard HDBSCAN (on {sample_size} samples): {silhouette_std:.4f}")
    num_clusters_std = len(np.unique(clusterer_std.labels_)) - (1 if -1 in clusterer_std.labels_ else 0)
    print(f"Number of clusters found: {num_clusters_std}")

    # ===========================================
    # Now: Run with cuML acceleration
    # ===========================================
    print("\nRunning with cuML acceleration:")
    # NOTE: In a .py script, use:
    #       python -m cuml.accel hdbscan_test.py
    import hdbscan  # re-import after accel

    clusterer_cuml = hdbscan.HDBSCAN(min_cluster_size=100, min_samples=10)
    _, hdbscan_cuml_time = time_execution(clusterer_cuml.fit, X_clusters)
    print(f"Clustering time with cuML HDBSCAN: {hdbscan_cuml_time:.2f} seconds")

    if len(np.unique(clusterer_cuml.labels_)) > 1:
        silhouette_cuml = silhouette_score(X_clusters[indices], clusterer_cuml.labels_[indices])
    else:
        silhouette_cuml = 0
    print(f"Silhouette score with cuML HDBSCAN (on {sample_size} samples): {silhouette_cuml:.4f}")
    num_clusters_cuml = len(np.unique(clusterer_cuml.labels_)) - (1 if -1 in clusterer_cuml.labels_ else 0)
    print(f"Number of clusters found: {num_clusters_cuml}")

    # Calculate speedup
    hdbscan_speedup = hdbscan_time / hdbscan_cuml_time if hdbscan_cuml_time > 0 else 0
    print(f"\nSpeedup for HDBSCAN clustering: {hdbscan_speedup:.2f}x")

    # Visualize timing
    labels = ['HDBSCAN (CPU)', 'HDBSCAN with cuML (GPU)']
    cluster_times = [hdbscan_time, hdbscan_cuml_time]

    plt.figure(figsize=(10, 6))
    plt.bar(labels, cluster_times)
    plt.title('HDBSCAN Clustering Time (seconds)')
    plt.ylabel('Time (seconds)')
    for i, v in enumerate(cluster_times):
        plt.text(i, v + max(cluster_times)*0.05, f"{v:.2f}s", ha='center')

    plt.tight_layout()
    plt.savefig('hdbscan_timing_comparison_large.png')
    plt.show()

if __name__ == "__main__":
    main()

Output & Results

==================================================
PART 2: HDBSCAN Clustering (Larger Dataset)
==================================================
Generating larger dataset with 100000 samples and 100 features...

Running with standard HDBSCAN:
Clustering time with standard HDBSCAN: 1093.78 seconds
Silhouette score with standard HDBSCAN (on 10000 samples): 0.7175
Number of clusters found: 10

Running with cuML acceleration:
Clustering time with cuML HDBSCAN: 5.06 seconds
Silhouette score with cuML HDBSCAN (on 10000 samples): 0.7175
Number of clusters found: 10

Speedup for HDBSCAN clustering: 216.36x

Here, HDBSCAN saw a jaw-dropping 216x speedup. Notably, the silhouette score is the same, indicating the clusters discovered by CPU and GPU versions are effectively identical in quality.

UMAP Dimensionality Reduction

Finally, we apply the UMAP algorithm to reduce a 50,000-sample MNIST dataset (784 features each) down to 2D. UMAP can be computationally heavy, so it’s a prime example of how GPU acceleration can drastically cut processing times.

Code:

import time
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler

# Utility function to measure execution time
def time_execution(func, *args, **kwargs):
    start_time = time.time()
    result = func(*args, **kwargs)
    end_time = time.time()
    return result, end_time - start_time

def main():
    print("\n" + "="*50)
    print("PART 3: UMAP Dimensionality Reduction (Larger Dataset)")
    print("="*50)

    # Load a larger subset of MNIST
    print("Loading larger MNIST dataset...")
    X_mnist, y_mnist = fetch_openml('mnist_784', version=1, return_X_y=True, parser='auto')
    X_mnist = X_mnist.to_numpy()[:50000]
    y_mnist = y_mnist.to_numpy()[:50000]

    scaler = StandardScaler()
    X_mnist_scaled = scaler.fit_transform(X_mnist)
    print(f"MNIST subset shape: {X_mnist_scaled.shape}")

    # ===========================================
    # First: Run with standard UMAP
    # ===========================================
    print("\nRunning with standard UMAP:")
    import umap

    umap_std = umap.UMAP(
        n_neighbors=15, min_dist=0.1, n_components=2, random_state=42
    )
    (X_umap_std, ), umap_time = time_execution(
        umap_std.fit_transform, X_mnist_scaled
    )
    print(f"UMAP time with standard library: {umap_time:.2f} seconds")

    # ===========================================
    # Now: Run with cuML acceleration
    # ===========================================
    print("\nRunning with cuML acceleration:")
    # NOTE: In a .py script, use:
    #       python -m cuml.accel umap_test.py
    import umap  # re-import after accel

    umap_cuml = umap.UMAP(
        n_neighbors=15, min_dist=0.1, n_components=2, random_state=42
    )
    (X_umap_cuml, ), umap_cuml_time = time_execution(
        umap_cuml.fit_transform, X_mnist_scaled
    )
    print(f"UMAP time with cuML: {umap_cuml_time:.2f} seconds")

    # Calculate speedup
    umap_speedup = umap_time / umap_cuml_time if umap_cuml_time else 0
    print(f"\nSpeedup for UMAP dimensionality reduction: {umap_speedup:.2f}x")

    # Visualize timing
    labels = ['UMAP (CPU)', 'UMAP with cuML (GPU)']
    umap_times = [umap_time, umap_cuml_time]

    plt.figure(figsize=(10, 6))
    plt.bar(labels, umap_times)
    plt.title('UMAP Execution Time (seconds)')
    plt.ylabel('Time (seconds)')
    for i, v in enumerate(umap_times):
        plt.text(i, v + max(umap_times)*0.05, f"{v:.2f}s", ha='center')

    plt.tight_layout()
    plt.savefig('umap_timing_comparison_large.png')
    plt.show()

if __name__ == "__main__":
    main()

Output & Results

==================================================
PART 3: UMAP Dimensionality Reduction (Larger Dataset)
==================================================
Loading larger MNIST dataset...
MNIST subset shape: (50000, 784)

Running with standard UMAP:
UMAP time with standard library: 71.31 seconds

Running with cuML acceleration:
UMAP time with cuML: 5.39 seconds

Speedup for UMAP dimensionality reduction: 13.24x

With UMAP, we enjoyed an impressive 13x speed boost on a modest T4 GPU. More advanced hardware (like NVIDIA A100 or H100) would see even bigger gains.

How to Run These Tests

Install Dependencies
Make sure you have the required packages installed: scikit-learn, cuml, hdbscan, umap-learn, matplotlib, and so forth.
Enable cuML Acceleration
In a .py script, use the syntax:
python -m cuml.accel random_forest_test.pyto ensure the cuML accelerator is loaded before scikit-learn or other libraries.
Inspect the Outputs
Each script prints timing information, speedups, and (where applicable) metrics like accuracy and silhouette scores. You’ll also see bar charts in .png files comparing CPU vs GPU times.

Feel free to adapt these examples to your own datasets and pipelines. With NVIDIA cuML, virtually any scikit-learn pipeline can run faster—all while keeping your code unchanged.

Conclusion

NVIDIA cuML’s zero code change acceleration stands out as a genuinely game-changing leap in the Python machine learning ecosystem. By adopting a friendly one-line extension, data scientists and ML engineers get an enormous performance boost—often over 50x—without the usual headache of rewriting perfectly good scikit-learn code.

If you’re juggling large datasets, tuning hyperparameters, or just sick of endless wait times, this is your golden ticket to faster, more efficient workflows. So, go ahead—load the extension, run your code, and watch your productivity soar.

Get 50% off your monthly or bundle subscription and advance your career. Use Code: FALL50

Learn anything. Thousands of top courses to choose from.

Share the Post:

Learn Data Science. Courses starting at $12.99.

Machine Learning

The Ultimate Guide to Meta’s Game-Changing Large Concept Models

Summarized Audio Version Introduction: A New AI Milestone The world has been buzzing about Large Language Models (LLMs) for a while, and it sometimes feels like every day brings a new advancement—be it better chatbot capabilities or more extensive language

Prateek Gaurav March 20, 2025

Machine Learning

Unlocking Gemma 3’s Potential: How This New AI Model Stacks Up in Real Tests

Summarized Audio Version Introduction Marketing hype is one thing, but real-world performance is what actually matters when choosing AI models for your projects. That’s why I decided to put Google’s new Gemma 3 models through their paces with comprehensive, hands-on

Prateek Gaurav March 19, 2025

Machine Learning

Google’s Gemma 3: A New Era of Accessible AI

Summarized Audio Version: The Gemma 3 AI Model from Google marks a major leap forward in accessible AI modeling. Launched on March 12, 2025, it builds on the success of the Gemma 2 line, which celebrated over 100 million downloads

Prateek Gaurav March 16, 2025

Skyrocket scikit-learn with NVIDIA cuML: 50x Faster, No Code

Table of Contents

Summarized Audio Version

Why This Matters

How Does Zero Code Change Acceleration Work?

Real-World Speedups

Why GPU Acceleration Rocks

Quick Tips for a Smooth Experience

Known Gotchas

Use Cases That Benefit the Most

Hands-On GPU Acceleration Demo

Code

Random Forest Classification

Code:

Output & Results

HDBSCAN Clustering

Code:

Output & Results

UMAP Dimensionality Reduction

Code:

Output & Results

How to Run These Tests

Conclusion

Related Posts

Skyrocket scikit-learn with NVIDIA cuML: 50x Faster, No Code

Table of Contents

Summarized Audio Version

Why This Matters

How Does Zero Code Change Acceleration Work?

Real-World Speedups

Why GPU Acceleration Rocks

Quick Tips for a Smooth Experience

Known Gotchas

Use Cases That Benefit the Most

Hands-On GPU Acceleration Demo

Code

Random Forest Classification

Code:

Output & Results

HDBSCAN Clustering

Code:

Output & Results

UMAP Dimensionality Reduction

Code:

Output & Results

How to Run These Tests

Conclusion

Related Posts

LOGIN