One-Class SVM: Detecting Anomalies by Learning the Boundary of Normal

DS
LDS Team
Let's Data Science
9 minAudio
Listen Along
0:00 / 0:00
AI voice

A credit card processor handles 10,000 transactions per second. Roughly 99.97% are legitimate. The remaining 0.03% are fraud, but no two fraudulent transactions look the same: one is a stolen card buying electronics overseas, another is a slow-drip of micro-charges from a compromised account, a third is an identity thief mimicking normal spending patterns. Building a classifier that learns "what fraud looks like" fails because fraud reinvents itself constantly. The smarter approach is to learn what normal looks like, and flag everything else.

One-Class SVM does exactly this. Rather than drawing a boundary between two classes, it wraps a tight decision surface around a single class (the normal data) using kernel-based optimization. Anything outside that surface is flagged as anomalous. The algorithm never sees a single example of fraud, intrusion, or defect during training. It was introduced by Scholkopf et al. (2001) and remains one of the most mathematically principled approaches to anomaly detection in scikit-learn 1.8.

Throughout this article, we'll use a single running example: server health monitoring, where normal servers produce predictable CPU and memory usage patterns, and anomalous servers (crypto-miners, memory-leaking processes, DDoS victims) deviate from those patterns.

The core idea behind one-class learning

One-class classification flips traditional supervised learning on its head. A standard binary classifier sees examples of both classes and draws a separating boundary between them. One-Class SVM sees examples of only one class and draws a boundary around it.

Think of it like a customs officer who has studied thousands of legitimate passports. She doesn't need to memorize every possible forgery technique. She knows what a real passport looks like so thoroughly that anything that deviates from that template triggers suspicion. That's one-class learning: model normality, then test new observations against that model.

In our server monitoring scenario, the training set contains only metrics from healthy servers: CPU usage hovering between 30-70%, memory consumption following predictable allocation patterns, network I/O within expected bounds. At prediction time, a server suddenly pinning its CPU at 98% while memory stays flat (a classic crypto-mining signature) falls outside the learned boundary and gets flagged.

This approach is formally called novelty detection rather than outlier detection, and the distinction matters:

ApproachTraining dataGoalExample
Novelty detectionClean (only normal)Flag new observations that don't conformOne-Class SVM on verified healthy servers
Outlier detectionContaminated (contains unknown anomalies)Find anomalies within the dataset itselfIsolation Forest on unsorted server logs

Key Insight: One-Class SVM produces a hard boundary: +1 for inliers, -1 for outliers. But the decision_function also returns continuous scores, where larger positive values mean "deeply normal" and large negative values mean "clearly anomalous." In production, you'll often set a custom threshold on these scores rather than relying on the default zero cutoff.

One-Class SVM wraps a decision boundary around normal server data pointsOne-Class SVM wraps a decision boundary around normal server data points

How the kernel trick shapes the decision boundary

The RBF kernel is the engine that makes One-Class SVM flexible enough to handle real data. In raw feature space, normal data rarely forms a neat circle or ellipse. Server metrics might cluster in two modes (daytime high-CPU web traffic vs. nighttime batch jobs), with irregular density and overlapping tails. A simple geometric shape can't capture this.

The kernel trick sidesteps this limitation. Instead of computing boundaries in the original feature space, the algorithm implicitly maps data into a much higher-dimensional space (infinite-dimensional for the RBF kernel) using a kernel function. In that expanded space, a flat hyperplane can separate the normal data from everything else. When projected back to the original space, that flat hyperplane becomes a curved, flexible boundary.

The Radial Basis Function (RBF) kernel computes similarity between two data points xx and xx':

K(x,x)=exp(γxx2)K(x, x') = \exp\left(-\gamma \|x - x'\|^2\right)

Where:

  • K(x,x)K(x, x') is the kernel value (similarity score) between points xx and xx'
  • γ\gamma is the kernel bandwidth parameter controlling how quickly similarity decays with distance
  • xx2\|x - x'\|^2 is the squared Euclidean distance between the two points
  • exp\exp is the exponential function

In Plain English: The RBF kernel answers "how similar are these two servers?" If two servers have nearly identical CPU and memory readings, the kernel returns a value close to 1. As their metrics diverge, the kernel value drops exponentially toward 0. The parameter γ\gamma controls the drop-off speed: high γ\gamma means even slight differences in server metrics produce low similarity (a narrow "neighborhood"), while low γ\gamma extends each server's sphere of influence further.

The relationship between γ\gamma and the Gaussian width σ\sigma is:

γ=12σ2\gamma = \frac{1}{2\sigma^2}

Where:

  • σ\sigma is the standard deviation of the Gaussian (controls its width)
  • Larger σ\sigma means a wider, smoother kernel
  • Smaller σ\sigma means a tighter, more localized kernel

In scikit-learn 1.8, the default gamma='scale' computes γ=1nfeaturesVar(X)\gamma = \frac{1}{n_{\text{features}} \cdot \text{Var}(X)}, which adapts automatically to both the number of features and the data's variance. This default works surprisingly well in practice and should be your starting point.

Kernel options at a glance

KernelFormulaWhen to use itServer monitoring example
RBFK(x,x)=exp(γxx2)K(x, x') = \exp(-\gamma \|x - x'\|^2)Default choice; complex, non-linear boundariesMulti-modal server clusters with irregular shapes
LinearK(x,x)=xxK(x, x') = x \cdot x'High-dimensional sparse data; fast training500+ monitoring features with sparse telemetry
PolynomialK(x,x)=(γxx+r)dK(x, x') = (\gamma \, x \cdot x' + r)^dMild non-linearity; feature interactionsQuadratic relationships between CPU and memory
SigmoidK(x,x)=tanh(γxx+r)K(x, x') = \tanh(\gamma \, x \cdot x' + r)Rarely used; neural-net-like behaviorNiche cases; not recommended as a first choice

For most anomaly detection tasks, start with RBF. Only switch to linear when your feature space is already high-dimensional (hundreds of features) relative to your sample count, or when training speed is the primary constraint.

The Scholkopf formulation explained

The mathematical backbone of One-Class SVM comes from the 2001 paper "Estimating the Support of a High-Dimensional Distribution" by Scholkopf, Platt, Shawe-Taylor, Smola, and Williamson, published in Neural Computation. The core geometric trick: treat the origin in kernel feature space as the stand-in for "everything anomalous," then find a hyperplane that pushes the mapped normal data as far from the origin as possible.

Here's how to think about it, step by step:

  1. The kernel function maps your original server metrics onto a surface in a higher-dimensional feature space. For the RBF kernel, this space is infinite-dimensional.
  2. The origin of this feature space represents the "anomaly class" — a convenient mathematical anchor point.
  3. The algorithm finds the hyperplane that separates the mapped training data from the origin, maximizing the margin (distance) between them.
  4. When projected back to the original input space, this hyperplane becomes a closed, non-linear boundary wrapping around the normal data.

Scholkopf formulation maps data to feature space and separates it from the originScholkopf formulation maps data to feature space and separates it from the origin

The optimization problem

Given training samples x1,x2,,xnx_1, x_2, \ldots, x_n representing healthy server observations, One-Class SVM solves the following quadratic program. Find weight vector ww, slack variables ξi\xi_i, and offset ρ\rho that minimize:

minw,ξ,ρ12w2+1νni=1nξiρ\min_{w, \xi, \rho} \frac{1}{2} \|w\|^2 + \frac{1}{\nu n} \sum_{i=1}^{n} \xi_i - \rho

Subject to:

w,Φ(xi)ρξi,ξi0,i\langle w, \Phi(x_i) \rangle \geq \rho - \xi_i, \quad \xi_i \geq 0, \quad \forall \, i

Where:

  • ww is the weight vector defining the hyperplane orientation in feature space
  • w2\|w\|^2 is the regularization term that penalizes boundary complexity
  • ξi\xi_i is the slack variable for training point ii (allows controlled violations)
  • ρ\rho is the offset from the origin (controls boundary tightness)
  • ν\nu is the nu parameter controlling the fraction of allowed outliers
  • nn is the number of training samples
  • Φ(xi)\Phi(x_i) is the kernel mapping of point xix_i into higher-dimensional space
  • ,\langle \cdot, \cdot \rangle denotes the inner product in feature space

In Plain English: The optimization balances three competing forces. First, w2\|w\|^2 keeps the boundary smooth — without it, the model would contort into bizarre shapes to include every single training point. Second, maximizing ρ\rho pushes the hyperplane away from the origin, pulling the boundary tighter around the server cluster. Third, the slack variables ξi\xi_i let a controlled fraction of training points fall outside the boundary, which is critical because even "clean" training data contains noise. The parameter ν\nu is the dial that controls how much slack to tolerate.

The decision function for a new server observation xx is:

f(x)=sign(w,Φ(x)ρ)f(x) = \text{sign}\left(\langle w, \Phi(x) \rangle - \rho\right)

Where:

  • f(x)=+1f(x) = +1 means the server falls inside the boundary (healthy)
  • f(x)=1f(x) = -1 means the server falls outside the boundary (anomalous)
  • w,Φ(x)\langle w, \Phi(x) \rangle measures the signed distance from the hyperplane in feature space

Pro Tip: There's an alternative formulation called Support Vector Data Description (SVDD), introduced by Tax and Duin (2004). SVDD finds the smallest hypersphere enclosing the training data, rather than separating data from the origin. With the RBF kernel, both formulations produce mathematically equivalent decision boundaries. Scikit-learn implements the Scholkopf formulation.

The nu parameter controls boundary tightness

The parameter ν\nu (nu) is the single most important hyperparameter in One-Class SVM. It replaces the CC parameter from standard SVMs and takes values in the interval (0,1](0, 1], with a default of 0.5 in scikit-learn.

What makes ν\nu elegant is its dual interpretation:

  1. Upper bound on training errors. Setting ν=0.05\nu = 0.05 means at most 5% of your healthy server training points will fall outside the learned boundary.
  2. Lower bound on support vectors. That same ν=0.05\nu = 0.05 guarantees at least 5% of training points become support vectors (the points that define the boundary shape).

This dual meaning gives ν\nu a direct, interpretable relationship to model behavior:

ν\nu valueBoundaryTraining errorsSupport vectorsServer monitoring use case
0.01Very tightAt most 1%At least 1%High-security environments; zero tolerance for missed anomalies
0.05Moderately tightAt most 5%At least 5%General production monitoring with clean training data
0.10Somewhat looseAt most 10%At least 10%Training data with suspected contamination
0.50Very permissiveAt most 50%At least 50%Rarely useful; most anomalies will pass through

For server health monitoring with a verified clean training set, start with ν\nu between 0.01 and 0.05. If you suspect some contamination (maybe a few servers had intermittent issues during the training window), bump ν\nu to 0.05-0.10 to give the boundary room to exclude those hidden anomalies.

Common Pitfall: Setting ν\nu too low (like 0.001) forces the boundary to wrap around nearly every training point, including noise. You get an overly complex, jagged boundary that overfits. Setting ν\nu too high (like 0.5) makes the boundary so loose that real anomalies slip through. The best practice is to tune ν\nu using a validation set that contains at least a handful of known anomalies.

Gamma shapes the kernel's reach

For the RBF kernel, γ\gamma determines how far each training point's influence extends. This has a dramatic effect on boundary shape.

High gamma gives each training point a tiny bubble of influence. The resulting boundary hugs every individual data point, creating a jagged surface with isolated pockets. In server monitoring, this would mean the model memorizes the exact CPU-memory combinations it saw during training and rejects anything even slightly different. A healthy server running 1% more CPU than any training example would get flagged. That's overfitting.

Low gamma gives each training point a wide sphere of influence. The boundary becomes a broad, smooth blob. In server monitoring, the model would generalize too aggressively and accept servers with clearly abnormal metrics. A crypto-miner running at 99% CPU might still fall inside the bloated boundary. That's underfitting.

The default gamma='scale' in scikit-learn provides a solid starting point by adapting to both the number of features and their variance. For fine-tuning, search over a logarithmic range:

γ\gamma rangeEffect on boundaryBias-variance tradeoff
$10^{-4} to \10^{-2}$Smooth, broad boundaryHigher bias, lower variance
$10^{-2} to \10^{0}$Moderate complexityBalanced
$10^{0} to \10^{1}$Jagged, tight boundaryLower bias, higher variance

Pro Tip: If you're using grid search to tune γ\gamma, always pair it with ν\nu in a 2D grid. These two parameters interact: a tight ν\nu with a high γ\gamma produces an absurdly complex boundary, while a loose ν\nu with a low γ\gamma produces one that's nearly useless. Searching them independently will miss the optimal combination.

Effect of gamma and nu parameters on the One-Class SVM decision boundaryEffect of gamma and nu parameters on the One-Class SVM decision boundary

Full Python implementation with server metrics

The following implementation uses scikit-learn 1.8's OneClassSVM to detect anomalous server behavior. We'll generate synthetic server metrics, train on normal data only, and evaluate on a mixed test set.

Generating and training

Expected Output:

code
Training samples (normal only): 300
Test samples: 90 (50 normal + 40 anomalous)
Support vectors: 13

Predicted normal: 53, Predicted anomaly: 37

Classification Report:
              precision    recall  f1-score   support

     Anomaly       1.00      0.93      0.96        40
      Normal       0.94      1.00      0.97        50

    accuracy                           0.97        90
   macro avg       0.97      0.96      0.97        90
weighted avg       0.97      0.97      0.97        90

Visualizing the decision boundary

Expected Output:

The plot displays two dense clusters of white dots (healthy servers in daytime and nighttime operation modes) enclosed by a dark red contour line — the learned decision boundary. Blue shading inside the boundary represents the "normal region," with darker blue near the edges indicating proximity to the boundary. Pale green fills the interior. Red X markers scattered outside the boundary represent detected anomalies: servers with unusual CPU/memory combinations.

Working with decision function scores

The decision_function method returns continuous scores that enable finer-grained anomaly ranking than binary predictions. Points deep inside the boundary score high positive values, borderline points score near zero, and clear outliers score negative.

Expected Output:

code
Score range: [-2.148, 0.328]
Mean score (50 normal servers): 0.210
Mean score (40 anomalous servers): -1.298

The histogram shows two overlapping but clearly separated distributions. Normal servers cluster on the positive side (right), anomalous servers cluster on the negative side (left), with the default threshold at zero providing reasonable separation. In production, you can shift this threshold based on your cost of false positives vs. false negatives.

Pro Tip: In production systems, don't just use the binary prediction. Export the decision function scores and set your own threshold. If missing an anomalous server is catastrophic (say, a financial system), lower the threshold to catch more outliers even at the cost of more false alarms. If false alarms are expensive (noisy pager alerts at 3 AM), raise the threshold.

Scaling to large datasets with SGDOneClassSVM

One-Class SVM's main weakness is computational cost. The underlying quadratic programming solver (libsvm) scales between O(n2)O(n^2) and O(n3)O(n^3) in training time, which becomes impractical above roughly 50,000 samples.

Scikit-learn provides SGDOneClassSVM (added in version 1.0) as a scalable alternative. It uses stochastic gradient descent with kernel approximation to achieve linear time complexity in the number of training samples.

python
import numpy as np
from sklearn.linear_model import SGDOneClassSVM
from sklearn.kernel_approximation import Nystroem
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs
import time

# Generate a larger dataset: 50,000 normal server observations
X_large, _ = make_blobs(
    n_samples=50_000,
    centers=[[55, 45], [35, 70]],
    cluster_std=5.0,
    n_features=10,  # 10 server metrics
    random_state=42
)

# Standard OneClassSVM (for comparison on a smaller subset)
from sklearn.svm import OneClassSVM

subset = X_large[:5000]
scaler = StandardScaler()
subset_scaled = scaler.fit_transform(subset)

start = time.perf_counter()
oc_svm = OneClassSVM(kernel='rbf', gamma='scale', nu=0.05)
oc_svm.fit(subset_scaled)
classic_time = time.perf_counter() - start

# SGDOneClassSVM with Nystroem kernel approximation (full dataset)
scaler_full = StandardScaler()
X_large_scaled = scaler_full.fit_transform(X_large)

start = time.perf_counter()
sgd_pipeline = make_pipeline(
    Nystroem(kernel='rbf', gamma=0.1, n_components=300, random_state=42),
    SGDOneClassSVM(nu=0.05, random_state=42)
)
sgd_pipeline.fit(X_large_scaled)
sgd_time = time.perf_counter() - start

print(f"Classic OneClassSVM on 5,000 samples: {classic_time:.2f}s")
print(f"SGDOneClassSVM on 50,000 samples:     {sgd_time:.2f}s")
print(f"\nSGD trained on 10x more data in comparable time")

Expected Output:

code
Classic OneClassSVM on 5,000 samples: 0.02s
SGDOneClassSVM on 50,000 samples:     0.25s

SGD trained on 10x more data in comparable time

Warning: SGDOneClassSVM uses a linear model internally, so you must pair it with a kernel approximation like Nystroem or RBFSampler to get RBF-like decision boundaries. Without the approximation, it fits a linear boundary in the original feature space, which is far less expressive.

When to use One-Class SVM (and when not to)

Choosing the right anomaly detection algorithm depends on your data characteristics and operational constraints. Here's a decision framework.

Choose One-Class SVM when

  • Your training data is clean. One-Class SVM assumes the training set contains only normal examples. It's a novelty detector, not an outlier detector.
  • Your dataset is small to medium (under 50,000 samples). The quadratic-to-cubic training cost is manageable at this scale, and the kernel-based boundary will be more precise than tree-based alternatives.
  • Feature interactions matter. The kernel trick captures complex, non-linear relationships between features that distance-based methods (like LOF) struggle with.
  • You need a hard decision boundary. Some compliance and regulatory settings require a clear "inside/outside" classification, not just a score.

Choose something else when

  • Your dataset exceeds 100,000 samples. Use SGDOneClassSVM with kernel approximation, or switch to Isolation Forest which scales to millions of rows with O(nlogn)O(n \log n) training complexity.
  • Your training data is contaminated. If you can't guarantee clean training data, Isolation Forest handles contamination gracefully because its random partitioning naturally isolates outliers.
  • Anomalies are local density deviations. If "anomalous" means "sparse compared to immediate neighbors" rather than "far from the global cluster," Local Outlier Factor is purpose-built for this.
  • You need real-time streaming detection. One-Class SVM doesn't support incremental learning. For streaming data, consider online variants or Isolation Forest which can be updated incrementally.

Head-to-head comparison

CriterionOne-Class SVMIsolation ForestLocal Outlier Factor
Core approachBoundary around normal dataRandom partitions isolate anomaliesLocal density comparison
Training timeO(n2)O(n^2) to O(n3)O(n^3)O(nlogn)O(n \log n)O(n2)O(n^2) worst case
Prediction timeO(nsvd)O(n_{sv} \cdot d) per pointO(tlogn)O(t \cdot \log n) per pointO(nd)O(n \cdot d) per point
Handles contaminationPoorly — distorts boundaryWell — natural isolationModerately well
Multi-modal dataGood with RBF kernelGoodExcellent
High dimensionsStrong (kernel handles interactions)Degrades with irrelevant featuresDegrades (distance concentration)
Practical scale limit~50K samplesMillions of samples~100K samples
scikit-learn classsklearn.svm.OneClassSVMsklearn.ensemble.IsolationForestsklearn.neighbors.LocalOutlierFactor

For a complete overview of all anomaly detection methods and how they compare, see Finding the Needle: A Comprehensive Guide to Anomaly Detection Algorithms.

Decision framework for choosing between anomaly detection algorithmsDecision framework for choosing between anomaly detection algorithms

Common pitfalls and production guidance

Feature scaling is mandatory

One-Class SVM with the RBF kernel computes Euclidean distances between points. If CPU usage ranges from 0-100 while network packet counts range from 0-10,000,000, the packet count will completely dominate the distance calculation and CPU anomalies will go undetected.

Always apply feature scaling before fitting. StandardScaler is the default choice. MinMaxScaler works too but is more sensitive to outliers in the training set (which ideally shouldn't exist in a novelty detection scenario).

Dimensionality reduction for high-dimensional data

When monitoring hundreds of server metrics simultaneously, the curse of dimensionality kicks in: distances between nearest and farthest neighbors converge, and the RBF kernel loses its ability to distinguish normal from anomalous.

The standard fix is to apply PCA before One-Class SVM. A pipeline of StandardScaler then PCA then OneClassSVM is a production-tested pattern. Keep enough components to explain 90-95% of variance — this filters noise while preserving signal.

Tuning without labeled anomalies

The trickiest practical challenge. Without labeled anomalies in a validation set, you can't compute precision, recall, or F1. Three strategies that work:

  1. Synthetic contamination. Inject artificial anomalies into a held-out set — random uniform noise in the feature space works well. Tune ν\nu and γ\gamma to maximize detection of these synthetic anomalies while keeping false positive rate low.
  2. Stability analysis. Sweep ν\nu and γ\gamma across a grid and check how much the boundary changes. If the boundary shifts dramatically with small parameter changes, the model is unstable. Stable boundaries across a range of hyperparameters indicate a well-specified model.
  3. Domain expert validation. If operations engineers can label a handful of known incidents ("this server was crypto-mining on Tuesday"), use those as your validation set. Even 10-20 labeled anomalies dramatically improve tuning.

Common Pitfall: Don't skip the validation step and just ship with default hyperparameters. The defaults (ν=0.5\nu = 0.5, gamma='scale') are conservative starting points, not production-ready values. A ν\nu of 0.5 means the model allows up to half the training data to fall outside the boundary, which defeats the purpose of anomaly detection.

Computational complexity cheat sheet

OperationTime complexityMemoryNotes
TrainingO(n2)O(n^2) to O(n3)O(n^3)O(n2)O(n^2) (kernel matrix)libsvm solver; 50K samples is practical limit
PredictionO(nsvd)O(n_{sv} \cdot d) per pointO(nsvd)O(n_{sv} \cdot d)Fast if support vectors are few
SGDOneClassSVM trainingO(nd)O(n \cdot d)O(nm)O(n \cdot m) with Nystroem (mm = components)Linear scaling; millions of samples feasible

Conclusion

One-Class SVM converts anomaly detection from an impossible enumeration problem into a boundary-learning problem. Instead of teaching a model every way a server can misbehave, you teach it what healthy looks like and flag deviations. The Scholkopf formulation, with the interpretable ν\nu parameter controlling boundary tightness and the RBF kernel handling non-linear patterns, provides a mathematically principled approach that's hard to beat on small-to-medium datasets with clean training data.

The algorithm's main constraint is cubic training complexity, which caps practical dataset sizes around 50,000 samples. For larger workloads, scikit-learn's SGDOneClassSVM with Nystroem kernel approximation offers a linear-time alternative. And for contaminated training data or local density anomalies, Isolation Forest and Local Outlier Factor are better fits. In practice, combining multiple detectors — One-Class SVM for boundary precision on critical subsystems, Isolation Forest for broad-scale screening — often outperforms any single method. For a comprehensive comparison of all anomaly detection approaches, see Finding the Needle: A Comprehensive Guide to Anomaly Detection Algorithms.

If your features need preprocessing before any of these models can work well, start with Standardization vs Normalization to get scaling right, then explore PCA for dimensionality reduction. The investment in clean, well-scaled features pays off more than any amount of hyperparameter tuning.

Frequently Asked Interview Questions

Q: What is One-Class SVM and how does it differ from a standard SVM?

A standard SVM draws a decision boundary between two labeled classes. One-Class SVM sees only one class (normal data) during training and learns a boundary that encloses that class. Anything outside the boundary is classified as anomalous. It's a novelty detection method, not a classifier in the traditional sense, and it works by separating the training data from the origin in a kernel-induced feature space.

Q: Explain the role of the nu parameter in One-Class SVM.

The ν\nu parameter has a dual interpretation: it's an upper bound on the fraction of training points allowed to fall outside the boundary, and a lower bound on the fraction of support vectors. Setting ν=0.05\nu = 0.05 means at most 5% of training points will be classified as outliers, and at least 5% will serve as support vectors. It's the primary knob for controlling boundary tightness.

Q: When would you choose One-Class SVM over Isolation Forest?

One-Class SVM is better when you have clean training data (no contamination), a small-to-medium dataset (under 50K samples), and need a precise, kernel-based decision boundary that captures complex feature interactions. Isolation Forest is preferred for large datasets, contaminated training data, or when you need fast training. The two methods also complement each other: One-Class SVM excels at boundary precision while Isolation Forest excels at scalability.

Q: Why is feature scaling mandatory for One-Class SVM?

The RBF kernel computes Euclidean distances between data points. If features have vastly different scales, the larger-magnitude features dominate the distance calculation, effectively making other features invisible to the model. Standardizing all features to zero mean and unit variance ensures each feature contributes proportionally to the distance computation and the resulting decision boundary.

Q: What's the difference between novelty detection and outlier detection?

Novelty detection assumes a clean training set and asks "is this new observation consistent with what I've seen before?" Outlier detection works on a potentially contaminated dataset and tries to identify anomalies within it. One-Class SVM is a novelty detector. Isolation Forest can operate as either. The practical consequence: if your training data contains anomalies, One-Class SVM will incorporate them into its definition of "normal," which distorts the boundary.

Q: How would you tune One-Class SVM hyperparameters without any labeled anomalies?

Three approaches: (1) inject synthetic anomalies (random uniform noise in the feature space) into a validation set and tune to maximize detection, (2) run stability analysis by sweeping ν\nu and γ\gamma across a grid and checking whether the boundary changes dramatically, (3) work with domain experts to label even a handful (10-20) of known incidents and use those for validation. The synthetic contamination approach is most automated and widely used.

Q: What happens if the training data for a One-Class SVM is contaminated with anomalies?

Contamination distorts the learned boundary because the algorithm treats every training point as "normal." The boundary stretches to include the anomalous points, which then go undetected at prediction time and may also cause legitimate borderline cases to be misclassified. If contamination is unavoidable, increase ν\nu to allow the boundary to exclude suspicious training points, or switch to Isolation Forest which handles contamination inherently.

Q: How does SGDOneClassSVM differ from the standard OneClassSVM in scikit-learn?

Standard OneClassSVM uses libsvm's quadratic programming solver with O(n2)O(n^2) to O(n3)O(n^3) time complexity. SGDOneClassSVM uses stochastic gradient descent with O(n)O(n) time complexity, making it practical for datasets with hundreds of thousands of samples. However, SGDOneClassSVM fits a linear model, so you need to pair it with a kernel approximation like Nystroem to approximate the RBF kernel's non-linear behavior. The trade-off is speed for some boundary precision.

Hands-On Practice

Anomaly detection is a critical skill in fields ranging from manufacturing quality control to financial fraud detection, where 'normal' data is abundant but failures are rare. In this hands-on tutorial, we will implement One-Class SVM (OC-SVM) to learn the boundary of what constitutes a 'normal' wine profile and flag unusual samples as anomalies. Using a high-dimensional wine analysis dataset. preprocessing, training an unsupervised OC-SVM model, and visualizing the decision boundary to understand how the algorithm separates outliers from the core distribution.

Dataset: Wine Analysis (High-Dimensional) Wine chemical analysis with 27 features (13 original + 9 derived + 5 noise) and 3 cultivar classes. PCA: 2 components=45%, 5=64%, 10=83% variance. Noise features have near-zero importance. Perfect for dimensionality reduction, feature selection, and regularization.

Experiment by adjusting the nu parameter in the OneClassSVM constructor. Increasing nu (e.g., to 0.2) tells the model to expect more outliers in the training data, which will shrink the decision boundary and potentially increase false positives. Conversely, decreasing gamma will make the boundary smoother and less fitted to individual data points, affecting the model's sensitivity to subtle anomalies.