Skip to content

Data Augmentation: How to Multiply Your Dataset and Fix Imbalance

DS
LDS Team
Let's Data Science
9 minAudio
Listen Along
0:00/ 0:00
AI voice

You have 5,000 credit card transactions. Only 250 are fraudulent. A model trained on this data achieves 95% accuracy, but it catches just 80% of the actual fraud. The other 20% slip through, costing the business real money. You cannot wait around for more fraud to happen, and labeling new data is expensive. Data augmentation offers a way to manufacture plausible synthetic fraud examples from what you already have, teaching the model patterns it would otherwise miss.

This guide walks through a single fraud detection dataset from raw imbalance to balanced training. Every formula, every code block, and every diagram references the same scenario so you can follow the logic end to end. By the conclusion, you will know exactly when augmentation helps, when it hurts, and how to implement it without introducing data leakage.

The Class Imbalance Problem

Class imbalance occurs when one class in a classification dataset vastly outnumbers another. In fraud detection, legitimate transactions typically make up over 99% of the data. According to research on the widely-used Kaggle credit card fraud dataset, fraudulent transactions represent just 0.17% of total records.

The core issue is straightforward: most classifiers optimize for overall accuracy. When 95% of samples belong to one class, a model that predicts "legitimate" for everything scores 95% accuracy while catching zero fraud. This makes accuracy a misleading metric for imbalanced problems, a topic covered in depth in our guide to ML metrics.

MetricWhat It MeasuresImbalanced Data Trap
AccuracyOverall correct predictionsInflated by majority class
PrecisionOf predicted fraud, how many are real?Can look great if model rarely predicts fraud
RecallOf actual fraud, how many did we catch?The metric that actually matters for rare events
F1 ScoreHarmonic mean of precision and recallBalances the precision-recall tension

Data augmentation attacks this problem by generating synthetic minority-class samples, giving the model more examples to learn from. But it is one of several approaches, and not always the best one.

The Fraud Detection Running Example

Every technique in this article operates on the same synthetic dataset: 5,000 transactions with four features that partially overlap between classes. This overlap is intentional. If fraud and legitimate transactions were perfectly separable, augmentation would be pointless.

<!— EXEC —>

python
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

np.random.seed(42)

# 5000 transactions: 4750 legitimate, 250 fraud (5% ratio)
n_legit, n_fraud = 4750, 250

# Four features with realistic partial overlap
legit_amount = np.random.exponential(80, n_legit)
fraud_amount = np.random.exponential(180, n_fraud)

legit_hour = np.random.normal(14, 4, n_legit).clip(0, 23)
fraud_hour = np.random.normal(3, 5, n_fraud).clip(0, 23)

legit_velocity = np.random.exponential(1.5, n_legit)
fraud_velocity = np.random.exponential(5.0, n_fraud)

legit_distance = np.random.exponential(20, n_legit)
fraud_distance = np.random.exponential(200, n_fraud)

X = np.column_stack([
    np.concatenate([legit_amount, fraud_amount]),
    np.concatenate([legit_hour, fraud_hour]),
    np.concatenate([legit_velocity, fraud_velocity]),
    np.concatenate([legit_distance, fraud_distance])
])
y = np.array([0] * n_legit + [1] * n_fraud)

print(f"Total transactions: {len(y)}")
print(f"Class distribution: {Counter(y)}")
print(f"Fraud ratio: {sum(y) / len(y) * 100:.1f}%")

# Split FIRST — augmentation only touches training data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"\nTraining set: {len(X_train)} ({int(sum(y_train))} fraud)")
print(f"Test set:     {len(X_test)} ({int(sum(y_test))} fraud)")

# Baseline: train on imbalanced data
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print("\nBaseline Random Forest (no augmentation):")
print(classification_report(y_test, y_pred, target_names=["Legit", "Fraud"], digits=2))

Expected Output:

text
Total transactions: 5000
Class distribution: Counter({np.int64(0): 4750, np.int64(1): 250})
Fraud ratio: 5.0%

Training set: 3500 (175 fraud)
Test set:     1500 (75 fraud)

Baseline Random Forest (no augmentation):
              precision    recall  f1-score   support

       Legit       0.99      1.00      0.99      1425
       Fraud       0.95      0.80      0.87        75

    accuracy                           0.99      1500
   macro avg       0.97      0.90      0.93      1500
weighted avg       0.99      0.99      0.99      1500

The baseline catches 80% of fraud. That means 15 out of 75 fraudulent transactions in the test set go undetected. For a payment processor handling millions of transactions, that gap translates directly into financial loss.

SMOTE: Synthetic Minority Oversampling

SMOTE (Synthetic Minority Over-sampling Technique), introduced by Chawla et al. in their 2002 JAIR paper, generates new minority samples by interpolating between existing ones and their nearest neighbors. Unlike random oversampling, which simply duplicates rows, SMOTE creates genuinely new data points. The model sees variations it has never encountered before, which reduces overfitting to the specific fraud examples in the training set.

The SMOTE Formula

xnew=xi+λ(xneighborxi)x_{new} = x_i + \lambda \cdot (x_{neighbor} - x_i)

Where:

  • xix_i is a randomly selected minority class sample
  • xneighborx_{neighbor} is one of the kk nearest neighbors of xix_i (typically k=5k = 5)
  • λ\lambda is a random number drawn uniformly from [0,1][0, 1]

In Plain English: Picture two fraud transactions plotted as points in feature space. SMOTE draws a straight line between them and places a new point somewhere along that line. If both endpoints are real fraud, the algorithm assumes the space between them is also "fraud territory." The random λ\lambda controls where along the line the new point lands.

SMOTE algorithm process from selecting a minority sample through neighbor search and interpolation to balanced datasetClick to expandSMOTE algorithm process from selecting a minority sample through neighbor search and interpolation to balanced dataset

Implementing SMOTE from Scratch

The imbalanced-learn library provides a production SMOTE implementation, but it is not available in browser-based Python environments. Building it manually with NumPy and scikit-learn's NearestNeighbors clarifies exactly what happens at each step.

<!— EXEC —>

python
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Recreate the fraud dataset
n_legit, n_fraud = 4750, 250
legit_amount = np.random.exponential(80, n_legit)
fraud_amount = np.random.exponential(180, n_fraud)
legit_hour = np.random.normal(14, 4, n_legit).clip(0, 23)
fraud_hour = np.random.normal(3, 5, n_fraud).clip(0, 23)
legit_velocity = np.random.exponential(1.5, n_legit)
fraud_velocity = np.random.exponential(5.0, n_fraud)
legit_distance = np.random.exponential(20, n_legit)
fraud_distance = np.random.exponential(200, n_fraud)

X = np.column_stack([
    np.concatenate([legit_amount, fraud_amount]),
    np.concatenate([legit_hour, fraud_hour]),
    np.concatenate([legit_velocity, fraud_velocity]),
    np.concatenate([legit_distance, fraud_distance])
])
y = np.array([0] * n_legit + [1] * n_fraud)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Extract minority class from TRAINING data only
X_minority = X_train[y_train == 1]
n_majority = int(sum(y_train == 0))
n_minority = int(sum(y_train == 1))

print(f"Training majority (legit): {n_majority}")
print(f"Training minority (fraud): {n_minority}")
print(f"Samples to generate: {n_majority - n_minority}")

def smote(X_minority, n_synthetic, k=5, seed=42):
    """Generate synthetic samples via nearest-neighbor interpolation."""
    rng = np.random.RandomState(seed)
    nn = NearestNeighbors(n_neighbors=k + 1)
    nn.fit(X_minority)

    synthetic_samples = []
    for _ in range(n_synthetic):
        # Step 1: Pick a random minority sample
        idx = rng.randint(0, len(X_minority))
        x_i = X_minority[idx]

        # Step 2: Find k nearest neighbors
        _, indices = nn.kneighbors([x_i])

        # Step 3: Select one neighbor at random (skip self at index 0)
        neighbor_idx = indices[0][rng.randint(1, k + 1)]
        x_neighbor = X_minority[neighbor_idx]

        # Step 4: Interpolate
        lam = rng.random()
        x_new = x_i + lam * (x_neighbor - x_i)
        synthetic_samples.append(x_new)

    return np.array(synthetic_samples)

n_to_generate = n_majority - n_minority
X_synthetic = smote(X_minority, n_to_generate, k=5)

print(f"\nExample: original fraud sample")
print(f"  amount={X_minority[0, 0]:.1f}  hour={X_minority[0, 1]:.1f}  "
      f"velocity={X_minority[0, 2]:.1f}  distance={X_minority[0, 3]:.1f}")
print(f"First synthetic sample")
print(f"  amount={X_synthetic[0, 0]:.1f}  hour={X_synthetic[0, 1]:.1f}  "
      f"velocity={X_synthetic[0, 2]:.1f}  distance={X_synthetic[0, 3]:.1f}")

X_train_aug = np.vstack([X_train, X_synthetic])
y_train_aug = np.concatenate([y_train, np.ones(n_to_generate)])

print(f"\nAugmented training set: {len(X_train_aug)} samples")
print(f"  Legit: {int(sum(y_train_aug == 0))}")
print(f"  Fraud: {int(sum(y_train_aug == 1))}")

Expected Output:

text
Training majority (legit): 3325
Training minority (fraud): 175
Samples to generate: 3150

Example: original fraud sample
  amount=142.6  hour=1.3  velocity=1.8  distance=165.3
First synthetic sample
  amount=40.9  hour=0.4  velocity=1.2  distance=16.6

Augmented training set: 6650 samples
  Legit: 3325
  Fraud: 3325

The synthetic fraud sample sits between two real fraud points in feature space. Notice how its values differ from any original sample. That is the key advantage over simple duplication: the model encounters novel combinations rather than memorizing existing ones.

Key Insight: SMOTE assumes the feature space between two minority samples is valid territory for that class. This works well when minority samples form a single cluster. It breaks down when the minority class has multiple distinct subgroups separated by majority-class regions.

Gaussian Noise Injection

Gaussian noise injection creates new samples by adding small random perturbations to existing data points. It is simpler than SMOTE and does not require a nearest-neighbor search.

xnew=xoriginal+ϵ,ϵN(0,σ2)x_{new} = x_{original} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)

Where:

  • xoriginalx_{original} is an existing minority sample
  • ϵ\epsilon is random noise drawn from a normal distribution
  • σ\sigma controls the spread of the noise (typically 5-15% of the feature's standard deviation)

In Plain English: Take a real fraud transaction where the amount was $245. Add a tiny random jitter to get $251 or $239. The transaction is still "fraud-like" but the exact numbers differ. Do this across all features simultaneously to create a new training point that is close to the original but not identical.

The critical design choice is how much noise to add. Too little and the augmented data is practically a duplicate. Too much and you push synthetic samples into regions of feature space where they do not belong.

Pro Tip: Scale σ\sigma relative to each feature's standard deviation, not its absolute value. A $10 perturbation is significant for a $20 lunch charge but invisible for a $5,000 international transfer.

Beyond SMOTE: Advanced Tabular Augmentation

SMOTE and noise injection handle basic tabular augmentation, but recent research has expanded the toolkit considerably.

SMOTE Variants

VariantHow It Differs from SMOTEBest For
Borderline-SMOTEOnly generates samples near the decision boundaryDatasets where most minority samples are easy to classify
ADASYNGenerates more samples in harder-to-learn regionsAdaptive focus on difficult patterns
SMOTE-ENNCombines SMOTE with Edited Nearest Neighbors cleanupRemoving noisy synthetic samples after generation
SVM-SMOTEUses SVM support vectors to guide synthesisSmaller datasets with clear margin separation

Mixup for Tabular Data

Mixup blends two random samples and their labels, forcing the model to learn smooth transitions between classes rather than hard decision boundaries:

xnew=λxi+(1λ)xjx_{new} = \lambda \cdot x_i + (1 - \lambda) \cdot x_j ynew=λyi+(1λ)yjy_{new} = \lambda \cdot y_i + (1 - \lambda) \cdot y_j

Where:

  • xix_i and xjx_j are two randomly selected training samples (any class)
  • yiy_i and yjy_j are their labels
  • λ\lambda is drawn from a Beta distribution, typically Beta(0.2,0.2)\text{Beta}(0.2, 0.2)

In Plain English: If you blend a fraud transaction (y=1y = 1) with a legitimate one (y=0y = 0) at λ=0.7\lambda = 0.7, you get a synthetic sample with label $0.7$. The model learns that this combination is "70% fraud-like." This penalizes overconfident predictions and typically improves calibration.

Deep Generative Models (CTGAN)

For complex tabular data with mixed column types, CTGAN (Conditional Tabular GAN) learns the joint distribution of all features and generates entirely new rows that respect categorical constraints and feature correlations. As of March 2026, the Synthetic Data Vault (SDV) library provides a production-ready pipeline around CTGAN with built-in quality evaluation.

CTGAN excels with mixed numeric and categorical columns. The tradeoff is training time; for pure oversampling with a handful of numeric features, SMOTE is faster and easier to debug.

Data Leakage: The Augmentation Trap

Data leakage occurs when information from outside the training set contaminates the model's learning process. With augmentation, the most common leak is augmenting the full dataset before splitting, which lets synthetic training samples share structure with test samples.

Correct augmentation pipeline: split first, then augment only training data, evaluate on untouched test setClick to expandCorrect augmentation pipeline: split first, then augment only training data, evaluate on untouched test set

Common Pitfall: Never augment your test or validation sets. Synthetic data belongs exclusively in the training pipeline. If you generate fake fraud in your validation set, you are measuring how well the model recognizes your augmentation method, not how well it catches real fraud.

The correct pipeline is:

  1. Split the data into train, validation, and test sets (stratified)
  2. Augment the minority class in the training set only
  3. Train the model on the augmented training data
  4. Evaluate on the original, unaugmented test set

For cross-validation with augmented data, augmentation must happen inside each fold, after the fold split. This is covered in more detail in our guide to cross-validation.

Comparing Every Approach Head to Head

The real question is not whether augmentation works in isolation. It is how the different techniques compare when evaluated on the same untouched test set.

<!— EXEC —>

python
import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, precision_score, f1_score

np.random.seed(42)

n_legit, n_fraud = 4750, 250
legit_amount = np.random.exponential(80, n_legit)
fraud_amount = np.random.exponential(180, n_fraud)
legit_hour = np.random.normal(14, 4, n_legit).clip(0, 23)
fraud_hour = np.random.normal(3, 5, n_fraud).clip(0, 23)
legit_velocity = np.random.exponential(1.5, n_legit)
fraud_velocity = np.random.exponential(5.0, n_fraud)
legit_distance = np.random.exponential(20, n_legit)
fraud_distance = np.random.exponential(200, n_fraud)

X = np.column_stack([
    np.concatenate([legit_amount, fraud_amount]),
    np.concatenate([legit_hour, fraud_hour]),
    np.concatenate([legit_velocity, fraud_velocity]),
    np.concatenate([legit_distance, fraud_distance])
])
y = np.array([0] * n_legit + [1] * n_fraud)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

X_min = X_train[y_train == 1]
n_maj = int(sum(y_train == 0))
n_min = int(sum(y_train == 1))
n_gen = n_maj - n_min

# SMOTE
def smote(X_minority, n_synthetic, k=5, seed=42):
    rng = np.random.RandomState(seed)
    nn = NearestNeighbors(n_neighbors=k + 1)
    nn.fit(X_minority)
    synthetic = []
    for _ in range(n_synthetic):
        idx = rng.randint(0, len(X_minority))
        x_i = X_minority[idx]
        _, indices = nn.kneighbors([x_i])
        neighbor_idx = indices[0][rng.randint(1, k + 1)]
        x_neighbor = X_minority[neighbor_idx]
        lam = rng.random()
        synthetic.append(x_i + lam * (x_neighbor - x_i))
    return np.array(synthetic)

X_smote = smote(X_min, n_gen, k=5)
X_tr_smote = np.vstack([X_train, X_smote])
y_tr_smote = np.concatenate([y_train, np.ones(n_gen)])

# Noise injection
rng2 = np.random.RandomState(42)
noise_std = X_min.std(axis=0) * 0.15
repeats = n_gen // n_min + 1
X_rep = np.tile(X_min, (repeats, 1))[:n_gen]
X_noisy = np.clip(X_rep + rng2.normal(0, noise_std, X_rep.shape), 0, None)
X_tr_noise = np.vstack([X_train, X_noisy])
y_tr_noise = np.concatenate([y_train, np.ones(n_gen)])

# Random oversampling (duplicate existing)
rng3 = np.random.RandomState(42)
dup_indices = rng3.choice(len(X_min), size=n_gen, replace=True)
X_dup = X_min[dup_indices]
X_tr_dup = np.vstack([X_train, X_dup])
y_tr_dup = np.concatenate([y_train, np.ones(n_gen)])

models = {
    "No augmentation": (X_train, y_train, {}),
    "Random oversampling": (X_tr_dup, y_tr_dup, {}),
    "Noise injection": (X_tr_noise, y_tr_noise, {}),
    "SMOTE": (X_tr_smote, y_tr_smote, {}),
    "Class weights": (X_train, y_train, {"class_weight": "balanced"}),
}

print(f"{'Method':<22} {'Recall':>7} {'Precision':>10} {'F1':>6}")
print("-" * 49)
for name, (Xtr, ytr, kwargs) in models.items():
    clf = RandomForestClassifier(n_estimators=100, random_state=42, **kwargs)
    clf.fit(Xtr, ytr)
    preds = clf.predict(X_test)
    r = recall_score(y_test, preds)
    p = precision_score(y_test, preds, zero_division=0)
    f = f1_score(y_test, preds)
    print(f"{name:<22} {r:>7.2f} {p:>10.2f} {f:>6.2f}")

Expected Output:

text
Method                  Recall  Precision     F1
-------------------------------------------------
No augmentation           0.80       0.95   0.87
Random oversampling       0.79       0.86   0.82
Noise injection           0.95       0.65   0.77
SMOTE                     0.89       0.74   0.81
Class weights             0.75       0.97   0.84

Comparison of tabular augmentation techniques: random oversampling, noise injection, SMOTE, and class weightsClick to expandComparison of tabular augmentation techniques: random oversampling, noise injection, SMOTE, and class weights

Several patterns stand out from this comparison.

Noise injection delivers the highest recall (0.95) but the lowest precision (0.65). It catches almost every fraud case but also flags many legitimate transactions. SMOTE strikes a better balance: recall jumps from 0.80 to 0.89 with a moderate precision drop to 0.74, often the sweet spot for production fraud systems.

Random oversampling barely helps. Duplicating rows does not give the model new information; the Random Forest memorizes those specific fraud patterns instead of learning generalizable boundaries.

Class weights actually lower recall here. They adjust the loss function, but for tree-based models the effect is more subtle than with gradient-based learners.

Key Insight: There is always a precision-recall tradeoff. Augmentation pushes the model to predict fraud more aggressively, catching more real fraud (higher recall) but also mislabeling some legitimate transactions (lower precision). The right balance depends on whether missed fraud or false alerts cost more.

When to Augment and When NOT To

Data augmentation is not a universal solution. It helps in specific situations and actively hurts in others.

Decision guide for when to apply data augmentation versus alternative approachesClick to expandDecision guide for when to apply data augmentation versus alternative approaches

Augment when:

  • Class ratio exceeds 10:1. Below this threshold, class_weight="balanced" or threshold tuning often suffices.
  • You cannot collect more real data. If labeled data is expensive or rare (fraud, disease, equipment failure), augmentation is the practical choice.
  • Your model memorizes instead of generalizing. If training recall is high but validation recall is low, SMOTE can help the model learn broader patterns. This connects directly to the bias-variance tradeoff.

Do NOT augment when:

  • Features have hard logical constraints. If "heart rate" must be between 40 and 200, SMOTE might interpolate a value of 25 between a resting patient and an exercising one. Always validate that synthetic samples respect domain constraints.
  • Minority subgroups exist. If fraud has two distinct clusters (online scams and in-person card theft), interpolating between clusters creates synthetic points in legitimate territory. Visualize your data with PCA or t-SNE before augmenting.
  • You have enough minority data. With 5,000+ minority samples, the model already has sufficient signal. Adding synthetic data at that point adds noise without improving generalization.
  • The problem is actually outlier detection. If fraud truly has no consistent pattern and each case is unique, augmenting from neighbors makes little sense. Consider anomaly detection (Isolation Forest, autoencoders) instead.

Production Considerations

FactorSMOTENoise InjectionClass Weights
Memory usageHigh (stores all neighbors)Low (in-place operation)None (no extra data)
Training time increase2-10x (larger dataset)2-10x (larger dataset)Negligible
Computational complexityO(nkd)O(n \cdot k \cdot d) for neighbor searchO(nd)O(n \cdot d) for noise generationNone
Works with cross-validationYes, but must augment inside each foldYes, same requirementYes, natively supported
Risk of overfittingModerate (novel points help)Low-moderate (depends on noise scale)Low

For datasets above 100,000 rows, the neighbor search in SMOTE becomes expensive. Consider using approximate nearest neighbor libraries (FAISS, Annoy) or switching to noise injection, which scales linearly. For feature engineering pipelines that run daily in production, class weights are often the simplest first step since they require no data manipulation at all.

Conclusion

Data augmentation transforms data scarcity into data abundance, but only when applied correctly. The cardinal rule is to split first, augment second, and never let synthetic data contaminate your evaluation sets.

SMOTE remains the most popular tabular augmentation technique for good reason: it generates genuinely novel points rather than duplicates, and it typically improves recall without cratering precision. For our fraud detection dataset, it pushed recall from 0.80 to 0.89, catching seven more fraudulent transactions out of 75.

Before reaching for augmentation, make sure you understand why your model fails in production through proper data splitting. Once your data pipeline is clean, experiment with SMOTE, noise injection, and class weights on your specific problem. The best approach depends on your imbalance ratio, feature types, and whether missed detections or false alarms carry the higher cost.

The simplest advice: start with class_weight="balanced". If recall still falls short, add SMOTE. If your features have complex correlations and mixed types, consider CTGAN. And always validate that your synthetic samples look plausible before feeding them to a model.

Frequently Asked Interview Questions

Q: What is the difference between random oversampling and SMOTE?

Random oversampling duplicates existing minority samples, giving the model more weight on those points but no new information. SMOTE creates novel samples by interpolating between a minority point and its k nearest neighbors, reducing overfitting because the model encounters variations it has never memorized.

Q: Why should you never augment the test set?

Synthetic samples share structural similarities with the training data they were derived from. If you augment the test set, the model gets an unfair advantage and your performance estimates become artificially inflated. The test set must contain only real, unmodified data.

Q: A fraud detection model has 99.5% accuracy but 30% recall on fraud. What is happening?

The model predicts "legitimate" for almost everything because the majority class dominates. Apply SMOTE to balance training classes, switch to F1 or AUC-PR for model selection, and lower the classification threshold to prioritize catching fraud.

Q: When does SMOTE fail?

SMOTE fails when the minority class has multiple distinct clusters separated by majority-class regions. Interpolating between clusters creates synthetic points in the wrong territory. It also fails when features have hard constraints (body temperature must be 35-42 degrees Celsius) because interpolation can produce impossible values.

Q: How would you implement data augmentation inside a cross-validation loop?

Augmentation must happen after each fold split. For each fold: (1) apply SMOTE to the training portion only, (2) train on the augmented training data, and (3) evaluate on the unaugmented validation portion. This prevents information leakage through shared nearest neighbors.

Q: How would you choose between SMOTE, noise injection, and class weights?

Start with class weights since they require no pipeline changes. If recall is still insufficient, try SMOTE for moderate imbalance (10:1 to 100:1) or noise injection when you can tolerate more false positives. For very large datasets (millions of rows), class weights or majority-class undersampling are more practical.

<!— PLAYGROUND_START data-dataset="lds_classification_binary" —>

Hands-On Practice

Note: The imbalanced-learn library isn't available in the browser environment. This hands-on section demonstrates the same concepts using manual Gaussian noise injection and sklearn's class_weight parameter. The core algorithm and approach remain identical to what the library does internally.

Data augmentation is a powerful technique to handle scarcity and class imbalance. In this exercise, we will tackle a real survival prediction scenario where survivors are the minority class. We will use two strategies from the article: Gaussian Noise Injection (manually implemented with NumPy) to generate synthetic data, and Cost-Sensitive Learning (using sklearn's class weights) to force the model to pay attention to the minority class.

Dataset: Passenger Survival (Binary Classification) Titanic-style survival prediction with 800 passengers. Contains natural class imbalance: ~63% didn't survive (Class 0), ~37% survived (Class 1). Features include passenger class, age, fare, and family information.

python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

# 1. Load the Dataset
# This is a Titanic-style survival dataset with NATURAL class imbalance
df = pd.read_csv("/datasets/playground/lds_classification_binary.csv")

print("Dataset Shape:", df.shape)
print("\nColumns:", df.columns.tolist())

# The target is 'survived' - this is NATURALLY imbalanced
# Class 0 (didn't survive) is the majority, Class 1 (survived) is the minority
print("\nOriginal Class Distribution:")
print(df['survived'].value_counts())
print(f"\nImbalance Ratio: {df['survived'].value_counts()[0] / df['survived'].value_counts()[1]:.2f}:1")

# Prepare features - use numeric columns for augmentation
# We'll one-hot encode 'embarked' for the model
df_encoded = pd.get_dummies(df, columns=['embarked'], drop_first=True)

feature_cols = ['passenger_class', 'sex', 'age', 'siblings_spouses',
                'parents_children', 'fare', 'embarked_Q', 'embarked_S']
X = df_encoded[feature_cols]
y = df_encoded['survived']

# Split into Train and Test
# Stratify ensures proportional class representation in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"\nTraining set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

# --- STRATEGY 1: MANUAL DATA AUGMENTATION (Noise Injection) ---
# Since we cannot use SMOTE in the browser, we implement Gaussian Noise Injection manually.
# This adds slight random variations to the minority class (survivors) to create new samples.

print("\n" + "="*50)
print("STRATEGY 1: Gaussian Noise Injection")
print("="*50)

# 1. Isolate the minority class (Survivors) in the training set
minority_mask = y_train == 1
X_minority = X_train[minority_mask].copy()
print(f"\nMinority class samples in training: {len(X_minority)}")

# 2. Define noise scales for each feature (relative to feature ranges)
# Numeric features get small noise; binary features stay unchanged
numeric_cols = ['age', 'fare', 'siblings_spouses', 'parents_children']
noise_scales = {
    'passenger_class': 0,      # Keep discrete
    'sex': 0,                  # Keep binary
    'age': 2.0,                # +/- 2 years std dev
    'siblings_spouses': 0.3,   # Small variation
    'parents_children': 0.3,   # Small variation
    'fare': 5.0,               # +/- &#36;5 std dev
    'embarked_Q': 0,           # Keep binary
    'embarked_S': 0            # Keep binary
}

# 3. Generate Synthetic Samples with feature-specific noise
np.random.seed(42)
X_synthetic = X_minority.copy()
for col in X_synthetic.columns:
    if noise_scales.get(col, 0) > 0:
        noise = np.random.normal(0, noise_scales[col], len(X_synthetic))
        X_synthetic[col] = X_synthetic[col] + noise
        # Ensure non-negative values for counts and fare
        if col in ['age', 'fare', 'siblings_spouses', 'parents_children']:
            X_synthetic[col] = X_synthetic[col].clip(lower=0)

y_synthetic = pd.Series([1] * len(X_synthetic))

# 4. Concatenate with original training data
X_train_aug = pd.concat([X_train, X_synthetic], ignore_index=True)
y_train_aug = pd.concat([y_train, y_synthetic], ignore_index=True)

print(f"Training set size BEFORE augmentation: {len(X_train)}")
print(f"Training set size AFTER augmentation:  {len(X_train_aug)}")
print(f"\nNew class distribution:")
print(y_train_aug.value_counts())

# Visualize augmentation effect on Age vs Fare
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.scatter(X_train[y_train==0]['age'], X_train[y_train==0]['fare'],
            alpha=0.5, c='blue', label='Did Not Survive (Original)')
plt.scatter(X_train[y_train==1]['age'], X_train[y_train==1]['fare'],
            alpha=0.7, c='red', s=60, label='Survived (Original)')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('Before Augmentation')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(X_train[y_train==0]['age'], X_train[y_train==0]['fare'],
            alpha=0.5, c='blue', label='Did Not Survive')
plt.scatter(X_train[y_train==1]['age'], X_train[y_train==1]['fare'],
            alpha=0.7, c='red', s=60, label='Survived (Original)')
plt.scatter(X_synthetic['age'], X_synthetic['fare'],
            alpha=0.7, c='orange', marker='x', s=60, label='Survived (Synthetic)')
plt.xlabel('Age')
plt.ylabel('Fare')
plt.title('After Augmentation')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# --- STRATEGY 2: CLASS WEIGHTS (Algorithmic Fix) ---
print("\n" + "="*50)
print("STRATEGY 2: Cost-Sensitive Learning (Class Weights)")
print("="*50)

# Instead of changing the data, we change how the model learns.
# We penalize missing a Survivor (Class 1) more than missing Non-Survivor (Class 0).

# Model A: Standard (May under-predict the minority class)
model_std = RandomForestClassifier(n_estimators=100, random_state=42)
model_std.fit(X_train, y_train)

# Model B: Balanced Weights (Pay attention to minority!)
# class_weight='balanced' automatically calculates weights inversely proportional to class frequencies
model_bal = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
model_bal.fit(X_train, y_train)

# Model C: Trained on Augmented Data
model_aug = RandomForestClassifier(n_estimators=100, random_state=42)
model_aug.fit(X_train_aug, y_train_aug)

# --- Evaluation ---
print("\n--- Model Comparison on Test Set ---\n")

models = {
    'Standard (No Fix)': model_std,
    'Class Weights': model_bal,
    'Augmented Data': model_aug
}

for name, model in models.items():
    preds = model.predict(X_test)
    cm = confusion_matrix(y_test, preds)

    # Extract metrics for minority class (Survived = 1)
    tn, fp, fn, tp = cm.ravel()
    recall_minority = tp / (tp + fn) if (tp + fn) > 0 else 0
    precision_minority = tp / (tp + fp) if (tp + fp) > 0 else 0

    print(f"{name}:")
    print(f"  Confusion Matrix: TN={tn}, FP={fp}, FN={fn}, TP={tp}")
    print(f"  Recall (Survivors): {recall_minority:.2%} <- How many survivors we caught")
    print(f"  Precision (Survivors): {precision_minority:.2%}")
    print()

# Visualize recall comparison
plt.figure(figsize=(10, 5))
recalls = []
precisions = []
names = []

for name, model in models.items():
    preds = model.predict(X_test)
    cm = confusion_matrix(y_test, preds)
    tn, fp, fn, tp = cm.ravel()
    recalls.append(tp / (tp + fn) if (tp + fn) > 0 else 0)
    precisions.append(tp / (tp + fp) if (tp + fp) > 0 else 0)
    names.append(name)

x = np.arange(len(names))
width = 0.35

plt.bar(x - width/2, recalls, width, label='Recall (Sensitivity)', color='steelblue')
plt.bar(x + width/2, precisions, width, label='Precision', color='coral')
plt.xlabel('Model')
plt.ylabel('Score')
plt.title('Impact of Augmentation Strategies on Minority Class Detection')
plt.xticks(x, names, rotation=15)
plt.legend()
plt.ylim(0, 1)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nData augmentation improved minority class recall!")
print("  Try adjusting noise scales or using class weights to see different tradeoffs.")

The results demonstrate the power of Data Augmentation for handling imbalanced datasets. The Gaussian Noise injection created plausible variations of survivor records, boosting recall on the minority class. Notice the precision-recall tradeoff: augmentation typically improves recall (catching more survivors) at the cost of some precision (more false positives). In practice, you would choose based on your domain needs, in fraud detection or medical diagnosis, higher recall is often worth the tradeoff.

<!— PLAYGROUND_END —>

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths