Mastering Frequency Encoding: The Simple Fix for High-Cardinality Data

DS
LDS Team
Let's Data Science
9 min readAudio
Mastering Frequency Encoding: The Simple Fix for High-Cardinality Data
0:00 / 0:00

Imagine you are building a model to predict house prices, and your dataset contains a "Zip Code" column. In the United States alone, there are over 40,000 unique zip codes.

If you use One-Hot Encoding, you instantly add 40,000 sparse columns to your dataset, creating a massive matrix that consumes memory and slows down training. If you use Label Encoding, you force an arbitrary order (10001<9021010001 < 90210) that confuses your linear models.

This is the "Curse of Cardinality," and it kills model performance daily.

The solution is often surprisingly simple: Frequency Encoding. Instead of creating thousands of new features, Frequency Encoding condenses all that categorical complexity into a single numerical column that captures the "popularity" of each category. It is a favorite trick of Kaggle Grandmasters for squeezing performance out of tree-based models like XGBoost and LightGBM.

What is Frequency Encoding?

Frequency Encoding (also known as Count Encoding) is a feature engineering technique that replaces each category in a variable with the count or percentage of its occurrence in the dataset. Rather than asking "Which category is this?", the model asks "How common is this category?".

In Plain English: Think of Frequency Encoding like a popularity contest. Instead of memorizing the name of every person at a party (Alice, Bob, Charlie), you simply label them by how many friends surround them (Popularity: 50, 10, 2). The model stops caring who the category is and starts caring how influential or common that category is.

The Intuition: Why It Works

Machine learning models, particularly decision trees, excel at finding patterns in numerical magnitude. By converting a category like "Product ID" into its frequency, you provide the model with valuable metadata: rarity.

In fraud detection, for example, a specific IP address appearing 10,000 times is suspicious. An IP appearing once is normal. The identity of the IP matters less than the frequency of the IP. Frequency encoding captures this signal explicitly.

How does the algorithm work mathematically?

The mathematics behind Frequency Encoding are straightforward, yet rigorous implementation requires handling the total sample size correctly.

For a categorical variable CC with KK unique categories c1,c2,...,cKc_1, c_2, ..., c_K, the frequency encoding for a specific category cic_i is calculated as:

Xfreq(ci)=Count(ci)NX_{freq}(c_i) = \frac{\text{Count}(c_i)}{N}

Where:

  • Count(ci)\text{Count}(c_i) is the number of times category cic_i appears in the dataset.
  • NN is the total number of observations (rows) in the dataset.

Alternatively, you can use raw Count Encoding, which omits the division by NN:

Xcount(ci)=Count(ci)X_{count}(c_i) = \text{Count}(c_i)

In Plain English: This formula calculates the probability of picking a specific category at random. If "Toyota" appears 30 times in a dataset of 100 cars, the Frequency Encoding for "Toyota" is 0.30 (or 30%). The model now sees "0.30" instead of the string "Toyota".

Why is Frequency Encoding better for high cardinality?

Frequency Encoding solves the "curse of dimensionality" by keeping the feature space constant regardless of how many unique categories exist. While One-Hot Encoding adds one column per category, Frequency Encoding transforms the variable in place, maintaining a single column.

If you have a "User ID" column with 1 million unique users:

  • One-Hot Encoding: Creates 1 million columns. Your RAM crashes.
  • Frequency Encoding: Keeps 1 column. Your model runs instantly.

This efficiency makes Frequency Encoding the standard choice for "high-cardinality" features (columns with many unique values) like Zip Codes, Product IDs, or User Agents.

For a deeper dive on why dimensionality matters, check out our Feature Selection vs Feature Extraction guide.

When should you use Frequency Encoding?

You should use Frequency Encoding when working with high-cardinality categorical variables in tree-based models (Random Forests, XGBoost, LightGBM) where the frequency of the category is correlated with the target variable.

Ideal Scenarios:

  1. High Cardinality: Variables with 50+ unique categories (e.g., City, Zip Code).
  2. Tree-Based Models: Decision trees handle non-monotonic relationships well. They can split data into "Rare things" (Frequency < 0.01) and "Common things" (Frequency > 0.1).
  3. Anomaly Detection: Rare categories often signal outliers or fraud. Frequency encoding makes these outliers numerically obvious (values close to 0).

💡 Pro Tip: Frequency Encoding is rarely the best choice for Linear Regression. Linear models assume a linear relationship between the feature and the target. There is rarely a linear link between "how common a city is" and "house price" (e.g., highly populated cities might be expensive, but so are exclusive, low-population luxury islands).

What are the risks and limitations?

The primary risk of Frequency Encoding is collision, where two different categories appear the same number of times and thus get mapped to the exact same value, causing the model to lose the ability to distinguish between them.

The Collision Problem: Imagine you are predicting customer churn.

  • "Category A" (High Churn Risk) appears 50 times.
  • "Category B" (Low Churn Risk) appears 50 times.

After Frequency Encoding, both become the number 50. The model can no longer tell them apart. If the target behavior for A and B is different, you have lost predictive signal.

The Solution:

  1. Check for collisions: Before encoding, check if important categories share counts.
  2. Add Ranking: Add a small amount of random noise to the counts or use a secondary metric to break ties.
  3. Combine Methods: As discussed in our Categorical Encoding Guide, you can use Frequency Encoding for high-cardinality columns and One-Hot for low-cardinality ones.

How do you implement Frequency Encoding in Python?

Implementation is simple using the Pandas library. You generate a mapping dictionary from the value counts and map it back to the original column.

Here is a robust implementation that handles the training and test set split correctly to avoid data leakage.

python
import pandas as pd
import numpy as np

# Create a dummy dataset
data = {
    'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 
             'New York', 'Chicago', 'Houston', 'Los Angeles', 'Phoenix'],
    'Price': [100, 150, 110, 90, 105, 95, 80, 145, 85]
}

df = pd.DataFrame(data)

# 1. Calculate Frequency (Probability)
# value_counts(normalize=True) returns the relative frequency
frequency_map = df['City'].value_counts(normalize=True)

# 2. Map the frequency back to the dataframe
df['City_Freq'] = df['City'].map(frequency_map)

print("Mapping Dictionary:\n", frequency_map)
print("\nEncoded DataFrame:")
print(df[['City', 'City_Freq']])

Expected Output:

text
Mapping Dictionary:
New York       0.333333
Los Angeles    0.222222
Chicago        0.222222
Houston        0.111111
Phoenix        0.111111
Name: City, dtype: float64

Encoded DataFrame:
          City  City_Freq
0     New York   0.333333
1  Los Angeles   0.222222
2     New York   0.333333
3      Chicago   0.222222
4     New York   0.333333
5      Chicago   0.222222
6      Houston   0.111111
7  Los Angeles   0.222222
8      Phoenix   0.111111

⚠️ Common Pitfall: Do not recalculate frequencies on your test set based on the test data alone. This creates a distribution shift. You must learn the frequencies from the training set and apply that same mapping to the test set. If a category appears in the test set that was never seen in training, fill it with 0 or the frequency of a generic "Unknown" token.

For more on why keeping training and test logic separate is vital, read Why Your Model Fails in Production: The Science of Data Splitting.

Frequency Encoding vs. One-Hot vs. Label Encoding

Choosing the right encoder is often about balancing dimensionality with information retention.

FeatureFrequency EncodingOne-Hot EncodingLabel Encoding
DimensionalityNo change (1 column)Increases (N columns)No change (1 column)
InformationCaptures "popularity"Captures identityCaptures arbitrary ID
Best ForTrees, High CardinalityLinear Models, Low CardinalityOrdinal Data, Trees
Main RiskCollisions (loss of identity)Curse of DimensionalityImplies false order

If your data implies a rank (like "Small, Medium, Large"), stick to Label/Ordinal encoding. If the category identity is crucial and cardinality is low (like "Color: Red, Blue, Green"), use One-Hot. For everything else—especially massive identifiers—Frequency Encoding is the superior choice.

To master the preprocessing steps that precede encoding, check out our guide on Feature Engineering.

Conclusion

Frequency Encoding is a powerful, lightweight technique for handling high-cardinality categorical variables. By transforming categories into their occurrence counts, you allow models to learn from the "rarity" or "commonness" of an event without inflating your dataset dimensions.

While it risks colliding categories that share the same frequency, its efficiency in tree-based algorithms like XGBoost makes it a staple in modern machine learning pipelines.

To truly master categorical data, your next steps should be:

  1. Compare this with Target Encoding, which uses the label itself for encoding (read about it in our Categorical Encoding Guide).
  2. Learn how to handle the missing values that might arise when mapping frequencies by reading Missing Data Strategies.
  3. Combine these features using Ensemble Methods to capture both popularity and identity signals.

Hands-On Practice

See how Frequency Encoding tames high-cardinality features! We'll compare it against One-Hot Encoding and show why it's the go-to for tree-based models.

Dataset: ML Fundamentals (Loan Approval) We'll create a high-cardinality feature to demonstrate the technique.

Try It Yourself

ML Fundamentals
Loading editor...
0/50 runs

ML Fundamentals: Loan approval data with features for classification and regression tasks

Try this: Change bins=50 to bins=100 when creating income_bracket to see how One-Hot encoding explodes while Frequency Encoding stays efficient!