Your e-commerce dataset has a product_category column with 25 unique values. Not a problem. But the seller_id column? 180,000 unique sellers. Apply one-hot encoding, and your feature matrix explodes from a handful of columns to 180,000 sparse binary vectors. Your laptop freezes. Your cloud bill spikes. Your model trains slower and often performs worse because the signal drowns in a sea of zeroes.
Frequency encoding sidesteps this entirely. Instead of asking "which category is this?", it asks "how common is this category?" and replaces each label with a single number: its occurrence rate in the training data. One column in, one column out, regardless of whether you have 10 categories or 10 million. It's the technique Kaggle grandmasters reach for first when they see a high-cardinality feature paired with a tree-based model, and it takes three lines of pandas to implement.
Throughout this article, we'll use an e-commerce product catalog as our running example, encoding product categories to predict whether items get returned.
The Core Idea Behind Frequency Encoding
Frequency encoding is a feature engineering technique that replaces each categorical value with the proportion (or raw count) of times that value appears in the training dataset. The model stops caring about the identity of a category and starts caring about its prevalence.
Consider an online store with 1,000 orders. "Electronics" appears in 204 of them, so every "Electronics" row becomes 0.204. "Handmade" appears in 10 orders, so it becomes 0.010. The rare category gets a small number; the common one gets a large number. That single float encodes something genuinely useful: popularity, market share, rarity. Kaggle's feature engineering guide lists count/frequency encoding among the top techniques for tabular competitions.
Key Insight: Tree-based models like XGBoost and LightGBM split on numerical thresholds. After frequency encoding, a split on category_freq < 0.05 cleanly separates rare categories from common ones. That's often exactly the signal the model needs, because rare categories behave differently from popular ones in fraud detection, recommendation systems, and demand forecasting.
Four encoding methods compared by output dimensions and tradeoffs
Count Encoding vs. Normalized Frequency Encoding
There are two flavors:
| Variant | Output | Formula | When to prefer |
|---|---|---|---|
| Count encoding | Raw integer | Count(c) | Absolute volume matters (e.g., total sales) |
| Frequency encoding | Float in [0, 1] | Count(c) / N | Relative prevalence matters; dataset size varies between train/test |
Most practitioners default to normalized frequency because the values stay interpretable across different dataset sizes. If your training set has 100K rows and your test set has 20K rows, raw counts from training would be on a completely different scale than counts computed (incorrectly) on test data. Normalized frequencies avoid that confusion.
The Mathematics of Frequency Encoding
For a categorical variable with unique categories , the frequency encoding for category is:
Where:
- is the encoded value assigned to every row containing category
- is the number of rows in the training set where the category equals
- is the total number of rows in the training set
- is the number of distinct categories (irrelevant to the formula but important for understanding collision risk)
In Plain English: If "Electronics" shows up in 204 out of 1,000 product orders, its frequency-encoded value is 204/1000 = 0.204. Every single row that originally said "Electronics" now says 0.204. The model sees a continuous number instead of a string, and that number tells it "this is a popular category."
Step-by-step frequency encoding pipeline from raw category to numeric feature
The raw count variant simply drops the denominator:
Both formulas are monotonically related (dividing by a constant preserves rank order), so tree-based models produce identical splits with either variant. Linear models, however, behave differently because the coefficient magnitudes change. The scikit-learn documentation on preprocessing covers the full range of encoding and scaling transformations.
Building a Frequency Encoding Pipeline in Python
The implementation is straightforward with pandas. The critical rule: compute frequencies from the training set only, then apply that mapping to both training and test data. Computing frequencies on the test set introduces distribution shift; computing on the full dataset before splitting leaks test information into training.
Expected Output:
Dataset shape: (1000, 2)
Top 10 categories by count:
product_category
Electronics 204
Clothing 141
Books 116
Home & Garden 98
Sports 87
Toys 54
Automotive 47
Pet Supplies 42
Jewelry 37
Grocery 29
Name: count, dtype: int64
Frequency encoding sample:
product_category category_freq category_count
Electronics 0.204 204
Clothing 0.141 141
Books 0.116 116
Home & Garden 0.098 98
Sports 0.087 87
Toys 0.054 54
Automotive 0.047 47
Pet Supplies 0.042 42
Notice the skewed distribution: "Electronics" is 20x more common than "Handmade." That skew is exactly what frequency encoding captures and what makes it informative for tree splits.
Avoiding Data Leakage with Train/Test Splits
The most common mistake with frequency encoding is computing frequencies on the entire dataset before splitting. This leaks test-set distribution information into training, inflating validation scores and producing models that underperform in production. The correct approach mirrors what you'd do with target encoding or any fitted transformer: fit on training, transform both.
Expected Output:
Unseen categories in test set: 1
Training set size: 800
Test set size: 200
Training frequency map (top 5):
Electronics: 0.2025
Clothing: 0.1400
Books: 0.1138
Home & Garden: 0.0975
Sports: 0.0825
Test set unseen category filled with 0:
product_category cat_freq
Vintage 0.0
Common Pitfall: Never recompute frequencies on the test set independently. If "Electronics" makes up 20% of training but 30% of a particular test batch, the training-derived 0.20 is the correct value to use for test rows. Recomputing on test data creates a distribution mismatch that your model was never trained to handle.
For a deeper look at why this separation matters, see The Science of Data Splitting.
Handling Unseen Categories
In production, new categories appear constantly: a new product line launches, a new city enters your delivery network. Categories that never appeared during training have no entry in the frequency map, so map() returns NaN.
Three strategies for handling this:
- Fill with 0 — treats unseen categories as maximally rare. Good default for tree-based models where rarity is the signal.
- Fill with the global mean frequency — assumes the unseen category is "average." Better for linear models where 0 might be extreme.
- Fill with a small epsilon (e.g.,
1/N) — avoids exact zero while still flagging rarity.
The Collision Problem and How to Fix It
The biggest limitation of frequency encoding is collisions: when two or more categories share the same count, they map to the same encoded value. The model can no longer distinguish between them.
Expected Output:
Frequency map:
Electronics: 0.2000
Clothing: 0.1500
Books: 0.1500
Home & Garden: 0.1000
Sports: 0.1000
Toys: 0.0800
Automotive: 0.0800
Jewelry: 0.0500
Pet Supplies: 0.0500
Office: 0.0400
Collisions detected: 8 categories share frequencies
Collision groups:
Frequency 0.1500: ['Clothing', 'Books']
Frequency 0.1000: ['Home & Garden', 'Sports']
Frequency 0.0800: ['Toys', 'Automotive']
Frequency 0.0500: ['Jewelry', 'Pet Supplies']
The model cannot distinguish between categories in the same group.
Solution: add a secondary feature or small noise to break ties.
After adding noise (Clothing vs Books, both originally 0.1500):
Clothing samples: [0.150642, 0.150084, 0.150162]
Books samples: [0.150503, 0.150856, 0.150659]
Collision problem diagram showing two categories mapping to the same value with three fix strategies
Pro Tip: In practice, collisions matter most when colliding categories have very different relationships with the target. If "Clothing" and "Books" have similar return rates, the collision is harmless. If one has a 40% return rate and the other 5%, you've lost critical signal. Always check collision pairs against your target before deciding whether to fix them.
Collision-Breaking Strategies
| Strategy | Pros | Cons |
|---|---|---|
| Add uniform noise | Simple, breaks all ties | Non-deterministic unless seeded |
| Combine with target encoding | Maximum information | Requires cross-validation to avoid leakage |
| Add rank as secondary feature | Deterministic, no leakage | Adds a column |
| Log-transform counts | Compresses large counts | Doesn't break exact ties |
Frequency Encoding vs. One-Hot, Label, and Target Encoding
Choosing the right encoding method depends on cardinality, model type, and the relationship between category and target.
| Criterion | Frequency Encoding | One-Hot Encoding | Label Encoding | Target Encoding |
|---|---|---|---|---|
| Output dimensions | 1 column | columns | 1 column | 1 column |
| Information captured | Category prevalence | Category identity | Arbitrary integer | Target relationship |
| Leakage risk | None | None | None | High (needs CV) |
| Best model type | Trees | Linear models | Ordinal data, trees | Any (with regularization) |
| Cardinality limit | Unlimited | ~50 categories max | Unlimited | Unlimited |
| Main weakness | Collisions | Memory explosion | Implies false order | Target leakage |
For low-cardinality nominal features (under 15 unique values), one-hot encoding preserves full category identity and works well with linear models. For ordinal features (Small/Medium/Large), label encoding is the right call. For high-cardinality features where category prevalence correlates with the target, frequency encoding is the clear winner. When you need maximum predictive power and can handle the complexity of cross-validated fitting, target encoding extracts the most signal.
Dimensionality and Model Performance Comparison
Let's measure the actual impact on feature count and accuracy. We'll build a Random Forest on our e-commerce dataset using three approaches: no category information, frequency encoding, and one-hot encoding.
Expected Output:
Unique categories: 25
Dataset size: 2000 rows
Feature dimensions:
No encoding (baseline): 2 features
Frequency encoding: 3 features
One-hot encoding: 27 features
5-Fold Cross-Validation Accuracy:
No encoding (baseline): 0.5995 +/- 0.0087
Frequency encoding: 0.6310 +/- 0.0077
One-hot encoding: 0.6200 +/- 0.0089
Frequency encoding matches one-hot with 24 fewer features.
The result is striking: frequency encoding actually outperforms one-hot here, with 24 fewer features. That's not always the case, but with 25 categories and only 2,000 rows, one-hot encoding creates sparse columns where each binary feature has limited data to learn from. Frequency encoding concentrates the category signal into a single dense column that the tree splits on efficiently.
Key Insight: Frequency encoding tends to outperform one-hot when the ratio of samples to categories is low. With 2,000 rows and 25 categories, each one-hot column averages just 80 positive examples. The frequency column gives every row a meaningful value from the full training set.
When to Use Frequency Encoding
Frequency encoding excels in specific scenarios. Knowing when to reach for it (and when not to) separates competent feature engineering from guesswork.
Decision tree for choosing frequency encoding based on cardinality and distribution
Use frequency encoding when:
- High cardinality (50+ unique categories). Zip codes, product IDs, seller IDs, IP addresses, user agents. One-hot encoding is impractical at this scale.
- Tree-based models are your primary algorithm. XGBoost, LightGBM, CatBoost, and Random Forest all handle frequency-encoded features well because they split on thresholds.
- Category prevalence correlates with the target. In fraud detection, a credit card used 10,000 times behaves differently from one used once. In e-commerce, high-volume sellers have different return rates than boutique shops.
- You need a leakage-free encoding. Unlike target encoding, frequency encoding never looks at the label. It's safe to compute without cross-validation tricks.
- Memory and speed matter. At scale (millions of rows, thousands of categories), frequency encoding is in output dimensionality while one-hot is .
Do NOT use frequency encoding when:
- The distribution is uniform. If all categories appear equally often, every row gets the same encoded value. You've turned a potentially informative feature into a constant. This happens with evenly distributed IDs or experimental conditions.
- Category identity matters more than prevalence. "Red" vs "Blue" as a product color might matter for purchase decisions, but both colors could appear equally often. One-hot encoding preserves identity; frequency encoding destroys it.
- Linear models are your target. Linear regression and logistic regression assume a monotonic relationship between features and target. "More common = higher price" is rarely true in practice. Trees handle non-monotonic patterns naturally, but linear models cannot.
- Collisions would destroy critical signal. If many categories share the same count and their target distributions differ dramatically, frequency encoding loses too much information.
Pro Tip: A quick diagnostic before encoding: compute the Spearman rank correlation between category frequency and the mean target value per category. If , frequency encoding will likely add signal. If , consider target encoding or embeddings instead.
Production Considerations
Computational Complexity
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Building the frequency map | where = unique categories | |
| Encoding a single row | hash lookup | |
| Encoding the full dataset | for the output column |
Compare this to one-hot encoding, which requires space for the output matrix. For a dataset with 10M rows and 100K categories, that's the difference between a single column of floats (~80 MB) and a sparse matrix requiring specialized storage.
Scikit-Learn Integration
For production pipelines, wrap frequency encoding in a custom transformer compatible with sklearn.pipeline.Pipeline. This ensures the frequency map is fitted once on training data and applied consistently to new data.
from sklearn.base import BaseEstimator, TransformerMixin
class FrequencyEncoder(BaseEstimator, TransformerMixin):
def __init__(self, columns=None, fill_value=0):
self.columns = columns
self.fill_value = fill_value
self.freq_maps_ = {}
def fit(self, X, y=None):
X = X.copy()
for col in self.columns:
self.freq_maps_[col] = X[col].value_counts(normalize=True).to_dict()
return self
def transform(self, X):
X = X.copy()
for col in self.columns:
X[col] = X[col].map(self.freq_maps_[col]).fillna(self.fill_value)
return X
This pattern works with Pipeline, GridSearchCV, and cross-validation without leaking information, because fit() is only called on the training fold. The category_encoders library on PyPI provides a production-ready CountEncoder and OrdinalEncoder that follow this same API. For hyperparameter tuning pipelines, this distinction matters enormously.
Handling Distribution Drift
In production, category distributions shift over time. A product category that was rare during training might become popular after a marketing campaign. Two strategies:
- Periodic retraining with fresh frequency maps. Standard for batch prediction systems.
- Exponential moving average on streaming data. Weight recent observations more heavily than old ones. This keeps the encoding responsive to trends without requiring full retraining.
Conclusion
Frequency encoding converts high-cardinality categorical variables into a single numeric column that captures how common each category is. It's fast ( to build, per lookup), memory-efficient ( output dimensionality), and leakage-free (no target information involved). For tree-based models facing features with hundreds or thousands of unique values, it's often the first encoding to try.
The technique isn't a silver bullet. Collisions reduce its effectiveness when many categories share identical counts, and it fails entirely on uniformly distributed features where every category appears the same number of times. In those cases, target encoding with cross-validation or learned embeddings will extract more signal. And for features where the category's identity matters more than its prevalence, one-hot encoding remains the right choice, as long as cardinality stays manageable.
The best feature engineering pipelines rarely use a single encoding method. Combine frequency encoding for your high-cardinality columns with one-hot for low-cardinality ones, and consider stacking both frequency and target-encoded columns as separate features for maximum signal extraction. For a systematic approach to deciding which features to keep, see Feature Selection.
Frequently Asked Interview Questions
Q: What is frequency encoding and how does it differ from one-hot encoding?
Frequency encoding replaces each category with the fraction of rows it occupies in the training set, producing a single numeric column regardless of cardinality. One-hot encoding creates one binary column per unique category, which preserves identity but scales linearly with the number of categories. Frequency encoding is preferred for high-cardinality features (thousands of unique values) because it avoids the memory and sparsity problems of one-hot.
Q: What is the collision problem in frequency encoding, and how would you solve it?
Collisions occur when two or more categories have identical counts, causing them to map to the same encoded value. The model can no longer distinguish between them. Common fixes include adding small random noise to break ties, combining frequency encoding with a secondary feature like rank or target encoding, or using interaction features. The severity depends on whether the colliding categories have different relationships with the target.
Q: Why must frequency maps be computed from the training set only?
Computing frequencies on the full dataset before splitting leaks information about the test distribution into training features. If a category's proportion differs between train and test, the model would have seen the test-influenced proportion during training. This inflates validation metrics and degrades real-world performance. The encoding should follow the same fit-on-train, transform-both pattern as any scikit-learn transformer.
Q: When would frequency encoding fail as a feature engineering strategy?
Frequency encoding fails when category distributions are uniform (all categories equally common), because every row gets the same value and the feature becomes constant. It also fails when category identity matters more than prevalence, for example when product colors affect purchase behavior independently of how often each color appears. Linear models also struggle with frequency-encoded features because the relationship between "how common something is" and the target is rarely linear.
Q: Your training data has 500 unique cities, but production data includes 50 new cities never seen in training. How would you handle this?
Map unseen cities to a fill value. The safest default for tree-based models is 0 (treating unseen categories as maximally rare). For linear models, use the mean training frequency to avoid extreme values. A more sophisticated approach is to maintain an "unknown" bucket during training by grouping rare categories (those below a threshold like 0.1% frequency) into a single "Other" category, so the model has already learned how to handle rare items.
Q: How does frequency encoding compare to target encoding for high-cardinality features?
Frequency encoding captures category prevalence without any target information, making it leakage-free and safe to compute without cross-validation. Target encoding captures the direct relationship between category and target, which is more informative but requires careful regularization (smoothing, cross-validated fitting) to avoid overfitting. In practice, target encoding typically outperforms frequency encoding when implemented correctly, but frequency encoding is simpler, faster, and a strong baseline. Many competition winners use both as separate features.
Q: How would you integrate frequency encoding into a scikit-learn pipeline for production?
Build a custom transformer inheriting from BaseEstimator and TransformerMixin that stores frequency maps during fit() and applies them during transform(), filling unseen categories with a configurable default. Place it inside a Pipeline so that cross-validation and grid search automatically fit the encoder only on training folds. Serialize the full pipeline (including fitted frequency maps) with joblib for deployment.
Hands-On Practice
See how Frequency Encoding tames high-cardinality features! We'll compare it against One-Hot Encoding and show why it's the go-to for tree-based models.
Dataset: ML Fundamentals (Loan Approval) We'll create a high-cardinality feature to demonstrate the technique.
Try this: Change bins=50 to bins=100 when creating income_bracket to see how One-Hot encoding explodes while Frequency Encoding stays efficient!