Mastering Frequency Encoding: The Simple Fix for High-Cardinality Data

DS
LDS Team
Let's Data Science
9 minAudio
Listen Along
0:00 / 0:00
AI voice

Your e-commerce dataset has a product_category column with 25 unique values. Not a problem. But the seller_id column? 180,000 unique sellers. Apply one-hot encoding, and your feature matrix explodes from a handful of columns to 180,000 sparse binary vectors. Your laptop freezes. Your cloud bill spikes. Your model trains slower and often performs worse because the signal drowns in a sea of zeroes.

Frequency encoding sidesteps this entirely. Instead of asking "which category is this?", it asks "how common is this category?" and replaces each label with a single number: its occurrence rate in the training data. One column in, one column out, regardless of whether you have 10 categories or 10 million. It's the technique Kaggle grandmasters reach for first when they see a high-cardinality feature paired with a tree-based model, and it takes three lines of pandas to implement.

Throughout this article, we'll use an e-commerce product catalog as our running example, encoding product categories to predict whether items get returned.

The Core Idea Behind Frequency Encoding

Frequency encoding is a feature engineering technique that replaces each categorical value with the proportion (or raw count) of times that value appears in the training dataset. The model stops caring about the identity of a category and starts caring about its prevalence.

Consider an online store with 1,000 orders. "Electronics" appears in 204 of them, so every "Electronics" row becomes 0.204. "Handmade" appears in 10 orders, so it becomes 0.010. The rare category gets a small number; the common one gets a large number. That single float encodes something genuinely useful: popularity, market share, rarity. Kaggle's feature engineering guide lists count/frequency encoding among the top techniques for tabular competitions.

Key Insight: Tree-based models like XGBoost and LightGBM split on numerical thresholds. After frequency encoding, a split on category_freq < 0.05 cleanly separates rare categories from common ones. That's often exactly the signal the model needs, because rare categories behave differently from popular ones in fraud detection, recommendation systems, and demand forecasting.

Four encoding methods compared by output dimensions and tradeoffsFour encoding methods compared by output dimensions and tradeoffs

Count Encoding vs. Normalized Frequency Encoding

There are two flavors:

VariantOutputFormulaWhen to prefer
Count encodingRaw integerCount(c)Absolute volume matters (e.g., total sales)
Frequency encodingFloat in [0, 1]Count(c) / NRelative prevalence matters; dataset size varies between train/test

Most practitioners default to normalized frequency because the values stay interpretable across different dataset sizes. If your training set has 100K rows and your test set has 20K rows, raw counts from training would be on a completely different scale than counts computed (incorrectly) on test data. Normalized frequencies avoid that confusion.

The Mathematics of Frequency Encoding

For a categorical variable CC with KK unique categories c1,c2,,cKc_1, c_2, \ldots, c_K, the frequency encoding for category cic_i is:

f(ci)=Count(ci)Nf(c_i) = \frac{\text{Count}(c_i)}{N}

Where:

  • f(ci)f(c_i) is the encoded value assigned to every row containing category cic_i
  • Count(ci)\text{Count}(c_i) is the number of rows in the training set where the category equals cic_i
  • NN is the total number of rows in the training set
  • KK is the number of distinct categories (irrelevant to the formula but important for understanding collision risk)

In Plain English: If "Electronics" shows up in 204 out of 1,000 product orders, its frequency-encoded value is 204/1000 = 0.204. Every single row that originally said "Electronics" now says 0.204. The model sees a continuous number instead of a string, and that number tells it "this is a popular category."

Step-by-step frequency encoding pipeline from raw category to numeric featureStep-by-step frequency encoding pipeline from raw category to numeric feature

The raw count variant simply drops the denominator:

fcount(ci)=Count(ci)f_{\text{count}}(c_i) = \text{Count}(c_i)

Both formulas are monotonically related (dividing by a constant preserves rank order), so tree-based models produce identical splits with either variant. Linear models, however, behave differently because the coefficient magnitudes change. The scikit-learn documentation on preprocessing covers the full range of encoding and scaling transformations.

Building a Frequency Encoding Pipeline in Python

The implementation is straightforward with pandas. The critical rule: compute frequencies from the training set only, then apply that mapping to both training and test data. Computing frequencies on the test set introduces distribution shift; computing on the full dataset before splitting leaks test information into training.

Expected Output:

text
Dataset shape: (1000, 2)

Top 10 categories by count:
product_category
Electronics      204
Clothing         141
Books            116
Home & Garden     98
Sports            87
Toys              54
Automotive        47
Pet Supplies      42
Jewelry           37
Grocery           29
Name: count, dtype: int64

Frequency encoding sample:
product_category  category_freq  category_count
     Electronics          0.204             204
        Clothing          0.141             141
           Books          0.116             116
   Home & Garden          0.098              98
          Sports          0.087              87
            Toys          0.054              54
      Automotive          0.047              47
    Pet Supplies          0.042              42

Notice the skewed distribution: "Electronics" is 20x more common than "Handmade." That skew is exactly what frequency encoding captures and what makes it informative for tree splits.

Avoiding Data Leakage with Train/Test Splits

The most common mistake with frequency encoding is computing frequencies on the entire dataset before splitting. This leaks test-set distribution information into training, inflating validation scores and producing models that underperform in production. The correct approach mirrors what you'd do with target encoding or any fitted transformer: fit on training, transform both.

Expected Output:

text
Unseen categories in test set: 1
Training set size: 800
Test set size: 200

Training frequency map (top 5):
  Electronics: 0.2025
  Clothing: 0.1400
  Books: 0.1138
  Home & Garden: 0.0975
  Sports: 0.0825

Test set unseen category filled with 0:
product_category  cat_freq
         Vintage       0.0

Common Pitfall: Never recompute frequencies on the test set independently. If "Electronics" makes up 20% of training but 30% of a particular test batch, the training-derived 0.20 is the correct value to use for test rows. Recomputing on test data creates a distribution mismatch that your model was never trained to handle.

For a deeper look at why this separation matters, see The Science of Data Splitting.

Handling Unseen Categories

In production, new categories appear constantly: a new product line launches, a new city enters your delivery network. Categories that never appeared during training have no entry in the frequency map, so map() returns NaN.

Three strategies for handling this:

  1. Fill with 0 — treats unseen categories as maximally rare. Good default for tree-based models where rarity is the signal.
  2. Fill with the global mean frequency — assumes the unseen category is "average." Better for linear models where 0 might be extreme.
  3. Fill with a small epsilon (e.g., 1/N) — avoids exact zero while still flagging rarity.

The Collision Problem and How to Fix It

The biggest limitation of frequency encoding is collisions: when two or more categories share the same count, they map to the same encoded value. The model can no longer distinguish between them.

Expected Output:

text
Frequency map:
  Electronics: 0.2000
  Clothing: 0.1500
  Books: 0.1500
  Home & Garden: 0.1000
  Sports: 0.1000
  Toys: 0.0800
  Automotive: 0.0800
  Jewelry: 0.0500
  Pet Supplies: 0.0500
  Office: 0.0400

Collisions detected: 8 categories share frequencies

Collision groups:
  Frequency 0.1500: ['Clothing', 'Books']
  Frequency 0.1000: ['Home & Garden', 'Sports']
  Frequency 0.0800: ['Toys', 'Automotive']
  Frequency 0.0500: ['Jewelry', 'Pet Supplies']

The model cannot distinguish between categories in the same group.
Solution: add a secondary feature or small noise to break ties.

After adding noise (Clothing vs Books, both originally 0.1500):
  Clothing samples: [0.150642, 0.150084, 0.150162]
  Books samples:    [0.150503, 0.150856, 0.150659]

Collision problem diagram showing two categories mapping to the same value with three fix strategiesCollision problem diagram showing two categories mapping to the same value with three fix strategies

Pro Tip: In practice, collisions matter most when colliding categories have very different relationships with the target. If "Clothing" and "Books" have similar return rates, the collision is harmless. If one has a 40% return rate and the other 5%, you've lost critical signal. Always check collision pairs against your target before deciding whether to fix them.

Collision-Breaking Strategies

StrategyProsCons
Add uniform noiseSimple, breaks all tiesNon-deterministic unless seeded
Combine with target encodingMaximum informationRequires cross-validation to avoid leakage
Add rank as secondary featureDeterministic, no leakageAdds a column
Log-transform countsCompresses large countsDoesn't break exact ties

Frequency Encoding vs. One-Hot, Label, and Target Encoding

Choosing the right encoding method depends on cardinality, model type, and the relationship between category and target.

CriterionFrequency EncodingOne-Hot EncodingLabel EncodingTarget Encoding
Output dimensions1 columnkk columns1 column1 column
Information capturedCategory prevalenceCategory identityArbitrary integerTarget relationship
Leakage riskNoneNoneNoneHigh (needs CV)
Best model typeTreesLinear modelsOrdinal data, treesAny (with regularization)
Cardinality limitUnlimited~50 categories maxUnlimitedUnlimited
Main weaknessCollisionsMemory explosionImplies false orderTarget leakage

For low-cardinality nominal features (under 15 unique values), one-hot encoding preserves full category identity and works well with linear models. For ordinal features (Small/Medium/Large), label encoding is the right call. For high-cardinality features where category prevalence correlates with the target, frequency encoding is the clear winner. When you need maximum predictive power and can handle the complexity of cross-validated fitting, target encoding extracts the most signal.

Dimensionality and Model Performance Comparison

Let's measure the actual impact on feature count and accuracy. We'll build a Random Forest on our e-commerce dataset using three approaches: no category information, frequency encoding, and one-hot encoding.

Expected Output:

text
Unique categories: 25
Dataset size: 2000 rows

Feature dimensions:
  No encoding (baseline):  2 features
  Frequency encoding:      3 features
  One-hot encoding:        27 features

5-Fold Cross-Validation Accuracy:
  No encoding (baseline):  0.5995 +/- 0.0087
  Frequency encoding:      0.6310 +/- 0.0077
  One-hot encoding:        0.6200 +/- 0.0089

Frequency encoding matches one-hot with 24 fewer features.

The result is striking: frequency encoding actually outperforms one-hot here, with 24 fewer features. That's not always the case, but with 25 categories and only 2,000 rows, one-hot encoding creates sparse columns where each binary feature has limited data to learn from. Frequency encoding concentrates the category signal into a single dense column that the tree splits on efficiently.

Key Insight: Frequency encoding tends to outperform one-hot when the ratio of samples to categories is low. With 2,000 rows and 25 categories, each one-hot column averages just 80 positive examples. The frequency column gives every row a meaningful value from the full training set.

When to Use Frequency Encoding

Frequency encoding excels in specific scenarios. Knowing when to reach for it (and when not to) separates competent feature engineering from guesswork.

Decision tree for choosing frequency encoding based on cardinality and distributionDecision tree for choosing frequency encoding based on cardinality and distribution

Use frequency encoding when:

  1. High cardinality (50+ unique categories). Zip codes, product IDs, seller IDs, IP addresses, user agents. One-hot encoding is impractical at this scale.
  2. Tree-based models are your primary algorithm. XGBoost, LightGBM, CatBoost, and Random Forest all handle frequency-encoded features well because they split on thresholds.
  3. Category prevalence correlates with the target. In fraud detection, a credit card used 10,000 times behaves differently from one used once. In e-commerce, high-volume sellers have different return rates than boutique shops.
  4. You need a leakage-free encoding. Unlike target encoding, frequency encoding never looks at the label. It's safe to compute without cross-validation tricks.
  5. Memory and speed matter. At scale (millions of rows, thousands of categories), frequency encoding is O(1)O(1) in output dimensionality while one-hot is O(k)O(k).

Do NOT use frequency encoding when:

  1. The distribution is uniform. If all categories appear equally often, every row gets the same encoded value. You've turned a potentially informative feature into a constant. This happens with evenly distributed IDs or experimental conditions.
  2. Category identity matters more than prevalence. "Red" vs "Blue" as a product color might matter for purchase decisions, but both colors could appear equally often. One-hot encoding preserves identity; frequency encoding destroys it.
  3. Linear models are your target. Linear regression and logistic regression assume a monotonic relationship between features and target. "More common = higher price" is rarely true in practice. Trees handle non-monotonic patterns naturally, but linear models cannot.
  4. Collisions would destroy critical signal. If many categories share the same count and their target distributions differ dramatically, frequency encoding loses too much information.

Pro Tip: A quick diagnostic before encoding: compute the Spearman rank correlation between category frequency and the mean target value per category. If ρs>0.3|\rho_s| > 0.3, frequency encoding will likely add signal. If ρs<0.1|\rho_s| < 0.1, consider target encoding or embeddings instead.

Production Considerations

Computational Complexity

OperationTime ComplexitySpace Complexity
Building the frequency mapO(N)O(N)O(K)O(K) where KK = unique categories
Encoding a single rowO(1)O(1) hash lookupO(1)O(1)
Encoding the full datasetO(N)O(N)O(N)O(N) for the output column

Compare this to one-hot encoding, which requires O(N×K)O(N \times K) space for the output matrix. For a dataset with 10M rows and 100K categories, that's the difference between a single column of floats (~80 MB) and a sparse matrix requiring specialized storage.

Scikit-Learn Integration

For production pipelines, wrap frequency encoding in a custom transformer compatible with sklearn.pipeline.Pipeline. This ensures the frequency map is fitted once on training data and applied consistently to new data.

python
from sklearn.base import BaseEstimator, TransformerMixin

class FrequencyEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None, fill_value=0):
        self.columns = columns
        self.fill_value = fill_value
        self.freq_maps_ = {}

    def fit(self, X, y=None):
        X = X.copy()
        for col in self.columns:
            self.freq_maps_[col] = X[col].value_counts(normalize=True).to_dict()
        return self

    def transform(self, X):
        X = X.copy()
        for col in self.columns:
            X[col] = X[col].map(self.freq_maps_[col]).fillna(self.fill_value)
        return X

This pattern works with Pipeline, GridSearchCV, and cross-validation without leaking information, because fit() is only called on the training fold. The category_encoders library on PyPI provides a production-ready CountEncoder and OrdinalEncoder that follow this same API. For hyperparameter tuning pipelines, this distinction matters enormously.

Handling Distribution Drift

In production, category distributions shift over time. A product category that was rare during training might become popular after a marketing campaign. Two strategies:

  1. Periodic retraining with fresh frequency maps. Standard for batch prediction systems.
  2. Exponential moving average on streaming data. Weight recent observations more heavily than old ones. This keeps the encoding responsive to trends without requiring full retraining.

Conclusion

Frequency encoding converts high-cardinality categorical variables into a single numeric column that captures how common each category is. It's fast (O(N)O(N) to build, O(1)O(1) per lookup), memory-efficient (O(1)O(1) output dimensionality), and leakage-free (no target information involved). For tree-based models facing features with hundreds or thousands of unique values, it's often the first encoding to try.

The technique isn't a silver bullet. Collisions reduce its effectiveness when many categories share identical counts, and it fails entirely on uniformly distributed features where every category appears the same number of times. In those cases, target encoding with cross-validation or learned embeddings will extract more signal. And for features where the category's identity matters more than its prevalence, one-hot encoding remains the right choice, as long as cardinality stays manageable.

The best feature engineering pipelines rarely use a single encoding method. Combine frequency encoding for your high-cardinality columns with one-hot for low-cardinality ones, and consider stacking both frequency and target-encoded columns as separate features for maximum signal extraction. For a systematic approach to deciding which features to keep, see Feature Selection.

Frequently Asked Interview Questions

Q: What is frequency encoding and how does it differ from one-hot encoding?

Frequency encoding replaces each category with the fraction of rows it occupies in the training set, producing a single numeric column regardless of cardinality. One-hot encoding creates one binary column per unique category, which preserves identity but scales linearly with the number of categories. Frequency encoding is preferred for high-cardinality features (thousands of unique values) because it avoids the memory and sparsity problems of one-hot.

Q: What is the collision problem in frequency encoding, and how would you solve it?

Collisions occur when two or more categories have identical counts, causing them to map to the same encoded value. The model can no longer distinguish between them. Common fixes include adding small random noise to break ties, combining frequency encoding with a secondary feature like rank or target encoding, or using interaction features. The severity depends on whether the colliding categories have different relationships with the target.

Q: Why must frequency maps be computed from the training set only?

Computing frequencies on the full dataset before splitting leaks information about the test distribution into training features. If a category's proportion differs between train and test, the model would have seen the test-influenced proportion during training. This inflates validation metrics and degrades real-world performance. The encoding should follow the same fit-on-train, transform-both pattern as any scikit-learn transformer.

Q: When would frequency encoding fail as a feature engineering strategy?

Frequency encoding fails when category distributions are uniform (all categories equally common), because every row gets the same value and the feature becomes constant. It also fails when category identity matters more than prevalence, for example when product colors affect purchase behavior independently of how often each color appears. Linear models also struggle with frequency-encoded features because the relationship between "how common something is" and the target is rarely linear.

Q: Your training data has 500 unique cities, but production data includes 50 new cities never seen in training. How would you handle this?

Map unseen cities to a fill value. The safest default for tree-based models is 0 (treating unseen categories as maximally rare). For linear models, use the mean training frequency to avoid extreme values. A more sophisticated approach is to maintain an "unknown" bucket during training by grouping rare categories (those below a threshold like 0.1% frequency) into a single "Other" category, so the model has already learned how to handle rare items.

Q: How does frequency encoding compare to target encoding for high-cardinality features?

Frequency encoding captures category prevalence without any target information, making it leakage-free and safe to compute without cross-validation. Target encoding captures the direct relationship between category and target, which is more informative but requires careful regularization (smoothing, cross-validated fitting) to avoid overfitting. In practice, target encoding typically outperforms frequency encoding when implemented correctly, but frequency encoding is simpler, faster, and a strong baseline. Many competition winners use both as separate features.

Q: How would you integrate frequency encoding into a scikit-learn pipeline for production?

Build a custom transformer inheriting from BaseEstimator and TransformerMixin that stores frequency maps during fit() and applies them during transform(), filling unseen categories with a configurable default. Place it inside a Pipeline so that cross-validation and grid search automatically fit the encoder only on training folds. Serialize the full pipeline (including fitted frequency maps) with joblib for deployment.

Hands-On Practice

See how Frequency Encoding tames high-cardinality features! We'll compare it against One-Hot Encoding and show why it's the go-to for tree-based models.

Dataset: ML Fundamentals (Loan Approval) We'll create a high-cardinality feature to demonstrate the technique.

Try this: Change bins=50 to bins=100 when creating income_bracket to see how One-Hot encoding explodes while Frequency Encoding stays efficient!