Categorical Encoding: A Practical Guide to One-Hot, Label, and Target Methods

DS
LDS Team
Let's Data Science
11 minAudio
Listen Along
0:00 / 0:00
AI voice

Machine learning models are mathematical functions. They multiply inputs by weights, compute gradients, and minimize loss functions. None of that works when an input column contains the string "BMW" instead of a number. Run model.fit() on raw categorical data and scikit-learn raises a ValueError before training begins.

The fix is categorical encoding: converting text categories into numerical representations that preserve the information models need. But the conversion is not neutral. Assign BMW=1, Toyota=2, Honda=3 and a linear model treats Honda as three times BMW. That false arithmetic relationship degrades predictions in ways that never surface as an explicit error.

This guide covers every major encoding strategy, from simple ordinal mapping to CatBoost's ordered target statistics, using a single car sales dataset so you can compare each transformation side by side.

The car sales dataset

Every encoding in this guide transforms the same five-row dataset. Keeping one consistent example makes it easy to see exactly what each method does to the data.

python
import pandas as pd

df = pd.DataFrame({
    "brand":     ["BMW", "Toyota", "Honda", "BMW", "Toyota"],
    "color":     ["red", "blue", "black", "blue", "red"],
    "fuel_type": ["gasoline", "diesel", "electric", "diesel", "gasoline"],
    "condition": ["new", "used", "certified", "new", "certified"],
    "price":     [45000, 28000, 32000, 48000, 26000]
})
print(df)
text
    brand  color  fuel_type  condition  price
0     BMW    red   gasoline        new  45000
1  Toyota   blue     diesel       used  28000
2   Honda  black   electric  certified  32000
3     BMW   blue     diesel        new  48000
4  Toyota    red   gasoline  certified  26000

The condition column has a natural order (used < certified < new). The brand, color, and fuel_type columns do not. This distinction drives every encoding decision that follows.

Why models need numbers, not strings

A linear regression predicts price as a weighted sum:

y^=w1x1+w2x2++wnxn+b\hat{y} = w_1 x_1 + w_2 x_2 + \ldots + w_n x_n + b

Each $x_i$ must be a number so the model can multiply it by $w_i$ and compute partial derivatives during gradient descent. Strings have no multiplication operator, no gradient, and no distance metric. Neural networks, SVMs, logistic regression, and k-nearest neighbors all share this constraint: the input matrix must be numeric.

Tree-based models (decision trees, random forests, XGBoost) can technically split on arbitrary category labels, but scikit-learn's implementations still require numeric input. Only CatBoost and LightGBM handle raw string categories natively.

Label and ordinal encoding

Label encoding maps each unique category to an integer. Scikit-learn's LabelEncoder assigns integers alphabetically, while OrdinalEncoder lets you specify the exact order.

When order matters

The condition column has a meaningful rank: used < certified < new. Encoding this as 0, 1, 2 preserves that relationship. A model can learn that higher values correlate with higher prices.

python
from sklearn.preprocessing import OrdinalEncoder

# Define the order explicitly
oe = OrdinalEncoder(categories=[["used", "certified", "new"]])
df["condition_encoded"] = oe.fit_transform(df[["condition"]]).astype(int)
print(df[["condition", "condition_encoded"]])
text
   condition  condition_encoded
0        new                  2
1       used                  0
2  certified                  1
3        new                  2
4  certified                  1

The mapping respects the real-world ordering: used=0, certified=1, new=2.

When order destroys signal

Apply the same technique to brand and the model sees BMW=0 < Honda=1 < Toyota=2. It may compute (BMW + Toyota) / 2 = Honda, a meaningless arithmetic relationship. For linear models, this false ordinality biases coefficients. For distance-based models like KNN, it warps the distance metric so that BMW and Honda appear closer than BMW and Toyota.

Pro Tip: Use OrdinalEncoder over LabelEncoder for pipeline-compatible workflows. LabelEncoder is designed for target columns (single 1D arrays), while OrdinalEncoder handles multiple feature columns and integrates with scikit-learn's ColumnTransformer.

The mathematics

For a categorical variable $X$ with $k$ distinct values $\{c_1, c_2, \ldots, c_k\}$, ordinal encoding defines a mapping:

M(x)=iwhere x=ci and i{0,1,,k1}M(x) = i \quad \text{where } x = c_i \text{ and } i \in \{0, 1, \ldots, k-1\}

This creates an implicit metric: $|M(c_i) - M(c_j)|$ becomes a "distance" between categories. That distance is only meaningful when the categories have a genuine ordinal relationship.

One-hot encoding

One-hot encoding creates a binary column for each unique category. A row gets a 1 in the column matching its category and 0 everywhere else. No column is numerically "greater" than another, so the model cannot infer a false ordering.

Transforming the brand column

python
df_onehot = pd.get_dummies(df[["brand"]], columns=["brand"], dtype=int)
print(pd.concat([df[["brand"]], df_onehot], axis=1))
text
    brand  brand_BMW  brand_Honda  brand_Toyota
0     BMW          1            0             0
1  Toyota          0            0             1
2   Honda          0            1             0
3     BMW          1            0             0
4  Toyota          0            0             1

Each brand is now equidistant from every other brand: the Euclidean distance between any two one-hot vectors is always $\sqrt{2}$.

The dummy variable trap

If you know brand_BMW=0 and brand_Honda=0, then brand_Toyota must be 1. The third column is a perfect linear combination of the first two. In linear regression, this perfect multicollinearity makes the design matrix $X^TX$ singular, so the normal equation $(X^TX)^{-1}X^Ty$ has no unique solution.

The fix: drop one column. The dropped category becomes the "reference" that the model implicitly represents when all remaining columns are zero.

python
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop="first", sparse_output=False)
encoded = ohe.fit_transform(df[["brand"]])
columns = ohe.get_feature_names_out(["brand"])
print(pd.DataFrame(encoded.astype(int), columns=columns))
text
   brand_Honda  brand_Toyota
0            0             0
1            0             1
2            1             0
3            0             0
4            0             1

BMW is the implicit reference category (all zeros).

Pro Tip: Tree-based models (random forests, gradient boosting) are not affected by multicollinearity. Drop a column only when using linear regression, logistic regression, or neural networks. Keeping all columns gives tree models more clean split points.

Sparse matrices and high cardinality

One-hot encoding a column with $k$ categories adds $k$ (or $k-1$) columns to the feature matrix. For a zip_code column with 40,000 unique values, that means 40,000 new columns where each row has exactly one non-zero entry. Scikit-learn's OneHotEncoder returns a scipy.sparse.csr_matrix by default (sparse_output=True), which stores only the non-zero values. This reduces memory from $O(n \times k)$ to $O(n)$.

python
ohe_sparse = OneHotEncoder(sparse_output=True)
sparse_matrix = ohe_sparse.fit_transform(df[["brand"]])
print(f"Shape: {sparse_matrix.shape}")
print(f"Stored values: {sparse_matrix.nnz} (instead of {sparse_matrix.shape[0] * sparse_matrix.shape[1]})")
text
Shape: (5, 3)
Stored values: 5 (instead of 15)

Even with sparse storage, one-hot encoding columns with thousands of categories produces extremely wide feature matrices that slow down training and invite overfitting. When cardinality exceeds roughly 15-20 categories, consider frequency encoding, target encoding, or binary encoding instead.

Frequency encoding

Frequency encoding replaces each category with its relative frequency (proportion of rows) in the training set. Categories that appear often get higher values, and the encoding naturally captures the distribution of the data.

python
freq_map = df["brand"].value_counts(normalize=True)
df["brand_freq"] = df["brand"].map(freq_map)
print(df[["brand", "brand_freq"]])
text
    brand  brand_freq
0     BMW         0.4
1  Toyota         0.4
2   Honda         0.2
3     BMW         0.4
4  Toyota         0.4

Strengths and limitations

Frequency encoding produces a single numeric column regardless of cardinality, so it handles 40,000 zip codes as easily as 3 brands. It requires no target variable, which means no leakage risk.

The main limitation is collision: categories with the same frequency get the same encoded value. In this dataset, BMW and Toyota both appear twice, so they both encode to 0.4. The model cannot distinguish between them. For datasets where frequency collisions are common, combine frequency encoding with another method (such as hashing) or use target encoding instead.

Target encoding (mean encoding)

Target encoding replaces each category with the mean of the target variable for rows in that category. Unlike one-hot or frequency encoding, it produces a single column that directly captures the predictive relationship between the feature and the target.

Basic target encoding on the car dataset

python
brand_means = df.groupby("brand")["price"].mean()
df["brand_target"] = df["brand"].map(brand_means)
print(df[["brand", "price", "brand_target"]])
text
    brand  price  brand_target
0     BMW  45000       46500.0
1  Toyota  28000       27000.0
2   Honda  32000       32000.0
3     BMW  48000       46500.0
4  Toyota  26000       27000.0

BMW's two sales averaged $46,500, Toyota's averaged $27,000, and Honda's single sale gives exactly $32,000.

The target leakage problem

The calculation above uses all rows to compute category means, then applies those means back to the same rows. This is data leakage: the encoding for row 0 was influenced by row 0's own target value. During training, the model sees information it should not have, inflating apparent accuracy. At inference time on genuinely unseen data, that advantage vanishes and performance drops.

The leakage is most severe for rare categories. Honda appears only once, so its target encoding equals its exact price. The model memorizes rather than generalizes.

Smoothing (regularization)

Smoothing blends the category-specific mean with the global mean, pulling rare categories toward the population average:

Si=λ(ni)yˉi+(1λ(ni))yˉglobalS_i = \lambda(n_i) \cdot \bar{y}_i + (1 - \lambda(n_i)) \cdot \bar{y}_{\text{global}}

where $\lambda(n_i)$ is a weight that increases with the number of samples $n_i$ for category $i$. A common sigmoid form is:

λ(n)=11+e(nk)/f\lambda(n) = \frac{1}{1 + e^{-(n - k) / f}}

When $n$ is large, $\lambda \approx 1$ and the category mean dominates. When $n$ is small, $\lambda \approx 0$ and the global mean dominates. The parameters $k$ (midpoint) and $f$ (steepness) control how quickly the transition happens.

Proper implementation with category_encoders

The category_encoders library applies smoothing automatically and integrates with scikit-learn pipelines:

python
# pip install category_encoders
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split

# Split first to prevent leakage
X_train, X_test, y_train, y_test = train_test_split(
    df[["brand"]], df["price"], test_size=0.4, random_state=42
)

encoder = TargetEncoder(cols=["brand"], smoothing=10.0)
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)

print("Training set:")
print(pd.concat([X_train.reset_index(drop=True), X_train_encoded.add_suffix("_encoded").reset_index(drop=True)], axis=1))

Pro Tip: Scikit-learn 1.3+ includes a native sklearn.preprocessing.TargetEncoder that uses internal cross-fitting (cv=5 by default) to prevent leakage during fit_transform. Use it when you want everything in a single scikit-learn pipeline without the category_encoders dependency.

Scikit-learn's native TargetEncoder

python
from sklearn.preprocessing import TargetEncoder as SklearnTargetEncoder

sk_encoder = SklearnTargetEncoder(smooth=10.0, cv=5, random_state=42)
# fit_transform uses cross-fitting internally to prevent leakage
df["brand_target_sk"] = sk_encoder.fit_transform(
    df[["brand"]], df["price"]
)

The key difference: calling fit_transform applies cross-validation internally so that each fold's encoding is computed without seeing that fold's targets. Calling fit followed by transform separately uses the full training set means (appropriate for encoding a held-out test set).

Binary encoding

Binary encoding converts each category to its integer index, then represents that integer as a binary number across multiple columns. A column with 8 categories needs only 3 binary columns ($\lceil \log_2 8 \rceil = 3$) instead of 8 one-hot columns.

python
from category_encoders import BinaryEncoder

be = BinaryEncoder(cols=["brand"])
df_binary = be.fit_transform(df[["brand"]])
print(pd.concat([df[["brand"]], df_binary], axis=1))
text
    brand  brand_0  brand_1  brand_2
0     BMW        0        0        1
1  Toyota        0        1        0
2   Honda        0        1        1
3     BMW        0        0        1
4  Toyota        0        1        0

BMW encodes as binary 001, Toyota as 010, Honda as 011. The dimensionality grows as $O(\log_2 k)$ instead of $O(k)$, making binary encoding practical for columns with hundreds of categories where one-hot would be prohibitively wide. The trade-off is that the binary digits create artificial proximity between categories whose binary representations differ by a single bit (Honda 011 appears "close" to BMW 001 and Toyota 010), which may or may not reflect reality.

Leave-one-out encoding

Leave-one-out (LOO) encoding is a variant of target encoding that excludes the current row's target value when computing the category mean. This reduces the self-influence that causes overfitting in naive target encoding.

For row $i$ belonging to category $c$:

LOOi=jc,jiyjnc1\text{LOO}_i = \frac{\sum_{j \in c, j \neq i} y_j}{n_c - 1}

python
from category_encoders import LeaveOneOutEncoder

loo = LeaveOneOutEncoder(cols=["brand"], random_state=42)
df["brand_loo"] = loo.fit_transform(df[["brand"]], df["price"])["brand"]
print(df[["brand", "price", "brand_loo"]])
text
    brand  price  brand_loo
0     BMW  45000    48000.0
1  Toyota  28000    26000.0
2   Honda  32000    35800.0
3     BMW  48000    45000.0
4  Toyota  26000    28000.0

Row 0 (BMW, price=45000) gets 48000: the mean of all other BMW prices (just row 3). Honda has only one row, so its LOO value falls back to the global mean (35800).

The sigma parameter adds Gaussian noise during training to further reduce overfitting. At transform time (inference), no noise is added.

Pro Tip: LOO encoding still uses target information and carries leakage risk. Always fit on training data only and call transform (not fit_transform) on test data.

CatBoost encoding (ordered target statistics)

CatBoost encoding solves the target leakage problem by processing rows sequentially. For row $k$, the encoding uses only the target values of rows $1$ through $k-1$:

CatBoostk=j=1k1[xj=xk]yj+priorj=1k1[xj=xk]+1\text{CatBoost}_k = \frac{\sum_{j=1}^{k-1} [x_j = x_k] \cdot y_j + \text{prior}}{\sum_{j=1}^{k-1} [x_j = x_k] + 1}

The prior (typically the global mean) ensures the formula produces a valid output even for the first occurrence of a category.

python
from category_encoders import CatBoostEncoder

cbe = CatBoostEncoder(cols=["brand"], random_state=42)
df["brand_catboost"] = cbe.fit_transform(df[["brand"]], df["price"])["brand"]
print(df[["brand", "price", "brand_catboost"]])

Because each row only sees preceding data, there is no leakage by construction. Multiple random permutations of the training data can be averaged to reduce variance, which is exactly what the CatBoost library does internally during training.

CatBoost encoding is particularly effective when paired with gradient-boosted tree models, and it handles both low and high cardinality features without manual tuning.

Handling unseen categories at inference time

Production models inevitably encounter categories that did not appear in the training data. A model trained on brand values BMW, Toyota, and Honda will fail if a test row contains "Ford".

Each encoder handles this differently:

Encoderhandle_unknownDefault behavior
OrdinalEncoder"error" or "use_encoded_value"Raises error; set unknown_value=-1 to assign a sentinel
OneHotEncoder"error", "ignore", or "infrequent_if_exist"Raises error; "ignore" produces an all-zeros row
TargetEncoder (sklearn)Built-inMaps unseen categories to the global target mean
TargetEncoder (category_encoders)Built-inMaps unseen categories to the prior (global mean)
python
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
ohe.fit(df[["brand"]])

# New data with unseen category "Ford"
new_data = pd.DataFrame({"brand": ["Ford", "BMW"]})
encoded_new = ohe.transform(new_data)
print(pd.DataFrame(encoded_new.astype(int), columns=ohe.get_feature_names_out()))
text
   brand_BMW  brand_Honda  brand_Toyota
0          0            0             0
1          1            0             0

Ford produces an all-zeros row: the model treats it as "none of the known brands." This is often a reasonable default but can degrade predictions when unseen categories are frequent. For high-cardinality features in production, consider:

  1. A fallback "OTHER" category trained on rare categories from the training set
  2. Target encoding, which naturally maps unseen categories to the global mean
  3. Periodic retraining to incorporate new categories

Choosing the right encoding

The encoding decision depends on three factors: whether the feature is ordinal, how many unique values it has (cardinality), and which model family you are using.

Decision flowchart:

  1. Is the feature ordinal? (has a meaningful rank like low/medium/high)

    • Yes: Use ordinal encoding with explicitly specified category order.
    • No: Continue to step 2.
  2. How many unique categories?

    • Low cardinality (under 15): Use one-hot encoding. Drop one column for linear models; keep all columns for tree-based models.
    • Medium cardinality (15-100): Use binary encoding or frequency encoding. Binary keeps dimensionality at $\lceil \log_2 k \rceil$; frequency produces a single column.
    • High cardinality (100+): Use target encoding (with smoothing and proper cross-validation) or CatBoost encoding. These produce a single column regardless of cardinality.
  3. Which model are you using?

    • Linear models: Avoid label encoding on nominal features (false ordinality). Use one-hot with drop="first", or target encoding.
    • Tree-based models: Can use ordinal encoding even for nominal features because trees split on thresholds and do not assume linear relationships. However, one-hot encoding with high cardinality creates sparse splits that reduce tree efficiency.
    • Neural networks: One-hot or target encoding. For very high cardinality, consider entity embeddings (learned during training).
ScenarioRecommended encodingReason
condition (3 ordered values)OrdinalPreserves natural rank
color (5 nominal values)One-hotLow cardinality, no false ordering
brand (20 nominal values)Binary or frequencyModerate cardinality, compact representation
zip_code (40,000 values)Target or CatBoostSingle column, captures predictive signal
user_id (millions of values)Target with heavy smoothing, or hashingExtreme cardinality, most categories are rare

Conclusion

Categorical encoding is not a one-size-fits-all preprocessing step. Ordinal encoding preserves rank for ordered features but invents false hierarchies for nominal ones. One-hot encoding eliminates ordinality bias at the cost of dimensionality. Target encoding compresses high-cardinality features into a single predictive column but demands careful regularization to avoid leaking target information into the training set. Binary encoding and CatBoost encoding offer middle-ground solutions that balance dimensionality, leakage risk, and predictive power.

The choice always comes back to three questions: does order matter, how many categories exist, and what model consumes the features. Answer those, and the encoding method follows directly.

To see how encoding fits into a broader preprocessing pipeline, read the Feature Engineering Guide. For numeric feature preparation after encoding, the Standardization vs Normalization guide covers when to scale and when to normalize. And for a deeper treatment of frequency-based approaches with high-cardinality data, see Frequency Encoding.

Hands-On Practice

See why Label Encoding nominal data is dangerous. We'll encode the same categorical feature two ways and watch how it affects model performance.

Dataset: ML Fundamentals (Loan Approval) We'll compare Label vs One-Hot encoding on a nominal categorical feature.

Try It Yourself

ML Fundamentals
Loading editor...
0/50 runs

ML Fundamentals: Loan approval data with features for classification and regression tasks

Try this: Change cat_col to 'education' - notice education IS ordinal (has natural order), so Label Encoding makes more sense there!