Machine learning models are mathematical functions. They multiply inputs by weights, compute gradients, and minimize loss functions. None of that works when an input column contains the string "BMW" instead of a number. Run model.fit() on raw categorical data and scikit-learn raises a ValueError before training begins.
The fix is categorical encoding: converting text categories into numerical representations that preserve the information models need. But the conversion is not neutral. Assign BMW=1, Toyota=2, Honda=3 and a linear model treats Honda as three times BMW. That false arithmetic relationship degrades predictions in ways that never surface as an explicit error.
This guide covers every major encoding strategy, from simple ordinal mapping to CatBoost's ordered target statistics, using a single car sales dataset so you can compare each transformation side by side.
The car sales dataset
Every encoding in this guide transforms the same five-row dataset. Keeping one consistent example makes it easy to see exactly what each method does to the data.
import pandas as pd
df = pd.DataFrame({
"brand": ["BMW", "Toyota", "Honda", "BMW", "Toyota"],
"color": ["red", "blue", "black", "blue", "red"],
"fuel_type": ["gasoline", "diesel", "electric", "diesel", "gasoline"],
"condition": ["new", "used", "certified", "new", "certified"],
"price": [45000, 28000, 32000, 48000, 26000]
})
print(df)
brand color fuel_type condition price
0 BMW red gasoline new 45000
1 Toyota blue diesel used 28000
2 Honda black electric certified 32000
3 BMW blue diesel new 48000
4 Toyota red gasoline certified 26000
The condition column has a natural order (used < certified < new). The brand, color, and fuel_type columns do not. This distinction drives every encoding decision that follows.
Why models need numbers, not strings
A linear regression predicts price as a weighted sum:
Each $x_i$ must be a number so the model can multiply it by $w_i$ and compute partial derivatives during gradient descent. Strings have no multiplication operator, no gradient, and no distance metric. Neural networks, SVMs, logistic regression, and k-nearest neighbors all share this constraint: the input matrix must be numeric.
Tree-based models (decision trees, random forests, XGBoost) can technically split on arbitrary category labels, but scikit-learn's implementations still require numeric input. Only CatBoost and LightGBM handle raw string categories natively.
Label and ordinal encoding
Label encoding maps each unique category to an integer. Scikit-learn's LabelEncoder assigns integers alphabetically, while OrdinalEncoder lets you specify the exact order.
When order matters
The condition column has a meaningful rank: used < certified < new. Encoding this as 0, 1, 2 preserves that relationship. A model can learn that higher values correlate with higher prices.
from sklearn.preprocessing import OrdinalEncoder
# Define the order explicitly
oe = OrdinalEncoder(categories=[["used", "certified", "new"]])
df["condition_encoded"] = oe.fit_transform(df[["condition"]]).astype(int)
print(df[["condition", "condition_encoded"]])
condition condition_encoded
0 new 2
1 used 0
2 certified 1
3 new 2
4 certified 1
The mapping respects the real-world ordering: used=0, certified=1, new=2.
When order destroys signal
Apply the same technique to brand and the model sees BMW=0 < Honda=1 < Toyota=2. It may compute (BMW + Toyota) / 2 = Honda, a meaningless arithmetic relationship. For linear models, this false ordinality biases coefficients. For distance-based models like KNN, it warps the distance metric so that BMW and Honda appear closer than BMW and Toyota.
Pro Tip: Use OrdinalEncoder over LabelEncoder for pipeline-compatible workflows. LabelEncoder is designed for target columns (single 1D arrays), while OrdinalEncoder handles multiple feature columns and integrates with scikit-learn's ColumnTransformer.
The mathematics
For a categorical variable $X$ with $k$ distinct values $\{c_1, c_2, \ldots, c_k\}$, ordinal encoding defines a mapping:
This creates an implicit metric: $|M(c_i) - M(c_j)|$ becomes a "distance" between categories. That distance is only meaningful when the categories have a genuine ordinal relationship.
One-hot encoding
One-hot encoding creates a binary column for each unique category. A row gets a 1 in the column matching its category and 0 everywhere else. No column is numerically "greater" than another, so the model cannot infer a false ordering.
Transforming the brand column
df_onehot = pd.get_dummies(df[["brand"]], columns=["brand"], dtype=int)
print(pd.concat([df[["brand"]], df_onehot], axis=1))
brand brand_BMW brand_Honda brand_Toyota
0 BMW 1 0 0
1 Toyota 0 0 1
2 Honda 0 1 0
3 BMW 1 0 0
4 Toyota 0 0 1
Each brand is now equidistant from every other brand: the Euclidean distance between any two one-hot vectors is always $\sqrt{2}$.
The dummy variable trap
If you know brand_BMW=0 and brand_Honda=0, then brand_Toyota must be 1. The third column is a perfect linear combination of the first two. In linear regression, this perfect multicollinearity makes the design matrix $X^TX$ singular, so the normal equation $(X^TX)^{-1}X^Ty$ has no unique solution.
The fix: drop one column. The dropped category becomes the "reference" that the model implicitly represents when all remaining columns are zero.
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop="first", sparse_output=False)
encoded = ohe.fit_transform(df[["brand"]])
columns = ohe.get_feature_names_out(["brand"])
print(pd.DataFrame(encoded.astype(int), columns=columns))
brand_Honda brand_Toyota
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1
BMW is the implicit reference category (all zeros).
Pro Tip: Tree-based models (random forests, gradient boosting) are not affected by multicollinearity. Drop a column only when using linear regression, logistic regression, or neural networks. Keeping all columns gives tree models more clean split points.
Sparse matrices and high cardinality
One-hot encoding a column with $k$ categories adds $k$ (or $k-1$) columns to the feature matrix. For a zip_code column with 40,000 unique values, that means 40,000 new columns where each row has exactly one non-zero entry. Scikit-learn's OneHotEncoder returns a scipy.sparse.csr_matrix by default (sparse_output=True), which stores only the non-zero values. This reduces memory from $O(n \times k)$ to $O(n)$.
ohe_sparse = OneHotEncoder(sparse_output=True)
sparse_matrix = ohe_sparse.fit_transform(df[["brand"]])
print(f"Shape: {sparse_matrix.shape}")
print(f"Stored values: {sparse_matrix.nnz} (instead of {sparse_matrix.shape[0] * sparse_matrix.shape[1]})")
Shape: (5, 3)
Stored values: 5 (instead of 15)
Even with sparse storage, one-hot encoding columns with thousands of categories produces extremely wide feature matrices that slow down training and invite overfitting. When cardinality exceeds roughly 15-20 categories, consider frequency encoding, target encoding, or binary encoding instead.
Frequency encoding
Frequency encoding replaces each category with its relative frequency (proportion of rows) in the training set. Categories that appear often get higher values, and the encoding naturally captures the distribution of the data.
freq_map = df["brand"].value_counts(normalize=True)
df["brand_freq"] = df["brand"].map(freq_map)
print(df[["brand", "brand_freq"]])
brand brand_freq
0 BMW 0.4
1 Toyota 0.4
2 Honda 0.2
3 BMW 0.4
4 Toyota 0.4
Strengths and limitations
Frequency encoding produces a single numeric column regardless of cardinality, so it handles 40,000 zip codes as easily as 3 brands. It requires no target variable, which means no leakage risk.
The main limitation is collision: categories with the same frequency get the same encoded value. In this dataset, BMW and Toyota both appear twice, so they both encode to 0.4. The model cannot distinguish between them. For datasets where frequency collisions are common, combine frequency encoding with another method (such as hashing) or use target encoding instead.
Target encoding (mean encoding)
Target encoding replaces each category with the mean of the target variable for rows in that category. Unlike one-hot or frequency encoding, it produces a single column that directly captures the predictive relationship between the feature and the target.
Basic target encoding on the car dataset
brand_means = df.groupby("brand")["price"].mean()
df["brand_target"] = df["brand"].map(brand_means)
print(df[["brand", "price", "brand_target"]])
brand price brand_target
0 BMW 45000 46500.0
1 Toyota 28000 27000.0
2 Honda 32000 32000.0
3 BMW 48000 46500.0
4 Toyota 26000 27000.0
BMW's two sales averaged $46,500, Toyota's averaged $27,000, and Honda's single sale gives exactly $32,000.
The target leakage problem
The calculation above uses all rows to compute category means, then applies those means back to the same rows. This is data leakage: the encoding for row 0 was influenced by row 0's own target value. During training, the model sees information it should not have, inflating apparent accuracy. At inference time on genuinely unseen data, that advantage vanishes and performance drops.
The leakage is most severe for rare categories. Honda appears only once, so its target encoding equals its exact price. The model memorizes rather than generalizes.
Smoothing (regularization)
Smoothing blends the category-specific mean with the global mean, pulling rare categories toward the population average:
where $\lambda(n_i)$ is a weight that increases with the number of samples $n_i$ for category $i$. A common sigmoid form is:
When $n$ is large, $\lambda \approx 1$ and the category mean dominates. When $n$ is small, $\lambda \approx 0$ and the global mean dominates. The parameters $k$ (midpoint) and $f$ (steepness) control how quickly the transition happens.
Proper implementation with category_encoders
The category_encoders library applies smoothing automatically and integrates with scikit-learn pipelines:
# pip install category_encoders
from category_encoders import TargetEncoder
from sklearn.model_selection import train_test_split
# Split first to prevent leakage
X_train, X_test, y_train, y_test = train_test_split(
df[["brand"]], df["price"], test_size=0.4, random_state=42
)
encoder = TargetEncoder(cols=["brand"], smoothing=10.0)
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_test_encoded = encoder.transform(X_test)
print("Training set:")
print(pd.concat([X_train.reset_index(drop=True), X_train_encoded.add_suffix("_encoded").reset_index(drop=True)], axis=1))
Pro Tip: Scikit-learn 1.3+ includes a native sklearn.preprocessing.TargetEncoder that uses internal cross-fitting (cv=5 by default) to prevent leakage during fit_transform. Use it when you want everything in a single scikit-learn pipeline without the category_encoders dependency.
Scikit-learn's native TargetEncoder
from sklearn.preprocessing import TargetEncoder as SklearnTargetEncoder
sk_encoder = SklearnTargetEncoder(smooth=10.0, cv=5, random_state=42)
# fit_transform uses cross-fitting internally to prevent leakage
df["brand_target_sk"] = sk_encoder.fit_transform(
df[["brand"]], df["price"]
)
The key difference: calling fit_transform applies cross-validation internally so that each fold's encoding is computed without seeing that fold's targets. Calling fit followed by transform separately uses the full training set means (appropriate for encoding a held-out test set).
Binary encoding
Binary encoding converts each category to its integer index, then represents that integer as a binary number across multiple columns. A column with 8 categories needs only 3 binary columns ($\lceil \log_2 8 \rceil = 3$) instead of 8 one-hot columns.
from category_encoders import BinaryEncoder
be = BinaryEncoder(cols=["brand"])
df_binary = be.fit_transform(df[["brand"]])
print(pd.concat([df[["brand"]], df_binary], axis=1))
brand brand_0 brand_1 brand_2
0 BMW 0 0 1
1 Toyota 0 1 0
2 Honda 0 1 1
3 BMW 0 0 1
4 Toyota 0 1 0
BMW encodes as binary 001, Toyota as 010, Honda as 011. The dimensionality grows as $O(\log_2 k)$ instead of $O(k)$, making binary encoding practical for columns with hundreds of categories where one-hot would be prohibitively wide. The trade-off is that the binary digits create artificial proximity between categories whose binary representations differ by a single bit (Honda 011 appears "close" to BMW 001 and Toyota 010), which may or may not reflect reality.
Leave-one-out encoding
Leave-one-out (LOO) encoding is a variant of target encoding that excludes the current row's target value when computing the category mean. This reduces the self-influence that causes overfitting in naive target encoding.
For row $i$ belonging to category $c$:
from category_encoders import LeaveOneOutEncoder
loo = LeaveOneOutEncoder(cols=["brand"], random_state=42)
df["brand_loo"] = loo.fit_transform(df[["brand"]], df["price"])["brand"]
print(df[["brand", "price", "brand_loo"]])
brand price brand_loo
0 BMW 45000 48000.0
1 Toyota 28000 26000.0
2 Honda 32000 35800.0
3 BMW 48000 45000.0
4 Toyota 26000 28000.0
Row 0 (BMW, price=45000) gets 48000: the mean of all other BMW prices (just row 3). Honda has only one row, so its LOO value falls back to the global mean (35800).
The sigma parameter adds Gaussian noise during training to further reduce overfitting. At transform time (inference), no noise is added.
Pro Tip: LOO encoding still uses target information and carries leakage risk. Always fit on training data only and call transform (not fit_transform) on test data.
CatBoost encoding (ordered target statistics)
CatBoost encoding solves the target leakage problem by processing rows sequentially. For row $k$, the encoding uses only the target values of rows $1$ through $k-1$:
The prior (typically the global mean) ensures the formula produces a valid output even for the first occurrence of a category.
from category_encoders import CatBoostEncoder
cbe = CatBoostEncoder(cols=["brand"], random_state=42)
df["brand_catboost"] = cbe.fit_transform(df[["brand"]], df["price"])["brand"]
print(df[["brand", "price", "brand_catboost"]])
Because each row only sees preceding data, there is no leakage by construction. Multiple random permutations of the training data can be averaged to reduce variance, which is exactly what the CatBoost library does internally during training.
CatBoost encoding is particularly effective when paired with gradient-boosted tree models, and it handles both low and high cardinality features without manual tuning.
Handling unseen categories at inference time
Production models inevitably encounter categories that did not appear in the training data. A model trained on brand values BMW, Toyota, and Honda will fail if a test row contains "Ford".
Each encoder handles this differently:
| Encoder | handle_unknown | Default behavior |
|---|---|---|
OrdinalEncoder | "error" or "use_encoded_value" | Raises error; set unknown_value=-1 to assign a sentinel |
OneHotEncoder | "error", "ignore", or "infrequent_if_exist" | Raises error; "ignore" produces an all-zeros row |
TargetEncoder (sklearn) | Built-in | Maps unseen categories to the global target mean |
TargetEncoder (category_encoders) | Built-in | Maps unseen categories to the prior (global mean) |
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
ohe.fit(df[["brand"]])
# New data with unseen category "Ford"
new_data = pd.DataFrame({"brand": ["Ford", "BMW"]})
encoded_new = ohe.transform(new_data)
print(pd.DataFrame(encoded_new.astype(int), columns=ohe.get_feature_names_out()))
brand_BMW brand_Honda brand_Toyota
0 0 0 0
1 1 0 0
Ford produces an all-zeros row: the model treats it as "none of the known brands." This is often a reasonable default but can degrade predictions when unseen categories are frequent. For high-cardinality features in production, consider:
- A fallback "OTHER" category trained on rare categories from the training set
- Target encoding, which naturally maps unseen categories to the global mean
- Periodic retraining to incorporate new categories
Choosing the right encoding
The encoding decision depends on three factors: whether the feature is ordinal, how many unique values it has (cardinality), and which model family you are using.
Decision flowchart:
-
Is the feature ordinal? (has a meaningful rank like low/medium/high)
- Yes: Use ordinal encoding with explicitly specified category order.
- No: Continue to step 2.
-
How many unique categories?
- Low cardinality (under 15): Use one-hot encoding. Drop one column for linear models; keep all columns for tree-based models.
- Medium cardinality (15-100): Use binary encoding or frequency encoding. Binary keeps dimensionality at $\lceil \log_2 k \rceil$; frequency produces a single column.
- High cardinality (100+): Use target encoding (with smoothing and proper cross-validation) or CatBoost encoding. These produce a single column regardless of cardinality.
-
Which model are you using?
- Linear models: Avoid label encoding on nominal features (false ordinality). Use one-hot with
drop="first", or target encoding. - Tree-based models: Can use ordinal encoding even for nominal features because trees split on thresholds and do not assume linear relationships. However, one-hot encoding with high cardinality creates sparse splits that reduce tree efficiency.
- Neural networks: One-hot or target encoding. For very high cardinality, consider entity embeddings (learned during training).
- Linear models: Avoid label encoding on nominal features (false ordinality). Use one-hot with
| Scenario | Recommended encoding | Reason |
|---|---|---|
condition (3 ordered values) | Ordinal | Preserves natural rank |
color (5 nominal values) | One-hot | Low cardinality, no false ordering |
brand (20 nominal values) | Binary or frequency | Moderate cardinality, compact representation |
zip_code (40,000 values) | Target or CatBoost | Single column, captures predictive signal |
user_id (millions of values) | Target with heavy smoothing, or hashing | Extreme cardinality, most categories are rare |
Conclusion
Categorical encoding is not a one-size-fits-all preprocessing step. Ordinal encoding preserves rank for ordered features but invents false hierarchies for nominal ones. One-hot encoding eliminates ordinality bias at the cost of dimensionality. Target encoding compresses high-cardinality features into a single predictive column but demands careful regularization to avoid leaking target information into the training set. Binary encoding and CatBoost encoding offer middle-ground solutions that balance dimensionality, leakage risk, and predictive power.
The choice always comes back to three questions: does order matter, how many categories exist, and what model consumes the features. Answer those, and the encoding method follows directly.
To see how encoding fits into a broader preprocessing pipeline, read the Feature Engineering Guide. For numeric feature preparation after encoding, the Standardization vs Normalization guide covers when to scale and when to normalize. And for a deeper treatment of frequency-based approaches with high-cardinality data, see Frequency Encoding.
Hands-On Practice
See why Label Encoding nominal data is dangerous. We'll encode the same categorical feature two ways and watch how it affects model performance.
Dataset: ML Fundamentals (Loan Approval) We'll compare Label vs One-Hot encoding on a nominal categorical feature.
Try It Yourself
ML Fundamentals: Loan approval data with features for classification and regression tasks
Try this: Change cat_col to 'education' - notice education IS ordinal (has natural order), so Label Encoding makes more sense there!