Ridge Regression: Making Your Regression More Robust

Table of Contents


Ridge Regression, though it sounds like something you’d encounter on a mountainous trek, is actually a powerful tool used in machine learning and data science. In our previous article, we explored Linear Regression, where we predicted the price of a house based on the average income in the area. But what happens when our prediction isn’t as accurate as we’d like it to be? This is where Ridge Regression comes to the rescue!

To understand Ridge Regression, imagine you’re playing a game of dart. Your aim is to hit the bullseye (the very center of the dartboard). But, let’s say, the wind is blowing strong and pushing your darts a bit off. In this case, you wouldn’t aim exactly for the bullseye because the wind will just push your dart away. Instead, you’d adjust your aim, accounting for the wind, to land your dart as close to the bullseye as possible. Ridge Regression works in a similar way, where it adjusts its aim (or predictions) to get as close as possible to the truth!

By the end of this article, you’ll have a clear understanding of Ridge Regression, why we use it, and how it helps us make better predictions. So, let’s dive in!


Before we embark on our Ridge Regression adventure, let’s revisit some concepts we learned from Linear Regression and introduce a few new ones that will help us understand Ridge Regression better.

In Linear Regression, we tried to find the line of best fit that could predict the price of a house based on the average income in the area. However, sometimes this line of best fit may rely too heavily on our existing data, leading to what we call ‘overfitting’. Overfitting is like memorizing answers for a test – it may work well for that specific test, but when new questions come up, it fails to perform.

To avoid overfitting, we use Ridge Regression, which introduces a small twist to the Linear Regression we know. Ridge Regression adds a penalty to the equation. This penalty is our safeguard, a protective measure that prevents any single variable from disproportionately influencing the predicted outcomes. Just like in our darts game, it adjusts our aim to account for any wind (or in our case, overfitting).

It’s important to note that Ridge Regression doesn’t replace Linear Regression but complements it. When we have lots of variables and data points, and we suspect that our line of best fit might be relying too much on our existing data (becoming overfit), Ridge Regression is a useful tool to have in our machine learning toolkit.

In the coming sections, we’ll go deeper into how Ridge Regression works, what the ‘ridge penalty’ is all about, and how it helps us get more robust predictions. So, grab your explorer hat as we set off on this data science adventure!


Imagine you’re a teacher who is giving different types of quizzes to the students. Some quizzes are easy, some are hard, and some are in between. You start to notice that some students do consistently well on all types of quizzes, but others only do well on certain types. As a teacher, you want to predict which students will do well in the future based on their performance so far. You decide to use regression, a technique that helps us to predict one thing (future performance) based on other things (past quiz results).

But here’s where it gets tricky. What if some types of quizzes are closely related? For instance, a student who does well on multiple-choice quizzes might also do well on true/false quizzes. This can make your predictions a bit off and could favor students who are good at one specific type of quiz.

This is where Ridge Regression comes in. It’s like a super-smart teacher who can balance the importance of all different quiz types and not favor one too much. It does this by adding a special thing called a “penalty” to the regression. This penalty discourages the model from putting too much importance on any one quiz type. That way, all types of quizzes get considered fairly.

In simpler words, Ridge Regression is a version of regression that stops it from focusing too much on any single type of quiz. It’s like a teacher who makes sure that every student gets a fair chance, no matter what type of quiz they’re good at!


Now let’s talk about this “penalty” thing in Ridge Regression. Imagine you’re playing a game where you get points for hitting targets. But here’s the twist: the game reduces your points if you hit the same target too many times. So to get a high score, you have to hit as many different targets as you can. This reduction in points for hitting the same target too often is like the penalty in Ridge Regression.

When we’re predicting something like student performance, we use a bunch of different factors (like quiz scores). Each factor is like a target in the game. Ridge Regression reduces the importance of any factor that gets too much emphasis (like hitting the same target too many times). This penalty makes sure that all factors get a fair chance to play a part in the prediction.

In mathematical terms, this penalty is calculated as the sum of the squared values of these factors (or “coefficients”), multiplied by a certain number (usually called ‘lambda’). The higher the lambda, the bigger the penalty, and the more our model tries to distribute the importance across all factors evenly.

This penalty thing may sound a bit complicated, but don’t worry! The beauty of Ridge Regression is that it does all these calculations behind the scenes. All you need to understand is that it’s a way to make your predictions fairer and more balanced by not letting any single factor become too dominant.


  1. Ridge Regression: Ridge Regression is like a super wise and fair captain of a ship. Now, imagine you are on a ship in the middle of the ocean. To reach your destination, you have to steer the ship in the right direction. In the world of data, your ship is the Ridge Regression model, and the direction is the prediction you want to make. Sometimes, strong winds (like an over-reliance on certain data) can push your ship off course. To prevent this, our wise captain uses a special compass (the Ridge Penalty) that ensures we don’t steer too much in any one direction, keeping our ship on the right path.
  2. Penalty Term (λ): Remember our wise captain’s special compass? This compass is the penalty term, also known as lambda (λ). The stronger the wind (the more we’re relying on a specific piece of data), the more our captain will use the compass to adjust the ship’s course. So, the penalty term helps us control how much we rely on any particular piece of data when making predictions. Just like how a captain wouldn’t rely only on the wind direction to steer the ship, our model shouldn’t rely too heavily on one type of data to make predictions.
  3. Coefficient: Coefficients are like the crew members of our ship. Each crew member has a job that influences the ship’s journey. In the same way, each coefficient represents the relationship between a particular piece of data and the prediction we want to make. If the relationship is strong, the coefficient will have a high value (it’s like a crew member who is working very hard!). If the relationship is weak, the coefficient will be low (like a crew member who’s a bit lazy). But remember, our wise captain (Ridge Regression) ensures that no single crew member (or data point) does all the work while the others slack off.
  4. Overfitting: Overfitting is like a ship that’s built to sail perfectly in one type of weather but performs poorly when the conditions change. This happens when our model learns too much from our existing data and doesn’t perform well when introduced to new data. Ridge Regression helps us avoid this by using the penalty term to ensure our ship (model) can sail smoothly, no matter the weather (data)!


To illustrate how Ridge Regression works in real life, let’s imagine we’re trying to predict the score of a soccer game. We have lots of data to consider, like the team’s previous scores, the players’ performance, the weather conditions, and even the number of fans cheering in the stadium!

  1. Identifying the problem: We want to predict the score of the next game. In Ridge Regression terms, this score is our ‘target’ (like the destination of our ship).
  2. Collecting the data: To predict the score, we need to collect data. This data is like our crew members – each piece of data has a role to play in guiding our ship to the right destination. Some data might be the team’s previous scores, the players’ performance, the weather on game day, and so on.
  3. Building the model: Next, we feed this data into our Ridge Regression model (our wise captain). Our model will look at the data and assign each piece a ‘coefficient’ based on how important it is in predicting the game’s score.
  4. Adding the penalty: But remember, we don’t want any piece of data to become too important (like a crew member doing all the work!). So, our model uses the penalty term to ensure that no single piece of data is influencing the prediction too much.
  5. Making the prediction: With the coefficients assigned and the penalty in place, our model is ready to predict the score of the soccer game. The model takes all the data into account and calculates a score that’s as close as possible to the actual score.
  6. Evaluating the model: Once the game is over, we can compare the predicted score with the actual score to see how well our model performed. And if it didn’t do so well, we can adjust the model (like a captain adjusting the ship’s course) to make better predictions next time.

And that’s Ridge Regression in action! It might seem a bit complex, but it’s really just about making sure all data gets a fair chance to influence the prediction, ensuring our model is as accurate as possible. It’s a bit like a super-wise captain guiding a ship on a long journey, using their expert knowledge and a handy compass to stay on course.

VII. Introduction to Dataset

In this article, we will be working with the Diabetes dataset, a common dataset provided by the Scikit-learn library. This dataset consists of ten baseline variables: age, sex, body mass index (BMI), average blood pressure, and six blood serum measurements taken from a population of 442 diabetes patients, along with the response of interest, a quantitative measure of disease progression one year after baseline (the ‘target’).

These features are all numeric and normalized, meaning each of these features has been centered (mean is zero) and scaled (standard deviation is one) upon being measured. The target variable is a continuous variable indicating the progression of the disease one year after the baseline. It’s important to note that this dataset does not include data on all relevant aspects that can affect diabetes, such as diet or physical activity.

VIII. Applying Ridge Regression

Now that we have a better understanding of our dataset, let’s apply Ridge Regression to predict diabetes progression.

Ridge Regression is a variant of linear regression, where the loss function is modified to minimize the complexity of the model. This is achieved by adding a penalty equivalent to the square of the magnitude of the coefficients.

We start off by importing the required libraries and loading the dataset:

#Import Required Packages
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
import matplotlib.pyplot as plt
import numpy as np

#Load the Diabetes Data
dataset = load_diabetes()
df = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
X = df
y = dataset.target

#Scale the Dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#Split the Data into test and train.
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

#Train the Model
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)

#Make Prediction using the trained Ridge Model
y_pred = ridge.predict(X_test)

#Evaluate the Model
rmse = sqrt(mean_squared_error(y_test, y_pred))
print("RMSE: ", rmse)
r2 = r2_score(y_test, y_pred)
print("R2 score: ", r2)

#Visualize the predicted values and the actual values (First 25 Values Only)
num_values = 25
index = np.arange(num_values)
bar_width = 0.35

plt.bar(index, y_test[:num_values], bar_width, color='b', label='Actual')
plt.bar(index+bar_width, y_pred[:num_values], bar_width, color='r', label='Predicted')

plt.ylabel('Diabetes Progression')
plt.title('Comparing Actual vs Predicted Values (Top 25 Instances)')
plt.xticks(index + bar_width/2, np.array([i for i in range(num_values)]))


IX. Understanding the Results

Upon applying Ridge Regression to our dataset, we calculated the RMSE and R2 scores, which serve as performance metrics for the model.

The Root Mean Squared Error (RMSE) score of our model is approximately 53.78. RMSE is a measure of how spread out these residuals are. In other words, it tells us how concentrated the data is around the line of best fit. Lower values of RMSE indicate a better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction. However, while our RMSE is relatively low, there is still some room for improvement.

The R-squared (R2) score of our model is approximately 0.45 (or 45% when expressed as a percentage). The R2 score, also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. If the R2 of a model is 0.45, it means that 45% of the variance in the dependent variable is predictable from the independent variable(s). An R2 score of 100% indicates that all changes in the dependent variable are completely explained by changes in the independent variable(s). So in our case, while 45% predictability may be deemed acceptable in some scenarios, it does suggest our model could be improved further.

X. Comparing Ridge Regression with Linear Regression

Linear Regression and Ridge Regression are both linear models that predict a target variable based on a set of input features. However, there are crucial differences between these two models that become apparent in their application.

Linear Regression creates a model that does not consider the impact of overly complex relationships in the data. In cases of high dimensionality (i.e., when we have a lot of features), or when some features are correlated, linear regression can create overfitted models that perform poorly on unseen data. This means the model learns the noise in the training data, making it less able to predict the response accurately for new data.

Ridge Regression is a technique used when the data suffers from multicollinearity (high correlations between predictor variables). By adding a degree of bias to the regression estimates, Ridge Regression reduces the standard errors. This bias addition is a form of regularization, a technique used to prevent overfitting. Ridge Regression decreases the complexity of a model but does not reduce the number of variables; it rather just shrinks their effect.

In contrast to basic Linear Regression, Ridge Regression has a regularization parameter (alpha). The larger the value of alpha, the higher the amount of regularization, leading to a smaller variance but increased bias. This can help to prevent overfitting by discouraging complex models that incorporate all the noise along with the underlying signal in the training data.

When comparing the two, Ridge Regression often results in better predictive performance, due to its ability to handle multicollinearity and reduce model complexity. This is particularly true when working with data where many features are correlated or in high-dimensional datasets.

To better understand this, you can run a basic linear regression model on the same dataset and compare the RMSE and R2 scores with our Ridge Regression model. If Ridge Regression is performing better, it will have a lower RMSE and a higher R2 score.

Understanding when to use Linear Regression versus Ridge Regression depends on your dataset and the problem at hand. Regularization, as in Ridge Regression, helps to prevent overfitting, but it may not be necessary if the dataset is relatively simple or if the features are not correlated.

XI. Limitations of Ridge Regression

While Ridge Regression offers valuable advantages in certain scenarios, it is essential to understand its limitations to make informed decisions about its application. Here are some key limitations to consider:

1. Linear Assumption: Ridge Regression assumes a linear relationship between the independent variables and the target variable. It may not perform optimally when the relationship is highly non-linear or exhibits complex patterns. In such cases, other regression techniques, such as polynomial regression or non-linear regression, may be more suitable.

2. Feature Selection: Ridge Regression does not inherently perform feature selection. It shrinks the coefficients towards zero without completely eliminating them. Consequently, it retains all the features but reduces their impact. If the dataset contains irrelevant or redundant features, Ridge Regression may not effectively address them. Feature selection techniques or alternative regression methods may be necessary to improve model performance in such cases.

3. Parameter Tuning: Ridge Regression involves selecting an appropriate regularization parameter, often denoted as alpha (λ). Choosing the optimal value of alpha is critical to strike a balance between overfitting and underfitting. Selecting an extremely high alpha can lead to excessive bias, while an excessively low alpha may not effectively shrink the coefficients. Tuning the regularization parameter requires careful consideration and often involves cross-validation or other optimization techniques.

4. Interpretability: Ridge Regression’s primary goal is to improve predictive performance rather than providing explicit interpretability of the coefficients. As Ridge Regression reduces the coefficients’ impact rather than eliminating them, interpreting the specific influence of each feature can be challenging. If interpretability is a crucial requirement, alternative regression techniques that prioritize feature interpretability, such as ordinary least squares regression, may be more appropriate.

5. Outliers: Like other linear regression techniques, Ridge Regression can be sensitive to outliers, which are extreme or atypical observations in the dataset. Outliers can have a disproportionate influence on the model’s performance, leading to biased coefficient estimates. Robust regression techniques or data preprocessing methods to handle outliers may be necessary in such cases.

6. Computational Complexity: Ridge Regression involves solving a system of linear equations with additional regularization terms. As the number of features increases, the computational complexity of Ridge Regression also grows. For large-scale datasets with a high number of features, the computational resources required to train and apply Ridge Regression models can be significant.

7. Limited Scope of Application: Ridge Regression is most suitable for datasets where the target variable is continuous and the relationship between the features and the target variable is linear or approximately linear. If the target variable is categorical or the relationships are highly non-linear, alternative regression techniques, such as logistic regression or decision tree-based methods, may be more appropriate.

Understanding these limitations is crucial in determining whether Ridge Regression is the appropriate technique for a given problem. It is essential to assess the dataset, consider the specific requirements of the problem, and evaluate alternative regression methods to ensure the most effective and accurate modeling approach.

QUIZ: Test Your Knowledge!

Share the Post:
Learn Data Science. Courses starting at $12.99.

Related Posts

© Let’s Data Science


Unlock AI & Data Science treasures. Log in!