Polynomial Regression: A Journey from Linearity to Curves

Table of Contents

I. INTRODUCTION

Ever wondered how we predict more complex patterns in data, beyond just straight lines? Welcome to the fascinating world of Polynomial Regression! Don’t be put off by the seemingly complex name – by the end of this article, you’ll understand what Polynomial Regression is, why it’s important, and how it works.

In our previous discussion on Linear Regression, we explored how we can predict outcomes based on a straight-line relationship between two variables. For instance, predicting house prices based on income. But what if the relationship isn’t a straight line? What if it’s a curve or a squiggly line? That’s where Polynomial Regression comes in. It allows us to capture more complex, non-linear relationships in our data, such as predicting the speed of a roller coaster based on time, or the growth of a tree based on its age. So, buckle up and get ready for an exciting journey into the world of Polynomial Regression!

II. BACKGROUND INFORMATION

Before we dive head-first into the world of Polynomial Regression, let’s make sure we’re all on the same page about some basic concepts: variables, lines, curves, and the difference between linear and polynomial regression.

As we learned in our last adventure with Linear Regression, a variable is something that can change or vary. We usually work with two types of variables: dependent (the outcome we want to predict) and independent (what we use to make the prediction).

Remember our hiking trip analogy from the Linear Regression guide? There we talked about paths and their steepness, or ‘slope’. Now imagine the hiking path isn’t just a straight uphill or downhill. It has curves, turns, and varying steepness. Sometimes it goes up, sometimes down. This winding path is more like the ‘line’ we deal with in Polynomial Regression – it’s a curve!

In real life, many relationships are more like this winding path than a straight line. For example, if you’re growing a plant, in the beginning, more water might lead to faster growth. But after a certain point, too much water might harm the plant. If we graphed this, it wouldn’t be a straight line, but a curve that goes up and then down.

That’s why we need Polynomial Regression. It allows us to find the curve of best fit for our data, not just the straight line. But how does it do that? Hold onto your hats, because that’s exactly what we’re going to learn next!

III. WHAT POLYNOMIAL REGRESSION DOES

If you’ve read the previous article about Linear Regression, you might remember how we used it to draw a straight line that best fits our data. But life isn’t always a straight line. Imagine you’re riding a roller coaster. If you draw the path you travel, it’s not going to be a straight line, right? It’s going to be full of ups and downs, just like a polynomial!

Polynomial Regression allows us to capture these ups and downs when our data has a curve. Imagine you’re a lemonade seller, and you want to predict your sales. You notice that when the temperature is very cold or very hot, your sales drop. But when the temperature is comfortable, around room temperature, your sales are the highest. If you were to plot this on a graph, with temperature on the x-axis and sales on the y-axis, you wouldn’t see a straight line. You’d see a curve, or a ‘hill’, peaking at room temperature and then going down as the temperature gets too cold or too hot.

This is where Polynomial Regression comes in. It’s like a super-powered version of Linear Regression that can deal with curves. It finds the curve, or ‘polynomial’, that best fits your data. Just like Linear Regression used a straight line (y = mx + b), Polynomial Regression uses an equation like y = ax2 + bx + c, which represents a curve.

Working of Polynomial Regression

IV. EXPLAINING KEY CONCEPTS

Before we start splashing around in the world of Polynomial Regression, we need to understand some words that we’ll be using a lot. So let’s go!

Degree of a Polynomial: Think of a roller coaster ride. Some are gentle with small humps (like a kiddie coaster), while others have huge, thrilling ups and downs. In Polynomial Regression, the ‘degree’ of a polynomial is like the number of big ups and downs (humps) in our roller coaster. A degree 1 polynomial is a simple straight line (like our old friend Linear Regression), a degree 2 polynomial has one hump (or curve), a degree 3 polynomial has two humps, and so on.

Coefficient: Remember how in our Linear Regression article, we said a coefficient tells us how much our predicted outcome (like the plant’s height) would change if we changed our independent variable (like the amount of sunlight) by one unit? Well, in Polynomial Regression, coefficients do the same thing, but they work with the polynomial curve.

Residuals: These are like the small mistakes our model makes. Imagine you’re throwing a basketball into a hoop. You’re not going to make the shot every time, right? Sometimes you’ll miss by a little bit. In Polynomial Regression, residuals are the differences between where we predict the ball will land (the curve) and where it actually lands (the data).

Overfitting & Underfitting: Now, these two are very important. Overfitting is like trying to wear a tight outfit that sticks to your body and captures every single curve. Sure, it fits you perfectly, but it’s so specific to you that it won’t fit anyone else. In Polynomial Regression, overfitting is when our curve captures every single data point perfectly, but it’s so specific that it might not predict new data well. On the other hand, underfitting is like wearing an outfit that’s too loose. It doesn’t capture your shape well, and it could fit many people. In our model, underfitting is when our curve is too simple and doesn’t capture the trends in our data well.

Remember, our goal in Polynomial Regression is to find the curve that fits our data ‘just right’, like Goldilocks! Not too tight (overfitting), and not too loose (underfitting), but just right!

V. REAL-WORLD EXAMPLE

Let’s journey into the exciting world of roller coasters! As a theme park enthusiast, you’ve always been fascinated by these thrilling rides. You may notice that when a roller coaster starts its journey, it slowly climbs up to a high point and then zooms down at high speed, going up and down through a series of hills and valleys, each varying in height and depth. The path of the roller coaster isn’t a straight line, is it? It’s more of a curve or a series of curves.

Imagine you’re a roller coaster designer trying to plan the design of a new ride. Based on prior designs, you notice a pattern between the time since the ride started (let’s call this ‘x’) and the height of the roller coaster (let’s call this ‘y’). The relationship isn’t a simple straight line; instead, the height changes in a way that forms a curve. The height increases as the roller coaster ascends, then sharply decreases as it plunges down, then rises and falls again, and so on. This is an example of a non-linear relationship.

To capture this relationship, we use Polynomial Regression. Unlike Linear Regression, which draws a straight line, Polynomial Regression fits a curve to the data points. The equation for this curve might look something like y = ax3 + bx2 + cx + d, where ‘a’, ‘b’, ‘c’, and ‘d’ are coefficients, and the highest power (in this case 3) determines the degree of the polynomial.

Using these coefficients, we can predict the height of the roller coaster at any given time. For example, suppose 5 seconds into the ride, we want to predict the height of the roller coaster. If the coefficients you calculated were a = 2, b = -3, c = 1, and d = 0, you would predict the height to be y = 253 – 352 + 1*5 + 0 = 30 meters. Then, you check your roller coaster’s design and see that it fits! That’s the magic of Polynomial Regression applied to roller coaster design!

In the next section, we’ll delve deeper into Polynomial Regression with a specific dataset, so you’ll get an even clearer understanding of how it works in practice.

VI. INTRODUCTION TO DATASET

Leaving the land of fairy tales and stepping into reality, we will now explore the California Housing dataset. This dataset contains details about houses across different blocks in California, including features such as:

  • MedInc: Median income in block.
  • HouseAge: Median house age in block.
  • AveRooms: Average number of rooms.
  • AveBedrms: Average number of bedrooms.
  • Population: Block population.
  • AveOccup: Average house occupancy.
  • Latitude: House block latitude.
  • Longitude: House block longitude.

Our target variable, ‘MedHouseVal’, represents the median house value for California households.

For the purpose of our Polynomial Regression, we will use the ‘AveRooms’ feature as our independent variable and ‘MedHouseVal’ as our dependent variable. The ‘AveRooms’ variable represents the average number of rooms per dwelling, while ‘MedHouseVal’ represents the median value of owner-occupied homes. We suspect a non-linear relationship between these variables: as the ‘AveRooms’ value increases, ‘MedHouseVal’ might also increase, but not in a straight-line manner. In the next sections, we’ll use Polynomial Regression to explore this relationship in more detail.

VII. APPLYING POLYNOMIAL REGRESSION

In this section, we will apply Polynomial Regression to predict house values in California. Just as we did with the dragon and stickers in the Linear Regression example, we’re going to use a real-world dataset and Python code. Don’t worry if you’re new to Python, we’ll guide you through everything step by step.

The California Housing dataset contains information about various houses in California, like the median income, average number of rooms, and so forth. To keep things simple, we’ll focus on one variable – ‘Average Number of Rooms per Dwelling’ (which we’ll call ‘AveRooms’ for short). Our goal is to predict the ‘Median House Value’ (which we’ll call ‘MedHouseVal’) based on ‘AveRooms’.

Here’s how to do it:

#Import required packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

#Load the Dataset
california = fetch_california_housing()
data = pd.DataFrame(data=california.data, columns=california.feature_names)
data['MedHouseVal'] = california.target

#Visualize the Data
plt.scatter(data['AveRooms'], data['MedHouseVal'], color = 'blue')
plt.xlabel('Average Number of Rooms per Dwelling (AveRooms)')
plt.ylabel('Median House Value (MedHouseVal)')
plt.title('AveRooms vs MedHouseVal')
plt.show()
"""
In Linear Regression, we assumed that the relationship between our independent and dependent variables is a straight line. But what if it's not? That's where Polynomial Regression comes in.

To apply Polynomial Regression, we need to add polynomial features to our model. Basically, we are adding extra powers of the original features as new features.

We'll use the sklearn library's PolynomialFeatures class to accomplish this.
"""

X = data['AveRooms']
y = data['MedHouseVal']

#Train the Model
polynomial_features= PolynomialFeatures(degree=2)
x_poly = polynomial_features.fit_transform(X.values.reshape(-1, 1))
model = LinearRegression()
model.fit(x_poly, y)

#Visualize the Result
plt.scatter(X, y, color = 'blue') 
plt.plot(X, model.predict(x_poly), color = 'red')
plt.xlabel('Average Number of Rooms per Dwelling (AveRooms)')
plt.ylabel('Median House Value (MedHouseVal)')
plt.title('Polynomial Regression AveRooms vs MedHouseVal')
plt.show()

# Predicting the target values using our trained model
y_pred = model.predict(x_poly)

# Calculating and printing RMSE
rmse = sqrt(mean_squared_error(y, y_pred))
print("RMSE: ", rmse)

# Calculating and printing R2 score
r2 = r2_score(y, y_pred)
print("R2 score: ", r2)

Please explore the code in the Playground section, where you can experiment with it and see the results firsthand.

PLAYGROUND:

VIII. UNDERSTANDING THE RESULTS

After running our Polynomial Regression on the California Housing dataset, we plotted our predictions and calculated our evaluation metrics. But what does this all mean? Let’s break it down.

Visualizing the Results

Remember the bar chart we plotted with the actual vs. predicted house values? Each bar represents a house in our dataset. The blue bars show the actual median house value for the first 25 houses, while the red bars show what our model predicted those values to be. If our model was perfect, the red bars would match up exactly with the blue bars – but as you can see, there are some differences. This is where our evaluation metrics come in.

Evaluation Metrics

We used two metrics to evaluate our model: Root Mean Squared Error (RMSE) and R2 score.

  • Root Mean Squared Error (RMSE): This is a measure of the differences between the values predicted by our model and the actual values. It’s like the average “miss” our model makes when predicting house values. A lower RMSE is better, as it means our model’s predictions are closer to the actual values. If our RMSE was 0, it would mean our model’s predictions were perfect!
  • R2 Score: This is a measure of how well our model’s predictions match the variability in the actual values. It ranges from 0 to 1, where a higher value is better. If R2 is 1, it means our model explains all the variability in house values. If it’s 0, it means our model explains none of the variability.

So, looking at our RMSE and R2 scores together can give us a good idea of how well our Polynomial Regression model is performing.

IX. LIMITATIONS OF POLYNOMIAL REGRESSION

Although Polynomial Regression is a powerful tool that can model complex, non-linear relationships, it’s important to understand its limitations.

  1. Overfitting: If we include too many polynomial features (i.e., increase the degree of the polynomial), our model can become too complex. It might fit the training data almost perfectly, but perform poorly on new, unseen data. This is because it’s not only modeling the underlying trend but also the noise in the data. This is called overfitting.
  2. Underfitting: On the other hand, if we don’t include enough polynomial features (i.e., the degree of the polynomial is too low), our model may be too simple to capture the underlying trend. It might perform poorly on both the training data and new data. This is called underfitting.
  3. Choice of Degree: Choosing the right degree for the polynomial can be challenging. It’s usually done by trial and error, testing different degrees and comparing their performance. However, there is no one-size-fits-all solution – the optimal degree may differ for different datasets and problem contexts.
  4. Computational Complexity: Polynomial Regression involves more computations than simple Linear Regression, as it needs to calculate powers and interaction terms. This might make it less suitable for very large datasets.

While Polynomial Regression has these limitations, it’s still an important tool in the data scientist’s toolbox. By understanding these limitations, we can use Polynomial Regression more effectively and in the right contexts.


QUIZ: Test Your Knowledge!

Share the Post:
Learn Data Science. Courses starting at $12.99.

Related Posts

© Let’s Data Science

LOGIN

Unlock AI & Data Science treasures. Log in!