A Beginner's Guide to Linear Regression with Scikit-Learn in Python

Introduction: 

In the exciting world of data science and machine learning, Linear Regression acts as a trusty compass, guiding us toward valuable insights hidden within our data. The best part? You don't need to be a math genius to understand it! Thanks to the user-friendly Scikit-Learn library in Python, Linear Regression becomes a breeze, even for beginners.

Understanding Linear Regression: 

At its core, Linear Regression is like connecting the dots on a graph with the straightest line possible. Imagine having many data points and wanting to find that perfect line that fits them all. Once we've found that line, we can use it to predict new data points, like tomorrow's temperature based on today's weather.

The Magic of Scikit-Learn: 

Implementing Linear Regression from scratch is challenging with concepts like Gradient Descent. However, Scikit-Learn simplifies it for us. All we need to do is call a few functions, and voilĂ , we have a powerful Linear Regression model at our disposal.

Real-World Example: Predicting House Prices

Let's start with a simple and commonly used example. We all know the size of an apartment or a house directly affects its price. Let us say we have collected some data about the same.

House Size (sq. ft) House Price ($)
1000 200000
1500 300000
1200 250000
1800 350000
1350 280000
2000 400000
1650 320000
2200 450000
 
We can see a correlation between house size and price (they seem directly proportional). Let us graph it for better understanding. 

House Size v Price


It seems like a straight line passes through these points.

Let's say we have another house of size 1600 sq. ft., and we have been tasked with setting a reasonable price for it. We know that there is a direct proportionality between size and price. Based on the data in the table above, can we predict the price for this new house by plotting a straight line that passes through the graph and using that to predict the new price by inputting the size? The answer is yes, and that is what we call Linear Regression. 

Linear Regression finds a line that passes through the graph, making the error as minimal as possible. We can predict new data by using the line and feeding in the values we know to guess the value we don't.

Building The Model:

We'll use Scikit-Learn to create and train our Linear Regression model. The library simplifies this process, allowing us to focus on the data and predictions. Once we've trained the model, we can use it to predict the price for a new house size of 1600 sq. ft.

#Import the module. Install it if not already done.
from sklearn.linear_model import LinearRegression
import numpy as np


#Getting the data we know into a numpy array, which we can feed into the model to train it
house_sizes = np.array([1000, 1500, 1200, 1800, 1350, 2000, 1650, 2200])
house_prices = np.array([200000, 300000, 250000, 350000, 280000, 400000, 320000, 450000])

#Creating a Linear Regression object and fitting it with our data.
model = LinearRegression()
model.fit(house_sizes.reshape(-1, 1), house_prices)


#Using the model to guess the value we need
new_house_size = np.array([[1600]])
predicted_price = model.predict(new_house_size)
print(predicted_price)


Result:

Running this code gives us the predicted value as a result. 

[321218.85157096]

The model we built tells us that for a house of 1600 sq. ft., a price of $321218.85 is most reasonable. This number does make sense looking at our table. Cool, right?

If we graph this line and our result...



We can see the red line, our prediction line, passing through the graph we use to predict new data. The green star is the house that we predicted the price for by finding the equivalent y-coordinate using the prediction line (the red line). This is how Linear Regression works. 

Error Analysis:

In Python, you can find and analyze errors and R-squared (R2) errors to understand your model's performance. Let's walk through an example using scikit-learn to calculate these statistics and discuss what is considered good and bad values.

First, we need to import the required functions from scikit-learn.

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

Then, we call the actual functions.

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_true, y_pred)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_true, y_pred)

# Calculate R-squared (R2) error
r2 = r2_score(y_true, y_pred)

print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)
print("R-squared (R2) Error:", r2)

For each function, feed in your original price values, and the predicted price values, and the function returns the corresponding values. But what is good and what is not?

Generally, lower values of MSE and MAE are desirable, as they indicate a better fit of the model to the data. However, the absolute values of MSE and MAE can vary based on the scale of your data.

For R-squared (R2) error, a value close to 1 is considered good, indicating a strong fit of the model to the data. R2 can range from negative values to 1, where negative values imply that the model is worse than a horizontal line, and R2 = 1 represents a perfect fit.


Conclusion:

Linear Regression with Scikit-Learn empowers beginners to tackle real-world problems effortlessly. As you delve further into the exciting field of data science, you'll find Linear Regression a versatile and essential tool in your arsenal. Happy predicting!

Comments