A Beginner's-Guide to Logistic Regression with Scikit-Learn in Python

Introduction:

There are two major aspects of supervised machine learning - Regression and Classification. Linear Regression is our friend when it comes to predicting continuous values, like house prices, or price of a car. However, there are some problems where we might need a discrete output, like Yes/No. This aspect of Machine Learning is called Classification. Just like Linear Regression for regression, Logistic Regression is a simple way to classify binary outputs. Scikit-Learn offers a simple function that helps us understand and implement Logistic Regression faster than ever. 

What is Logistic Regression:

Logistic Regression is a method used to predict binary outcomes, meaning it's used when the dependent variable (what you're trying to predict) can only have two possible outcomes. It's called "logistic" because it uses the logistic function to model the probability of a certain class or event occurring. In simpler terms, imagine you want to predict if it will rain tomorrow based on weather data. Logistic Regression would help you decide if the answer is "yes" or "no". It's like fitting a curve to your data that best separates the two possible outcomes (rain or no rain), based on the input variables you provide (like temperature, humidity, etc.).

Real-World Example:

Let us take an example. Let us assume we have to predict if a student passes an exam or not. We have noticed that the hours they studied and attempting a practice test are the only determining factors for this (this wouldn't happen in real-life). Plotting our data, we get this:

Student        Hours Studied    Completed Practice Test    Pass (1 = Yes, 0 = No)
        1                      2                                   0                                        0
        2                      1                                   1                                        0
        3                      3                                   1                                        1
        4                      5                                   1                                        1
        5                      4                                   1                                        1
        6                      2                                   0                                        0

We can clearly make a few conclusions on our own, like the importance of the practice-test. However, for a huge set of data, we can't make such predictions on our own. So, let us try to build a Logistic Regression model to predict whether a student passes or not.

How it works:

There would be one question on everyone's mind - If it is a classification algorithm, why is it called Logistic Regression? To understand the answer for this, we need to delve deeper into the workings of the algorithm. 
Logistic Regression has two steps - The first step is to derive a linear equation that is a function of our independent variables, or the values we know. In our problem, we know the values of Hours Studied, and Completion of Practice Test. These are our independent variables. Now, we need to derive a line equation that is a function of these two values, and returns some value y. For our dataset, the line equation would be somewhere around this:

1.094X1 + 0.485X2 - 3.325 = Y

Seems easy enough, but then, we realize that the Y values we get need not necessarily be discrete or even between 0 and 1. Here is where Step 2 comes into play - The part which makes Logistic Regression a Classification algorithm. We use a function called an Activation function. An Activation function is a special type of function that introduces non-linearity into our linear function. There are many activation functions that are commonly used, but our Logistic Regression model uses one called the Sigmoid Function, that looks like this:



This function is generally used because it binds our function between 0 and 1, which are our target values. So, passing our Y through this function, gives us a value between 0 and 1, which we round off to 0 or 1. 

Code:

Now, let us try to implement this in Python and find whether a student who has put in 3 hours of study and attempted the practice test would pass the exam. 

from sklearn.linear_model import LogisticRegression

# Assuming X contains your input variables (hours studied and completed practice test)
X = [[2, 0],
     [1, 1],
     [3, 1],
     [5, 1],
     [4, 1],
     [2, 0]]

# Assuming y contains your output variable (pass exam)
y = [0, 0, 1, 1, 1, 0]

# Create a Logistic Regression model
model = LogisticRegression()

# Fit the model
model.fit(X, y)

#Predict the result for (2,1)
result = model.predict([[3,1]])

print(result)

And now, the result... <insert a drum-roll here>

[1]

Phew, lucky for him, he passed the exam...

And that's all there is. You have your very own, working Logistic Regression Model. Even though there are numerous, highly advanced classification models in the world now, Logistic Regression still remains a very useful algorithm for binary classifications and thanks to Scikit-Learn, we can develop our very own Logistic Regression model in barely any time. 

If you haven't yet checked out the Linear Regression post: Linear Regression

If you want to explore Scikit-Learn further: scikit-learn: machine learning in Python — scikit-learn 1.3.0 documentation










Comments