To properly dive into the least squares regression line concept, first, we need to understand what regression analysis is. Regression analysis is simply a method of estimating the relationships between a dependant variable and a single or multiple independent variables.
Linear regression is the simplest form of regression method where we supposedly have a linear relationship between our independent and dependent variables. Hence, it is one of the simplest methods of developing a machine learning model to predict a value of an unknown variable with their linear relationship to the independent variable/s.
Our focus is to understand the least squares regression and how to draw a least squares regression line. So we have to proceed with linear regression in mind. As a rule of thumb, a line is drawn when there is a linear relationship. You will understand what this means as you progress through this article.
Linear Regression in Machine Learning
Linear regression is used in machine learning to model linear relationships between a dependent variable and a single or multiple independent variables. Hence we have simple linear regression and multiple linear regression where multiple independent variables are considered.
What is Least Squares Regression & Line of Best Fit?
The method is a widely used technique in regression analysis, hence in machine learning regression models as well. The least-squares regression technique for linear regression is a mathematical method of finding the best fit line that represents the relationship between the independent and corresponding dependent variable. When we mention the best fit, we are referring to minimize the errors (differences between real and anticipated values) as much as possible.
As I just explained the widely used application of least squares regression is the “linear” or “ordinary” method which is used for linear regression analysis. Its goal is to create a straight line that intends to minimize the total of the squares of the errors that are generated by the equations we use to generate the line. To calculate and minimize the errors, things like squared residuals (which we calculate by the differences in our predicted values and the real values) are considered based on our model.
So what is Least Squares Regression Line?
Great! Now we have already started to embrace the core of this blog topic. The line of best fit drawn between two sets of variables by making the total of the squares of the individual errors as small as possible can be simply called the Least Squares Regression Line. This mathematical method to reduce the error is also known as the ordinary least squares method.
Are all these complicated mathematical terms starting to bothering you much? Let’s try to understand the concept by using a simple example. Shall we? 🙂
Least Squares Regression Line Mathematical Example
Imagine that we’ve got a data set of a bunch of student grades. So, we’ve got many variables in that data set, including Grade 3 data (G3) which are the students’ final grades, and Grade 2 data (G2) which are their second-period grades. If we have to model the relationship between G2 and G3 in order to predict G3 by using new G2 data, what should be the preferred method of doing so?
Let’s just plot the G2 vs G3 data on a scatter plot to see if they have a linear relationship or not.
As we can see most of the plot is describing a linear relationship between these variables. When we increase the G2 values, we can clearly see that G3 values also increase linearly. Therefore we can conclude that one of the best ways to create a final grade prediction model for the given scenario is by developing a linear regression model.
Before going into python machine learning to develop such a prediction model and get our hands dirty, let’s look at the basic math behind Least Squares Regression Line in depth.
Drawing the best fit line by the eye
Let’s assume that we want to draw a line to best fit these points. We could draw a line just judging by our eyes like this…
Unfortunately, that line can’t be so accurate. Because we drew it just by judging the best fit using our eyes and minds. We need a user-independent way of creating our line, something based on solid mathematics. We do not want a personal judgment that can be extremely dependable to the eye of the beholder.
High school equation for the line
To find such a solution we don’t have to look so far. We can mathematically define our best fit line by using the following very famous high school line equation and a little bit more… Hmm maths!
Let’s adopt this line equation to our problem. The equation consists of several variables, where y is final grades (dependant variable), x is second-period grades (independent variable), m is the slope of the line and finally, c is the y-intercept (where the line cuts y-axis).
Let’s calculate the least squares regression line one step at a time!
Let’s assume that we have N points (x,y interceptions) in our plot and we want to find the best fit line for that plot;
*Note that the data set I will be using to build the model in python has 375 rows/records, meaning that we will have 375 data points for each x and y value. It’s not practical to use such a big dataset just to explain the concept. Therefore I will pick just only 5 records. This will make our plot not so smooth though.*
Calculating the slope (m) of the best fit line using the least squares method
- Pick 5 corosponding values for G2 (x) and G3 (y)
2. calculate xy and x2 for each of those points.
3. Get values for Σx, Σy, Σx2 and Σxy (sums of x values, sums of y values, sums of x2 values and sums of xy values)
N equal to 5 as we have 5 data points here.
= 5 * 578.5 – 48 * 51.5 / 5 * 546 – 48 * 48
= 2,892.5 – 2,472 / 2730 – 2304
= 420.5 / 426
m = 0.987
Calculating the y-intercept (c) for the best fit line
= 51.5 − 0.987 * 48 / 5
= 4.124 / 5
c = 0.825
Final line equation for our model
Now that we have missing m and c values, we can calculate the points for the best fit line using the y=mx+c equation and draw the line.
Let’s plugin values of m and c to our equation
y = 0.987x + 0.825
We can call this y value as y hat (ŷ). The meaning is the predicted y value or predicted dependant variable.
ŷ = 0.987x + 0.825
Calculating the values of the points of our Least Squares Regression Line
Let’s plot the points and the Least Squares Regression Line in a scatterplot now. Red points represent x,y points which are our G2 and G3 values respectively. Blue points represent x,ŷ points which are G2 and predicted G3 values respectively.
There we have it! Finally, we have the Least Squares Regression Line calculated and drawn on our scatterplot.
Why do we call it a least squares regression line?
We call it so because as the process goes for creating the best fit line we keep the total of the squares of all errors as minimum as possible. In other words, we squared all the individual errors and added them all up making the total error as small as possible.
Now, do you see why we need a mathematical method or generating the best fit line for our data than just drawing a line by eye? 🙂
Before getting into the nitty-gritty of developing a linear regression machine learning model in python and calculating the least squares regression line, let’s try to predict a y (G3) value for a new x (G2) value using our best fit line formula?
Predicting a y value for a new x value using our model
Let’s assume that a student called Michael scored 7.5 in his G2. What would be his final grade in the near future that we can expect according to our model?
Let’s plug in the data to our ŷ = 0.987x + 0.825 equation.
x is 7.5, so
ŷ = 0.987x + 0.825
= 0.987 * 7.5 + 0.825
ŷ = 8.227
This tells us that we can expect Michael’s final grade to be 8.2. (if we rounded that number to 1st decimal place)
Implementing a Linear Regression Model in Python & drawing the Least Squares Regression Line
Hmm, now it’s time for us to move into the interesting coding stuff. Developing machine learning models in python is very exciting, especially with the machine learning support packages and libraries.
The best thing about machine learning with python is that it has so many mathematical libraries and packages that we can use to simplify our code. For example, we can tell Numpy, Pandas, and sci-kit learn to do the heavy lifting, complicated mathematical coding, and data preparation by simply calling out their modules and functions into our code. By doing so, we both simply our code and reduce development time.
*I use Pycharm for my python coding. It’s a personal preference. You can use your favorite python IDE you are used to and comfortable with. It doesn’t matter, You can even use Jupyter notebooks in your browser.*
Importing required libraries and packages into python
Let’s import NumPy, pandas, sci-kit learn. Then we need matplotlib as well for drawing graphs and charts. (We need it here for the scatterplot and for plotting the least squares regression line). One of the best ways to install these useful machine learning libraries and packages at once is by installing anaconda distribution on your computer.
Do you already have anaconda on your computer?? Or aren’t you sure which version it is? Then read this article. How to Check Anaconda Version in Windows?
import numpy as np import pandas as pd from matplotlib import pyplot as pt import sklearn from sklearn import linear_model
Downloading and extracting the CSV dataset into our project folder
I almost forgot to mention that we need to download our dataset called Student Performance Data Set from the UCI Machine Learning Repository.
After downloading the student zip folder extract the files. Now move the student-mat.csv file to your python project folder. This makes our job’s a little easier, otherwise, we would have to define the path to the CSV file within our code.
Reading the CSV using Pandas
Now let’s read the CSV file using pandas and save the data into a variable called data.
data = pd.read_csv ("student-mat.csv", sep=";")
Strangely, this data set doesn’t include comma-separated values. But it has semicolons to separate the values. So we have to tell our model that we have used “;” to separate the values. That’s the purpose of the code sep=”;”.
Now let’s set that variable to define only the G2 and G3 values.
data = data [["G2", "G3"]]
Determining if we have a linear relationship inbetwen our dependant and independant variables
Let’s populate G2 and G3 values on a scatterplot to see if there’s a linear connection between them. If so, developing a linear model and plotting a least squares regression line for these variables makes sense.
pt.scatter(data.G2, data.G3, color='blue') pt.xlabel("G2-") pt.ylabel("G3-") pt.show()
Now that we know that the relationship is linear we can proceed with the rest of the code.
Puting our x and y data in Numpy Arrays
Let’s put that G3 into a different variable called “predictable”.
predictable = “G3”
Let’s define x and y variables for the model. Python doesn’t have inbuilt arrays. So, we have to use NumPy to create arrays and store x and y values.
X = np.array(data.drop([predictlable], 1)) Y = np.array(data[predictlable])
Setting our trainning and testing data sets using sckit-learn train_test_split function
Let’s create four more different arrays out of our data (from X and Y) to train and test the model. by running this code we get 90% of data randomly selected for the train sets and 10% for the test sets.
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.1)
Declaring our linear regresion model
We can use the following code to create our linear regression model, using the LinearRegression class provided by the linear_model module in sci-kit-learn. I’m choosing my model name as nnl_leastsqureregression.
nnl_leastsqureregression = linear_model.LinearRegression()
Training our model using the training data
Now we can train the model using the test potion of our dataset, shown below
Now, at this point, our linear regression model has found the Least Squares Regression Line, which is the best fit line for our training data. By default, it has used the least-squares method we just learned above, to fit the line to our training data. The method has minimized the total of the squares of the individual errors of the data points.
Geting the slope and intercept of our least squares regression line
Therefore, now we can get the slope and the y-intercept of the line by running the following lines of codes.
print("slope :", nnl_leastsqureregression.coef_) print("intercept :", nnl_leastsqureregression.intercept_)
However, In case if you are wondering, yes coefficiency is the same as the slope of the regression line.
The slope of the line is 1.09879931
The y-intercept is -1.3262551084047196
Well, I know that these numbers are too long. But it’s just because we didn’t tell the program to round them up to our preferred decimal point.
Now, you may wonder why these values are different from the example that we used to understand the concept at the beginning. Let me tell you why. This is a real data set of 395 student records. And that example only had 5 records… Simple!
One more thing about the coefficiency and intercept values. It’s normal for each one of you to have slightly different values for them. We all get a different data split each time we run the code. Simply because of that random data split done at the train and test data split using scikit-learn’s train_test_split function. 🙂
Plotting the least squares regression line on top of our training x and y data
Let’s actually plot the Least Squares Regression Line on scatterplot now, shall we? We can plot the line on top of our training data.
pt.scatter(x_train, y_train, color='blue') pt.plot(x_train, nnl_leastsqureregression.coef_*x_train + nnl_leastsqureregression.intercept_, '-r') pt.xlabel("G2-") pt.ylabel("G3-") pt.show()
The least squares method is by far the most popular and widely used mathematical method for drawing the best fit line in linear regression models. The resulting line is called as Least Squares Regression Line. The concept is mathematically a bit complexed over huge datasets. But we can rely on scikit learn to do the calculation for us.
Learn more about scikit learn by heading over to their official site. scikit learn official site
Want to become an ML and Data Science Expert? And get hired by reputed companies? Enroll with Eduraka Today!
Pingback: auc sklearn with practical example - Neural Net Lab