Home » Least Squares Regression Line

Least Squares Regression Line

Least Squares Regression Line

To properly dive into the least squares regression line concept, first, we need to understand what regression analysis is. Regression analysis is simply a method of estimating the relationships between a dependant variable and a single or multiple independent variables.

Linear regression is the simplest form of regression method where we supposedly have a linear relationship between our independent and dependent variables. Hence, it is one of the simplest methods of developing a machine learning model to predict a value of an unknown variable with their linear relationship to the independent variable/s.

Our focus is to understand the least squares regression and how to draw a least squares regression line. So we have to proceed with linear regression in mind. As a rule of thumb, a line is drawn when there is a linear relationship. You will understand what this means as you progress through this article.

Linear Regression in Machine Learning

Linear regression is used in machine learning to model linear relationships between a dependent variable and a single or multiple independent variables. Hence we have simple linear regression and multiple linear regression where multiple independent variables are considered.

What is Least Squares Regression & Line of Best Fit?

The method is a widely used technique in regression analysis, hence in machine learning regression models as well. The least-squares regression technique for linear regression is a mathematical method of finding the best fit line that represents the relationship between the independent and corresponding dependent variable. When we mention the best fit, we are referring to minimize the errors (differences between real and anticipated values) as much as possible.

As I just explained the widely used application of least squares regression is the “linear” or “ordinary” method which is used for linear regression analysis. Its goal is to create a straight line that intends to minimize the total of the squares of the errors that are generated by the equations we use to generate the line. To calculate and minimize the errors, things like squared residuals (which we calculate by the differences in our predicted values and the real values) are considered based on our model.

So what is Least Squares Regression Line?

Great! Now we have already started to embrace the core of this blog topic. The line of best fit drawn between two sets of variables by making the total of the squares of the individual errors as small as possible can be simply called the Least Squares Regression Line. This mathematical method to reduce the error is also known as the ordinary least squares method.

Are all these complicated mathematical terms starting to bothering you much? Let’s try to understand the concept by using a simple example. Shall we? 🙂

Least Squares Regression Line Mathematical Example

Imagine that we’ve got a data set of a bunch of student grades. So, we’ve got many variables in that data set, including Grade 3 data (G3) which are the students’ final grades, and Grade 2 data (G2) which are their second-period grades. If we have to model the relationship between G2 and G3 in order to predict G3 by using new G2 data, what should be the preferred method of doing so?

Let’s just plot the G2 vs G3 data on a scatter plot to see if they have a linear relationship or not.

Independent variable vs dependent variable linear relationship on a scatterplot
Independent variable vs dependent variable on a scatterplot

As we can see most of the plot is describing a linear relationship between these variables. When we increase the G2 values, we can clearly see that G3 values also increase linearly. Therefore we can conclude that one of the best ways to create a final grade prediction model for the given scenario is by developing a linear regression model.

Before going into python machine learning to develop such a prediction model and get our hands dirty, let’s look at the basic math behind Least Squares Regression Line in depth.

Drawing the best fit line by the eye

Let’s assume that we want to draw a line to best fit these points. We could draw a line just judging by our eyes like this…

a fit line drawn on a linear regression data set
Trying to fit a line by eye

Unfortunately, that line can’t be so accurate. Because we drew it just by judging the best fit using our eyes and minds. We need a user-independent way of creating our line, something based on solid mathematics. We do not want a personal judgment that can be extremely dependable to the eye of the beholder.

High school equation for the line

To find such a solution we don’t have to look so far. We can mathematically define our best fit line by using the following very famous high school line equation and a little bit more… Hmm maths!

y = mx + c

Let’s adopt this line equation to our problem. The equation consists of several variables, where y is final grades (dependant variable), x is second-period grades (independent variable), m is the slope of the line and finally, c is the y-intercept (where the line cuts y-axis).

Let’s calculate the least squares regression line one step at a time!

Let’s assume that we have N points (x,y interceptions) in our plot and we want to find the best fit line for that plot;

*Note that the data set I will be using to build the model in python has 375 rows/records, meaning that we will have 375 data points for each x and y value. It’s not practical to use such a big dataset just to explain the concept. Therefore I will pick just only 5 records. This will make our plot not so smooth though.*

Calculating the slope (m) of the best fit line using the least squares method

  1. Pick 5 corosponding values for G2 (x) and G3 (y)
x and y values

2. calculate xy and x2 for each of those points.

3. Get values for Σx, Σy, Σx2 and Σxy (sums of x values, sums of y values, sums of x2 values and sums of xy values)

Σx, Σy, Σx2 and Σxy

N equal to 5 as we have 5 data points here.

= 5 * 578.5 – 48 * 51.5 / 5 * 546 – 48 * 48

= 2,892.5 – 2,472 / 2730 – 2304

= 420.5 / 426

m = 0.987

Calculating the y-intercept (c) for the best fit line

51.5 − 0.987 * 48 / 5

= 4.124 / 5

c = 0.825

Final line equation for our model

Now that we have missing m and c values, we can calculate the points for the best fit line using the y=mx+c equation and draw the line.

Let’s plugin values of m and c to our equation

y = 0.987x + 0.825

We can call this y value as y hat ). The meaning is the predicted y value or predicted dependant variable.

ŷ = 0.987x + 0.825

Calculating the values of the points of our Least Squares Regression Line

Calculating the values of the points of our Least Squares Regression Line

Let’s plot the points and the Least Squares Regression Line in a scatterplot now. Red points represent x,y points which are our G2 and G3 values respectively. Blue points represent x,ŷ points which are G2 and predicted G3 values respectively.

 Least squares regression line calculated on a scatterplot
Least squares regression line calculated on a scatterplot

There we have it! Finally, we have the Least Squares Regression Line calculated and drawn on our scatterplot.

Why do we call it a least squares regression line?

We call it so because as the process goes for creating the best fit line we keep the total of the squares of all errors as minimum as possible. In other words, we squared all the individual errors and added them all up making the total error as small as possible.

error representation of the linear regression best fit line
Differences between y and ŷ or the error of the linear regression model

Now, do you see why we need a mathematical method or generating the best fit line for our data than just drawing a line by eye? 🙂

Before getting into the nitty-gritty of developing a linear regression machine learning model in python and calculating the least squares regression line, let’s try to predict a y (G3) value for a new x (G2) value using our best fit line formula?

Predicting a y value for a new x value using our model

Let’s assume that a student called Michael scored 7.5 in his G2. What would be his final grade in the near future that we can expect according to our model?

Let’s plug in the data to our ŷ = 0.987x + 0.825 equation.

x is 7.5, so

ŷ = 0.987x + 0.825

= 0.987 * 7.5 + 0.825

ŷ = 8.227

This tells us that we can expect Michael’s final grade to be 8.2. (if we rounded that number to 1st decimal place)

Implementing a Linear Regression Model in Python & drawing the Least Squares Regression Line

Hmm, now it’s time for us to move into the interesting coding stuff. Developing machine learning models in python is very exciting, especially with the machine learning support packages and libraries.

The best thing about machine learning with python is that it has so many mathematical libraries and packages that we can use to simplify our code. For example, we can tell Numpy, Pandas, and sci-kit learn to do the heavy lifting, complicated mathematical coding, and data preparation by simply calling out their modules and functions into our code. By doing so, we both simply our code and reduce development time.

*I use Pycharm for my python coding. It’s a personal preference. You can use your favorite python IDE you are used to and comfortable with. It doesn’t matter, You can even use Jupyter notebooks in your browser.*

Importing required libraries and packages into python

Let’s import NumPy, pandas, sci-kit learn. Then we need matplotlib as well for drawing graphs and charts. (We need it here for the scatterplot and for plotting the least squares regression line). One of the best ways to install these useful machine learning libraries and packages at once is by installing anaconda distribution on your computer.

Do you already have anaconda on your computer?? Or aren’t you sure which version it is? Then read this article. How to Check Anaconda Version in Windows?

import numpy as np
import pandas as pd
from matplotlib import pyplot as pt
import sklearn
from sklearn import linear_model

Downloading and extracting the CSV dataset into our project folder

I almost forgot to mention that we need to download our dataset called Student Performance Data Set from the UCI Machine Learning Repository.

After downloading the student zip folder extract the files. Now move the student-mat.csv file to your python project folder. This makes our job’s a little easier, otherwise, we would have to define the path to the CSV file within our code.

saving the CSV dataset in the python project folder in pycharm

Reading the CSV using Pandas

Now let’s read the CSV file using pandas and save the data into a variable called data.

data = pd.read_csv ("student-mat.csv", sep=";")

Strangely, this data set doesn’t include comma-separated values. But it has semicolons to separate the values. So we have to tell our model that we have used “;” to separate the values. That’s the purpose of the code sep=”;”.

Now let’s set that variable to define only the G2 and G3 values.

data = data [["G2", "G3"]]

Determining if we have a linear relationship inbetwen our dependant and independant variables

Let’s populate G2 and G3 values on a scatterplot to see if there’s a linear connection between them. If so, developing a linear model and plotting a least squares regression line for these variables makes sense.

pt.scatter(data.G2, data.G3, color='blue')
pt.xlabel("G2-")
pt.ylabel("G3-")
pt.show()
x and y variables data points and their linear relationship
A linear relationship between x and y variables

Now that we know that the relationship is linear we can proceed with the rest of the code.

Puting our x and y data in Numpy Arrays

Let’s put that G3 into a different variable called “predictable”.

predictable = “G3”

Let’s define x and y variables for the model. Python doesn’t have inbuilt arrays. So, we have to use NumPy to create arrays and store x and y values.

X = np.array(data.drop([predictlable], 1))
Y = np.array(data[predictlable])

Setting our trainning and testing data sets using sckit-learn train_test_split function

Let’s create four more different arrays out of our data (from X and Y) to train and test the model. by running this code we get 90% of data randomly selected for the train sets and 10% for the test sets.

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.1)

Declaring our linear regresion model

We can use the following code to create our linear regression model, using the LinearRegression class provided by the linear_model module in sci-kit-learn. I’m choosing my model name as nnl_leastsqureregression.

nnl_leastsqureregression = linear_model.LinearRegression()

Training our model using the training data

Now we can train the model using the test potion of our dataset, shown below

nnl_leastsqureregression.fit(x_train, y_train)

Now, at this point, our linear regression model has found the Least Squares Regression Line, which is the best fit line for our training data. By default, it has used the least-squares method we just learned above, to fit the line to our training data. The method has minimized the total of the squares of the individual errors of the data points.

Geting the slope and intercept of our least squares regression line

Therefore, now we can get the slope and the y-intercept of the line by running the following lines of codes.

print("slope :", nnl_leastsqureregression.coef_)

print("intercept :", nnl_leastsqureregression.intercept_)

However, In case if you are wondering, yes coefficiency is the same as the slope of the regression line.

calculating the ecoefficiency and intercept of the scikit-learn linear regression model

The slope of the line is 1.09879931

The y-intercept is -1.3262551084047196

Well, I know that these numbers are too long. But it’s just because we didn’t tell the program to round them up to our preferred decimal point.

Now, you may wonder why these values are different from the example that we used to understand the concept at the beginning. Let me tell you why. This is a real data set of 395 student records. And that example only had 5 records… Simple!

One more thing about the coefficiency and intercept values. It’s normal for each one of you to have slightly different values for them. We all get a different data split each time we run the code. Simply because of that random data split done at the train and test data split using scikit-learn’s train_test_split function. 🙂

Plotting the least squares regression line on top of our training x and y data

Let’s actually plot the Least Squares Regression Line on scatterplot now, shall we? We can plot the line on top of our training data.

pt.scatter(x_train, y_train,  color='blue')
pt.plot(x_train, nnl_leastsqureregression.coef_*x_train + nnl_leastsqureregression.intercept_, '-r')
pt.xlabel("G2-")
pt.ylabel("G3-")
pt.show()
plotting the Least Squares Regression Line on scatterplot diagram
Least squares regression line is plotted on a scatterplot

Conclusion

The least squares method is by far the most popular and widely used mathematical method for drawing the best fit line in linear regression models. The resulting line is called as Least Squares Regression Line. The concept is mathematically a bit complexed over huge datasets. But we can rely on scikit learn to do the calculation for us.

Learn more about scikit learn by heading over to their official site. scikit learn official site

Want to become an ML and Data Science Expert? And get hired by reputed companies? Enroll with Eduraka Today!

1 thought on “Least Squares Regression Line”

  1. Pingback: auc sklearn with practical example - Neural Net Lab

Leave a Reply