Home » auc sklearn with practical example

auc sklearn with practical example

  • by
auc sklearn

The auc sklearn is a method for assessing a binary classifier’s quality.  It measures the area under the ROC curve, which is also known as “AUC” to quantify how well a supervised classifier can distinguish between positive and negative classes. The auc sklearn ranges from 0, indicating a useless classification model, to a value of 1, a perfect prediction.  An auc sklearn is a useful, essential tool for a data scientist as a performance measure of a classifier’s quality and as a guide for model improvement.

Having trouble with sklearn? You might want to check out this article. ModuleNotFoundError: No module named ‘sklearn’.

auc sklearn

There are a few things to note about auc sklearn.  First, the AUC a is not a measure of how well the classifier performs on a test set; it is a measure of how well the classifier performs on the entire training set.  Second, the AUC is not always reliable, especially if the data set is small.  Third, it can be affected by the threshold value.

Despite its limitations, the AUC sklearn is a valuable tool for assessing a classifier’s quality.  It is a good starting point for improving a classifier and can help data scientists to identify areas for improvement.

Example: auc sklearn

Now let’s take a look at a good example to understand the concept behind the auc sklearn.

Importing necessary python machine learning libraries

We need to import make_classification, import train_test_split, roc_curve, roc_auc_score, and matplotlib for this example.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

Creating arbitrary data for auc sklearn example

We can create some imaginary data using sklearn for this example. We call them arbitrary data. And we create them using sklearn make_classification method shown below.

Generating a arbitrary dataset with two classes

X, y = make_classification (n_samples=2000, n_classes=2, n_features=40, random_state=37)

Splitting the dataset into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=37)

Testing the arbitrary dataset using a machine learning classifier

In order to practically calculate auc sklearn, we can create a machine learning model and train it for our dataset. Then we can use the predicted values to understand the usage of auc sklearn.

Importing logistic regression model from sklearn

I’m going to import the logistic regression model from sklearn linear model library.

from sklearn.linear_model import LogisticRegression

Since we are talking about logistic regression, you might want to check out this article about linear regression Least Squares Regression Line.

Creating the logistic regression model

LogRegModel = LogisticRegression()

Training the logostic regression model with abrbitratry data

LogRegModel.fit(X_train, y_train)

Pedicted Values

pred_val = LogRegModel.predict_proba(X_test)

computing the ROC for our logistic regression model and returning fpr, tpr and threshold values

fpr, tpr, threshold = roc_curve(y_test, pred_val[:,1], pos_label=1)
Calculating the curve where tpr = fpr (true positive rate = false poitive rate)
random_pred_val = [0 for i in range(len(y_test))]
p_fpr1, p_tpr1, _ = roc_curve(y_test, random_pred_val, pos_label=1)

Calculating the auc scores

auc_score = roc_auc_score(y_test, pred_val[:,1])
Let’s print the auc score out
print(auc_score)
print auc sklearn score
print(auc_score)

Here we have an auc score of 0.95 which is so close to 1.

The code so far

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

# Let's generate an arbitrary dataset with two classes
X, y = make_classification(n_samples=2000, n_classes=2, n_features=40, random_state=37)

# Let's split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=37)

# Importing the logistic regression models from sklearn linear model directory
from sklearn.linear_model import LogisticRegression

# Creating the logistic regression machine learning model
LogRegModel = LogisticRegression()

# Training the model with the training data
LogRegModel.fit(X_train, y_train)

# Calculating the prediction values and assigning them into a variable
pred_val = LogRegModel.predict_proba(X_test)

# roc curve for the logistic regression model
fpr, tpr, threshold = roc_curve(y_test, pred_val[:,1], pos_label=1)

# Calculating the curve where tpr = fpr
random_pred_val = [0 for i in range(len(y_test))]
p_fpr1, p_tpr1, _ = roc_curve(y_test, random_pred_val, pos_label=1)

# Calculating the auc score
auc_score = roc_auc_score(y_test, pred_val[:,1])

#Printing the auc score
print(auc_score)

Plotting the auc sklearn curve (ROC Curve)

Finally, we can plot the ROC curve. I’m going to use the matplotlib visualizations library to plot both ROC curves, the one with the predicted values, and for the condition where tpr is equal to for. Let’s write the codes responsible for the matplotlib functions now.

Using seaborn style

plt.style.use('seaborn')

Plotting the roc curves

plt.plot(fpr, tpr, linestyle='--',color='red', label='Logistic Regression')
plt.plot(p_fpr1, p_tpr1, linestyle='--', color='black')

Naming the plot title

plt.title('ROC curve plot')

Setting the x label

plt.xlabel('False Positive Rate/FPR')

Setting the y label

plt.ylabel('True Positive Rate/TPR')

Plot legend

plt.legend(loc='best')

Saving plt figure name and resolution

plt.savefig('ROC',dpi=300)

showing the plot

plot.show();
ROC curve plotted
ROC curve

With this ROC curve, we can see the relationship between false positives and true negatives. The closer to the left side of the graph (0% on y-axis), the better than model is at picking out relevant data points without too many irrelevant ones. Higher on the right side of the chart indicate that more wrong predictions are made with little accuracy in predicting what’s correct information. Generally speaking, models which have an AUC score greater than 0.8 are considered very good at their job while those under 0.5 should be investigated for possible improvements or new algorithms altogether before they’re used again! Ours is 0.95 which is considered a great AUC score.

scikit learn ROC curve and AUC

The full code for matplotlib

# plot roc curves
plt.plot(fpr, tpr, linestyle='--',color='red', label='Logistic Regression')
plt.plot(p_fpr1, p_tpr1, linestyle='--', color='black')
# title
plt.title('ROC curve plot')
# x label
plt.xlabel('False Positive Rate / FPR')
# y label
plt.ylabel('True Positive rate / TPR')
# plt legend
plt.legend(loc='best')
#saving plt figure name and resolution
plt.savefig('ROC',dpi=300)
#showing the plot
plt.show();

Conclusion

The AUC score is a good way to find the best classifier and can be plotted by using ROC curves. Often, we use this metric as an indicator of how well your binary classification model will perform in practice. There are many different ways to calculate it (and even more ways not). We encourage you to try some other methods on your data sets before settling with one method. You may also want to read our upcoming blog post about calculating area under curve for another perspective! So make sure to sign up for our newsletter to stay updated on upcoming content.

Wanna join the neural net lab discussion? Join our forum and our neuralnetlab subreddit.

Leave a Reply