Training a Naive Bayes classifier using sklearn

This is the second post in a series, in this post we will look at how to apply the Naive Bayes to train and solve a classification problem on the iris data-set.  Here’s the link to the first post naive bayes a primer in case you missed it, in it we breakdown the mechanics of the algorithm.

Okay so lets get down to business and get an overview of what is going to be covered in this post. We will get our hands dirty by creating a naive bayes model using the scikit-learn python framework.

Before we dive in, lets look at the software prerequisites  to execute the code.

  1. Python 2.7 or higher
  2. install the SciPy package  –
  3. install the scikit sklearn package –

About scikit-learn 

Scikit-Learn is a Open Source Machine learning library for Python. The library is simple to use and contains tools for data analysis and data mining, not to mention several machine learning algorithms. The framework is built on the NumPy, SciPy and matplotlib packages.

The Dataset

In the program we will be using the iris dataset that is provided with the sklearn library. The dataset contains a total of 150 observations, which is made up of 3 classes of 50 instances each, where each class refers to a type of iris plant.

Please refer to this link  for more reading on the dataset. Each row of the observations is made of 4  feature attributes and 1 class attribute which is the predicted attribute.

  1. Sepal Length in cm
  2. Sepal Width in cm
  3. Petal Length in cm
  4. Petal Width in cm
  5. Class: The class labels are
  • Setosa
  • Versicolour
  • Virginica

Exploring the dataset

Before we get to the code, it is vital to  get a better understanding of the relationship between the features, one way to visualize the data is generate a scatter matrix plot as shown below

iris_data-scatter-plot-1 The attribute in the row represent the y axis and the attribute in the column is the variable on the x axis. So the first plot , sepal length is plotted on the y axis and sepal width on the x axis. From the plot we can conclude that there is a linear relationship between Sepal Length and Petal Length, Sepal Length and Petal Width.

Another method to find the linear relationships between features is to use a bivariate statistic measure called correlation coefficient r , where the range of values can be between -1 and 1, where 1 indicates a very strong positive linear relationship and -1 indicating the a negative linear relationship. Here’s a plot of the correlation coefficient ;


Train a Naive Bayes Classifier

Now that we’ve developed an intuition of the data, let’s write a application to train and use a naive bayes classifier and have it predict the class outcomes.

But first let’s break the problem into smaller steps

  • We will first load the features and the Class in two separate variables called X and y respectively, and then we will randomly divide the dataset into a training and test set.
  • The training set will be used to train the dataset, the test set will be used to get the classifier predict the class for each of the outcomes.
  • We will then measure the accuracy of the predictions by comparing the predicted outcomes to that of the true values of the class for the test set.

Now that we’ve defined the problem let’s code the solution:


import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.naive_bayes import  GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.cross_validation import train_test_split
#import pandas as pd
import  numpy as np
from StringIO import StringIO

iris = datasets.load_iris()
X =
y =

# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Create the Naive Bayes Classifier
clf = GaussianNB()

# Train the classifier using the fit method,y_train)

# Generate predictions i.e. class names on the test data set
y_predict = clf.predict(X_test)

score = accuracy_score(y_test,y_predict,normalize=False)

print("Total number of correctly classified observations: {0} out of {2} observations, Accuracy of the predictions: {1}").format(score,score/float(len(y_test)),len(y_test))

def plot_confusion_matrix(cm, title='Confusion matrix',
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    tick_marks = np.arange(len(iris.target_names))
    plt.xticks(tick_marks, iris.target_names, rotation=45)
    plt.yticks(tick_marks, iris.target_names)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

#Compute confusion matrix
cm = confusion_matrix(y_test,y_predict)
print('Confusion matrix, without normalization')


# Normalize the confusion matrix by row (i.e by the number of samples
# in each class)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print('Normalized confusion matrix')
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')

All the code is available on github @


Naive Bayes a Primer

Before you read this, here is a full disclosure, the following are notes that I took after taking the course on statistics and machine learning from , it’s taught by Sebastian Thrun, Founder of Udacity, who I believe is one of the best teachers on Machine Learning out there. I highly recommend you take both the courses to get a better understanding of the concepts.

This post is for folks who are learning the ropes of Machine Learning Algorithms.

This is the first post in a series where we will explore one of the coolest classification algorithms called Naive Bayes, we will implement the model first using the  scikit-learn Python library, then we will build the model using KNIME and finally build the model using Spark’s MLlib API.

We will cover the following topics in this post:

  • Discuss the differences between Supervised and Unsupervised Machine Learning algorithms.
  • Look at Naive Bayes  a Classification (Supervised) algorithm
  • Break down the Mechanics of Naive Bayes and build an intuition of the algorithm

Supervised learning vs Unsupervised learning

Before we get into Naive Bayes first let’s get familiar with a couple of terms used to categorize Machine Learning algorithms. There are two broad categories called Supervised learning and Unsupervised learning algorithms.

Supervised Classifiers : Supervised classification is the task of inferring a state/value from a set of training data. The training data consists of a set of data where each example  is a pair that consist of an input and a desired output state or signal that is a class label. A supervised learning algorithm analyzes the training data and builds a statistical model which can be used to predict the output.

Unsupervised Classifiers: In this case the classifier has no prior knowledge of the output i.e. the class/value, the data is not labeled and the goal is to learn patterns, clusters of data.

Naive Bayes Theorem Explained

History of Bayes Theorem:

Reverend Thomas Bayes was an English statistician and Presbyterian minister who is credited for formulating the Naive Bayes Theorem which  he used to prove the existence of God through the application of probabilistic inference.

Here’s the fundamentals of Bayes Rule :

  1. There is some prior probability of an event.
  2. Then there is a test that can be administered or applied to that will give an evidence of the event.
  3. Bayes rule incorporates the evidence from the test into the prior probability to arrive at the posterior probability.

Here’s an illustration of the process described above –

NB-Illustration - 1

Before we dive into the mechanics of the algorithm lets refresh our memory on probability  theory, I promise it will make sense in the end:

Flip a fair coin once , what’s the probability that we get heads :

P(H) = 0. 5

How about tails P(T) = 1 – P(H) = 0.5

How about if the coin is loaded and the probability of a heads on a toss is 0.8, whats the probability of getting a tail when you flip a coin

P(T) = 1 – P(H) = 1 – 0.8 = 0.2

How about if it’s a loaded coin, where the probability of you getting a heads is about 0.8, the question is what’s the probability  you get tails, well by the formula from above i.e.

1 – P(H) = 1 – 0.8 = 0.2

Let’s take this up a notch and ask ourselves what if we flip the same loaded coin twice, what’s the probability of us getting 2 consecutive Heads i.e. P(H,H) , it’s easier to figure that out by using a truth table –

Probability Theory - 2 flips

So the P(H,H) = P(H)*P(H) = 0.8*0.8 = 0.64

We are going to use the 1 – P(H) and the P(H)*P(H) equations to compute the posterior probability using Naive Bayes. Now that we covered some probability theory let’s get back to Bayes Theorem using an example:

Suppose there’s a certain type of Cancer C that occurs in about 1 % of the population. Then there is a test when taken there’s a 90 %  chance that it will come back as positive if the patient has Cancer C, this is called the Sensitivity of the test.

If the subject does not have cancer there’s a 90% chance it will come back as negative. This is called the specificity of the test.

Let’s break this problem statement and list the different data points available that can be used to apply Bayes Rule and make a probabilistic inference of what are the chances that the patient has cancer if the test comes back as positive.

So from the above example we can derive the following data points in the following Probability table –

NB-Cancer Probabilities

Given the data what’s the probability that a person has cancer given a positive test: i.e. 

?P(C | P) – where C is the Cancer Population and P is the Positive Test, >C is the Non Cancer Population.

  1. ? P(P , C) = P(C) x P( P | C) = 0.01 x 0.9 = 0.009  { Here we are calculating the Probability of a positive test for the Cancer Population }

  2. ? P(P, >C) = P(>C) x P(P | >C) = 0.99×0.10 = 0.099 { Here we are calculating the Probability of a positive test for the Non Cancer Population, we get this data point from the Specificity of the test }

  3. Next we want to calculate the Total Probability for a Positive test for the cancer and non cancer population which is  – P(P) = 0.009 + 0.099 = 0.108 { Also called the evidence, we add up the probabilities  for the Positive test for the Cancer and Non Cancer Probabilities, which gives us the evidence}

  1. P(C | P) = 0.009 / 0.108 = 0.0833 = about 8.3 % – So this the answer we are looking for which gives is the posterior probability of cancer given a positive test

  2. P(>C | P) = 0.099 / 0.108 = 0.9166 = about 91.66 % – And this is the Posterior                      Probability of no cancer given a positive test  

Let’s summarize – The whole premise of Naive Bayes is centered around the following properties –

  1. A prior probability of an event before the evidence let’s call it Pr[H]
  2. Evidence\Test on the event, let’s call it E
  3. Probability of the Event given the Evidence E  Pr[E]
  4. Probability of the event after evidence\test called the Posterior probability Pr[H | E]

Pr[H | E] = Pr[E|H]*Pr[H] / Pr[E]


Pr[E] = Pr[H] x Pr[H | E is true] + Pr[>H] x Pr[>H | E is true]

The Naive in the “Naive Bayes” is to indicate that the Theorem is very basic in nature, the Bayes comes from the Bayes Theorem.

In the next post we will look at how to program a Naive Bayes model on a dataset using the scikit-learn Python library.