Training a Naive Bayes classifier using sklearn
This is the second post in a series, in this post we will look at how to apply the Naive Bayes to train and solve a classification problem on the iris dataset. Here’s the link to the first post naive bayes a primer in case you missed it, in it we breakdown the mechanics of the algorithm.
Okay so lets get down to business and get an overview of what is going to be covered in this post. We will get our hands dirty by creating a naive bayes model using the scikitlearn python framework.
Before we dive in, lets look at the software prerequisites to execute the code.
 Python 2.7 or higher
 install the SciPy package – http://www.scipy.org/install.html
 install the scikit sklearn package – http://scikitlearn.org/stable/install.html
About scikitlearn
ScikitLearn is a Open Source Machine learning library for Python. The library is simple to use and contains tools for data analysis and data mining, not to mention several machine learning algorithms. The framework is built on the NumPy, SciPy and matplotlib packages.
The Dataset
In the program we will be using the iris dataset that is provided with the sklearn library. The dataset contains a total of 150 observations, which is made up of 3 classes of 50 instances each, where each class refers to a type of iris plant.
Please refer to this link for more reading on the dataset. Each row of the observations is made of 4 feature attributes and 1 class attribute which is the predicted attribute.
 Sepal Length in cm
 Sepal Width in cm
 Petal Length in cm
 Petal Width in cm
 Class: The class labels are
 Setosa
 Versicolour
 Virginica
Exploring the dataset
Before we get to the code, it is vital to get a better understanding of the relationship between the features, one way to visualize the data is generate a scatter matrix plot as shown below
The attribute in the row represent the y axis and the attribute in the column is the variable on the x axis. So the first plot , sepal length is plotted on the y axis and sepal width on the x axis. From the plot we can conclude that there is a linear relationship between Sepal Length and Petal Length, Sepal Length and Petal Width.
Another method to find the linear relationships between features is to use a bivariate statistic measure called correlation coefficient r , where the range of values can be between 1 and 1, where 1 indicates a very strong positive linear relationship and 1 indicating the a negative linear relationship. Here’s a plot of the correlation coefficient ;
Train a Naive Bayes Classifier
Now that we’ve developed an intuition of the data, let’s write a application to train and use a naive bayes classifier and have it predict the class outcomes.
But first let’s break the problem into smaller steps
 We will first load the features and the Class in two separate variables called X and y respectively, and then we will randomly divide the dataset into a training and test set.
 The training set will be used to train the dataset, the test set will be used to get the classifier predict the class for each of the outcomes.
 We will then measure the accuracy of the predictions by comparing the predicted outcomes to that of the true values of the class for the test set.
Now that we’ve defined the problem let’s code the solution:
import matplotlib.pyplot as plt from sklearn import datasets from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score,confusion_matrix from sklearn.cross_validation import train_test_split #import pandas as pd import numpy as np from StringIO import StringIO iris = datasets.load_iris() X = iris.data y = iris.target # Split the data into a training set and a test set X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # Create the Naive Bayes Classifier clf = GaussianNB() # Train the classifier using the fit method clf.fit(X_train,y_train) # Generate predictions i.e. class names on the test data set y_predict = clf.predict(X_test) score = accuracy_score(y_test,y_predict,normalize=False) print("Total number of correctly classified observations: {0} out of {2} observations, Accuracy of the predictions: {1}").format(score,score/float(len(y_test)),len(y_test)) def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(iris.target_names)) plt.xticks(tick_marks, iris.target_names, rotation=45) plt.yticks(tick_marks, iris.target_names) plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') #Compute confusion matrix cm = confusion_matrix(y_test,y_predict) np.set_printoptions(precision=2) print('Confusion matrix, without normalization') print(cm) plt.figure() plot_confusion_matrix(cm) # Normalize the confusion matrix by row (i.e by the number of samples # in each class) cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] print('Normalized confusion matrix') print(cm_normalized) plt.figure() plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix') plt.show()
All the code is available on github @ https://github.com/jkoyan/MLCodeExamples/tree/master/nbmodel
Naive Bayes a Primer
Before you read this, here is a full disclosure, the following are notes that I took after taking the course on statistics and machine learning from udacity.com , it’s taught by Sebastian Thrun, Founder of Udacity, who I believe is one of the best teachers on Machine Learning out there. I highly recommend you take both the courses to get a better understanding of the concepts.
This post is for folks who are learning the ropes of Machine Learning Algorithms.
This is the first post in a series where we will explore one of the coolest classification algorithms called Naive Bayes, we will implement the model first using the scikitlearn Python library, then we will build the model using KNIME and finally build the model using Spark’s MLlib API.
We will cover the following topics in this post:
 Discuss the differences between Supervised and Unsupervised Machine Learning algorithms.
 Look at Naive Bayes a Classification (Supervised) algorithm
 Break down the Mechanics of Naive Bayes and build an intuition of the algorithm
Supervised learning vs Unsupervised learning
Before we get into Naive Bayes first let’s get familiar with a couple of terms used to categorize Machine Learning algorithms. There are two broad categories called Supervised learning and Unsupervised learning algorithms.
Supervised Classifiers : Supervised classification is the task of inferring a state/value from a set of training data. The training data consists of a set of data where each example is a pair that consist of an input and a desired output state or signal that is a class label. A supervised learning algorithm analyzes the training data and builds a statistical model which can be used to predict the output.
Unsupervised Classifiers: In this case the classifier has no prior knowledge of the output i.e. the class/value, the data is not labeled and the goal is to learn patterns, clusters of data.
Naive Bayes Theorem Explained
History of Bayes Theorem:
Reverend Thomas Bayes was an English statistician and Presbyterian minister who is credited for formulating the Naive Bayes Theorem which he used to prove the existence of God through the application of probabilistic inference.
Here’s the fundamentals of Bayes Rule :
 There is some prior probability of an event.
 Then there is a test that can be administered or applied to that will give an evidence of the event.
 Bayes rule incorporates the evidence from the test into the prior probability to arrive at the posterior probability.
Here’s an illustration of the process described above –
Before we dive into the mechanics of the algorithm lets refresh our memory on probability theory, I promise it will make sense in the end:
Flip a fair coin once , what’s the probability that we get heads :
P(H) = 0. 5
How about tails P(T) = 1 – P(H) = 0.5
How about if the coin is loaded and the probability of a heads on a toss is 0.8, whats the probability of getting a tail when you flip a coin
P(T) = 1 – P(H) = 1 – 0.8 = 0.2
How about if it’s a loaded coin, where the probability of you getting a heads is about 0.8, the question is what’s the probability you get tails, well by the formula from above i.e.
1 – P(H) = 1 – 0.8 = 0.2
Let’s take this up a notch and ask ourselves what if we flip the same loaded coin twice, what’s the probability of us getting 2 consecutive Heads i.e. P(H,H) , it’s easier to figure that out by using a truth table –
So the P(H,H) = P(H)*P(H) = 0.8*0.8 = 0.64
We are going to use the 1 – P(H) and the P(H)*P(H) equations to compute the posterior probability using Naive Bayes. Now that we covered some probability theory let’s get back to Bayes Theorem using an example:
Suppose there’s a certain type of Cancer C that occurs in about 1 % of the population. Then there is a test when taken there’s a 90 % chance that it will come back as positive if the patient has Cancer C, this is called the Sensitivity of the test.
If the subject does not have cancer there’s a 90% chance it will come back as negative. This is called the specificity of the test.
Let’s break this problem statement and list the different data points available that can be used to apply Bayes Rule and make a probabilistic inference of what are the chances that the patient has cancer if the test comes back as positive.
So from the above example we can derive the following data points in the following Probability table –
Given the data what’s the probability that a person has cancer given a positive test: i.e.
?P(C  P) – where C is the Cancer Population and P is the Positive Test, >C is the Non Cancer Population.

? P(P , C) = P(C) x P( P  C) = 0.01 x 0.9 = 0.009 { Here we are calculating the Probability of a positive test for the Cancer Population }

? P(P, >C) = P(>C) x P(P  >C) = 0.99×0.10 = 0.099 { Here we are calculating the Probability of a positive test for the Non Cancer Population, we get this data point from the Specificity of the test }

Next we want to calculate the Total Probability for a Positive test for the cancer and non cancer population which is – P(P) = 0.009 + 0.099 = 0.108 { Also called the evidence, we add up the probabilities for the Positive test for the Cancer and Non Cancer Probabilities, which gives us the evidence}

P(C  P) = 0.009 / 0.108 = 0.0833 = about 8.3 % – So this the answer we are looking for which gives is the posterior probability of cancer given a positive test

P(>C  P) = 0.099 / 0.108 = 0.9166 = about 91.66 % – And this is the Posterior Probability of no cancer given a positive test
Let’s summarize – The whole premise of Naive Bayes is centered around the following properties –
 A prior probability of an event before the evidence let’s call it Pr[H]
 Evidence\Test on the event, let’s call it E
 Probability of the Event given the Evidence E Pr[E]
 Probability of the event after evidence\test called the Posterior probability Pr[H  E]
Pr[H  E] = Pr[EH]*Pr[H] / Pr[E]
where
Pr[E] = Pr[H] x Pr[H  E is true] + Pr[>H] x Pr[>H  E is true]
The Naive in the “Naive Bayes” is to indicate that the Theorem is very basic in nature, the Bayes comes from the Bayes Theorem.
In the next post we will look at how to program a Naive Bayes model on a dataset using the scikitlearn Python library.