Naive Bayes a Primer

Before you read this, here is a full disclosure, the following are notes that I took after taking the course on statistics and machine learning from udacity.com , it’s taught by Sebastian Thrun, Founder of Udacity, who I believe is one of the best teachers on Machine Learning out there. I highly recommend you take both the courses to get a better understanding of the concepts.

This post is for folks who are learning the ropes of Machine Learning Algorithms.

This is the first post in a series where we will explore one of the coolest classification algorithms called Naive Bayes, we will implement the model first using the  scikit-learn Python library, then we will build the model using KNIME and finally build the model using Spark’s MLlib API.

We will cover the following topics in this post:

  • Discuss the differences between Supervised and Unsupervised Machine Learning algorithms.
  • Look at Naive Bayes  a Classification (Supervised) algorithm
  • Break down the Mechanics of Naive Bayes and build an intuition of the algorithm

Supervised learning vs Unsupervised learning

Before we get into Naive Bayes first let’s get familiar with a couple of terms used to categorize Machine Learning algorithms. There are two broad categories called Supervised learning and Unsupervised learning algorithms.

Supervised Classifiers : Supervised classification is the task of inferring a state/value from a set of training data. The training data consists of a set of data where each example  is a pair that consist of an input and a desired output state or signal that is a class label. A supervised learning algorithm analyzes the training data and builds a statistical model which can be used to predict the output.

Unsupervised Classifiers: In this case the classifier has no prior knowledge of the output i.e. the class/value, the data is not labeled and the goal is to learn patterns, clusters of data.

Naive Bayes Theorem Explained

History of Bayes Theorem:

Reverend Thomas Bayes was an English statistician and Presbyterian minister who is credited for formulating the Naive Bayes Theorem which  he used to prove the existence of God through the application of probabilistic inference.

Here’s the fundamentals of Bayes Rule :

  1. There is some prior probability of an event.
  2. Then there is a test that can be administered or applied to that will give an evidence of the event.
  3. Bayes rule incorporates the evidence from the test into the prior probability to arrive at the posterior probability.

Here’s an illustration of the process described above –

NB-Illustration - 1

Before we dive into the mechanics of the algorithm lets refresh our memory on probability  theory, I promise it will make sense in the end:

Flip a fair coin once , what’s the probability that we get heads :

P(H) = 0. 5

How about tails P(T) = 1 – P(H) = 0.5

How about if the coin is loaded and the probability of a heads on a toss is 0.8, whats the probability of getting a tail when you flip a coin

P(T) = 1 – P(H) = 1 – 0.8 = 0.2

How about if it’s a loaded coin, where the probability of you getting a heads is about 0.8, the question is what’s the probability  you get tails, well by the formula from above i.e.

1 – P(H) = 1 – 0.8 = 0.2

Let’s take this up a notch and ask ourselves what if we flip the same loaded coin twice, what’s the probability of us getting 2 consecutive Heads i.e. P(H,H) , it’s easier to figure that out by using a truth table –

Probability Theory - 2 flips

So the P(H,H) = P(H)*P(H) = 0.8*0.8 = 0.64

We are going to use the 1 – P(H) and the P(H)*P(H) equations to compute the posterior probability using Naive Bayes. Now that we covered some probability theory let’s get back to Bayes Theorem using an example:

Suppose there’s a certain type of Cancer C that occurs in about 1 % of the population. Then there is a test when taken there’s a 90 %  chance that it will come back as positive if the patient has Cancer C, this is called the Sensitivity of the test.

If the subject does not have cancer there’s a 90% chance it will come back as negative. This is called the specificity of the test.

Let’s break this problem statement and list the different data points available that can be used to apply Bayes Rule and make a probabilistic inference of what are the chances that the patient has cancer if the test comes back as positive.

So from the above example we can derive the following data points in the following Probability table –

NB-Cancer Probabilities

Given the data what’s the probability that a person has cancer given a positive test: i.e. 

?P(C | P) – where C is the Cancer Population and P is the Positive Test, >C is the Non Cancer Population.

  1. ? P(P , C) = P(C) x P( P | C) = 0.01 x 0.9 = 0.009  { Here we are calculating the Probability of a positive test for the Cancer Population }

  2. ? P(P, >C) = P(>C) x P(P | >C) = 0.99×0.10 = 0.099 { Here we are calculating the Probability of a positive test for the Non Cancer Population, we get this data point from the Specificity of the test }

  3. Next we want to calculate the Total Probability for a Positive test for the cancer and non cancer population which is  – P(P) = 0.009 + 0.099 = 0.108 { Also called the evidence, we add up the probabilities  for the Positive test for the Cancer and Non Cancer Probabilities, which gives us the evidence}

  1. P(C | P) = 0.009 / 0.108 = 0.0833 = about 8.3 % – So this the answer we are looking for which gives is the posterior probability of cancer given a positive test

  2. P(>C | P) = 0.099 / 0.108 = 0.9166 = about 91.66 % – And this is the Posterior                      Probability of no cancer given a positive test  

Let’s summarize – The whole premise of Naive Bayes is centered around the following properties –

  1. A prior probability of an event before the evidence let’s call it Pr[H]
  2. Evidence\Test on the event, let’s call it E
  3. Probability of the Event given the Evidence E  Pr[E]
  4. Probability of the event after evidence\test called the Posterior probability Pr[H | E]

Pr[H | E] = Pr[E|H]*Pr[H] / Pr[E]

where

Pr[E] = Pr[H] x Pr[H | E is true] + Pr[>H] x Pr[>H | E is true]

The Naive in the “Naive Bayes” is to indicate that the Theorem is very basic in nature, the Bayes comes from the Bayes Theorem.

In the next post we will look at how to program a Naive Bayes model on a dataset using the scikit-learn Python library.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s