Overview :
Overview
This article will help you in understanding one of the simplest and popular classification algorithms in machine learning, i.e., Naive Bayes.
Following are main objectives of this article:
- Introduction to standard Naive Bayes classifier
- Examples of Naive Bayes classification workflow
- Overview of different models of Naive Bayes classifier
- Discussions on Naive Bayes classifier
- Conclusion or ending notes
Introduction
A Naive Bayes is a probabilistic based machine learning classification method whose strategy lies behind the principle of Bayes theorem in probability.
To understand how the Bayes theorem works, look at the following equation:
Here A is the hypothesis and B is called as evidences.
where;
- P(A|B) is the posterior probability for the target class A given B as predictor attributes (predictors and feature set are interchangeable words) or probability for the happening of A after the occurrence of B.
- P(A) refers to prior probability for the output/hypothesis A.
- P(B|A) is the probability of the predictor or feature set B given the hypothesis or class A.
- P(B) is the prior evidence of feature set B.
In Naive Bayes classifier, it is assumed that the predictors or input features used to determine the probability of an outcome are not co-related i.e., their existence in a class is independent of any other feature instances. For an instance, if we want to classify fish into two classes as sea bass or salmon, in this case, the feature set may consist of the fish height, color or feathers. All of these features contribute independently to the probability that whether the fish is salmon or sea bass. Hence, the so-called name of the algorithm, ‘Naive’ comes from the assumption that the presence of one feature does not affect the other or the predictors/attributes are independent.
Naive Bayes classifier is simple to build and specially used in the classification tasks with larger data sets. Naive Bayes is known to lead pretty sophisticated classifiers, along with simplicity.
Examples
Let’s have a look on the working of the algorithm using a few illustrative examples below:
Example#1
Below we have a data set containing weather as a predictor and play as a target class or variable. The weather predictor will suggest the possibility of play.
Here, we need to classify whether the player will play or not, given the condition of the weather. The first column represents weather as a feature and the second column is representing the target variable or class, whereas, the rows represent individual entries. If we consider the first row in the dataset, we can conclude that the player will not play as the weather is sunny.
We have to make the following assumptions here:
1) predictors are independent, i.e., if weather is sunny, it does not necessarily mean that the player will not play, in the case if we have more than one predictors, say humidity, then both the features will give the probability of playing independently; and,
2) all the predictors will equally affect the outcome, i.e., if the weather is sunny and suppose we have one more attribute deciding the possibility of playing, say, humidity, then they both will affect the probability of playing equally.
According to this example, Bayes theorem can be rewritten as:
where,
- y is the variable for class (play), which will represent that the player will play or not, given the conditions.
- X represents the features set or the parameters.
Here X can be given as,
where,
x_{1}, x_{2},… x_{n }represents all the possible features or evidence to decide the possibility of playing, which in this case, is only mapped to weather.
After substituting X and applying the chain rule, we get,
The denominator in the above equation can be removed since its entries will remain static for each observation. Hence we can introduce the following proportionality:
For a multivariate classification problem, we need to find out the class y with the highest probability. So the above proportionality becomes,
Now, we can find out the class based on given predictors using the above formula.
To have a better intuition of the theorem let’s do an exercise from the above dataset.
Problem
Can we predict whether the player will play or not based on the following whether condition?
Weather | Play |
Sunny | ? |
Solution
Let’s solve the problem following the steps below:
Step 1:
Construct the frequency table for all shreds of evidence from the dataset.
Solution
Let’s solve the problem following the steps below:
Step 1:
Construct the frequency table for all shreds of evidence from the dataset.
Step 2:
Construct likelihood table
Step 3:
Calculate the posterior probability for each class using the above simplified Bayes formula. The outcome of the prediction will be the class with the maximum posterior probability.
Argmax (P(play=yes)*P(weather=sunny|play=yes), P(play=no)*P(weather=sunny|play=no))
=( 3/9*9/14 , 2/5*5/14) = (0.33*0.64 , 0.4*0.36) = (0.21,0.14)
Therefore,
argmax (0.21,0.14) => 0.21 which is the probability for play with outcome yes.
Example#2
Suppose we have a set of symptoms for the diagnosis of clinical patients given all the previous patient records. Table 1. shows their observed symptoms and respective diagnosis:
Chills | Runny nose | Headache | Fever | Flu |
Y | N | Mild | Y | N |
Y | Y | No | N | Y |
Y | N | Strong | Y | Y |
N | Y | Mild | Y | Y |
N | N | No | N | N |
N | Y | Strong | Y | Y |
N | Y | Strong | N | N |
Y | Y | Mild | Y | Y |
Problem
Can we predict that the patient with the following symptoms has the flu?
Chills | Runny nose | Headache | Fever | Flu? |
Y | N | Mild | Y | ? |
Solution
In order to answer the above question, we need to do the same steps as we did in the above example, merging all steps, we will finally have the following prior probabilities for each of the predictor.
P(flu=Y) | 0.625 | P(flu=N) | 0.375 |
P(chills=Y | flu=Y) | 0.6 | P(chills=Y | flu=N) | 0.333 |
P(chills=N | flu=Y) | 0.4 | P(chills=N | flu=N) | 0.666 |
P(runny nose=Y | flu=Y) | 0.8 | P(runny nose=Y | flu=N) | 0.333 |
P(runny nose=N | flu=Y) | 0.2 | P(runny nose=N | flu=N) | 0.666 |
P(headache=Mild | flu=Y) | 0.4 | P(headache=Mild | flu=N) | 0.333 |
P(headache=No | flu=Y) | 0.2 | P(headache=Strong | flu=N) | 0.3333 |
P(headache=Strong | flu=Y) | 0.4 | P(headache=Strong | flu=N) | 0.333 |
P(fever=Y | flu=Y) | 0.8 | P(fever=Y | flu=N) | 0.333 |
P(fever=N | flu=Y) | 0.2 | P(fever=N | flu=N) | 0.666 |
Note: The sum of the probability of each evidence of a predictor for a particular outcome (i.e., Y or N) of the target attribute, flu will be equal to 1. For example, the probability that the flu=Y given chills=Y is 0.6 and the probability that the flu=Y given chills=N is 0.4, the sum of which is equal to 1.
Now we will build a hypothesis based on the evidence of the symptoms in the above question and choose the hypothesis will maximum probability. We will compute the probability for each outcome of the target attribute flu given the above symptoms as pieces of evidence.
argmax P(flu=Y)*P(chills=Y|flu=Y)*P(runny nose=N|flu=Y)*P(headache=Mild|flu=Y)*P(fever=N|flu=Y) = 0.006
vs
argmax P(flu=N)*P(chills=Y|flu=N)*P(runny nose=N|flu=N)*P(headache=Mild|flu=N)*P(fever=N|flu=N) = 0.0185
Hence, P(flu=N) will be having maximum probability as 0.0185, given the predictions chills=Y, runny nose=N, headache=Mild, and fever=N. So, we can conclude the class flu with N outcome.
Models of Naive Bayes Classifier
Gaussian Naive Bayes
This model is used in classification with an assumption of a normal distribution in features or the values are sampled from a Gaussian distribution when there are continuous predictor values.
The equation for the probability will be changed to the following equation depending on the way the values are present in the dataset.
Multinomial Naive Bayes
The multinomial model is used to count discreetly. Let’s assume the problem of text classification. We can take Bernoulli trials here and a one step ahead, instead of finding the occurrence of a word in the document, we will count the frequency of the word in the report, think of it as how many times an outcome number n is being recorded for n number of trials. Mostly, it is used for the classification of documents.
Bernoulli Naive Bayes
Bernoulli model is similar to the multinomial naive bayes model in the sense that it founds the occurrence of a word in the text but the difference lies in the representation of predictors. Here, the predictors are represented using Boolean variables. It is useful for the classification tasks with binary feature vectors (i.e., 0 or 1). For instance, in the classification of documents, the occurrence of a word in documents will be represented by 0 or 1, depending on its absence or presence in the document respectively.
Discussions
Advantages
- Besides its simplicity, Naive Bayes is an accurate and fast method for prediction with a low computation cost.
- The assumption of independence of predictors in the Naive Bayes algorithm makes its performance out stands of other machine learning models, for example, logistic regression.
- The performance of Naive Bayes is good in the case of categorical variables instead of numerical variables. On the other side, it also performs well for discrete counts instead of continuous variables.
- Naive Bayes can also be used multi-class classification problems and shows good performance in text analytics.
Disadvantages
- One of the limitations of the Naive Bayes algorithm is its assumption of predictors’ independence. However, in the case of real-life scenarios, this assumption will no longer hold always as this dataset involves such predictors which may have strong independence on each other.
- Naive Bayes is also used to be called as a bad estimator, so we can’t take its posterior probabilities too consciously.
- If a particular class does not contain training tuple in the training data then this causes zero later probability as it can’t make predictions in this case. This case is known to be called as zero frequency or zero probability problem.
Conclusion
Congratulations, until the end of this tutorial you’ve done it!
So far, we take an introductory overview of the Naive Bayes algorithm, its workflow with different examples and also derived its simpler form in terms of argmax, also, we take a look at different models of Naive Bayes algorithm and its specifications and purpose. Afterward, we discuss some of its pros and cons.
Naive Bayes is the most convenient and straight forward machine learning algorithm. Its worth is still retained in-spite of emerging advances in machine learning. Most importantly, it is being used in real-time predictions and a have a well-known usage in multi-class predictions.
Some of the real-life applications of Naive Bayes classifier include text classification, filtering of spam emails, sentiment analysis, and recommendation systems.
In the future, we will provide a basic implementation of Naive Bayes classifier in Python using scikit-learn and also looking forward to hearing from you like a response in feedback and we will try our best to answer it. Your suggestions for improvements are always welcome.