We have talked a bit about how to read and process a text file into information usable by the classifier. Now we will take a look at the implementation of the Numerical Naive Bayes algorithm. An example of the text file holding the data is the following:
Class Height Weight Wrestler 170 61 Wrestler 173 67 Sumo 181 110 Sumo 177 100 Wrestler 175 69 Sumo 180 111
For this tutorial, we will use the data set found here.
Distribution Function
We will also assume the data set follows the Gaussian Distribution. You can read on the gaussian distribution on Wikipedia. What you need to know is that it takes as input an item x and given the standard deviation and mean of the distribution returns the probability of x occuring. So, we only need to compute σ (standard deviation) and µ (mean) and then we can find the probability of any value.
To find the conditional probability P(evidence|class), we want to get the output of the above formula. Therefore, we will write a function for it, named Gaussian. This will take as input the mean, the standard deviation and the value x.
def Gaussian(mean,stDev,x): pi = math.pi; e = math.e; m = mean; s = stDevl g = 1/(math.sqrt(2*pi)*s)*e**(-0.5*(float(x-m)/s)**2); return g;
Notice we have calls to math.sqrt, math.pi etc. so we will need to import the library math.
import math;
Code
Reading from File
First we will need to build the dictionaries and lists Classes, Features, P and get the count of the items in our data set, n.
We have already build a data reader. We will save that code into a Python script called _DataReader and we will import it here.
import _DataReader as DataReader;
We will call its Read function and we will pass as parameter the file name of the data set. This will return a list, which we will parse into the appropriate variables:
Classes, Features, P, n = DataReader.Read('data.txt');
Classifier
Let our classifier function be called Classifier. It takes as parameter a tuple, Evidence which holds tuples of two items, the name of the feature and the value of the feature. This is how we call the function:
#Run classifier with the evidence list Classifier((('Height', 170), ('Weight', 65)));
That means the item we want to classify has Height of 170 and Weight of 65.
First we want to parse Evidence into a string (called evidence), with features separated by a comma. For example, if we receive ((‘Height’, 170), (‘Weight’, 65)) we want to parse it to “Height = 170, Weight = 65”. That way we can save the classification for future use in an intuitive way. In Statistics we write P(Wrestler|Height = 170, Weight = 65). Using a dictionary, we will write it as P[“Wrestler|Height = 170, Weight = 65”], where “Height = 170, Weight = 65” will come from the variable evidence and “Wrestler” from the iteration of the classes.
We also want to check if the given features are actually part of the feature list, Features. We will check that and we will build the string at the same time. Evidence has two-length lists; the first element the feature name and the second the feature value. We will save those two values on the variables eF and eV respectively. We will check if eF is in Features.
#Check if all evidence is also in Features for e in Evidence: eF = e[0]; #The feature in evidence e eV = e[1]; #The value in evidence e if eF not in Features: #A given evidence does not belong in Features. Abort. print "Evidence list is erroneous" return; #Build the evidence string evidence += eF + " = " + str(eV) + ', '; evidence = evidence[:-2]; #remove last two chars
Now we reach the core of the algorithm. We will run the Bayes equation for all classes and we will pick the largest value.
What we are essentially doing is finding the value P(c|evidence) for each c in Classes. The Bayes equation is P(c|evidence) = P(evidence|c)* P(c)/P(evidence). Because though the features are independent (hence the Naive algorithm) we can simplify the above into \( P(c | evidence) = P(evidence_1 | c) * P(evidence_2 | c) * … * P(evidence_n | c) * P(c) \). We do not need to calculate P(evidence), since it is the same for all classes and does not play a role in the algorithm (also, it’s hard to compute). We know what P(c) is, but we are missing the probability of a feature given a class, \( P(evidence_i | c) \). To calculate it, we will use the Gaussian Formula from before. |
To the Gaussian function we have created, we will input the mean and the standard deviation of the feature given the class. (class name is c, feature name is eF)
Classes[c][eF]["Mean"] Classes[c][eF]["StDev"]
We will also input the feature value, eV. We will do this for every class and every feature in Evidence.
Lastly, we need to keep the class with the max value. A simple “Find Max” problem. To implement this we will use the auxiliary variables m (for the max) and classification (for the class).
In code:
for c in Classes: #Start from the prior probability P[c + '|' + evidence] = P[c]; for e in Evidence: eF = e[0]; #The feature in evidence e eV = e[1]; #The value in evidence e #Multiply by the conditional prob mean = Classes[c][eF]["Mean"]; #mean stDev = Classes[c][eF]["StDev"]; #standard deviation P[c + '|' + evidence] *= Gaussian(mean,stDev,eV); if(P[c + '|' + evidence] > m): #P(c|evidence) is the max so far; #update m and classification m = P[c + '|' + evidence]; classification = c;
To print the solution/classification of the item, we write:
#With the evidence, the item belongs to classification #with a prob of m print classification, m;
If you run the above code, you should receive as output:
Wrestler 0.00346294406914
Conclusion
And with that we have developed a Naive Numerical Bayes Classifier from the ground up. You can find the full code on my Github.