Naive Bayes

This page contains the Naive Bayes code and visualizations done in R for Record data and in Python for Text data.

What is Naive Bayes

It is a classification technique based on Bayes Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Naive Bayes in R

The naive bayes is created using record dataset in R. It is the same record dataset used for decision trees. The dataset has been downloaded from UCI Machine learning repository. The link for the source can be found here This dataset consists of various features like age, workclass, education, marital status, relationship, race, sex, hours per week. The label column contains the data of whether the person is rich (salary greater than or equal to 50k) or poor(salary less than $50k).

The snapshot of the dataset and the link to the csv file is attached below.

View Download csv file

Cleaning and formatting the dataset to the required format

The code to create clean and format the datasets can be found here

There are a lot of unwated columns in the dataframe. These columns are dropped from the dataframe, retaining only the necessary columns. The dataset is checked for NA values, and all the NA values are removed. Checking the balance of the dataset and label is very important before performing decision trees, as unbalanced dataset may lead to over or underfitting.

The snapshot of the label column (income) before and after balancing.

Model Building

The code to build the model can be found here

Before building the model, the dataset is split into training and testing sets. The split ratio is 0.75 of the total data in the training set and 0.25 data in the testing set. The naive bayes model is trained using the training dataset and then the model is tested using labels from the testing dataset

The snapshot of the naive bayes are attached below.

The graph to show the key features from the model is attached below.

Conclusion

The intention was to perform naive bayes to help predict the income class of a person given the different features. The accuracy of the model (79%) is pretty decent in predicting the income group. According to the above graph, the key features to predict the income group are age, hours worked per week and sex. It makes sense that the variable age and hours worked per week have a significant impact on income.