Naive Bayes

This page contains the Naive Bayes code and visualizations done in Python for Text data.

What is Naive Bayes

It is a classification technique based on Bayes Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Naive Bayes in Python

The naive bayes is created using text dataset in Python.The tweets are extracted on the hashtag "social worker" and hashtag "covid" . The motive behind collecting this text data was to understand the opinion of people regarding social workers and different tweets regarding covid. To get a quick overview of the data, a wordcloud of both the hashtags has been made.

Cleaning and formatting the dataset to the required format

There are a lot of unwated columns in the dataframe. These columns are dropped from the dataframe, retaining only the necessary columns. The stopwords are removed and the text is tokenized, lemmatized and stemmed. Countvectorizer is applied on the data to convert it to numerical format. Checking the balance of the label is very important before performing decision trees, as unbalanced dataset may lead to over or underfitting.

The snapshot of the dataset and the link to the csv file is attached below.

View Download csv file

View Download csv file

Model Building

The code to build the model can be found here

Before building the model, the dataset is split into training and testing sets. The split ratio is 0.75 of the total data in the training set and 0.25 data in the testing set. Three different naive bayes models are created. The Naive Bayes models differ due to hypertuning of different parameters. Mainly the alpha values.

Naive Bayes Model 1

This is the first naive bayes model. In this model, the hyperparameter are alpha = 1. In this model, the accuracy is 95%.

The snapshot of the accuracy/heatmap is attached below.

Naive Bayes Model 2

This is the second naive bayes model. In this model, the hyperparameter are alpha = 5. In this model, the accuracy is 95%.

The snapshot of the accuracy/heatmap is attached below.

Naive Bayes Model 3

This is the third naive bayes model. In this model, the hyperparameter are alpha = 0. In this model, the accuracy is 94%.

The snapshot of the accuracy/heatmap is attached below.

Conclusion

The Naive Bayes model classified tweets generated from Twitter into the hashtag class (socialworker and covid) of different tweets. The accuracy of the models are 95% and 94%. The accuracy is pretty high. With the collection of words, the model is able to predict or classify the tweets into particular classes.