Naive Bayes
What is Naive Bayes
It is a classification technique based on Bayes Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.
Naive Bayes in Python
The naive bayes is created using text dataset in Python.The tweets are extracted on the hashtag "social worker" and hashtag "covid" . The motive behind collecting this text data was to understand the opinion of people regarding social workers and different tweets regarding covid. To get a quick overview of the data, a wordcloud of both the hashtags has been made.
Cleaning and formatting the dataset to the required format
There are a lot of unwated columns in the dataframe. These columns are dropped from the dataframe, retaining only the necessary columns. The stopwords are removed and the text is tokenized, lemmatized and stemmed. Countvectorizer is applied on the data to convert it to numerical format. Checking the balance of the label is very important before performing decision trees, as unbalanced dataset may lead to over or underfitting.
Model Building
The code to build the model can be found here
Before building the model, the dataset is split into training and testing sets. The split ratio is 0.75 of the total data in the training set and 0.25 data in the testing set. Three different naive bayes models are created. The Naive Bayes models differ due to hypertuning of different parameters. Mainly the alpha values.
Naive Bayes Model 1
This is the first naive bayes model. In this model, the hyperparameter are alpha = 1. In this model, the accuracy is 95%.
Naive Bayes Model 2
This is the second naive bayes model. In this model, the hyperparameter are alpha = 5. In this model, the accuracy is 95%.
Naive Bayes Model 3
This is the third naive bayes model. In this model, the hyperparameter are alpha = 0. In this model, the accuracy is 94%.
Conclusion
The Naive Bayes model classified tweets generated from Twitter into the hashtag class (socialworker and covid) of different tweets. The accuracy of the models are 95% and 94%. The accuracy is pretty high. With the collection of words, the model is able to predict or classify the tweets into particular classes.