Problem

We have to do the binary classification and trained model for classifying whether the email lies in spam or ham category.

Flow to solve the problem

Flowcharts(1).png

1. Download input file

Input csv file can be downloaded from the link.

2. Data Preprocessing

Data Preprocessing includes following steps.

Download the required libraries.

!pip install -U scikit-learn
!pip install pandas

Import the libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

Read the Input file

df = pd.read_csv("dataset/spam_ham_dataset.csv")

Delete the columns which are not required.
```
del df['Unnamed: 0']
del df['label']
```
Check the null rows.
```
df.isnull().sum()
```

Splitting the data into a train and test set.

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                  test_size=0.2,
                                                  random_state=42,
                                                  shuffle=True
                                                 )

3. TF-IDF

TF-IDF stands for Term Frequency Inverse Document Frequency.
It is an algorithm to convert each word in a sentence to an informative numeric vector.
The purpose of using this algorithm is that any model did not process the text directly so you need to require a function which converts the text into numeric values which gives the semantic meaning.

4. Model training on Naive bayes algorithm

Naive Bayes is a probabilistic machine learning algorithm based on the Bayes Theorem, used in a wide variety of classification tasks
Creating the pipeline first dataset will pass through the tf-idf which convert the text into numeric value and then MultinomialNB() (Naive Bayes model).
```
model = make_pipeline(TfidfVectorizer(),MultinomialNB())
```
Train the model
```
model.fit(X_train,y_train)
```

5. Testing the result

At this point, we test the trained model by doing the following steps.
- Check the accuracy of the model.
- Check the confusion matrix parameters (True Positive , True Negative, False Positive, False Negative).
- Analysis on confusion matrix result, which parameter is needed to improve.

y_pred = model.predict(X_test)
acc =  accuracy_score(y_pred, y_test)
confusion_matrix(y_test, y_pred)

Code

Full code can be accessed from this link

Muhammad Faizan's Blog