Problem
We have to do the binary classification and trained model for classifying whether the email lies in spam or ham category.
Flow to solve the problem
1. Download input file
- Input csv file can be downloaded from the link.
2. Data Preprocessing
Data Preprocessing includes following steps.
- Download the required libraries.
!pip install -U scikit-learn !pip install pandas
- Import the libraries
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import make_pipeline from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score
- Read the Input file
df = pd.read_csv("dataset/spam_ham_dataset.csv")
- Delete the columns which are not required.
del df['Unnamed: 0'] del df['label']
- Check the null rows.
df.isnull().sum()
- Splitting the data into a train and test set.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42, shuffle=True )
3. TF-IDF
- TF-IDF stands for Term Frequency Inverse Document Frequency.
- It is an algorithm to convert each word in a sentence to an informative numeric vector.
- The purpose of using this algorithm is that any model did not process the text directly so you need to require a function which converts the text into numeric values which gives the semantic meaning.
4. Model training on Naive bayes algorithm
- Naive Bayes is a probabilistic machine learning algorithm based on the Bayes Theorem, used in a wide variety of classification tasks
- Creating the pipeline first dataset will pass through the tf-idf which convert the text into numeric value and then MultinomialNB() (Naive Bayes model).
model = make_pipeline(TfidfVectorizer(),MultinomialNB())
- Train the model
model.fit(X_train,y_train)
5. Testing the result
- At this point, we test the trained model by doing the following steps.
- Check the accuracy of the model.
- Check the confusion matrix parameters (True Positive , True Negative, False Positive, False Negative).
- Analysis on confusion matrix result, which parameter is needed to improve.
y_pred = model.predict(X_test)
acc = accuracy_score(y_pred, y_test)
confusion_matrix(y_test, y_pred)
Code
Full code can be accessed from this link