Email Classifier using Machine Learning Algorithm

Email Classifier using Machine Learning Algorithm

Problem

We have to do the binary classification and trained model for classifying whether the email lies in spam or ham category.

Flow to solve the problem

Flowcharts(1).png

1. Download input file

  • Input csv file can be downloaded from the link.

2. Data Preprocessing

Data Preprocessing includes following steps.

  • Download the required libraries.
    !pip install -U scikit-learn
    !pip install pandas
    
  • Import the libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.pipeline import make_pipeline
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import accuracy_score
    
  • Read the Input file
    df = pd.read_csv("dataset/spam_ham_dataset.csv")
    
  • Delete the columns which are not required.
    del df['Unnamed: 0']
    del df['label']
    
  • Check the null rows.
    df.isnull().sum()
    
  • Splitting the data into a train and test set.
    X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                      test_size=0.2,
                                                      random_state=42,
                                                      shuffle=True
                                                     )
    

3. TF-IDF

  • TF-IDF stands for Term Frequency Inverse Document Frequency.
  • It is an algorithm to convert each word in a sentence to an informative numeric vector.
  • The purpose of using this algorithm is that any model did not process the text directly so you need to require a function which converts the text into numeric values which gives the semantic meaning.

4. Model training on Naive bayes algorithm

  • Naive Bayes is a probabilistic machine learning algorithm based on the Bayes Theorem, used in a wide variety of classification tasks
  • Creating the pipeline first dataset will pass through the tf-idf which convert the text into numeric value and then MultinomialNB() (Naive Bayes model).
    model = make_pipeline(TfidfVectorizer(),MultinomialNB())
    
  • Train the model
    model.fit(X_train,y_train)
    

5. Testing the result

  • At this point, we test the trained model by doing the following steps.
    • Check the accuracy of the model.
    • Check the confusion matrix parameters (True Positive , True Negative, False Positive, False Negative).
    • Analysis on confusion matrix result, which parameter is needed to improve.
y_pred = model.predict(X_test)
acc =  accuracy_score(y_pred, y_test)
confusion_matrix(y_test, y_pred)

Code

Full code can be accessed from this link