Spam Shield - Statistics & Analytics

97.6%

Precision

97.6% of messages predicted as spam were actually spam

89.8%

Recall

89.8% of all spam messages were correctly detected

93.5%

F1-Score

Harmonic mean of precision and recall

~5,500

Dataset Size

Number of labeled SMS messages analyzed

Processing Pipeline

Data Collection

Collection of ~5,500 labeled SMS messages (ham or spam) from spam.csv dataset.

Preprocessing

Removed null values and unnecessary columns. Converted text to lowercase, removed punctuation and English stopwords using NLTK. Tokenized and cleaned using regex.

Feature Extraction

Used TF-IDF Vectorizer to convert text into numerical features, capturing importance of words in individual messages and across the dataset.

Label Encoding

Encoded labels using Label Encoder (ham = 0, spam = 1).

Train-Test Split

Split data into 80% training and 20% testing to ensure model generalization and unbiased evaluation.

Model Training

Used Random Forest Classifier chosen for its high accuracy, ability to handle large feature sets, and resistance to overfitting.

Evaluation

Evaluated using precision, recall, and F1-score to assess performance with focus on minimizing false positives and false negatives.

Model Details

Our spam detection system uses a Random Forest Classifier with TF-IDF feature extraction to identify unwanted messages.

Why Random Forest?

High accuracy on diverse data
Handles large feature sets efficiently
Resistant to overfitting
Provides feature importance metrics
Works well with TF-IDF vectorized text

Confusion Matrix

Predicted HAM

Predicted SPAM

Actual HAM

TN: 982

FP: 8

Actual SPAM

FN: 16

TP: 144

Test Case Example

Input: "Congratulations! You've won a $1000 Walmart gift card. Click here to claim now."

Prediction: SPAM (Confidence: 96.8%)

Insights & Interpretations

Based on our analysis of the model performance and dataset characteristics:

High Precision (97.6%): The model excels at avoiding false positives, meaning legitimate messages are rarely flagged as spam.
Good Recall (89.8%): The model catches most spam messages, though some still slip through.
Strong F1-Score (93.5%): Indicates well-balanced performance between precision and recall.
TF-IDF Effectiveness: Successfully highlights spam-indicative keywords, improving classification quality.
Dataset Imbalance: The dataset contains more ham than spam messages, which is expected in real-world scenarios.

Model Statistics & Performance