Model Statistics & Performance

Detailed analysis of our spam detection model's performance, dataset insights, and feature importance.

97.6%
Precision

97.6% of messages predicted as spam were actually spam

89.8%
Recall

89.8% of all spam messages were correctly detected

93.5%
F1-Score

Harmonic mean of precision and recall

~5,500
Dataset Size

Number of labeled SMS messages analyzed

Dataset Composition
Performance Metrics
Processing Pipeline
Data Collection
Collection of ~5,500 labeled SMS messages (ham or spam) from spam.csv dataset.
Preprocessing
Removed null values and unnecessary columns. Converted text to lowercase, removed punctuation and English stopwords using NLTK. Tokenized and cleaned using regex.
Feature Extraction
Used TF-IDF Vectorizer to convert text into numerical features, capturing importance of words in individual messages and across the dataset.
Label Encoding
Encoded labels using Label Encoder (ham = 0, spam = 1).
Train-Test Split
Split data into 80% training and 20% testing to ensure model generalization and unbiased evaluation.
Model Training
Used Random Forest Classifier chosen for its high accuracy, ability to handle large feature sets, and resistance to overfitting.
Evaluation
Evaluated using precision, recall, and F1-score to assess performance with focus on minimizing false positives and false negatives.
Model Details

Our spam detection system uses a Random Forest Classifier with TF-IDF feature extraction to identify unwanted messages.

Why Random Forest?
  • High accuracy on diverse data
  • Handles large feature sets efficiently
  • Resistant to overfitting
  • Provides feature importance metrics
  • Works well with TF-IDF vectorized text
Confusion Matrix
Predicted HAM
Predicted SPAM
Actual HAM
TN: 982
FP: 8
Actual SPAM
FN: 16
TP: 144
Test Case Example
Input: "Congratulations! You've won a $1000 Walmart gift card. Click here to claim now."
Prediction: SPAM (Confidence: 96.8%)
Feature Importance
Our model identifies specific keywords and patterns that are strongly associated with spam messages.
Spam Keywords
Ham Keywords
Insights & Interpretations

Based on our analysis of the model performance and dataset characteristics:

  • High Precision (97.6%): The model excels at avoiding false positives, meaning legitimate messages are rarely flagged as spam.
  • Good Recall (89.8%): The model catches most spam messages, though some still slip through.
  • Strong F1-Score (93.5%): Indicates well-balanced performance between precision and recall.
  • TF-IDF Effectiveness: Successfully highlights spam-indicative keywords, improving classification quality.
  • Dataset Imbalance: The dataset contains more ham than spam messages, which is expected in real-world scenarios.