Day 14 – Hands-On Project: Spam Detection with Machine Learning


Introduction

Spam emails and messages are a persistent problem in 2025. Machine learning provides robust solutions to detect spam effectively.

At curiositytech.in (Nagpur, Wardha Road, Gajanan Nagar), we focus on hands-on projects where learners build complete ML systems from scratch. This blog walks you through a spam detection project, covering every step from raw data to model evaluation, giving ML engineers practical experience.


1. Understanding the Problem

Objective: Classify emails or messages as spam or not spam (ham).

Challenges:

  • Imbalanced data (few spam messages, many ham messages)
  • Text variability and noise
  • Feature extraction from raw text

At CuriosityTech Park, we teach that understanding the problem statement and business impact is crucial before coding.


2. Dataset

Popular Datasets:

  • SMS Spam Collection Dataset (UCI Machine Learning Repository)
  • Enron Email Dataset
  • Kaggle Spam Detection Dataset

Dataset Structure:

ColumnDescription
Labelspam or ham
TextEmail/SMS content

Scenario: Riya uses the SMS Spam Collection Dataset with 5,574 messages for training and testing.


3. Text Preprocessing

Text preprocessing ensures clean, standardized input. Steps include:

  1. Lowercasing: Convert all text to lowercase
  2. Remove punctuation & special characters
  3. Stopword Removal: Exclude common words (the, is, in)
  4. Tokenization: Split text into words
  5. Stemming / Lemmatization: Reduce words to their root forms
  6. Vectorization: Convert text to numerical form using TF-IDF or Bag of Words

Practical Tip: CuriosityTech learners often visualize word frequency distributions to understand data patterns.


4. Feature Extraction

Bag of Words (BoW): Count of each word occurrence in messages
TF-IDF: Weighted representation, emphasizing rare but important words
Word Embeddings: Dense vectors capturing semantic meaning (Word2Vec, GloVe)

Scenario:
Arjun applies TF-IDF to the dataset, converting raw text into a matrix of numerical features for model training.


5. Model Selection

Common ML Algorithms for Spam Detection:

AlgorithmProsCons
Logistic RegressionSimple, interpretableMay underfit complex patterns
Naive BayesPerforms well with text, fastAssumes feature independence
SVMHandles high-dimensional dataSlower on large datasets
Random ForestRobust, handles noiseCan overfit with many trees

At CuriosityTech.in, students often compare Naive Bayes and Logistic Regression as a starting point for spam detection projects.


6. Model Training & Evaluation

Stepwise Approach:

  1. Split dataset into train and test sets (80%-20%)
  2. Train model using training set
  3. Evaluate using test set

Evaluation Metrics:

MetricDescription
AccuracyOverall correctness
PrecisionCorrectly predicted spam out of all predicted spam
RecallCorrectly predicted spam out of all actual spam
F1-ScoreBalance between precision and recall
Confusion MatrixVisual representation of TP, TN, FP, FN

Practical Insight:
Riya notices that accuracy is high but recall is low, indicating many spam messages are missed. Adjusting thresholds and using F1-score provides a better performance measure.


7. Hands-On Code Example (Python Snippet)

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

# Split data

X_train, X_test, y_train, y_test = train_test_split(messages, labels, test_size=0.2, random_state=42)

# Vectorize text

vectorizer = TfidfVectorizer()

X_train_vect = vectorizer.fit_transform(X_train)

X_test_vect = vectorizer.transform(X_test)

# Train Naive Bayes

model = MultinomialNB()

model.fit(X_train_vect, y_train)

# Predictions

y_pred = model.predict(X_test_vect)

# Evaluation

print(“Accuracy:”, accuracy_score(y_test, y_pred))

print(“F1-Score:”, f1_score(y_test, y_pred, pos_label=’spam’))

print(“Confusion Matrix:\n”, confusion_matrix(y_test, y_pred))

CuriosityTech students visualize confusion matrices to identify false positives and false negatives, which helps improve the model.


8. Improving Model Performance

  • Hyperparameter Tuning: Adjust alpha in Naive Bayes or regularization in Logistic Regression
  • Feature Engineering: Include n-grams (bigrams, trigrams)
  • Ensemble Methods: Combine multiple models to improve robustness
  • Cross-Validation: Ensure model generalization

Scenario: Arjun applies 5-fold cross-validation and bigram TF-IDF, improving F1-score from 0.85 to 0.91.


9. Deployment Tips

  1. Export model using joblib or pickle
  2. Deploy using Flask or FastAPI for real-time prediction
  3. Monitor model performance in production
  4. Update model with new data to handle evolving spam patterns

At curiositytech.in, learners build end-to-end spam detection systems, combining preprocessing, model training, and deployment in a single project.


10. Key Takeaways

  • Spam detection is an excellent beginner-friendly NLP project
  • Preprocessing, feature extraction, and evaluation metrics are mandatory steps
  • Hands-on experimentation teaches practical ML skills essential for real-world projects
  • Model deployment ensures learning translates to production-ready systems

Conclusion

Spam detection demonstrates the power of machine learning in solving real-world problems. By mastering preprocessing, feature extraction, modeling, and deployment:

  • ML engineers build robust, production-ready models
  • Gain practical insights into NLP pipelines
  • Develop transferable skills for other text-based ML projects

Contact curiositytech.in at +91-9860555369 or contact@curiositytech.in to join hands-on ML projects and workshops.


Leave a Comment

Your email address will not be published. Required fields are marked *