https://github.com/soumyapro/sms-spam-classifier
A machine learning project that detects spam SMS messages using natural language processing techniques. The model analyzes text messages and accurately classifies them as spam or legitimate (ham).
https://github.com/soumyapro/sms-spam-classifier
multinomial-naive-bayes nltk sklearn tfidf-vectorizer tokenizer
Last synced: about 5 hours ago
JSON representation
A machine learning project that detects spam SMS messages using natural language processing techniques. The model analyzes text messages and accurately classifies them as spam or legitimate (ham).
- Host: GitHub
- URL: https://github.com/soumyapro/sms-spam-classifier
- Owner: Soumyapro
- Created: 2025-03-29T22:16:14.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-03-29T22:36:09.000Z (6 months ago)
- Last Synced: 2025-03-29T23:24:23.484Z (6 months ago)
- Topics: multinomial-naive-bayes, nltk, sklearn, tfidf-vectorizer, tokenizer
- Language: Jupyter Notebook
- Homepage:
- Size: 753 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# SMS Spam Classifier

A machine learning project that accurately identifies spam SMS messages using Natural Language Processing (NLP) techniques and a Multinomial Naive Bayes classifier.
## Overview
This SMS Spam Classifier analyzes text messages and classifies them as either spam or legitimate (ham) with high precision. The model is specifically optimized to minimize false positives, ensuring legitimate messages are not incorrectly flagged as spam.
## Features
- **Text Preprocessing Pipeline**:
- Lowercase conversion
- Tokenization
- Special character removal
- Stop words and punctuation filtering
- Word stemming using Porter Stemmer- **Feature Engineering**:
- Text vectorization using TF-IDF
- Text statistics extraction (character count, word count, sentence count)
- Feature correlation analysis- **Model Selection and Evaluation**:
- Comparative analysis of multiple classification algorithms
- Emphasis on precision metric to minimize false positives
- Multinomial Naive Bayes selected for optimal precision (100%)- **Visualization**:
- Word clouds for spam vs. legitimate messages
- Frequency distribution of common words
- Feature importance analysis
- Class distribution visualization## Performance Metrics
| Model | Accuracy | Precision |
|-------|----------|-----------|
| Multinomial Naive Bayes | 97.0% | 100% |## Requirements
numpy==1.20.3
pandas==1.3.4
scikit-learn==1.0.1
nltk==3.6.5
matplotlib==3.4.3
seaborn==0.11.2
wordcloud==1.8.1
xgboost==1.4.2