https://github.com/breezy-codes/machine-learning-for-spam-sms
Real-time SMS spam detection using ML models in simulated cellular networks. Compares 4 algorithms with comprehensive performance analysis.
https://github.com/breezy-codes/machine-learning-for-spam-sms
logistic-regression machine-learning naive-bayes network-simulation random-forest research scikit-learn spam-sms spam-sms-detection svm telecommunication
Last synced: 22 days ago
JSON representation
Real-time SMS spam detection using ML models in simulated cellular networks. Compares 4 algorithms with comprehensive performance analysis.
- Host: GitHub
- URL: https://github.com/breezy-codes/machine-learning-for-spam-sms
- Owner: breezy-codes
- License: gpl-3.0
- Created: 2025-09-26T22:45:40.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-09-26T22:56:27.000Z (8 months ago)
- Last Synced: 2025-10-04T05:49:11.536Z (8 months ago)
- Topics: logistic-regression, machine-learning, naive-bayes, network-simulation, random-forest, research, scikit-learn, spam-sms, spam-sms-detection, svm, telecommunication
- Language: Jupyter Notebook
- Homepage:
- Size: 32.1 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Machine Learning for Real-Time SMS Spam Detection in Cellular Networks
[](https://www.python.org/downloads/)
[](https://scikit-learn.org/)
[](LICENSE)
## Overview
This project evaluates the effectiveness of multiple machine learning models for **real-time spam detection in cellular networks**. Using a comprehensive SMS dataset, we train, evaluate, and simulate four different ML models in a realistic cellular network environment to determine the most effective approach for real-time spam filtering. View the academic report [here](report.pdf).
### Key Features
- **4 Machine Learning Models**: Logistic Regression, Naive Bayes, Random Forest, and Support Vector Machine
- **Real-time Simulation**: Simulated cellular network environment with baseband units and radio units
- **Comprehensive Evaluation**: Performance metrics including accuracy, precision, recall, and F1-score
- **Automated Spam Detection**: Real-time alerting system for high spam volume detection
- **Dataset Generation**: Custom spam dataset generator for testing on unseen data
### Research Focus
The project addresses the critical need for effective spam detection in cellular networks by:
- Comparing multiple ML algorithms in a realistic network simulation
- Evaluating real-time performance under cellular network constraints
- Analyzing model effectiveness for different spam patterns and volumes
- Providing insights into the most suitable algorithms for mobile network deployment
## Model Performance Summary
| Model | Training Accuracy | Simulation Accuracy | Simulation Time | Best Use Case |
|-------|------------------|-------------------|----------------|---------------|
| **Logistic Regression** | 99% | 88% | 8m 47s | High precision spam detection |
| **Naive Bayes** | 89% | 81% | 9m 39s | Fast processing, good recall |
| **Random Forest** | 89% | 85% | 20m 55s | Balanced performance |
| **Support Vector Machine** | 89% | 82% | 25m 53s | Complex pattern recognition |
## Project Structure
```text
machine-learning-for-spam-sms/
βββ π models/ # ML Model Development
β βββ models.ipynb # Model training & evaluation
β βββ spam_data.csv # Training dataset
β βββ *.pkl # Trained model files
βββ π simulation/ # Network Simulation
β βββ simulation.ipynb # Main simulation notebook
β βββ π data/ # Generated test datasets
β βββ π logs/ # Simulation logs by model
β βββ π results/ # Performance results & figures
βββ π spam-generator/ # Dataset Generation
β βββ generator.py # Spam dataset generator
β βββ conversations.py # Conversation templates
βββ π markdown/ # Documentation
β βββ models.md # Model implementation details
β βββ simulation.md # Simulation methodology
β βββ install_instructions.md # Setup instructions
βββ requirements.txt # Python dependencies
```
### Key Components
- **Machine Learning Models**: Four different algorithms trained on SMS spam data
- **Cellular Network Simulation**: Realistic network topology with baseband and radio units
- **Real-time Processing**: Stream processing of SMS messages with spam detection
- **Performance Monitoring**: Comprehensive logging and alerting system
- **Dataset Generation**: Custom spam generator for testing model robustness
## Quick Start
### Prerequisites
- Python 3.8 or higher
- Virtual environment (recommended)
### Installation
1. **Clone the repository**
```bash
git clone https://github.com/breezy-codes/machine-learning-for-spam-sms.git
cd machine-learning-for-spam-sms
```
2. **Set up virtual environment**
```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .\.venv\Scripts\activate
```
3. **Install dependencies**
```bash
pip install -r requirements.txt
```
For detailed setup instructions, see: [Setting Up a Python Virtual Environment](./markdown/install_instructions.md)
## π€ Running the Machine Learning Models
Train and evaluate all four ML models using the comprehensive Jupyter notebook:
```bash
jupyter notebook models/models.ipynb
```
### What the Models Do
- **Data Preprocessing**: Text cleaning, tokenization, and vectorization
- **Model Training**: Hyperparameter tuning with cross-validation
- **Performance Evaluation**: Accuracy, precision, recall, F1-score metrics
- **Model Persistence**: Saves trained models as `.pkl` files
**Detailed guide**: [Model Implementation Notes](./markdown/models.md)
## Running the Cellular Network Simulation
Experience real-time spam detection in a simulated cellular environment:
```bash
jupyter notebook simulation/simulation.ipynb
```
### Simulation Features
- **Network Topology**: Multiple baseband units with radio units
- **Real-time Processing**: Stream-based message processing
- **Spam Detection**: Live classification with alerting system
- **Performance Analytics**: Comprehensive logging and metrics collection
- **Load Testing**: Handles high-volume message streams
**Detailed guide**: [Simulation Methodology](./markdown/simulation.md)
## Results & Analysis
### Model Performance Comparison
The simulation reveals interesting trade-offs between different algorithms:
- **Logistic Regression**: Highest precision (99% spam detection) but lower recall
- **Random Forest**: Best balanced performance with 85% accuracy
- **Naive Bayes**: Fastest processing with good spam recall (90%)
- **SVM**: Robust to outliers but computationally intensive
### Real-time Performance Insights
- **Processing Speed**: Naive Bayes processes messages fastest
- **Memory Usage**: Logistic Regression has smallest memory footprint
- **Accuracy vs Speed**: Random Forest offers best accuracy/speed balance
- **Alert Response**: All models successfully trigger spam volume alerts
## π οΈ Technical Architecture
### Machine Learning Pipeline
1. **Data Preprocessing**: Text normalization, stop word removal, stemming
2. **Feature Extraction**: TF-IDF vectorization with n-grams
3. **Model Training**: Cross-validation with hyperparameter optimization
4. **Evaluation**: Multi-metric assessment on held-out test data
### Cellular Network Simulation
```text
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Radio Unit βββββΆβ Baseband Unit βββββΆβ Core Network β
β (Message RX) β β (ML Processing) β β (Spam Alerts) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
```
- **Radio Units**: Simulate message reception from mobile devices
- **Baseband Units**: Apply ML models for real-time spam classification
- **Core Network**: Aggregate results and trigger spam volume alerts
- **Logging System**: Captures all decisions and performance metrics
## Customization & Extension
### Adding New Models
1. Train your model in `models/models.ipynb`
2. Save as `.pkl` file in the `models/` directory
3. Add simulation code in `simulation/simulation.ipynb`
4. Update logging and results directories
### Modifying Network Topology
- Adjust baseband unit count in simulation parameters
- Configure radio unit connections per baseband
- Customize message processing rates and volumes
### Custom Dataset Generation
Use the spam generator to create targeted test scenarios:
```python
from spam_generator.generator import generate_spam_dataset
dataset = generate_spam_dataset(volume=1000, spam_ratio=0.3)
```
## Dependencies
Key libraries used in this project:
- **scikit-learn**: Machine learning algorithms and evaluation
- **pandas**: Data manipulation and analysis
- **numpy**: Numerical computing
- **matplotlib/seaborn**: Data visualization
- **nltk**: Natural language processing
- **simpy**: Discrete event simulation
- **jupyter**: Interactive development environment
## Contributing
Contributions are welcome! Areas for improvement:
- Additional ML algorithms (Deep Learning, XGBoost)
- Enhanced network simulation (5G features, edge computing)
- Real-world dataset integration
- Performance optimization
- Mobile deployment strategies
## License
This project is licensed under the GNU General Public License v3.0 - see the [LICENSE](LICENSE) file for details.
## References
- SMS Spam Collection Dataset
- Cellular Network Architecture Standards
- Machine Learning for Telecommunications
- Real-time Stream Processing Techniques
---
Built with β€οΈ for telecommunications and machine learning research