{"id":31534495,"url":"https://github.com/breezy-codes/machine-learning-for-spam-sms","last_synced_at":"2026-05-14T12:35:09.738Z","repository":{"id":316826330,"uuid":"1065006050","full_name":"breezy-codes/machine-learning-for-spam-sms","owner":"breezy-codes","description":"Real-time SMS spam detection using ML models in simulated cellular networks. Compares 4 algorithms with comprehensive performance analysis.","archived":false,"fork":false,"pushed_at":"2025-09-26T22:56:27.000Z","size":33682,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-04T05:49:11.536Z","etag":null,"topics":["logistic-regression","machine-learning","naive-bayes","network-simulation","random-forest","research","scikit-learn","spam-sms","spam-sms-detection","svm","telecommunication"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/breezy-codes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-26T22:45:40.000Z","updated_at":"2025-09-26T22:57:55.000Z","dependencies_parsed_at":"2025-09-27T00:33:24.260Z","dependency_job_id":null,"html_url":"https://github.com/breezy-codes/machine-learning-for-spam-sms","commit_stats":null,"previous_names":["breezy-codes/machine-learning-for-spam-sms"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/breezy-codes/machine-learning-for-spam-sms","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/breezy-codes%2Fmachine-learning-for-spam-sms","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/breezy-codes%2Fmachine-learning-for-spam-sms/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/breezy-codes%2Fmachine-learning-for-spam-sms/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/breezy-codes%2Fmachine-learning-for-spam-sms/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/breezy-codes","download_url":"https://codeload.github.com/breezy-codes/machine-learning-for-spam-sms/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/breezy-codes%2Fmachine-learning-for-spam-sms/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33025226,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T13:14:54.681Z","status":"online","status_checked_at":"2026-05-14T02:00:06.663Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["logistic-regression","machine-learning","naive-bayes","network-simulation","random-forest","research","scikit-learn","spam-sms","spam-sms-detection","svm","telecommunication"],"created_at":"2025-10-04T05:40:28.620Z","updated_at":"2026-05-14T12:35:09.731Z","avatar_url":"https://github.com/breezy-codes.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Machine Learning for Real-Time SMS Spam Detection in Cellular Networks\n\n[![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![scikit-learn](https://img.shields.io/badge/scikit--learn-1.0+-orange.svg)](https://scikit-learn.org/)\n[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](LICENSE)\n\n## Overview\n\nThis project evaluates the effectiveness of multiple machine learning models for **real-time spam detection in cellular networks**. Using a comprehensive SMS dataset, we train, evaluate, and simulate four different ML models in a realistic cellular network environment to determine the most effective approach for real-time spam filtering. View the academic report [here](report.pdf).\n\n### Key Features\n\n- **4 Machine Learning Models**: Logistic Regression, Naive Bayes, Random Forest, and Support Vector Machine\n- **Real-time Simulation**: Simulated cellular network environment with baseband units and radio units\n- **Comprehensive Evaluation**: Performance metrics including accuracy, precision, recall, and F1-score\n- **Automated Spam Detection**: Real-time alerting system for high spam volume detection\n- **Dataset Generation**: Custom spam dataset generator for testing on unseen data\n\n### Research Focus\n\nThe project addresses the critical need for effective spam detection in cellular networks by:\n\n- Comparing multiple ML algorithms in a realistic network simulation\n- Evaluating real-time performance under cellular network constraints\n- Analyzing model effectiveness for different spam patterns and volumes\n- Providing insights into the most suitable algorithms for mobile network deployment\n\n## Model Performance Summary\n\n| Model | Training Accuracy | Simulation Accuracy | Simulation Time | Best Use Case |\n|-------|------------------|-------------------|----------------|---------------|\n| **Logistic Regression** | 99% | 88% | 8m 47s | High precision spam detection |\n| **Naive Bayes** | 89% | 81% | 9m 39s | Fast processing, good recall |\n| **Random Forest** | 89% | 85% | 20m 55s | Balanced performance |\n| **Support Vector Machine** | 89% | 82% | 25m 53s | Complex pattern recognition |\n\n## Project Structure\n\n```text\nmachine-learning-for-spam-sms/\n├── 📁 models/                          # ML Model Development\n│   ├── models.ipynb                    # Model training \u0026 evaluation\n│   ├── spam_data.csv                   # Training dataset\n│   └── *.pkl                          # Trained model files\n├── 📁 simulation/                      # Network Simulation\n│   ├── simulation.ipynb                # Main simulation notebook\n│   ├── 📁 data/                       # Generated test datasets\n│   ├── 📁 logs/                       # Simulation logs by model\n│   └── 📁 results/                    # Performance results \u0026 figures\n├── 📁 spam-generator/                  # Dataset Generation\n│   ├── generator.py                    # Spam dataset generator\n│   └── conversations.py               # Conversation templates\n├── 📁 markdown/                        # Documentation\n│   ├── models.md                       # Model implementation details\n│   ├── simulation.md                   # Simulation methodology\n│   └── install_instructions.md         # Setup instructions\n└── requirements.txt                    # Python dependencies\n```\n\n### Key Components\n\n- **Machine Learning Models**: Four different algorithms trained on SMS spam data\n- **Cellular Network Simulation**: Realistic network topology with baseband and radio units  \n- **Real-time Processing**: Stream processing of SMS messages with spam detection\n- **Performance Monitoring**: Comprehensive logging and alerting system\n- **Dataset Generation**: Custom spam generator for testing model robustness\n\n## Quick Start\n\n### Prerequisites\n\n- Python 3.8 or higher\n- Virtual environment (recommended)\n\n### Installation\n\n1. **Clone the repository**\n\n   ```bash\n   git clone https://github.com/breezy-codes/machine-learning-for-spam-sms.git\n   cd machine-learning-for-spam-sms\n   ```\n\n2. **Set up virtual environment**\n\n   ```bash\n   python -m venv .venv\n   source .venv/bin/activate  # On Windows: .\\.venv\\Scripts\\activate\n   ```\n\n3. **Install dependencies**\n\n   ```bash\n   pip install -r requirements.txt\n   ```\n\nFor detailed setup instructions, see: [Setting Up a Python Virtual Environment](./markdown/install_instructions.md)\n\n## 🤖 Running the Machine Learning Models\n\nTrain and evaluate all four ML models using the comprehensive Jupyter notebook:\n\n```bash\njupyter notebook models/models.ipynb\n```\n\n### What the Models Do\n\n- **Data Preprocessing**: Text cleaning, tokenization, and vectorization\n- **Model Training**: Hyperparameter tuning with cross-validation\n- **Performance Evaluation**: Accuracy, precision, recall, F1-score metrics\n- **Model Persistence**: Saves trained models as `.pkl` files\n\n**Detailed guide**: [Model Implementation Notes](./markdown/models.md)\n\n## Running the Cellular Network Simulation\n\nExperience real-time spam detection in a simulated cellular environment:\n\n```bash\njupyter notebook simulation/simulation.ipynb\n```\n\n### Simulation Features\n\n- **Network Topology**: Multiple baseband units with radio units\n- **Real-time Processing**: Stream-based message processing\n- **Spam Detection**: Live classification with alerting system\n- **Performance Analytics**: Comprehensive logging and metrics collection\n- **Load Testing**: Handles high-volume message streams\n\n**Detailed guide**: [Simulation Methodology](./markdown/simulation.md)\n\n## Results \u0026 Analysis\n\n### Model Performance Comparison\n\nThe simulation reveals interesting trade-offs between different algorithms:\n\n- **Logistic Regression**: Highest precision (99% spam detection) but lower recall\n- **Random Forest**: Best balanced performance with 85% accuracy\n- **Naive Bayes**: Fastest processing with good spam recall (90%)\n- **SVM**: Robust to outliers but computationally intensive\n\n### Real-time Performance Insights\n\n- **Processing Speed**: Naive Bayes processes messages fastest\n- **Memory Usage**: Logistic Regression has smallest memory footprint  \n- **Accuracy vs Speed**: Random Forest offers best accuracy/speed balance\n- **Alert Response**: All models successfully trigger spam volume alerts\n\n## 🛠️ Technical Architecture\n\n### Machine Learning Pipeline\n\n1. **Data Preprocessing**: Text normalization, stop word removal, stemming\n2. **Feature Extraction**: TF-IDF vectorization with n-grams\n3. **Model Training**: Cross-validation with hyperparameter optimization\n4. **Evaluation**: Multi-metric assessment on held-out test data\n\n### Cellular Network Simulation\n\n```text\n┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐\n│   Radio Unit    │───▶│  Baseband Unit   │───▶│  Core Network   │\n│  (Message RX)   │     │  (ML Processing) │     │ (Spam Alerts)   │\n└─────────────────┘     └──────────────────┘     └─────────────────┘\n```\n\n- **Radio Units**: Simulate message reception from mobile devices\n- **Baseband Units**: Apply ML models for real-time spam classification\n- **Core Network**: Aggregate results and trigger spam volume alerts\n- **Logging System**: Captures all decisions and performance metrics\n\n## Customization \u0026 Extension\n\n### Adding New Models\n\n1. Train your model in `models/models.ipynb`\n2. Save as `.pkl` file in the `models/` directory\n3. Add simulation code in `simulation/simulation.ipynb`\n4. Update logging and results directories\n\n### Modifying Network Topology\n\n- Adjust baseband unit count in simulation parameters\n- Configure radio unit connections per baseband\n- Customize message processing rates and volumes\n\n### Custom Dataset Generation\n\nUse the spam generator to create targeted test scenarios:\n\n```python\nfrom spam_generator.generator import generate_spam_dataset\ndataset = generate_spam_dataset(volume=1000, spam_ratio=0.3)\n```\n\n## Dependencies\n\nKey libraries used in this project:\n\n- **scikit-learn**: Machine learning algorithms and evaluation\n- **pandas**: Data manipulation and analysis  \n- **numpy**: Numerical computing\n- **matplotlib/seaborn**: Data visualization\n- **nltk**: Natural language processing\n- **simpy**: Discrete event simulation\n- **jupyter**: Interactive development environment\n\n## Contributing\n\nContributions are welcome! Areas for improvement:\n\n- Additional ML algorithms (Deep Learning, XGBoost)\n- Enhanced network simulation (5G features, edge computing)\n- Real-world dataset integration\n- Performance optimization\n- Mobile deployment strategies\n\n## License\n\nThis project is licensed under the GNU General Public License v3.0 - see the [LICENSE](LICENSE) file for details.\n\n## References\n\n- SMS Spam Collection Dataset\n- Cellular Network Architecture Standards\n- Machine Learning for Telecommunications\n- Real-time Stream Processing Techniques\n\n---\n\nBuilt with ❤️ for telecommunications and machine learning research\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbreezy-codes%2Fmachine-learning-for-spam-sms","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbreezy-codes%2Fmachine-learning-for-spam-sms","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbreezy-codes%2Fmachine-learning-for-spam-sms/lists"}