{"id":23496378,"url":"https://github.com/arkodeepsen/spam-filter-mbo","last_synced_at":"2025-04-22T12:14:09.033Z","repository":{"id":268909196,"uuid":"904968173","full_name":"arkodeepsen/spam-filter-mbo","owner":"arkodeepsen","description":null,"archived":false,"fork":false,"pushed_at":"2024-12-20T09:17:04.000Z","size":10796,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-16T14:29:41.518Z","etag":null,"topics":["model-training","monarch-butterflies","monarch-butterfly-optimization","spam-classification","spam-detection","spam-filtering"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arkodeepsen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-17T22:35:20.000Z","updated_at":"2024-12-20T09:17:08.000Z","dependencies_parsed_at":"2024-12-19T16:54:36.851Z","dependency_job_id":null,"html_url":"https://github.com/arkodeepsen/spam-filter-mbo","commit_stats":null,"previous_names":["arkodeepsen/spam-filter-mbo"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arkodeepsen%2Fspam-filter-mbo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arkodeepsen%2Fspam-filter-mbo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arkodeepsen%2Fspam-filter-mbo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arkodeepsen%2Fspam-filter-mbo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arkodeepsen","download_url":"https://codeload.github.com/arkodeepsen/spam-filter-mbo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250237836,"owners_count":21397401,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["model-training","monarch-butterflies","monarch-butterfly-optimization","spam-classification","spam-detection","spam-filtering"],"created_at":"2024-12-25T04:12:45.989Z","updated_at":"2025-04-22T12:14:09.009Z","avatar_url":"https://github.com/arkodeepsen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spam Detection System With Enhanced MBO 🦋\n\n## Introduction\n\nThe Spam Detection System is a comprehensive tool designed to train and deploy models for classifying emails and SMS messages as spam or not spam. The system provides a user-friendly interface built with Tkinter for both training models and classifying messages. It supports multiple models, including a lite model, a legacy model, and an optimized model using the Monarch Butterfly Optimization algorithm 🦋.\n\n## Project Structure\n\nThe project is organized as follows:\n\n```\nspam_detection/\n├── app.py\n├── train.py\n├── training/\n│   ├── train_model_lite.py\n│   ├── train_model_legacy.py\n│   └── train_model_mbo.py\n├── data/\n│   └── spam.csv\n├── models/\n│   ├── model.pkl\n│   ├── vectorizer.pkl\n│   ├── model_optimized.pkl\n│   ├── vectorizer_optimized.pkl\n│   └── metrics.txt\n├── graphs/\n│   ├── dataset_insights.png\n│   ├── wordclouds.png\n│   └── performance_metrics.png\n└── README.md\n```\n\n- **`app.py`**: The main application script for the Spam Classifier UI.\n- **`train.py`**: The main application script for the Model Trainer UI.\n- **training/**: Directory containing scripts for training different models.\n    - **`train_model_lite.py`**: Script for training the lite model.\n    - **`train_model_legacy.py`**: Script for training the legacy model.\n    - **`train_model_mbo.py`**: Script for training the model with the Monarch Butterfly Optimization algorithm 🦋.\n- **data/**: Directory containing the dataset (`spam.csv`).\n- **models/**: Directory where trained models and vectorizers are saved.\n- **graphs/**: Directory where generated visualizations are saved.\n- **README.md**: Project documentation.\n\n## Getting Started\n\n### Prerequisites\n\n- **Python 3.6 or higher**\n- **pip** package installer\n\n### Required Python Packages\n\nInstall the required packages using the following command:\n\n```bash\npip install -r requirements.txt\n```\n\n*Note: The requirements.txt file should contain all the necessary packages, including but not limited to:*\n\n- tkinter\n- nltk\n- scikit-learn\n- pandas\n- numpy\n- matplotlib\n- seaborn\n- wordcloud\n- pillow (PIL)\n\n### NLTK Data\n\nThe application uses the Natural Language Toolkit (NLTK) library, which requires certain datasets. These datasets are downloaded automatically when you run the scripts, but you can also download them manually:\n\n```python\nimport nltk\nnltk.download('punkt')\nnltk.download('wordnet')\nnltk.download('stopwords')\n```\n\n## Dataset\n\nThe dataset used for training is [`spam.csv`](data/spam.csv), which should be placed in the data directory. This dataset contains labeled messages indicating whether they are spam or ham (not spam).\n\n## Application Overview\n\n### Spam Classifier UI (app.py)\n\nThe `app.py` script launches a GUI application that allows users to input a message and classify it as spam or not spam using the selected model.\n\n#### Features\n\n- **Model Selection**: Users can choose between the optimized model or the legacy model.\n- **Message Input**: Provides a text area for users to input the message to be classified.\n- **Prediction**: Classifies the input message and displays whether it is spam or not spam.\n- **Visual Feedback**: Displays the prediction result with color-coded text (red for spam, green for not spam).\n\n#### How to Run\n\n```bash\npython app.py\n```\n\n#### Code Structure\n\n- **SpamClassifierUI Class**: Contains the GUI setup and the logic for running the classification.\n    - **__init__**: Initializes the GUI and loads the models.\n    - **setup_ui**: Sets up the UI components.\n    - **transform_text_legacy**: Preprocesses text for the legacy model (stemming).\n    - **transform_text_optimized**: Preprocesses text for the optimized model (lemmatization).\n    - **predict**: Handles the prediction logic.\n\n### Model Trainer UI (train.py)\n\nThe `train.py` script launches a GUI application that allows users to train models using different algorithms and visualize the training results.\n\n#### Features\n\n- **Model Selection**: Users can choose to train the lite model, the legacy model, or the Monarch Butterfly Optimization (MBO) model 🦋.\n- **Training Controls**: Provides a button to start training the selected model.\n- **Progress Indication**: Shows a progress bar during training.\n- **Visualizations**: Displays dataset insights, word clouds, and performance metrics after training.\n- **Logs**: Provides a log tab to display training logs and error messages.\n\n#### How to Run\n\n```bash\npython train.py\n```\n\n#### Code Structure\n\n- **SpamDetectionUI Class**: Contains the GUI setup and logic for training models.\n    - **__init__**: Initializes the GUI and sets up tabs.\n    - **setup_training_tab**: Sets up the training controls and model selection.\n    - **setup_visualization_tab**: Sets up the visualizations display area.\n    - **setup_log_tab**: Sets up the log display area.\n    - **start_training**: Begins the training process in a separate thread.\n    - **train_model**: Handles model training logic based on the selected model.\n    - **update_displays**: Updates the GUI with new visualizations and metrics after training.\n    - **display_image**: Helper function to display images in the GUI.\n    - **update_log**: Updates the log text area with training logs.\n\n## Training Scripts\n\n### `train_model_lite.py`\n\n\n\nThe lite model training script focuses on a streamlined approach with optimized parameters and enhanced feature extraction.\n\n#### Features\n\n- **Text Preprocessing**: Uses lemmatization and removes stop words and punctuation.\n- **Feature Extraction**: Adds additional features such as text length, word count, unique word count, uppercase count, and special character count.\n- **Model Creation**: Builds an ensemble model using SVC, MultinomialNB, and ExtraTreesClassifier with predefined optimized hyperparameters.\n- **Visualization**: Generates graphs for dataset insights, word clouds, and performance metrics.\n- **Metrics Saving**: Saves accuracy, precision, and F1 score to a metrics file.\n\n#### How to Run Individually\n\n```bash\npython training/train_model_lite.py\n```\n\n### `train_model_legacy.py`\n\n\n\nThe legacy model training script retains the original model logic without optimization but updates the structure and adds visualizations.\n\n#### Features\n\n- **Text Preprocessing**: Uses stemming and removes stop words and punctuation.\n- **Model Creation**: Builds an ensemble model using SVC, MultinomialNB, and ExtraTreesClassifier with original parameters.\n- **Visualization**: Generates graphs for dataset insights, word clouds, and performance metrics.\n- **Metrics Saving**: Saves accuracy and precision to a metrics file.\n\n#### How to Run Individually\n\n```bash\npython training/train_model_legacy.py\n```\n\n### `train_model_mbo.py`\n\n\n\nThis script trains a model using the Monarch Butterfly Optimization algorithm 🦋 to optimize hyperparameters.\n\n#### Features\n\n- **Monarch Butterfly Optimization (MBO)**: Implements the MBO algorithm to find optimal hyperparameters for the model.\n- **Parameter Optimization**: Optimizes parameters for SVC, MultinomialNB, and ExtraTreesClassifier as well as ensemble weights.\n- **Model Creation**: Builds an ensemble model using the optimized parameters.\n- **Visualization**: Generates graphs for dataset insights, word clouds, and performance metrics.\n- **Metrics Saving**: Saves accuracy, precision, and F1 score to a metrics file.\n\n#### How to Run Individually\n\n```bash\npython training/train_model_mbo.py\n```\n\n#### The Monarch Butterfly Optimization Algorithm\n\nThe MBO algorithm is a nature-inspired optimization technique that mimics the migration behavior of monarch butterflies 🦋. It is used in this project to find the optimal hyperparameters for the ensemble model.\n\n**Key Components:**\n\n- **Population Initialization**: Randomly initializes a population of possible solutions (butterflies).\n- **Migration Operator**: Adjusts solutions based on the migration behavior, encouraging exploration and exploitation of the search space.\n- **Fitness Function**: Evaluates solutions based on cross-validation scores.\n\n## Dependencies and Libraries\n\n- **tkinter**: For building the GUI applications.\n- **nltk**: For natural language processing tasks.\n- **scikit-learn**: For machine learning algorithms and tools.\n- **pandas**: For data manipulation.\n- **numpy**: For numerical computations.\n- **matplotlib and seaborn**: For data visualization.\n- **wordcloud**: For generating word cloud images.\n- **Pillow (PIL)**: For image handling in the GUI.\n- **pickle**: For saving and loading model objects.\n\n## Visualizations\n\nThe system generates several visualizations to help understand the dataset and the model performance.\n\n- **Dataset Insights** (`dataset_insights.png`): Histograms and box plots showing message length distribution, class distribution, and word count by class.\n\n![Dataset Insights](graphs/dataset_insights.png)\n\n- **Word Clouds** (`wordclouds.png`): Visual representations of the most frequent words in spam and ham messages.\n\n![Word Clouds](graphs/wordclouds.png)\n\n- **Performance Metrics** (`performance_metrics.png`): Confusion matrix, classification report, and top important features.\n\n![Performance Metrics](graphs/performance_metrics.png)\n\n## Logs and Metrics\n\n- **Logs**: The training scripts and applications output logs to help track the training process and any issues.\n- **Metrics**: After training, key performance metrics are saved to metrics.txt in the models directory.\n    \nExample content of metrics.txt:\n```\nAccuracy: 0.9825\nPrecision: 0.9700\nF1: 0.9750\n```\n\n## Error Handling\n\n- The applications include error handling to manage exceptions such as missing models or datasets.\n- Informative messages are displayed to the user via message boxes or log outputs.\n\n## How to Train and Test Models\n\n1. **Prepare the Dataset**: Ensure `spam.csv` is placed in the data directory.\n2. **Run the Trainer GUI**: Execute `python train.py` to launch the training application.\n3. **Select the Model**: Choose the model you wish to train (Lite, Legacy, MBO) from the GUI.\n4. **Train the Model**: Click the \"Train Model\" button to start training. Progress and logs will be displayed.\n5. **View Visualizations**: After training, navigate to the \"Visualizations\" tab to see the generated graphs.\n6. **Check Metrics**: Metrics will be displayed in the \"Model Metrics\" section and saved to metrics.txt.\n7. **Test the Classifier**: Run python app.py to launch the spam classifier application and test message classification.\n\n## Additional Notes\n\n- **Directory Creation**: The applications will create required directories such as models and graphs if they do not exist.\n- **Model Artifacts**: Trained models and vectorizers are saved in the models directory for reuse.\n- **Data Encodings**: The dataset is read with encoding='latin-1' due to special characters.\n- **Stratification**: Data splitting uses stratification to maintain class distribution across training and test sets.\n\n## Troubleshooting\n\n- **Missing NLTK Data**: If you encounter errors related to NLTK data, ensure that the necessary data packages (`punkt`, `wordnet`, `stopwords`) are downloaded.\n- **ModuleNotFoundError**: Ensure all required Python packages are installed.\n- **GUI Not Displaying Correctly**: Check your Python version and ensure that tkinter is properly installed.\n\n## Contact and Support\n\nFor questions or support, please contact the project maintainers.\n\n---\n\nThis documentation provides a detailed overview of the Spam Detection System, including instructions on how to set up, train, and use the models. It covers the structure of the project, the functionalities of each component, and guidance on how to resolve common issues.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farkodeepsen%2Fspam-filter-mbo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farkodeepsen%2Fspam-filter-mbo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farkodeepsen%2Fspam-filter-mbo/lists"}