{"id":18482421,"url":"https://github.com/professorlearncode/email-classifier-model","last_synced_at":"2025-05-13T20:24:01.510Z","repository":{"id":253592690,"uuid":"843961707","full_name":"ProfessorlearnCode/Email-Classifier-Model","owner":"ProfessorlearnCode","description":"This project implements a spam email classifier using Natural Language Processing (NLP) and a Naive Bayes model. The pipeline preprocesses email text and uses the model to classify messages as spam or not spam, with an evaluation on test data.","archived":false,"fork":false,"pushed_at":"2024-08-18T00:22:52.000Z","size":244,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-16T21:30:06.169Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ProfessorlearnCode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-18T00:20:54.000Z","updated_at":"2024-08-18T00:22:55.000Z","dependencies_parsed_at":"2024-08-18T01:27:15.567Z","dependency_job_id":"48b01202-8613-4375-b5f3-ffb81be06f97","html_url":"https://github.com/ProfessorlearnCode/Email-Classifier-Model","commit_stats":null,"previous_names":["professorlearncode/email-classifier-model"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProfessorlearnCode%2FEmail-Classifier-Model","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProfessorlearnCode%2FEmail-Classifier-Model/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProfessorlearnCode%2FEmail-Classifier-Model/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ProfessorlearnCode%2FEmail-Classifier-Model/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ProfessorlearnCode","download_url":"https://codeload.github.com/ProfessorlearnCode/Email-Classifier-Model/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254020741,"owners_count":22000776,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T12:28:26.560Z","updated_at":"2025-05-13T20:24:01.493Z","avatar_url":"https://github.com/ProfessorlearnCode.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"### Documentation for Spam Email Classifier Using Naive Bayes\n\n#### Overview\nThis repository contains a Python implementation of a spam email classifier using Natural Language Processing (NLP) techniques and a Naive Bayes model. The model is trained on a labeled dataset of emails, and it predicts whether new emails are spam or not. The project includes text preprocessing, model training, prediction, and evaluation steps.\n\n#### Prerequisites\nEnsure the following Python libraries are installed before running the code:\n- `nltk`\n- `pandas`\n- `scikit-learn`\n\nYou can install the necessary libraries using:\n```bash\npip install nltk pandas scikit-learn\n```\n\n#### Code Breakdown\n\n1. **Importing Libraries**\n   ```python\n   import nltk\n   import pandas as pd\n   from nltk.corpus import stopwords\n   from nltk.tokenize import word_tokenize\n   from sklearn.naive_bayes import MultinomialNB\n   from sklearn.feature_extraction.text import CountVectorizer\n   from sklearn.model_selection import train_test_split\n   from sklearn.pipeline import Pipeline\n   ```\n   The required libraries for text preprocessing, model building, and evaluation are imported.\n\n2. **Downloading NLTK Data**\n   ```python\n   nltk.download('punkt')\n   nltk.download('stopwords')\n   ```\n   The necessary NLTK data for tokenization and stopword filtering is downloaded.\n\n3. **Text Preprocessing Function**\n   ```python\n   def preprocess_text(text):\n       tokens = word_tokenize(text.lower())\n       filtered_words = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]\n       return ' '.join(filtered_words)\n   ```\n   This function tokenizes the text, converts it to lowercase, removes stopwords, and keeps only alphanumeric tokens. The resulting cleaned text is returned as a string.\n\n4. **Sample Text Preprocessing**\n   ```python\n   sample_text = \"This is an example email!\"\n   processed_text = preprocess_text(sample_text)\n   print(\"Processed Sample Text:\", processed_text)\n   ```\n   A sample email is processed using the `preprocess_text` function to demonstrate text preprocessing.\n\n5. **Loading and Preparing the Dataset**\n   ```python\n   data = pd.read_csv('spam.csv', encoding='latin-1')\n   data = data[['v1', 'v2']]  # Assuming the columns are named 'v1' for Category and 'v2' for Message\n   data.columns = ['Category', 'Message']\n   data['Spam'] = data['Category'].apply(lambda x: 1 if x == 'spam' else 0)\n   ```\n   The dataset is loaded and prepared for modeling. The dataset is assumed to have columns for email category (`v1`) and message (`v2`). The 'Category' column is renamed, and a new binary column 'Spam' is created where spam is labeled as `1` and non-spam as `0`.\n\n6. **Text Preprocessing for the Dataset**\n   ```python\n   data['Message'] = data['Message'].apply(preprocess_text)\n   ```\n   The `Message` column is preprocessed using the `preprocess_text` function.\n\n7. **Splitting the Dataset**\n   ```python\n   X_train, X_test, y_train, y_test = train_test_split(data['Message'], data['Spam'], test_size=0.25, random_state=42)\n   ```\n   The dataset is split into training and testing sets with a 75-25 ratio.\n\n8. **Building and Training the Model Pipeline**\n   ```python\n   clf = Pipeline([\n       ('vectorizer', CountVectorizer()),\n       ('nb', MultinomialNB())\n   ])\n   clf.fit(X_train, y_train)\n   ```\n   A pipeline is created that first converts text into a matrix of token counts using `CountVectorizer` and then applies the `MultinomialNB` classifier. The model is trained on the training data.\n\n9. **Making Predictions**\n   ```python\n   emails = [\n       'Sounds great! Are you home now?',\n       'Will u meet ur dream partner soon? Is ur career off 2 a flyng start? 2 find out free, txt HORO followed by ur star sign, e. g. HORO ARIES'\n   ]\n   predictions = clf.predict(emails)\n   print(\"Predictions:\", predictions)\n   ```\n   The model predicts whether a list of sample emails is spam or not.\n\n10. **Evaluating the Model**\n    ```python\n    accuracy = clf.score(X_test, y_test)\n    print(\"Accuracy on test data:\", accuracy)\n    ```\n    The model's accuracy is calculated on the test data, providing a measure of how well the model generalizes to unseen data.\n\n#### Conclusion\nThis project provides a complete pipeline for building a spam email classifier using a Naive Bayes model. The process includes text preprocessing, model training, and evaluation, offering a robust approach to identifying spam emails.\n\n#### Future Improvements\n- **Feature Engineering**: Explore different feature extraction techniques, such as TF-IDF, to potentially improve model accuracy.\n- **Model Optimization**: Fine-tune the hyperparameters of the Naive Bayes model or explore other classifiers like SVM or Random Forest.\n\n#### License\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\nThis documentation should help guide software developers in understanding and using the code effectively.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprofessorlearncode%2Femail-classifier-model","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprofessorlearncode%2Femail-classifier-model","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprofessorlearncode%2Femail-classifier-model/lists"}