{"id":15157725,"url":"https://github.com/chandkund/spam-email-detection","last_synced_at":"2026-01-20T20:33:37.691Z","repository":{"id":254657285,"uuid":"847181365","full_name":"chandkund/Spam-Email-Detection","owner":"chandkund","description":"This project focuses on detecting spam emails using a fine-tuned DistilBERT model, a lighter version of the BERT model. The model is trained to classify email text into two categories: spam (1) and not spam (0). The dataset consists of email texts labeled as either spam or non-spam.","archived":false,"fork":false,"pushed_at":"2024-08-25T05:04:23.000Z","size":3473,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-13T16:53:57.442Z","etag":null,"topics":["data-visualization","datapreprocessing","matplotlib","pandas","python","pytorch","sklearn","transformer"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chandkund.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-25T04:50:38.000Z","updated_at":"2024-11-23T11:07:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"8b9f3387-5043-45e6-9fa6-af704cd71c64","html_url":"https://github.com/chandkund/Spam-Email-Detection","commit_stats":{"total_commits":3,"total_committers":1,"mean_commits":3.0,"dds":0.0,"last_synced_commit":"eb93b31d5883657d5f01ad9f3cf23a7ec5269596"},"previous_names":["chandkund/spam-email-detection"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chandkund%2FSpam-Email-Detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chandkund%2FSpam-Email-Detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chandkund%2FSpam-Email-Detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chandkund%2FSpam-Email-Detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chandkund","download_url":"https://codeload.github.com/chandkund/Spam-Email-Detection/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247668737,"owners_count":20976260,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-visualization","datapreprocessing","matplotlib","pandas","python","pytorch","sklearn","transformer"],"created_at":"2024-09-26T20:02:03.976Z","updated_at":"2026-01-20T20:33:37.685Z","avatar_url":"https://github.com/chandkund.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spam Email Detection\n\nThis project aims to classify emails as either spam or not spam using a fine-tuned DistilBERT model. The dataset consists of email text with labels indicating whether each email is spam (1) or not spam (0). The model is trained using the Hugging Face Transformers library and leverages the DistilBERT architecture for efficient and accurate text classification.\n\n## Table of Contents\n\n- [Project Overview](#project-overview)\n- [Installation](#installation)\n- [Usage](#usage)\n- [Code Explanation](#code-explanation)\n- [Model Evaluation](#model-evaluation)\n- [License](#license)\n\n## Project Overview\n\nThis project uses DistilBERT, a distilled version of BERT, to detect spam emails. The dataset includes email texts and a binary label indicating whether each email is spam. The project involves data preprocessing, model training, and evaluation to achieve accurate spam detection.\n\n## Installation\n\nTo run this project, you’ll need Python and the following libraries:\n\n- `torch`\n- `transformers`\n- `pandas`\n- `scikit-learn`\n\nInstall the required packages using `pip`:\n\n```bash\npip install torch transformers pandas scikit-learn\n```\n\n## Usage\n\n1. **Prepare the Dataset:**\n   - Ensure your dataset is in a CSV file with columns labeled `text` and `spam`.\n\n2. **Run the Code:**\n   - Use the following code to preprocess data, train the model, and evaluate performance.\n\n3. **Train the Model:**\n   - Run the script to train the DistilBERT model on your dataset.\n\n4. **Evaluate the Model:**\n   - After training, evaluate the model using accuracy, precision, recall, and F1-score metrics.\n\n## Code Explanation\n\n### 1. Importing Libraries\n\n```python\nimport torch\nfrom transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments\nfrom sklearn.model_selection import train_test_split\nimport pandas as pd\nfrom sklearn.metrics import accuracy_score, precision_recall_fscore_support\n```\n\n### 2. Load and Preprocess Data\n\n```python\n# Load dataset\ndf = pd.read_csv('spam_dataset.csv')\n\n# Split into training and testing sets\ntrain_texts, test_texts, train_labels, test_labels = train_test_split(df['text'], df['spam'], test_size=0.2, random_state=42)\n\n# Tokenize the text data\ntokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')\ntrain_encodings = tokenizer(list(train_texts), truncation=True, padding=True, max_length=512)\ntest_encodings = tokenizer(list(test_texts), truncation=True, padding=True, max_length=512)\n```\n\n### 3. Define the Model\n\n```python\nmodel = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)\n```\n\n### 4. Training the Model\n\n```python\nmodel.train()\nfor epoch in range(3):  # Set the number of epochs\n    for batch in train_loader:\n        input_ids, attention_mask, labels = batch\n        optimizer.zero_grad()\n        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)\n        loss = outputs.loss\n        loss.backward()\n        optimizer.step()\n    print(f\"Epoch {epoch+1} complete! Loss: {loss.item()}\")\n```\n\n### 5. Evaluate the Model\n\n```python\nmodel.eval()\nall_preds = []\nall_labels = []\nwith torch.no_grad():\n    for batch in val_loader:\n        input_ids, attention_mask, labels = batch\n        outputs = model(input_ids, attention_mask=attention_mask)\n        logits = outputs.logits\n        preds = torch.argmax(logits, dim=1).flatten()\n        all_preds.extend(preds.cpu().numpy())\n        all_labels.extend(labels.cpu().numpy())\n    }\n\n```\n## Model Evaluation\n\nThe trained DistilBERT model is evaluated on the test dataset using accuracy_score. These metrics provide insight into the model's performance in correctly identifying spam and non-spam emails.\n```python\n\naccuracy = accuracy_score(all_labels, all_preds)\nprint(f\"Validation Accuracy: {accuracy:.4f}\")\n```\nValidation Accuracy: 0.9939\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchandkund%2Fspam-email-detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchandkund%2Fspam-email-detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchandkund%2Fspam-email-detection/lists"}