{"id":20177368,"url":"https://github.com/saba-gul/spam_detection_using_text_classification","last_synced_at":"2026-06-08T00:31:57.596Z","repository":{"id":248286052,"uuid":"828281324","full_name":"Saba-Gul/Spam_detection_using_text_classification","owner":"Saba-Gul","description":"This project aims to build a machine learning model that can classify text messages as either spam or not spam (ham)","archived":false,"fork":false,"pushed_at":"2024-07-13T18:11:31.000Z","size":587,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-01T02:12:58.467Z","etag":null,"topics":["fraud-detection","ngram-language-model","nlp-machine-learning","nltk","nltk-python","sms-messages","spam-detection","text-classification"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Saba-Gul.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-13T16:51:43.000Z","updated_at":"2024-07-13T18:13:02.000Z","dependencies_parsed_at":"2024-07-13T18:23:01.466Z","dependency_job_id":null,"html_url":"https://github.com/Saba-Gul/Spam_detection_using_text_classification","commit_stats":null,"previous_names":["saba-gul/spam_detection_using_text_classification"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Saba-Gul/Spam_detection_using_text_classification","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Saba-Gul%2FSpam_detection_using_text_classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Saba-Gul%2FSpam_detection_using_text_classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Saba-Gul%2FSpam_detection_using_text_classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Saba-Gul%2FSpam_detection_using_text_classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Saba-Gul","download_url":"https://codeload.github.com/Saba-Gul/Spam_detection_using_text_classification/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Saba-Gul%2FSpam_detection_using_text_classification/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34043822,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-07T02:00:07.652Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fraud-detection","ngram-language-model","nlp-machine-learning","nltk","nltk-python","sms-messages","spam-detection","text-classification"],"created_at":"2024-11-14T02:15:26.093Z","updated_at":"2026-06-08T00:31:57.579Z","avatar_url":"https://github.com/Saba-Gul.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Spam Detection Using NLP Techniques\n\nThis project implements text classification techniques to detect spam messages using Natural Language Processing (NLP) methods. It includes preprocessing steps, model training, evaluation, and performance analysis.\n\n![Sample Image](images/wordcloud.png)\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Dataset](#dataset)\n- [Preprocessing](#preprocessing)\n- [Models Used](#models-used)\n- [Evaluation Metrics](#evaluation-metrics)\n- [Results](#results)\n- [Usage](#usage)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Overview\n\nThis project aims to build a machine learning model that can classify text messages as either spam or not spam (ham). It leverages various NLP techniques such as tokenization, stopword removal, stemming, and n-grams vectorization to preprocess the text data. The model performance is evaluated using metrics like accuracy, precision, recall, and F1-score.\n\n## Dataset\n\nThe SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam. [Link to Dataset on Kaggle]([https://www.kaggle.com/datasetname](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset))\n\n\n## Preprocessing\n\nText preprocessing steps include:\n\n- **Lowercasing**\n- **Punctuation removal**\n- **Stopword removal:** Common stop words are removed to reduce noise in the data.\n- **Stemming:** Reduces words to their base or root form by removing suffixes (e.g., \"running\" becomes \"run\").\n- **Lemmatization:** Reduces words to their base or dictionary form, considering the context (e.g., \"better\" becomes \"good\").\n- **Tokenization:** Splits text into individual words or tokens (e.g., \"The cat sat on the mat\" becomes [\"The\", \"cat\", \"sat\", \"on\", \"the\", \"mat\"]).\n- **N-grams vectorization (unigrams, bigrams, trigrams):** N-grams refer to contiguous sequences of N items from a given text. In the context of text vectorization:\n\n  - Unigrams: These are single words. For example, the sentence \"I love machine learning\" would yield unigrams: \"I\", \"love\", \"machine\", \"learning\".\n\n  - Bigrams: These consist of pairs of adjacent words. From the same sentence, bigrams would be: \"I love\", \"love machine\", \"machine learning\".\n\n  - Trigrams: These are sequences of three adjacent words. For instance, trigrams from the sentence would include: \"I love machine\", \"love machine learning\".\n    \n    N-grams capture sequential word information directly from text and is often used in tasks where word sequence matters, such as language modeling, sentiment analysis, and machine translation.\n\n## Models Used\n\nTwo classification models are implemented:\n\n1. Logistic Regression\n2. Naive Bayes (Multinomial)\n\n## Evaluation Metrics\n\nThe following metrics are used to evaluate model performance:\n\n- Accuracy\n- Precision\n- Recall\n- F1-score\n- Confusion Matrix\n\n## Results\n\n| Algorithm            | Accuracy | Precision | Recall | F1-Score |\n|----------------------|----------|-----------|--------|----------|\n| Logistic Regression  | 97.31%   | 100%      | 79.45% | 88.55%   |\n| Naive Bayes (Multinomial) | 94.98%   | 74.73%    | 93.15% | 82.93%   |\n\n## Usage\n\n- Modify `Spam_detection_using_text_classification.ipynb` to experiment with different preprocessing techniques or models.\n- Use the provided functions and classes to integrate with other applications or pipelines.\n\n## Contributing\n\nContributions are welcome! Fork the repository and create a pull request with your proposed changes.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaba-gul%2Fspam_detection_using_text_classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaba-gul%2Fspam_detection_using_text_classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaba-gul%2Fspam_detection_using_text_classification/lists"}