{"id":22482306,"url":"https://github.com/rahul-404/bbc-news-sorting","last_synced_at":"2026-05-05T11:34:57.891Z","repository":{"id":243323003,"uuid":"812114110","full_name":"Rahul-404/bbc-news-sorting","owner":"Rahul-404","description":"📰 BBC News Article Classifier: A project that categorizes BBC News articles into business, entertainment, politics, sport, and tech 🏙️. Utilizes NLP techniques to build a precise classification model for text data, delivering accurate categorization 🤖.","archived":false,"fork":false,"pushed_at":"2025-07-12T08:59:32.000Z","size":25191,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-12T10:26:41.888Z","etag":null,"topics":["ai","bbc","data-science","dataanalysis","deep-learning","kaggle","machine-learning","mediaanalytics","natural-language-processing","newsclassification","nlp","python","sentiment-analysis","text-classification","topic-modeling"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Rahul-404.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-06-08T02:06:45.000Z","updated_at":"2025-07-12T08:59:35.000Z","dependencies_parsed_at":"2024-11-19T02:16:02.628Z","dependency_job_id":"26f8568e-1f58-4a68-b43b-6df2f4b2e10a","html_url":"https://github.com/Rahul-404/bbc-news-sorting","commit_stats":null,"previous_names":["rahul-404/bbc-news-sorting"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Rahul-404/bbc-news-sorting","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rahul-404%2Fbbc-news-sorting","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rahul-404%2Fbbc-news-sorting/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rahul-404%2Fbbc-news-sorting/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rahul-404%2Fbbc-news-sorting/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Rahul-404","download_url":"https://codeload.github.com/Rahul-404/bbc-news-sorting/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Rahul-404%2Fbbc-news-sorting/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266195309,"owners_count":23891167,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","bbc","data-science","dataanalysis","deep-learning","kaggle","machine-learning","mediaanalytics","natural-language-processing","newsclassification","nlp","python","sentiment-analysis","text-classification","topic-modeling"],"created_at":"2024-12-06T16:24:14.085Z","updated_at":"2026-05-05T11:34:57.861Z","avatar_url":"https://github.com/Rahul-404.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# News Sorting with NLP: BBC News Dataset 📰🔍\n\n## Overview\nWelcome to the News Sorting project using Natural Language Processing (NLP) techniques applied to the BBC News Dataset. This project aims to classify news articles into predefined categories such as business, entertainment, politics, sport, and tech. By leveraging NLP, we'll extract features from the text data to build a machine learning model capable of accurately categorizing news articles.\n\n## Table of Contents\n- [Installation](#installation)\n- [Requirements](#requirements)\n- [Dataset](#dataset)\n- [Approach](#approach)\n- [Results](#results)\n- [Usage](#usage)\n- [Project Demo](#project-demo)\n- [License](#license)\n\n## Installation\n\nClone the repository:\n```bash\ngit clone https://github.com/Rahul-404/bbc-news-sorting.git\ncd bbc-news-sorting\n```\n\nCreate and activate a virtual environment:\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n```\n\n### Requirements\nTo run the project, you'll need Python 3.x and the following libraries:\n- numpy\n- pandas\n- scikit-learn\n- matplotlib\n- seaborn\n- nltk\n- wordcloud\n- tensorflow\n- mlflow\n\nInstall the required dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\n### 🔧 To make any updates in code: Follow the workflow\n\n1. Update config.yaml\n2. Update secrets.yaml [Optional]\n3. Update params.yaml\n4. Update the entity -\u003e config_entity.py\n5. Update the configuration manager in src config\n6. Update the components\n7. Update the pipeline\n8. Update the main.py\n9. Update the dvc.yaml\n\n\n## Dataset\n\nThe BBC News Dataset consists of news articles published by the BBC, categorized into five predefined classes: business, entertainment, politics, sport, and tech. Each article contains textual content along with its corresponding category label. The dataset is available on [Kaggle](https://www.kaggle.com/shivamkushwaha/bbc-full-text-document-classification).\n\n### Train, Test and Validation Split:\n\n![Train-Test_Validation-Split](/notebooks/plots/data_split_output.png)\n\nsplitting data based on `Category` and the distribution of tokens per example across the category to avoid bias of token length and distribution of words.\n\n![Dictionary-words-vs-OOV-words](/notebooks/plots/dictionary_words_vs_oov_words_output.png)\n\naround 10K words are there in train data and, from those only ~3.2K words are out-of-vocabulary this might cause some test error but it will help us to get well generalized model\n\n![Words-Distribution](/notebooks/plots/english_non_english_words_distribution_output.png)\n\n- Train : ~ 10K\n    - ~5.5K english + ~4.5K non-english\n- Validation : ~ 3.2K\n    - ~1.1K english + ~2.1K non-english\n\n## Approach\n\n1. **Data Preprocessing:**\n   - Text cleaning\n      - Normalizing\n      - remove currencies\n      - remove distance\n      - remove conturies\n      - remove numbers\n      - remove special characters\n      - remove punctuations\n      - remove multiple spaces\n      - remove stopwords\n      - lemmetization\n\n   - Tokenization\n   - Vectorization\n      1. One-Hot Encoding\n      2. TF-IDF Encoding\n      3. Word2Vec Embeddings\n      4. Glove Embeddings\n      5. Fasttext Embeddings\n\n2. **Machine Learning Models:**\n   - Logistic Regression\n   - Support Vector Machine (SVM)\n   - Naive Bayes\n   - Random Forest\n   - Gradient Boost\n\n3. **Deep Learning Models:**\n   - Multi Layer Perceptron\n   - LSTM with 2 Dense layers\n   - LSTM\n   - Bidirectional LSTM\n\n4. **Model Evaluation:**\n   - Precision, Recall, F1-Score\n   - Confusion matrix and ROC curve for performance analysis.\n\n\n## Results\n\nThe models were evaluated on precision, recall, F1-score and ROC curve. Below are the results:\n\n```\n+---------------------------+----------+-------+----------+\n| Baseline Model (F1 Score) | Word2Vec | Glove | Fasttext |\n+---------------------------+----------+-------+----------+\n|    Logistic Regression    |   0.96   |  0.96 |   0.96   |\n|        Naive Bayes        |   0.84   |  0.92 |   0.8    |\n|            SVC            |   0.96   |  0.96 |   0.96   |\n|       Random Forest       |   0.96   |  0.95 |   0.95   |\n|       Gradient Boost      |   0.97   |  0.96 |   0.93   |\n+---------------------------+----------+-------+----------+\n```\n```\n+------------------------------+-------------------+---------------------+\n|            Model             |     Embedding     | Validation F1 Score |\n+------------------------------+-------------------+---------------------+\n| Baseline Logistic Regression |       GloVe       |         0.97        |\n|             MLP              |   No Embeddings   |         0.92        |\n|             MLP              |       Glove       |         0.14        |\n|             MLP              |   Glove(Trained)  |         0.95        |\n|             MLP              |      Fasttext     |         0.92        |\n|             MLP              | Fasttext(Trained) |         0.96        |\n|     LSTM 2 Dense layers      |      Fasttext     |         0.18        |\n|             LSTM             |      Fasttext     |         0.13        |\n|      Bidirectional LSTM      |       Glove       |         0.95        |\n+------------------------------+-------------------+---------------------+\n```\n\nConfusion matrices and other relevant graphs:\n\n```\n             precision    recall  f1-score   support\n\n           0       0.98      0.96      0.97       101\n           1       0.97      0.97      0.97        78\n           2       0.96      0.96      0.96        82\n           3       1.00      1.00      1.00       104\n           4       0.98      1.00      0.99        82\n\n    accuracy                           0.98       447\n   macro avg       0.98      0.98      0.98       447\nweighted avg       0.98      0.98      0.98       447\n```\n\n\u003c!-- ![Confusion Matrix](confusion_matrix.png) --\u003e\n\n\n## Usage\n\nTo make predictions on new news articles, you can use the following function:\n\n```python\nfrom src.news_sorting_project.components.predictor import PredictionMaker\nfrom src.news_sorting_project.config.configuration import ConfigurationManager\n\n\ncategories = {\n    'Business': '💼',\n    'Entertainment': '🎬',\n    'Politics': '🗳️',\n    'Sports': '🏅',\n    'Technology': '💻',\n}\n\nconfig = ConfigurationManager()\nmodel_predict_config = config.get_model_predict_config()\nmodel_clean_config = config.get_data_cleaning_config()\nmodel_transform_config = config.get_data_transform_config()\nmake_prediction = PredictionMaker(model_predict_config,\n                                    model_clean_config, \n                                    model_transform_config,\n                                    )\n\narticle_text = \"Your news article text here.\"\nprobabilities = make_prediction.predict(article_text)\nclasses = list(categories.keys())\nprobabilities = [prob / sum(probabilities) for prob in probabilities]  # Normalize\npredicted_class = classes[np.argmax(probabilities)]\n\n\nprint(predicted_class)\n```\nYou can also run the training script to retrain the models:\n```bash\npython train.py\n```\n\n## Project Demo\n\n\n## License\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahul-404%2Fbbc-news-sorting","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frahul-404%2Fbbc-news-sorting","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frahul-404%2Fbbc-news-sorting/lists"}