https://github.com/dpb24/fake-news-detector

📰 NLP: Fake News Detection using Classical Machine Learning
https://github.com/dpb24/fake-news-detector

bag-of-words decision-tree decision-tree-classifier fake-news feature-engineering feature-extraction machine-learning matplotlib natural-language-processing nlp nlp-machine-learning predictive-analytics predictive-modeling scikit-learn text vectorization visual-studio-code xgboost xgboost-classifier xgboost-model

Last synced: 3 months ago
JSON representation

📰 NLP: Fake News Detection using Classical Machine Learning

Host: GitHub
URL: https://github.com/dpb24/fake-news-detector
Owner: dpb24
Created: 2025-06-02T15:18:51.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-06-03T09:31:39.000Z (about 1 year ago)
Last Synced: 2025-06-14T01:09:20.075Z (about 1 year ago)
Topics: bag-of-words, decision-tree, decision-tree-classifier, fake-news, feature-engineering, feature-extraction, machine-learning, matplotlib, natural-language-processing, nlp, nlp-machine-learning, predictive-analytics, predictive-modeling, scikit-learn, text, vectorization, visual-studio-code, xgboost, xgboost-classifier, xgboost-model
Language: Jupyter Notebook
Homepage:
Size: 2 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 📰 Fake News Detector: Binary Classification Model

**Libraries:** `scikit-learn`, `XGBoost`, `matplotlib`, `pandas`, `numpy`

**Dataset:** [ISOT Fake News Detection Dataset](https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/)

In this project we use the 🐍 Python libraries [scikit-learn](https://scikit-learn.org/stable/) and [XGBoost](https://xgboost.readthedocs.io/en/stable/) to build a machine learning model that classifies news articles as fake or real. We combine classical machine learning techniques with engineered textual features to improve model generalisability and performance.

## 🧠 Approach
- **Text vectorisation:** Bag of Words (BoW)
- **Feature engineering:** % of special characters & % of capitalised characters
- **Baseline model:** `DecisionTreeClassifier` with `GridSearchCV`
- **Ensemble model:** `XGBClassifier` with `RandomizedSearchCV`
- **Robustness:** Removed dataset-specific artefacts (eg. *reuters*) from BoW to improve generalisability

## ✅ Results
- 🤖 **XGBoost ensemble** achieved **~99.8%** accuracy, precision, recall, and F1 score
- **Top feature:** `headline_capitalised` (engineered)
- **Fun insight:** second most important vectorized word for classification — "Trump" 🇺🇸

## 🔭 Future Work
- Test on more diverse, real-world datasets
- Experiment with advanced text vectorisation (eg. word embeddings, transformer models)
- Compare with alternative classifiers (eg. Support Vector Machines)

📖 Jupyter Notebook: [GitHub](https://github.com/dpb24/fake-news-detector/blob/main/notebooks/Fake_News_Detector.ipynb) | [CoLab](https://colab.research.google.com/drive/1WacZBouhz3WlujSIORFhSaVje6W5upGZ?usp=sharing) | [Kaggle](https://www.kaggle.com/code/davidpbriggs/fake-news-detector)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dpb24/fake-news-detector

Awesome Lists containing this project

README