An open API service indexing awesome lists of open source software.

https://github.com/dpb24/fake-news-detector

πŸ“° NLP: Fake News Detection using Classical Machine Learning
https://github.com/dpb24/fake-news-detector

bag-of-words decision-tree decision-tree-classifier fake-news feature-engineering feature-extraction machine-learning matplotlib natural-language-processing nlp nlp-machine-learning predictive-analytics predictive-modeling scikit-learn text vectorization visual-studio-code xgboost xgboost-classifier xgboost-model

Last synced: about 2 months ago
JSON representation

πŸ“° NLP: Fake News Detection using Classical Machine Learning

Awesome Lists containing this project

README

          

# πŸ“° Fake News Detector: Binary Classification Model

**Libraries:** `scikit-learn`, `XGBoost`, `matplotlib`, `pandas`, `numpy`

**Dataset:** [ISOT Fake News Detection Dataset](https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/)

In this project we use the 🐍 Python libraries [scikit-learn](https://scikit-learn.org/stable/) and [XGBoost](https://xgboost.readthedocs.io/en/stable/) to build a machine learning model that classifies news articles as fake or real. We combine classical machine learning techniques with engineered textual features to improve model generalisability and performance.

## 🧠 Approach
- **Text vectorisation:** Bag of Words (BoW)
- **Feature engineering:** % of special characters & % of capitalised characters
- **Baseline model:** `DecisionTreeClassifier` with `GridSearchCV`
- **Ensemble model:** `XGBClassifier` with `RandomizedSearchCV`
- **Robustness:** Removed dataset-specific artefacts (eg. *reuters*) from BoW to improve generalisability

## βœ… Results
- πŸ€– **XGBoost ensemble** achieved **~99.8%** accuracy, precision, recall, and F1 score
- **Top feature:** `headline_capitalised` (engineered)
- **Fun insight:** second most important vectorized word for classification β€” "Trump" πŸ‡ΊπŸ‡Έ

## πŸ”­ Future Work
- Test on more diverse, real-world datasets
- Experiment with advanced text vectorisation (eg. word embeddings, transformer models)
- Compare with alternative classifiers (eg. Support Vector Machines)

πŸ“– Jupyter Notebook: [GitHub](https://github.com/dpb24/fake-news-detector/blob/main/notebooks/Fake_News_Detector.ipynb) | [CoLab](https://colab.research.google.com/drive/1WacZBouhz3WlujSIORFhSaVje6W5upGZ?usp=sharing) | [Kaggle](https://www.kaggle.com/code/davidpbriggs/fake-news-detector)