https://github.com/dpb24/fake-news-detector
π° NLP: Fake News Detection using Classical Machine Learning
https://github.com/dpb24/fake-news-detector
bag-of-words decision-tree decision-tree-classifier fake-news feature-engineering feature-extraction machine-learning matplotlib natural-language-processing nlp nlp-machine-learning predictive-analytics predictive-modeling scikit-learn text vectorization visual-studio-code xgboost xgboost-classifier xgboost-model
Last synced: about 2 months ago
JSON representation
π° NLP: Fake News Detection using Classical Machine Learning
- Host: GitHub
- URL: https://github.com/dpb24/fake-news-detector
- Owner: dpb24
- Created: 2025-06-02T15:18:51.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-03T09:31:39.000Z (about 1 year ago)
- Last Synced: 2025-06-14T01:09:20.075Z (about 1 year ago)
- Topics: bag-of-words, decision-tree, decision-tree-classifier, fake-news, feature-engineering, feature-extraction, machine-learning, matplotlib, natural-language-processing, nlp, nlp-machine-learning, predictive-analytics, predictive-modeling, scikit-learn, text, vectorization, visual-studio-code, xgboost, xgboost-classifier, xgboost-model
- Language: Jupyter Notebook
- Homepage:
- Size: 2 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π° Fake News Detector: Binary Classification Model
**Libraries:** `scikit-learn`, `XGBoost`, `matplotlib`, `pandas`, `numpy`
**Dataset:** [ISOT Fake News Detection Dataset](https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/)
In this project we use the π Python libraries [scikit-learn](https://scikit-learn.org/stable/) and [XGBoost](https://xgboost.readthedocs.io/en/stable/) to build a machine learning model that classifies news articles as fake or real. We combine classical machine learning techniques with engineered textual features to improve model generalisability and performance.
## π§ Approach
- **Text vectorisation:** Bag of Words (BoW)
- **Feature engineering:** % of special characters & % of capitalised characters
- **Baseline model:** `DecisionTreeClassifier` with `GridSearchCV`
- **Ensemble model:** `XGBClassifier` with `RandomizedSearchCV`
- **Robustness:** Removed dataset-specific artefacts (eg. *reuters*) from BoW to improve generalisability
## β
Results
- π€ **XGBoost ensemble** achieved **~99.8%** accuracy, precision, recall, and F1 score
- **Top feature:** `headline_capitalised` (engineered)
- **Fun insight:** second most important vectorized word for classification β "Trump" πΊπΈ
## π Future Work
- Test on more diverse, real-world datasets
- Experiment with advanced text vectorisation (eg. word embeddings, transformer models)
- Compare with alternative classifiers (eg. Support Vector Machines)
π Jupyter Notebook: [GitHub](https://github.com/dpb24/fake-news-detector/blob/main/notebooks/Fake_News_Detector.ipynb) | [CoLab](https://colab.research.google.com/drive/1WacZBouhz3WlujSIORFhSaVje6W5upGZ?usp=sharing) | [Kaggle](https://www.kaggle.com/code/davidpbriggs/fake-news-detector)