An open API service indexing awesome lists of open source software.

https://github.com/victorkiosh/fake-news-detection

Detecting fake news using NLP and machine learning (Logistic Regression, Random Forest, XGBoost)
https://github.com/victorkiosh/fake-news-detection

data-science fake-news-detection machine-learning nlp scikit-learn xgboost

Last synced: 3 months ago
JSON representation

Detecting fake news using NLP and machine learning (Logistic Regression, Random Forest, XGBoost)

Awesome Lists containing this project

README

          

# ๐Ÿ“ฐ Fake News Detection using NLP & Machine Learning

This project builds a machine learning pipeline to detect fake news articles based on their content. Using natural language processing (NLP) techniques, we classify news as either **Fake** or **Real**. The goal is to help mitigate misinformation by demonstrating the power of data science in content verification.

We trained multiple models and achieved **up to 99.65% accuracy** using XGBoost. We also emphasized interpretability with logistic regression and feature importance analysis.

---

## ๐Ÿ“ฆ Project Structure

- `FakeNews_Detection.ipynb` โ†’ Full notebook (EDA โ†’ Preprocessing โ†’ Modeling โ†’ Evaluation)
- `data/` โ†’ Contains the original datasets (`True.csv`, `Fake.csv`)
- `visuals/` โ†’ Word clouds, ROC curves, feature importance charts
- `requirements.txt` โ†’ Python packages used
- `README.md` โ†’ Project overview, findings, and usage instructions

---

## ๐Ÿง  Key Features

- ๐Ÿ” Exploratory Data Analysis (EDA)
- Word count distributions
- Word clouds for Fake vs Real news
- โœ‚๏ธ Text cleaning & preprocessing
- ๐Ÿงพ Feature extraction with TF-IDF
- ๐Ÿค– ML models:
- Logistic Regression
- Random Forest
- XGBoost (top performer)
- ๐Ÿ“ˆ Evaluation metrics: Accuracy, ROC AUC, Confusion Matrix
- ๐Ÿ“Š Interpretability:
- Top predictive words (Logistic Regression)
- Top features (XGBoost)

---

## ๐Ÿ“Š Model Performance

| Model | Accuracy | ROC AUC |
|---------------------|----------|---------|
| Logistic Regression | 98.39% | 0.9838 |
| Random Forest | 95.78% | 0.9571 |
| XGBoost | **99.65%** | **0.9966** |

> โœ… **XGBoost** achieved the best performance, while **Logistic Regression** offered excellent interpretability.

![ROC Curve](visuals/roc_curve.png)

![Feature Importance](visuals/Top_20_features_XGBoost.png)

---

## ๐ŸŒ Dataset

- Source: [Kaggle Fake News Dataset](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset)
- [`Fake.csv`] โ€” articles marked as fake
- [`True.csv`] โ€” articles from verified sources

---

## ๐Ÿงฐ Tech Stack

- Python
- Pandas, NumPy
- Scikit-learn
- XGBoost
- Matplotlib, Seaborn, WordCloud

---

## ๐Ÿ“Œ Insights

- Fake news often uses emotionally charged or politically biased words (e.g., "lying", "watch", "gop")
- Real news is more likely to mention institutional sources and structured reporting (e.g., "reuters", "statement", "reporters")
- TF-IDF with traditional models like XGBoost and Logistic Regression can still outperform deep learning on smaller datasets

---

## ๐Ÿš€ How to Run

1. Clone this repo
2. Install required libraries:
```bash
pip install -r requirements.txt

---

## ๐Ÿ“ฌ Author

**Victor Kioko Mutua**
๐Ÿ“ง kiokovictor78@gmail.com
๐ŸŒ [GitHub Profile](https://github.com/Victorkiosh)
๐Ÿ”— [LinkedIn](www.linkedin.com/in/mutuavictor)