https://github.com/victorkiosh/fake-news-detection
Detecting fake news using NLP and machine learning (Logistic Regression, Random Forest, XGBoost)
https://github.com/victorkiosh/fake-news-detection
data-science fake-news-detection machine-learning nlp scikit-learn xgboost
Last synced: 3 months ago
JSON representation
Detecting fake news using NLP and machine learning (Logistic Regression, Random Forest, XGBoost)
- Host: GitHub
- URL: https://github.com/victorkiosh/fake-news-detection
- Owner: Victorkiosh
- Created: 2025-06-27T09:05:20.000Z (3 months ago)
- Default Branch: master
- Last Pushed: 2025-06-27T09:27:53.000Z (3 months ago)
- Last Synced: 2025-06-27T10:29:19.788Z (3 months ago)
- Topics: data-science, fake-news-detection, machine-learning, nlp, scikit-learn, xgboost
- Language: Jupyter Notebook
- Homepage:
- Size: 41.1 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ๐ฐ Fake News Detection using NLP & Machine Learning
This project builds a machine learning pipeline to detect fake news articles based on their content. Using natural language processing (NLP) techniques, we classify news as either **Fake** or **Real**. The goal is to help mitigate misinformation by demonstrating the power of data science in content verification.
We trained multiple models and achieved **up to 99.65% accuracy** using XGBoost. We also emphasized interpretability with logistic regression and feature importance analysis.
---
## ๐ฆ Project Structure
- `FakeNews_Detection.ipynb` โ Full notebook (EDA โ Preprocessing โ Modeling โ Evaluation)
- `data/` โ Contains the original datasets (`True.csv`, `Fake.csv`)
- `visuals/` โ Word clouds, ROC curves, feature importance charts
- `requirements.txt` โ Python packages used
- `README.md` โ Project overview, findings, and usage instructions---
## ๐ง Key Features
- ๐ Exploratory Data Analysis (EDA)
- Word count distributions
- Word clouds for Fake vs Real news
- โ๏ธ Text cleaning & preprocessing
- ๐งพ Feature extraction with TF-IDF
- ๐ค ML models:
- Logistic Regression
- Random Forest
- XGBoost (top performer)
- ๐ Evaluation metrics: Accuracy, ROC AUC, Confusion Matrix
- ๐ Interpretability:
- Top predictive words (Logistic Regression)
- Top features (XGBoost)---
## ๐ Model Performance
| Model | Accuracy | ROC AUC |
|---------------------|----------|---------|
| Logistic Regression | 98.39% | 0.9838 |
| Random Forest | 95.78% | 0.9571 |
| XGBoost | **99.65%** | **0.9966** |> โ **XGBoost** achieved the best performance, while **Logistic Regression** offered excellent interpretability.


---
## ๐ Dataset
- Source: [Kaggle Fake News Dataset](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset)
- [`Fake.csv`] โ articles marked as fake
- [`True.csv`] โ articles from verified sources---
## ๐งฐ Tech Stack
- Python
- Pandas, NumPy
- Scikit-learn
- XGBoost
- Matplotlib, Seaborn, WordCloud---
## ๐ Insights
- Fake news often uses emotionally charged or politically biased words (e.g., "lying", "watch", "gop")
- Real news is more likely to mention institutional sources and structured reporting (e.g., "reuters", "statement", "reporters")
- TF-IDF with traditional models like XGBoost and Logistic Regression can still outperform deep learning on smaller datasets---
## ๐ How to Run
1. Clone this repo
2. Install required libraries:
```bash
pip install -r requirements.txt---
## ๐ฌ Author
**Victor Kioko Mutua**
๐ง kiokovictor78@gmail.com
๐ [GitHub Profile](https://github.com/Victorkiosh)
๐ [LinkedIn](www.linkedin.com/in/mutuavictor)