https://github.com/fyt3rp4til/fake-news-detection-v3-word2vec-spacy
https://github.com/fyt3rp4til/fake-news-detection-v3-word2vec-spacy
glove-embeddings knn-classifier multinomial-naive-bayes spacy word2vec
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/fyt3rp4til/fake-news-detection-v3-word2vec-spacy
- Owner: FYT3RP4TIL
- Created: 2024-09-07T06:45:11.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-09-07T07:17:22.000Z (8 months ago)
- Last Synced: 2025-01-30T15:51:26.873Z (3 months ago)
- Topics: glove-embeddings, knn-classifier, multinomial-naive-bayes, spacy, word2vec
- Language: Jupyter Notebook
- Homepage:
- Size: 9.06 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 📰 Fake News Classification Project
## 📌 Table of Contents
1. [Problem Statement](#-problem-statement)
2. [Dataset](#-dataset)
3. [Setup and Installation](#-setup-and-installation)
4. [Running the Notebook](#-running-the-notebook)
5. [Methodology](#️-methodology)
6. [Results](#-results)
7. [Key Takeaways](#-key-takeaways)
8. [Future Work](#-future-work)## 🎯 Problem Statement
This project addresses the challenge of distinguishing between real and fake news articles using Natural Language Processing (NLP) techniques and machine learning algorithms. Our goal is to develop a classifier that can accurately identify fake news, contributing to the ongoing efforts to combat misinformation.
## 📊 Dataset
- **Source**: [Kaggle - Fake and Real News Dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset)
- **File**: "Fake_Real_Data.csv"
- **Columns**: Text (news content), Label (Fake or Real)
- **Size**: 9,900 entries (5,000 Fake, 4,900 Real)## 🛠 Setup and Installation
To run this project, you'll need Python and Jupyter Notebook installed. Follow these steps:
1. Ensure you have Jupyter Notebook installed. If not, you can install it using:
```
pip install jupyter
```2. Install the required packages. You can do this directly in a code cell within the notebook:
```python
!pip install pandas numpy scikit-learn spacy
```3. Download the spaCy model. Run this in a code cell:
```python
!python -m spacy download en_core_web_lg
```4. Ensure you have the "Fake_Real_Data.csv" file in the same directory as the notebook.
## 🚀 Running the Notebook
1. Start Jupyter Notebook:
```
jupyter notebook
```2. Open the "Fake_News_Classification.ipynb" file in the Jupyter interface.
3. Run the cells in order, following the instructions within the notebook.
## 🛠️ Methodology
The notebook guides you through the following steps:
### 1. Data Loading and Exploration
```python
import pandas as pd# Read the dataset
df = pd.read_csv("Fake_Real_Data.csv")# Print the shape of dataframe
print(df.shape)# Print top 5 rows
df.head(5)# Check the distribution of labels
df['label'].value_counts()
```### 2. Text Vectorization
We use spaCy's `en_core_web_lg` model to create word embeddings:
```python
import spacy
nlp = spacy.load("en_core_web_lg")# This will take some time (nearly 15 minutes)
df['vector'] = df['Text'].apply(lambda text: nlp(text).vector)
```### 3. Data Splitting
```python
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(
df.vector.values,
df.label_num,
test_size=0.2,
random_state=2022
)import numpy as np
X_train_2d = np.stack(X_train) # converting to 2d numpy array
X_test_2d = np.stack(X_test)
```### 4. Model Training and Evaluation
We implement and compare two models:
#### Multinomial Naive Bayes
```python
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler()
scaled_train_embed = scaler.fit_transform(X_train_2d)
scaled_test_embed = scaler.transform(X_test_2d)clf = MultinomialNB()
clf.fit(scaled_train_embed, y_train)from sklearn.metrics import classification_report
y_pred = clf.predict(scaled_test_embed)
print(classification_report(y_test, y_pred))
```#### K-Nearest Neighbors (KNN)
```python
from sklearn.neighbors import KNeighborsClassifierclf = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
clf.fit(X_train_2d, y_train)y_pred = clf.predict(X_test_2d)
print(classification_report(y_test, y_pred))
```## 📊 Results
The notebook presents classification reports for both models:
### Multinomial Naive Bayes
```
precision recall f1-score support0 0.95 0.94 0.95 1024
1 0.94 0.95 0.94 956accuracy 0.94 1980
macro avg 0.94 0.94 0.94 1980
weighted avg 0.94 0.94 0.94 1980
```### K-Nearest Neighbors (KNN)
```
precision recall f1-score support0 1.00 0.99 0.99 1024
1 0.99 0.99 0.99 956accuracy 0.99 1980
macro avg 0.99 0.99 0.99 1980
weighted avg 0.99 0.99 0.99 1980
```## 🔑 Key Takeaways
1. **Effective Vectorization**: GloVe embeddings from spaCy provided rich 300-dimensional vectors, capturing semantic relationships effectively.
2. **Model Performance**:
- KNN achieved exceptional accuracy (99%), benefiting from the compact, semantic-rich GloVe vectors.
- Multinomial Naive Bayes performed well (94% accuracy) after scaling to handle negative values.3. **Preprocessing Impact**: Pre-trained GloVe embeddings significantly enhanced both models' performance, especially KNN.
4. **Time Consideration**: While GloVe embedding is time-consuming (about 15 minutes for this dataset), it results in high-quality feature representations.
## 🚀 Future Work
1. Experiment with deep learning models like LSTM or BERT.
2. Incorporate additional features (e.g., article source, publication date).
3. Develop a real-time classification system.
4. Explore explainable AI techniques for model interpretability.---
📌 **Note**: This project is for educational purposes. Always critically evaluate news sources and cross-reference information, regardless of model predictions.