Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fyt3rp4til/fake-news-detection-v4-gensim-googlenews300


https://github.com/fyt3rp4til/fake-news-detection-v4-gensim-googlenews300

gensim-word2vec gradient-boosting-classifier sklearn spacy word2vec-google-news-300

Last synced: 1 day ago
JSON representation

Awesome Lists containing this project

README

        

# 📰 Fake-News-Detection-V4-Gensim-GoogleNews300-Word2Vec

## 📚 Table of Contents

1. [Introduction](#introduction)
2. [Installation](#installation)
3. [Data](#data)
4. [Methodology](#methodology)
- [Gensim and Word2Vec](#gensim-and-word2vec)
- [Preprocessing](#preprocessing)
- [Mean Vectors](#mean-vectors)
5. [Model](#model)
6. [Results](#results)
7. [Testing with Internet News](#testing-with-internet-news)
8. [Conclusion](#conclusion)
9. [Future Work](#future-work)

## 🎯 Introduction

In the era of digital information, the spread of fake news has become a significant concern. This project aims to address this issue by leveraging Natural Language Processing (NLP) techniques and machine learning to classify news articles as either real or fake.

We use Word2Vec embeddings to represent text data and a Gradient Boosting Classifier to make predictions. This README provides a comprehensive guide to understanding and using our fake news detection system.

## 🛠️ Installation

To set up the project, you'll need to install the following dependencies:

```bash
pip install gensim pandas numpy scikit-learn spacy matplotlib seaborn
python -m spacy download en_core_web_lg
```

## 📊 Data

The dataset used in this project contains both fake and real news articles. It consists of two main columns:

1. `Text`: The content of the news article
2. `label`: A binary indicator (0 for fake, 1 for real)

Here's a glimpse of the data:

```python
import pandas as pd

df = pd.read_csv("fake_and_real_news.csv")
print(df.shape)
df.head(5)
```

Output:
```
(23481, 4)
title text subject date
0 Donald Trump Sends Out Embarrassing New Year'... Donald Trump just couldn't wish all Americans ... News Dec 31, 2017
1 Drunk Bragging Trump Staffer Started Russian ... House Intelligence Committee Chairman Devin Nu... News Dec 31, 2017
2 Sheriff David Clarke Becomes An Internet Joke... On Friday, it was revealed that former Milwauk... News Dec 30, 2017
3 Trump Is So Obsessed He Even Has Obama's Name... On Christmas day, Donald Trump announced that ... News Dec 29, 2017
4 Pope Francis Just Called Out Donald Trump Dur... Pope Francis used his annual Christmas Day mes... News Dec 25, 2017
```

## 🧠 Methodology

### Gensim and Word2Vec

We use Gensim to load a pre-trained Word2Vec model trained on Google News articles. This model represents words as 300-dimensional vectors.

```python
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
```

Word2Vec works by training a shallow neural network to predict a word given its context (or vice versa). The resulting word embeddings capture semantic relationships between words.

Example of finding word similarity:

```python
similarity = wv.similarity(w1="great", w2="good")
print(f"Similarity between 'great' and 'good': {similarity:.4f}")
```

Output:
```
Similarity between 'great' and 'good': 0.7292
```

### Preprocessing

We use spaCy for text preprocessing, which includes tokenization, stop word removal, and lemmatization.

```python
import spacy
nlp = spacy.load("en_core_web_lg")

def preprocess_and_vectorize(text):
doc = nlp(text)
filtered_tokens = [token.lemma_ for token in doc if not (token.is_stop or token.is_punct)]
return wv.get_mean_vector(filtered_tokens)
```

### Mean Vectors

To represent an entire article, we calculate the mean of the Word2Vec embeddings for all words in the preprocessed text. This results in a single 300-dimensional vector for each article.

## 🤖 Model

We use a Gradient Boosting Classifier from scikit-learn for the fake news detection task:

```python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

X = df['text'].apply(preprocess_and_vectorize).tolist()
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
clf.fit(X_train, y_train)
```

## 📊 Results

We evaluate our model using various metrics:

```python
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(10,7))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()
```

Classification Report:
```
precision recall f1-score support

0 0.99 0.97 0.98 1000
1 0.97 0.99 0.98 980

accuracy 0.98 1980
macro avg 0.98 0.98 0.98 1980
weighted avg 0.98 0.98 0.98 1980
```
![download](https://github.com/user-attachments/assets/fedab9fb-e819-4b86-9afd-df2e402bc414)

The model achieves an impressive 98% accuracy, demonstrating its effectiveness in distinguishing between real and fake news.

## 🌐 Testing with Internet News

We tested the model with three news articles from the internet:

```python
test_news = [
"Michigan governor denies misleading U.S. House on Flint water...",
"WATCH: Fox News Host Loses Her Sh*t, Says Investigating Russia...",
"Sarah Palin Celebrates After White Man Who Pulled Gun On Black..."
]

test_news_vectors = [preprocess_and_vectorize(n) for n in test_news]
results = clf.predict(test_news_vectors)
print(results) # Output: array([1, 0, 0], dtype=int64)
```

Results interpretation:
- Article 1 (Michigan governor): Classified as real (1)
- Article 2 (Fox News Host): Classified as fake (0)
- Article 3 (Sarah Palin): Classified as fake (0)

## 🎭 Conclusion

Our Word2Vec-based fake news detection model demonstrates high accuracy in distinguishing between real and fake news articles. By leveraging pre-trained word embeddings and a powerful classifier, we've created a robust system capable of generalizing to unseen news articles from various sources.

## 🚀 Future Work

1. Experiment with other word embedding techniques (e.g., BERT, RoBERTa)
2. Implement an attention mechanism to focus on key parts of articles
3. Develop a user-friendly web interface for real-time fake news detection
4. Expand the dataset to include more diverse news sources and languages
5. Investigate the model's performance on different types of misinformation (e.g., satire, propaganda)

---

📌 **Note**: While this model shows promising results, it's important to use it as part of a broader fact-checking process. Always verify information from multiple reliable sources.