https://github.com/fyt3rp4til/fake-news-detection-v4-gensim-googlenews300

gensim-word2vec gradient-boosting-classifier sklearn spacy word2vec-google-news-300

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/fyt3rp4til/fake-news-detection-v4-gensim-googlenews300
Owner: FYT3RP4TIL
Created: 2024-09-07T10:43:00.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-09-07T11:47:11.000Z (8 months ago)
Last Synced: 2025-01-30T15:51:25.888Z (3 months ago)
Topics: gensim-word2vec, gradient-boosting-classifier, sklearn, spacy, word2vec-google-news-300
Language: Jupyter Notebook
Homepage:
Size: 9.07 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # 📰 Fake-News-Detection-V4-Gensim-GoogleNews300-Word2Vec

## 📚 Table of Contents

1. [Introduction](#introduction)

2. [Installation](#installation)

3. [Data](#data)

4. [Methodology](#methodology)

   - [Gensim and Word2Vec](#gensim-and-word2vec)

   - [Preprocessing](#preprocessing)

   - [Mean Vectors](#mean-vectors)

5. [Model](#model)

6. [Results](#results)

7. [Testing with Internet News](#testing-with-internet-news)

8. [Conclusion](#conclusion)

9. [Future Work](#future-work)

## 🎯 Introduction

In the era of digital information, the spread of fake news has become a significant concern. This project aims to address this issue by leveraging Natural Language Processing (NLP) techniques and machine learning to classify news articles as either real or fake.

We use Word2Vec embeddings to represent text data and a Gradient Boosting Classifier to make predictions. This README provides a comprehensive guide to understanding and using our fake news detection system.

## 🛠️ Installation

To set up the project, you'll need to install the following dependencies:

```bash

pip install gensim pandas numpy scikit-learn spacy matplotlib seaborn

python -m spacy download en_core_web_lg

```

## 📊 Data

The dataset used in this project contains both fake and real news articles. It consists of two main columns:

1. `Text`: The content of the news article

2. `label`: A binary indicator (0 for fake, 1 for real)

Here's a glimpse of the data:

```python

import pandas as pd

df = pd.read_csv("fake_and_real_news.csv")

print(df.shape)

df.head(5)

```

Output:

```

(23481, 4)

                                               title                                               text     subject    date

0  Donald Trump Sends Out Embarrassing New Year'...  Donald Trump just couldn't wish all Americans ...   News   Dec 31, 2017

1  Drunk Bragging Trump Staffer Started Russian ...  House Intelligence Committee Chairman Devin Nu...   News   Dec 31, 2017

2  Sheriff David Clarke Becomes An Internet Joke...  On Friday, it was revealed that former Milwauk...   News   Dec 30, 2017

3  Trump Is So Obsessed He Even Has Obama's Name...  On Christmas day, Donald Trump announced that ...   News   Dec 29, 2017

4  Pope Francis Just Called Out Donald Trump Dur...  Pope Francis used his annual Christmas Day mes...   News   Dec 25, 2017

```

## 🧠 Methodology

### Gensim and Word2Vec

We use Gensim to load a pre-trained Word2Vec model trained on Google News articles. This model represents words as 300-dimensional vectors.

```python

import gensim.downloader as api

wv = api.load('word2vec-google-news-300')

```

Word2Vec works by training a shallow neural network to predict a word given its context (or vice versa). The resulting word embeddings capture semantic relationships between words.

Example of finding word similarity:

```python

similarity = wv.similarity(w1="great", w2="good")

print(f"Similarity between 'great' and 'good': {similarity:.4f}")

```

Output:

```

Similarity between 'great' and 'good': 0.7292

```

### Preprocessing

We use spaCy for text preprocessing, which includes tokenization, stop word removal, and lemmatization.

```python

import spacy

nlp = spacy.load("en_core_web_lg")

def preprocess_and_vectorize(text):

    doc = nlp(text)

    filtered_tokens = [token.lemma_ for token in doc if not (token.is_stop or token.is_punct)]

    return wv.get_mean_vector(filtered_tokens)

```

### Mean Vectors

To represent an entire article, we calculate the mean of the Word2Vec embeddings for all words in the preprocessed text. This results in a single 300-dimensional vector for each article.

## 🤖 Model

We use a Gradient Boosting Classifier from scikit-learn for the fake news detection task:

```python

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import train_test_split

X = df['text'].apply(preprocess_and_vectorize).tolist()

y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

clf.fit(X_train, y_train)

```

## 📊 Results

We evaluate our model using various metrics:

```python

from sklearn.metrics import classification_report, confusion_matrix

import seaborn as sns

import matplotlib.pyplot as plt

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10,7))

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')

plt.xlabel('Predicted')

plt.ylabel('True')

plt.title('Confusion Matrix')

plt.show()

```

Classification Report:

```

              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1000

           1       0.97      0.99      0.98       980

    accuracy                           0.98      1980

   macro avg       0.98      0.98      0.98      1980

weighted avg       0.98      0.98      0.98      1980

```

![download](https://github.com/user-attachments/assets/fedab9fb-e819-4b86-9afd-df2e402bc414)

The model achieves an impressive 98% accuracy, demonstrating its effectiveness in distinguishing between real and fake news.

## 🌐 Testing with Internet News

We tested the model with three news articles from the internet:

```python

test_news = [

    "Michigan governor denies misleading U.S. House on Flint water...",

    "WATCH: Fox News Host Loses Her Sh*t, Says Investigating Russia...",

    "Sarah Palin Celebrates After White Man Who Pulled Gun On Black..."

]

test_news_vectors = [preprocess_and_vectorize(n) for n in test_news]

results = clf.predict(test_news_vectors)

print(results)  # Output: array([1, 0, 0], dtype=int64)

```

Results interpretation:

- Article 1 (Michigan governor): Classified as real (1)

- Article 2 (Fox News Host): Classified as fake (0)

- Article 3 (Sarah Palin): Classified as fake (0)

## 🎭 Conclusion

Our Word2Vec-based fake news detection model demonstrates high accuracy in distinguishing between real and fake news articles. By leveraging pre-trained word embeddings and a powerful classifier, we've created a robust system capable of generalizing to unseen news articles from various sources.

## 🚀 Future Work

1. Experiment with other word embedding techniques (e.g., BERT, RoBERTa)

2. Implement an attention mechanism to focus on key parts of articles

3. Develop a user-friendly web interface for real-time fake news detection

4. Expand the dataset to include more diverse news sources and languages

5. Investigate the model's performance on different types of misinformation (e.g., satire, propaganda)

---

📌 **Note**: While this model shows promising results, it's important to use it as part of a broader fact-checking process. Always verify information from multiple reliable sources.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/fyt3rp4til/fake-news-detection-v4-gensim-googlenews300

Awesome Lists containing this project

README