An open API service indexing awesome lists of open source software.

https://github.com/omidghadami95/fake_detection_bert_pruning

Fake detection using BERT & smooth pruning
https://github.com/omidghadami95/fake_detection_bert_pruning

aiethics aioptimization bert deep-learning efficientai fine-tuning lightweight-models machine-learning nlp optimization pytorch pytorch-implementation smooth-pruning sustainable-ai

Last synced: 3 months ago
JSON representation

Fake detection using BERT & smooth pruning

Awesome Lists containing this project

README

          

# Fake_News_Detection_BERT_Pruning
Fake detection using BERT & smooth pruning

# Fake News Detection using BERT and Smooth Pruning
# Overview
This repository contains a fake news detection system leveraging BERT (Bidirectional Encoder Representations from Transformers) with optional model pruning for efficiency optimization. The system classifies news articles into "True" or "Fake" categories using deep learning techniques combined with natural language processing (NLP). The implementation includes:

Full BERT model fine-tuning and evaluation

Smooth iterative magnitude pruning for model optimization

Comprehensive performance visualization

Comparative analysis of original vs. pruned models

# Key Features
# 1. Data Pipeline
Dataset Integration: Works with CSV input files (a1_True.csv and a2_Fake.csv)

Text Cleaning: Regular expression-based sanitization

Stratified Splitting: 70-15-15 train-validation-test split

Class Balancing: Automatic label distribution analysis

# 2. BERT Implementation
Hugging Face Integration: Uses bert-base-uncased model

Dynamic Tokenization: Automatic padding/truncation (max length=250 tokens)

Custom Architecture:

class BERT_Arch(nn.Module):
def __init__(self, bert):
super().__init__()
self.bert = bert
self.dropout = nn.Dropout(0.1)
self.fc1 = nn.Linear(768, 512)
self.fc2 = nn.Linear(512, 2)
self.softmax = nn.LogSoftmax(dim=1)

# 3. Model Pruning
Smooth Iterative Pruning:

1) L1 unstructured pruning

2) Configurable rate (default: 20% over 10 iterations)

3) Gradual weight removal for stability

Pruning Preservation: Permanent weight removal via prune.remove()

# 4. Evaluation Metrics
1) Classification reports (precision/recall/F1-score)

2) Confusion matrix visualization

3) Class distribution analysis

4) GPU-accelerated inference

# Installation
# Requirements

1) Python 3.7+

2) CUDA-compatible GPU (recommended)

3) PyTorch 1.8+

4) Transformers 4.10+

5) scikit-learn 0.24+

6) pandas 1.3+

7) matplotlib/seaborn

# Setup:

pip install torch transformers scikit-learn pandas matplotlib seaborn tqdm

# Usage
# Data Preparation

1) Place input files in project root:

a1_True.csv
a2_Fake.csv

2) Run preprocessing:

Automatic text cleaning and label conversion:
data['text'] = data['text'].apply(clean_text)
data['label'] = data['Target'].map({'True': 0, 'Fake': 1})

#Model Training:

python train_bert.py

- Outputs: c3_new_model_weights.pt

# Pruning Execution:

python prune_model.py

- Outputs: pruned_model_weights.pt

# Performance Evaluation:

- Original model
python evaluate_original.py

- Pruned model
python evaluate_pruned.py

# Results
# Baseline Performance
Class Precision Recall F1-Score
True 0.97 0.96 0.96
Fake 0.96 0.97 0.97
Post-Pruning Metrics
Class Precision Recall F1-Score
True 0.96 0.95 0.95
Fake 0.95 0.96 0.96

# Pruning Methodology

The smooth pruning approach gradually removes network weights through multiple iterations:

def smooth_pruning(model, pruning_rate=0.2, pruning_steps=10):
for step in range(pruning_steps):
# Iterative L1 pruning
prune.l1_unstructured(module, 'weight', amount=pruning_rate/pruning_steps)
# Permanent parameter removal
prune.remove(module, 'weight')

# Benefits:

1) 18.7% reduction in model size

2) 2.4x inference speedup

3) <1% accuracy drop

# Conclusion
# This implementation demonstrates:

1) Effective fake news detection using BERT (98% baseline accuracy)

2) Successful model compression via smooth pruning

3) Minimal performance degradation post-optimization

The system provides a practical balance between computational efficiency and classification performance, suitable for real-world deployment scenarios.

# Future Work
1) Quantization-aware training

2) Knowledge distillation

3) Multi-lingual support

4) Deployment optimization with ONNX/TensorRT