https://github.com/coder5omkar/fake_news_detection
This project focuses on detecting fake news using semantic classification techniques. By leveraging Word2Vec embeddings and classical machine learning models, it captures the deeper meaning of news content. The approach enhances accuracy by filtering key linguistic features like nouns.
https://github.com/coder5omkar/fake_news_detection
Last synced: 4 months ago
JSON representation
This project focuses on detecting fake news using semantic classification techniques. By leveraging Word2Vec embeddings and classical machine learning models, it captures the deeper meaning of news content. The approach enhances accuracy by filtering key linguistic features like nouns.
- Host: GitHub
- URL: https://github.com/coder5omkar/fake_news_detection
- Owner: coder5omkar
- Created: 2025-06-22T10:16:23.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-06-22T10:28:42.000Z (12 months ago)
- Last Synced: 2025-06-30T00:08:31.062Z (11 months ago)
- Size: 2.5 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
๐ Fake News Identification
๐ฏ Goal
The primary objective of this project is to build a semantic-based classification system that can effectively differentiate between authentic and misleading news articles. By utilizing Word2Vec embeddings, the project focuses on capturing the underlying meanings of the text, which are then analyzed using supervised learning algorithms for classification.
๐ข Real-World Relevance
With the increasing circulation of fabricated news, there's a growing need for intelligent tools that can automatically detect misinformation. This solution showcases how semantic analysis can assist digital platforms and end users in evaluating the trustworthiness of online content.
๐๏ธ Data Overview
The dataset includes two distinct CSV files:
๐น True.csv โ Contains 21,417 samples of verified news articles.
๐น Fake.csv โ Contains 23,502 samples of fabricated news stories.
Each record provides:
๐ title: Headline of the news item
๐ text: Full article content
๐ date: Date when the article was published
๐ Key Observations
โ
Authentic Articles: Use formal language, topic-relevant vocabulary, and a coherent structure.
โ ๏ธ False Articles: Tend to feature emotionally loaded words, recurring phrases, and a less organized layout. Certain word patterns and expressions are common in deceptive content.
๐ Visual analyses (like word clouds) confirmed noticeable differences in vocabulary usage between the two categories.
๐ง Workflow Breakdown
๐งน Data Preprocessing
๐น Removal of noise and irrelevant characters
๐น Lemmatization to unify word forms
๐น Focused on extracting nouns via part-of-speech tagging to enhance semantic feature quality
๐ง Feature Construction
๐น Employed pre-trained Word2Vec vectors to encode textual data into dense, meaning-rich formats
๐งช Classification Models
๐น Logistic Regression
๐น Decision Tree
๐น Random Forest
๐ Performance Metrics
๐ Accuracy on Validation Set: 86.00%
๐ Precision: 85.90%
๐ Recall: 87.01%
๐ F1-Score: 86.45%
๐ Evaluation Summary
Class Precision Recall F1-Score Support
0 (Fake) 0.86 0.85 0.86 73
1 (Real) 0.86 0.87 0.86 77
Overall 0.86 0.86 0.86 150
Macro Average: Precision = 0.86, Recall = 0.86, F1 = 0.86
Weighted Average: Precision = 0.86, Recall = 0.86, F1 = 0.86
๐ Insights
๐น Semantic techniques enhance classification performance and reduce noise.
๐น Pre-trained Word2Vec simplifies feature engineering while improving depth of analysis.
๐น Random Forest consistently outperformed other models in accuracy and F1-score.
๐น This approach provides a practical foundation for scalable misinformation detection tools.
๐ฎ Enhancements Ahead
๐น Adapt Word2Vec embeddings using domain-specific datasets for improved relevance.
๐น Explore context-aware models like BERT to further capture linguistic nuances and dependencies.