An open API service indexing awesome lists of open source software.

https://github.com/coder5omkar/fake_news_detection

This project focuses on detecting fake news using semantic classification techniques. By leveraging Word2Vec embeddings and classical machine learning models, it captures the deeper meaning of news content. The approach enhances accuracy by filtering key linguistic features like nouns.
https://github.com/coder5omkar/fake_news_detection

Last synced: 4 months ago
JSON representation

This project focuses on detecting fake news using semantic classification techniques. By leveraging Word2Vec embeddings and classical machine learning models, it captures the deeper meaning of news content. The approach enhances accuracy by filtering key linguistic features like nouns.

Awesome Lists containing this project

README

          

๐Ÿ“˜ Fake News Identification
๐ŸŽฏ Goal
The primary objective of this project is to build a semantic-based classification system that can effectively differentiate between authentic and misleading news articles. By utilizing Word2Vec embeddings, the project focuses on capturing the underlying meanings of the text, which are then analyzed using supervised learning algorithms for classification.

๐Ÿข Real-World Relevance
With the increasing circulation of fabricated news, there's a growing need for intelligent tools that can automatically detect misinformation. This solution showcases how semantic analysis can assist digital platforms and end users in evaluating the trustworthiness of online content.

๐Ÿ—‚๏ธ Data Overview
The dataset includes two distinct CSV files:

๐Ÿ”น True.csv โ€“ Contains 21,417 samples of verified news articles.

๐Ÿ”น Fake.csv โ€“ Contains 23,502 samples of fabricated news stories.

Each record provides:

๐Ÿ“Œ title: Headline of the news item

๐Ÿ“Œ text: Full article content

๐Ÿ“Œ date: Date when the article was published

๐Ÿ“Š Key Observations
โœ… Authentic Articles: Use formal language, topic-relevant vocabulary, and a coherent structure.

โš ๏ธ False Articles: Tend to feature emotionally loaded words, recurring phrases, and a less organized layout. Certain word patterns and expressions are common in deceptive content.

๐Ÿ“ˆ Visual analyses (like word clouds) confirmed noticeable differences in vocabulary usage between the two categories.

๐Ÿ”ง Workflow Breakdown
๐Ÿงน Data Preprocessing
๐Ÿ”น Removal of noise and irrelevant characters

๐Ÿ”น Lemmatization to unify word forms

๐Ÿ”น Focused on extracting nouns via part-of-speech tagging to enhance semantic feature quality

๐Ÿง  Feature Construction
๐Ÿ”น Employed pre-trained Word2Vec vectors to encode textual data into dense, meaning-rich formats

๐Ÿงช Classification Models
๐Ÿ”น Logistic Regression

๐Ÿ”น Decision Tree

๐Ÿ”น Random Forest

๐Ÿ“ Performance Metrics
๐Ÿ“Š Accuracy on Validation Set: 86.00%

๐Ÿ“Š Precision: 85.90%

๐Ÿ“Š Recall: 87.01%

๐Ÿ“Š F1-Score: 86.45%

๐Ÿ“„ Evaluation Summary
Class Precision Recall F1-Score Support
0 (Fake) 0.86 0.85 0.86 73
1 (Real) 0.86 0.87 0.86 77
Overall 0.86 0.86 0.86 150

Macro Average: Precision = 0.86, Recall = 0.86, F1 = 0.86

Weighted Average: Precision = 0.86, Recall = 0.86, F1 = 0.86

๐Ÿ” Insights
๐Ÿ”น Semantic techniques enhance classification performance and reduce noise.

๐Ÿ”น Pre-trained Word2Vec simplifies feature engineering while improving depth of analysis.

๐Ÿ”น Random Forest consistently outperformed other models in accuracy and F1-score.

๐Ÿ”น This approach provides a practical foundation for scalable misinformation detection tools.

๐Ÿ”ฎ Enhancements Ahead
๐Ÿ”น Adapt Word2Vec embeddings using domain-specific datasets for improved relevance.

๐Ÿ”น Explore context-aware models like BERT to further capture linguistic nuances and dependencies.