https://github.com/prathush-kumar/fake_news_detection
Developed an intelligent system to identify and classify fake news articles using Natural Language Processing (NLP) and Machine Learning techniques.
https://github.com/prathush-kumar/fake_news_detection
classification machine-learning natural-language-processing numpy pandas sklearn-library
Last synced: about 2 months ago
JSON representation
Developed an intelligent system to identify and classify fake news articles using Natural Language Processing (NLP) and Machine Learning techniques.
- Host: GitHub
- URL: https://github.com/prathush-kumar/fake_news_detection
- Owner: Prathush-Kumar
- Created: 2024-12-31T17:22:42.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-12-03T14:22:04.000Z (7 months ago)
- Last Synced: 2025-12-03T17:48:47.351Z (7 months ago)
- Topics: classification, machine-learning, natural-language-processing, numpy, pandas, sklearn-library
- Language: Jupyter Notebook
- Homepage:
- Size: 97.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
#
Fake News Detection Using Machine Learning
## Project Overview
In the digital age, the rapid spread of fake news through various online platforms has become a significant issue, leading to misinformation and societal harm. This project aims to develop a robust machine-learning model capable of classifying news articles as either true or false. By leveraging various machine learning techniques, we provide a valuable tool to combat misinformation.
explore multiple machine learning techniques, including Logistic Regression, Decision Tree Classifier, Gradient Boosting Classifier, and Random Forest Classifier, to determine the most effective approach for fake news detection. The performance of each model is evaluated based on accuracy, precision, recall, and F1 score..
##
Objective
The primary objective of this project is to build a machine-learning model that accurately classifies news articles into two categories:
- **True:** Genuine news articles
- **False:** Fake or fabricated news articles
##
Methodology
###
1. Data Collection and Preprocessing
- **Data Collection:** The dataset consists of labeled news articles categorized as either true or false. It is sourced from reliable repositories to ensure a diverse range of topics and sources.
- **Data Preprocessing:** This includes cleaning the data by removing irrelevant information, handling missing values, and performing text normalization (e.g., lowercasing, punctuation removal). Techniques such as tokenization, stop-word removal, and lemmatization are applied to prepare the text for model training.
###
2. Feature Engineering
- **Text Representation:** Convert text data into numerical features using techniques like Count Vectorization and Term Frequency-Inverse Document Frequency (TF-IDF) to make it suitable for machine learning models.
- **Additional Features:** Enhance the model by incorporating features such as article length, presence of specific keywords, and sentiment analysis.
###
3. Model Selection
- **Logistic Regression:** A baseline model that uses linear relationships between features and labels for classification.
- **Decision Tree Classifier:** A non-linear model that captures complex patterns and provides insights into feature importance.
- **Gradient Boosting Classifier:** An ensemble technique that combines weak learners to create a strong predictive model.
- **Random Forest Classifier:** Another ensemble method that builds multiple decision trees and averages their predictions to improve accuracy and reduce overfitting.
###
4. Model Evaluation
Evaluate each model using the following metrics:
- **Accuracy:** The proportion of correctly classified instances.
- **Precision:** The model's ability to identify positive instances correctly.
- **Recall:** The model's ability to capture all relevant positive instances.
- **F1 Score:** A balanced measure considering both precision and recall.
##
Dataset
The dataset used consists of a balanced set of true and false news articles. It is split into training and testing sets to evaluate the model's performance on unseen data. Ensuring the quality and diversity of the dataset is crucial for developing a robust model.
##
Dependencies
Before running the code, ensure the following libraries and packages are installed:
- **Nltk**
- **Python**
- **Scikit-learn**
- **Pandas**
- **NumPy**
- **Regular Expression**
## Results
##
Classification Report of Logistic regression
| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **0** | 0.99 | 0.98 | 0.99 | 4714 |
| **1** | 0.97 | 0.99 | 0.98 | 4262 |
| **Accuracy** | | | 0.98 | 8976 |
| **Macro Avg** | 0.98 | 0.98 | 0.98 | 8976 |
| **Weighted Avg** | 0.98 | 0.98 | 0.98 | 8976 |
##
Classification Report of Decision Tree Classifier
| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **0** | 0.99 | 0.98 | 0.99 | 4714 |
| **1** | 0.97 | 0.99 | 0.98 | 4262 |
| **Accuracy** | | | 0.98 | 8976 |
| **Macro Avg** | 0.98 | 0.98 | 0.98 | 8976 |
| **Weighted Avg** | 0.98 | 0.98 | 0.98 | 8976 |
##
Classification Report of Gradient Boosting Classifier
| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **0** | 1.00 | 1.00 | 1.00 | 4714 |
| **1** | 0.99 | 1.00 | 1.00 | 4262 |
| **Accuracy** | | | 1.00 | 8976 |
| **Macro Avg** | 1.00 | 1.00 | 1.00 | 8976 |
| **Weighted Avg** | 1.00 | 1.00 | 1.00 | 8976 |
##
Classification Report of Random forest
| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| **0** | 0.99 | 0.97 | 0.98 | 4714 |
| **1** | 0.97 | 1.00 | 0.98 | 4262 |
| **Accuracy** | | | 0.98 | 8976 |
| **Macro Avg** | 0.98 | 0.98 | 0.98 | 8976 |
| **Weighted Avg** | 0.98 | 0.98 | 0.98 | 8976 |
##
Model selection
### Gradient Boosting Classifier
The Gradient Boosting Classifier was selected as the final model due to its impressive performance across various metrics. Gradient Boosting is an ensemble technique that builds multiple weak learners (typically decision trees) in a sequential manner, where each subsequent model corrects the errors of its predecessor. This results in a powerful and accurate predictive model.
#### **Performance Summary**
- **Precision:** The Gradient Boosting Classifier achieved a precision of 1.00 for class 0 (true news) and 0.99 for class 1 (fake news). High precision indicates that the model correctly identifies a high proportion of actual positive instances without misclassifying too many negative instances as positive.
- **Recall:** The model demonstrated perfect recall of 1.00 for class 0, meaning it identified all actual true news articles. For class 1, recall is also 1.00, signifying that it captured all fake news articles accurately.
- **F1-Score:** The F1-score of 1.00 for both classes reflects a balance between precision and recall, showcasing the model's overall effectiveness in identifying both true and fake news articles accurately.
- **Accuracy:** With an overall accuracy of 1.00, the Gradient Boosting Classifier correctly classified all instances in the dataset, indicating exceptional performance.
- **Macro Average:** The macro average values of precision, recall, and F1-score are all 1.00, demonstrating the model's consistent performance across both classes.
- **Weighted Average:** The weighted average scores, also 1.00, confirm that the model performs uniformly well across different class distributions in the dataset.
#### **Conclusion**
#### **Conclusion**
The Gradient Boosting Classifier has proven to be highly effective for the fake news detection task, providing excellent results across all performance metrics. Its ability to handle complex patterns in the data and achieve perfect classification makes it a strong candidate for deployment in real-world applications to combat misinformation.