Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alpayariyak/fake-news-classification
Classifying Fake News with 99.6% Accuracy using ML and Exploring NLP Data Processing Techniques
https://github.com/alpayariyak/fake-news-classification
machine-learning ml nlp nlp-machine-learning
Last synced: 18 days ago
JSON representation
Classifying Fake News with 99.6% Accuracy using ML and Exploring NLP Data Processing Techniques
- Host: GitHub
- URL: https://github.com/alpayariyak/fake-news-classification
- Owner: alpayariyak
- Created: 2022-10-04T04:03:26.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-10-04T07:46:26.000Z (over 2 years ago)
- Last Synced: 2024-10-29T12:37:35.187Z (2 months ago)
- Topics: machine-learning, ml, nlp, nlp-machine-learning
- Language: Python
- Homepage:
- Size: 48.3 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Detecting Fake News with NLP using ML
Predicting whether a News Article is Real or Fake with 99.6% Accuracy using only the Title and Text, while exploring different NLP Processing techniques and ML Models.
>Dataset: [Fake + Real News Dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset)
## preprocessing.py
__Real Time Data Loading and Processing:__Use __print_analytics__ to view information on the input dataset and WordClouds in the terminal.
To load and process data in real time, enable __real_time__.
```python
print_analytics = False
real_time = False
```__POS-Tagging__:
To create new datasets where text is filtered with desired Parts-Of-Speech, modify the list below.
```python
pos_combination_list = [['NN', 'VB'], ['NN', 'JJ'], ['NN'], ['VB', 'JJ']]
```
[List of Parts-of-Speech and their abbreviations](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)
## train.py
__Training Models with different NLP Techniques:__Use __generate_reports__ function to create a CSV with Accuracy, Recall and Precision for each combination of the following inputs:
>__Vectorizer:__ the feature of choice for converting text data to numerical input.
>__Model:__ Machine Learning models.
>__Data:__ can be filtered to contain specific POS. Generated after __preprocessing.py__ has been executed.
To accomodate multiple combinations, use dictionaries for input to the __generate_reports__ function, as shown below.
```python
vectorizers = {
'Name of Vectorizer': Vectorizer Object, # Format
'TFIDF': TfidfVectorizer(min_df=10) } # Examplemodels = {
'Model Name': Model Object,
'Logistic_Regression': LogisticRegression() }filtered_data = {
'Filter Name': 'Path to filtered Data', # Created by preprocessing.py
'Noun_Adjective': 'processed_data/pos_filtered/NN_JJ_dataset.pkl' }generate_reports(vectorizers, models, filtered_data)
```## Directories
>__Data:__ [Original Fake + Real News Dataset](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset)>__Analytics:__ WordClouds with Word Frequency and CSVs containing top 100 most used words for each dataset.
>__Model_Results:__ Confusion Matrices and a report on Accuracy, Precision and Recall for each combination of Models, Filters and Vectorizers - __result.csv__.
## Results
__No POS Filter:__|ML Model |Feature |Precision |Recall |Accuracy |
|-------------------|---------------|--------------|-----------|-----------|
|Logistic Regression|Frequency Count|0.996 |0.997 |0.996 |
|Random Forest |Frequency Count|0.995 |0.993 |0.994 |
|Random Forest |TFIDF |0.991 |0.990 |0.991 |
|Logistic Regression|TFIDF |0.983 |0.988 |0.986 |
|Naive Bayes |Frequency Count|0.945 |0.952 |0.951 |
|Naive Bayes |TFIDF |0.931 |0.936 |0.936 |__Top 5 Best Filtered Results:__
|ML Model |Feature |Filter |Precision |Recall |Accuracy |
|-------------------|---------------|--------------|-----------|-----------|-----------|
|Logistic Regression|Frequency_Count|Noun Adjective|0.981 |0.974 |0.978 |
|Logistic Regression|TFIDF |Noun Adjective|0.970 |0.971 |0.972 |
|Logistic Regression|Frequency Count|Noun Verb |0.975 |0.964 |0.971 |
|Random Forest |Frequency Count|Noun Adjective|0.973 |0.964 |0.970 |
|Logistic Regression|Frequency Count|Noun |0.970 |0.967 |0.970 |__Top 5 Worst Overall Results:__
|ML Model |Feature |Filter |Precision|Recall|Accuracy|
|-----------|---------------|--------------|---------|------|--------|
|Naive Bayes|TFIDF |Noun Adjective|0.923 |0.907 |0.920 |
|Naive Bayes|Frequency Count|Noun |0.918 |0.908 |0.917 |
|Naive Bayes|TFIDF |Verb Adjective|0.914 |0.906 |0.914 |
|Naive Bayes|TFIDF |Noun Verb |0.917 |0.895 |0.911 |
|Naive Bayes|TFIDF |Noun |0.911 |0.896 |0.908 |