{"id":27918115,"url":"https://github.com/sergio11/fake_news_classifier","last_synced_at":"2026-03-06T09:31:03.852Z","repository":{"id":270357987,"uuid":"799534521","full_name":"sergio11/fake_news_classifier","owner":"sergio11","description":"📰 Fighting Fake News with machine learning! 🤖 Using source-based classification to detect misinformation using TF-IDF + RandomForest vs Embeddings + CNN. 🔍","archived":false,"fork":false,"pushed_at":"2025-04-21T17:38:04.000Z","size":1092,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-06T18:09:23.488Z","etag":null,"topics":["cnn","data-processing","deep-learning","fake-news-detection","machine-learning","misinformation","nlp","random-forest","tensorflow","text-classification","tf-idf"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sergio11.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-05-12T12:41:06.000Z","updated_at":"2025-04-21T17:38:07.000Z","dependencies_parsed_at":null,"dependency_job_id":"d6a36d78-2887-454c-a55c-e0184bde58ab","html_url":"https://github.com/sergio11/fake_news_classifier","commit_stats":null,"previous_names":["sergio11/fake_news_classifier"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/sergio11/fake_news_classifier","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sergio11%2Ffake_news_classifier","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sergio11%2Ffake_news_classifier/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sergio11%2Ffake_news_classifier/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sergio11%2Ffake_news_classifier/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sergio11","download_url":"https://codeload.github.com/sergio11/fake_news_classifier/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sergio11%2Ffake_news_classifier/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30168966,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T07:56:45.623Z","status":"ssl_error","status_checked_at":"2026-03-06T07:55:55.621Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cnn","data-processing","deep-learning","fake-news-detection","machine-learning","misinformation","nlp","random-forest","tensorflow","text-classification","tf-idf"],"created_at":"2025-05-06T18:09:17.927Z","updated_at":"2026-03-06T09:31:03.828Z","avatar_url":"https://github.com/sergio11.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 📰 Fighting Misinformation — A Personal Learning Project on Fake News Classification\n\nThis project was developed as part of my hands-on journey through a **Deep Learning course**, where I focused on understanding how AI can be applied to real-world challenges—particularly the spread of **fake news** across digital platforms.\n\nIn today's fast-paced digital world 🌐, misinformation spreads rapidly, often distorting public perception and influencing critical decisions. From shaping elections 🗳️ to triggering widespread panic during crises, the consequences are real and significant.\n\nTo explore this issue from a practical and technical perspective, I built a machine learning model that classifies news articles as **real or fake**, using a **source-based approach**. Rather than analyzing article content directly, the model looks at structured metadata—such as:\n\n- The **author** of the article ✍️  \n- The **publication date** 📅  \n- And the **reputation of the source** itself 🏅\n\nBy training on this structured data, the goal was to simulate how credibility might be algorithmically assessed based on source characteristics alone. This approach reflects a more **explainable and efficient** path to automated verification systems, while also encouraging deeper awareness of where our information comes from.\n\n\u003e 🤖 Tools \u0026 Techniques: I implemented this using **scikit-learn** and **Pandas**, exploring different classifiers (e.g., Logistic Regression, Random Forests), and working through typical steps like preprocessing, feature selection, and evaluation with metrics like accuracy and F1-score.\n\nMore than just a technical exercise, this project reinforced my understanding of how machine learning models can assist in **digital literacy**, promoting more **informed and critical media consumption**.\n\n🙏 I would like to extend my heartfelt gratitude to [Santiago Hernández, an expert in Cybersecurity and Artificial Intelligence](https://www.udemy.com/user/shramos/). His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.\n\n🔍 This project was inspired by the Kaggle notebook [*EDA and Modelling on News Dataset (99% accuracy)*](https://www.kaggle.com/code/bansalvishesh/eda-and-modelling-on-news-dataset-99-accuracy) by **Vishesh Bansal**, which provided valuable insights into text preprocessing and classification workflows.\n\n\u003cp align=\"center\"\u003e\n   \u003cimg src=\"https://img.shields.io/badge/pypi-3775A9?style=for-the-badge\u0026logo=pypi\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Python-FFD43B?style=for-the-badge\u0026logo=python\u0026logoColor=blue\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Keras-FF0000?style=for-the-badge\u0026logo=keras\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/TensorFlow-FF6F00?style=for-the-badge\u0026logo=tensorflow\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Jupyter-F37626.svg?\u0026style=for-the-badge\u0026logo=Jupyter\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Pandas-2C2D72?style=for-the-badge\u0026logo=pandas\u0026logoColor=white\" /\u003e\n   \u003cimg src=\"https://img.shields.io/badge/Numpy-777BB4?style=for-the-badge\u0026logo=numpy\u0026logoColor=white\" /\u003e\n\u003c/p\u003e\n\n## ⚠️ Disclaimer  \n**This project was developed for educational and research purposes only.** It is an academic exploration of **machine learning techniques for source-based fake news classification**.  \n\nThe models and techniques presented in this repository **are not intended for real-world misinformation detection or journalistic verification**. They serve as a proof of concept and have not been extensively tested for accuracy, bias, or robustness in diverse media environments.  \n\nWhile this project leverages publicly available datasets and references existing research, **users should not rely on its outputs for making factual or editorial decisions**. Always verify news from multiple trusted sources.  \n\n## 🌟 Explore My Other Cutting-Edge AI Projects! 🌟\n\nIf you found this project intriguing, I invite you to check out my other AI and machine learning initiatives, where I tackle real-world challenges across various domains:\n\n+ [🌍 Advanced Classification of Disaster-Related Tweets Using Deep Learning 🚨](https://github.com/sergio11/disasters_prediction)  \nUncover how social media responds to crises in real time using **deep learning** to classify tweets related to disasters.\n\n+ [📰 Fighting Misinformation: Source-Based Fake News Classification 🕵️‍♂️](https://github.com/sergio11/fake_news_classifier)  \nCombat misinformation by classifying news articles as real or fake based on their source using **machine learning** techniques.\n\n+ [🛡️ IoT Network Malware Classifier with Deep Learning Neural Network Architecture 🚀](https://github.com/sergio11/iot_network_malware_classifier)  \nDetect malware in IoT network traffic using **Deep Learning Neural Networks**, offering proactive cybersecurity solutions.\n\n+ [📧 Spam Email Classification using LSTM 🤖](https://github.com/sergio11/spam_email_classifier_lstm)  \nClassify emails as spam or legitimate using a **Bi-directional LSTM** model, implementing NLP techniques like tokenization and stopword removal.\n\n+ [💳 Fraud Detection Model with Deep Neural Networks (DNN)](https://github.com/sergio11/online_payment_fraud) \nDetect fraudulent transactions in financial data with **Deep Neural Networks**, addressing imbalanced datasets and offering scalable solutions.\n\n+ [🧠🚀 AI-Powered Brain Tumor Classification](https://github.com/sergio11/brain_tumor_classification_cnn)  \nClassify brain tumors from MRI scans using **Deep Learning**, CNNs, and Transfer Learning for fast and accurate diagnostics.\n\n+ [📊💉 Predicting Diabetes Diagnosis Using Machine Learning](https://github.com/sergio11/diabetes_prediction_ml)  \nCreate a machine learning model to predict the likelihood of diabetes using medical data, helping with early diagnosis.\n\n+ [🚀🔍 LLM Fine-Tuning and Evaluation](https://github.com/sergio11/llm_finetuning_and_evaluation)  \nFine-tune large language models like **FLAN-T5**, **TinyLLAMA**, and **Aguila7B** for various NLP tasks, including summarization and question answering.\n\n+ [📰 Headline Generation Models: LSTM vs. Transformers](https://github.com/sergio11/headline_generation_lstm_transformers)  \nCompare **LSTM** and **Transformer** models for generating contextually relevant headlines, leveraging their strengths in sequence modeling.\n\n+ [🩺💻 Breast Cancer Diagnosis with MLP](https://github.com/sergio11/breast_cancer_diagnosis_mlp)  \nAutomate breast cancer diagnosis using a **Multi-Layer Perceptron (MLP)** model to classify tumors as benign or malignant based on biopsy data.\n\n+ [Deep Learning for Safer Roads 🚗 Exploring CNN-Based and YOLOv11 Driver Drowsiness Detection 💤](https://github.com/sergio11/safedrive_drowsiness_detection)\nComparing driver drowsiness detection with CNN + MobileNetV2 vs YOLOv11 for real-time accuracy and efficiency 🧠🚗. Exploring both deep learning models to prevent fatigue-related accidents 😴💡.\n\n## 📊 About the Dataset\n### 🔎 Context\nSocial media platforms are a treasure trove of content, with **news** being one of the most consumed categories. However, not all news is authentic. Fake news, whether posted by politicians, news outlets, or civilians, can have far-reaching consequences. \n\n**Challenges**:\n- Manual classification of news is **time-consuming** and prone to **bias**.\n- Verifying authenticity remains a critical task in the fight against misinformation.\n\n### 🔒 Source\nPublished paper: [Source-Based Fake News Classification](http://www.ijirset.com/upload/2020/june/115_4_Source.PDF)\n\n### 🔧 Features\n- Preprocessed data from the **Getting Real about Fake News** dataset.\n- Eliminated skew for improved reliability.\n- Comprehensive inclusion of source information, including author names, publication dates, and labels.\n\n## 🚀 Motivation\nIn an age where fake WhatsApp forwards and misleading Tweets influence public opinion, it’s crucial to develop tools to:\n- Mitigate the spread of misinformation.\n- Inform users about the nature of the news they consume.\n\nThis project’s inspiration lies in creating:\n1. **Practical applications** to analyze and classify news articles.\n2. **Plugins** and tools for easy access to fact-checking.\n3. **Awareness** campaigns about the consequences of consuming and spreading fake news.\n\n## 🌟 Highlights\n- **Source-Based Labeling**: Ensures credibility by tracking the origin of news articles.\n- **Automation**: Reduces human bias in classification.\n- **Informed Consumption**: Helps users make smarter decisions about the news they trust.\n\n## ⚖️ Comparison of Approaches\n\nIn this project, two machine learning approaches are evaluated for classifying fake news:\n\n1. **RandomForestClassifier using TF-IDF**\n2. **Embeddings + CNN (Convolutional Neural Networks)**\n\n### 1. **RandomForestClassifier using TF-IDF**\n- **TF-IDF** (Term Frequency-Inverse Document Frequency) is a traditional text preprocessing technique that transforms text data into a high-dimensional sparse vector space. This method measures the importance of a word in a document relative to its frequency across all documents.\n- The **RandomForestClassifier** then uses this vectorized representation for classification. Random forests are an ensemble method that builds multiple decision trees and combines their outputs, typically resulting in a strong and reliable classifier.\n  \n   **Pros**:\n   - Efficient and works well for smaller datasets.\n   - Simple to implement and interpret.\n  \n   **Cons**:\n   - The sparse representation of text doesn’t capture the semantic meaning of words or their contextual relationships.\n   - May struggle with large datasets or when the relationships between words are complex.\n\n### 2. **Embeddings + CNN (Convolutional Neural Networks)**\n- **Embeddings** are dense, lower-dimensional vector representations of words that capture their semantic meaning. By mapping words with similar meanings closer together in a vector space, embeddings provide more context and depth compared to traditional vectorization methods like TF-IDF.\n- The **CNN** architecture is well-suited for text classification tasks. In this case, convolutional layers capture local patterns in the text, and pooling layers help reduce dimensionality. CNNs can learn more abstract and hierarchical features from text, which is useful in identifying subtle patterns and relationships that might indicate whether news is fake or real.\n  \n   **Pros**:\n   - Better at capturing semantic relationships and context of words.\n   - Suitable for large and complex datasets with nuanced patterns.\n   - Can provide higher performance in text classification tasks.\n  \n   **Cons**:\n   - Requires larger datasets for training.\n   - Needs more computational resources and may take longer to train.\n\n### **Model Evaluation**\n- Both approaches were trained and evaluated on the **Getting Real about Fake News** dataset.\n- The **RandomForestClassifier using TF-IDF** showed decent performance for basic tasks but struggled to capture deeper semantic meaning and context.\n- The **Embeddings + CNN** approach outperformed the traditional method in both training and testing accuracy, as it was able to better capture the relationships between words and classify news more effectively.\n\n### Conclusion\nThe results of this comparison highlight the advantages of using **Embeddings + CNN** for more complex text classification tasks, especially in dealing with large, high-dimensional datasets. However, **RandomForestClassifier using TF-IDF** remains a useful and simpler tool for tasks where computational resources or training data are limited. This project shows that using a source-based approach combined with machine learning techniques can effectively aid in the detection of fake news.\n\n## ⚠️ Disclaimer  \n**This project was developed for educational and research purposes only.** It is an academic exploration of **machine learning techniques for source-based fake news classification**.  \n\nThe models and techniques presented in this repository **are not intended for real-world misinformation detection or journalistic verification**. They serve as a proof of concept and have not been extensively tested for accuracy, bias, or robustness in diverse media environments.  \n\nWhile this project leverages publicly available datasets and references existing research, **users should not rely on its outputs for making factual or editorial decisions**. Always verify news from multiple trusted sources.  \n\n✨ Let’s Fight Fake News Together! 🕵️‍♂️\n\n## **🙏 Acknowledgments**\n\n- Dataset: **Getting Real about Fake News**\n  - Selected for its detailed inclusion of source information, crucial for verifying authenticity.\n- Special thanks to the creators and contributors of this dataset for enabling research in combating misinformation.\n  \nA huge **thank you** to **ruchi798** for providing the dataset that made this project possible! 🌟 The dataset can be found on [Kaggle](https://www.kaggle.com/datasets/ruchi798/source-based-news-classification/data). Your contribution is greatly appreciated! 🙌\n\n🙏 I would like to extend my heartfelt gratitude to [Santiago Hernández, an expert in Cybersecurity and Artificial Intelligence](https://www.udemy.com/user/shramos/). His incredible course on Deep Learning, available at Udemy, was instrumental in shaping the development of this project. The insights and techniques learned from his course were crucial in crafting the neural network architecture used in this classifier.\n\nThroughout the development of this project, I drew inspiration from several community contributions that tackled fake news classification from different angles. One particularly valuable resource was the Kaggle notebook by **Vishesh Bansal**, titled [*EDA and Modelling on News Dataset (99% accuracy)*](https://www.kaggle.com/code/bansalvishesh/eda-and-modelling-on-news-dataset-99-accuracy).\n\nThis notebook provides a thorough exploratory data analysis of news content and experiments with text preprocessing techniques such as **TF-IDF** and **word embeddings** using TensorFlow. I am grateful for the educational value of such community-driven contributions, which not only accelerate individual learning but also foster collaborative research and shared growth within the field.\n\n## Visitors Count\n\n\u003cimg width=\"auto\" src=\"https://profile-counter.glitch.me/fake_news_classifier/count.svg\" /\u003e\n\n## License ⚖️\n\nThis project is licensed under the MIT License, an open-source software license that allows developers to freely use, copy, modify, and distribute the software. 🛠️ This includes use in both personal and commercial projects, with the only requirement being that the original copyright notice is retained. 📄\n\nPlease note the following limitations:\n\n- The software is provided \"as is\", without any warranties, express or implied. 🚫🛡️\n- If you distribute the software, whether in original or modified form, you must include the original copyright notice and license. 📑\n- The license allows for commercial use, but you cannot claim ownership over the software itself. 🏷️\n\nThe goal of this license is to maximize freedom for developers while maintaining recognition for the original creators.\n\n```\nMIT License\n\nCopyright (c) 2024 Dream software - Sergio Sánchez \n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n``\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsergio11%2Ffake_news_classifier","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsergio11%2Ffake_news_classifier","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsergio11%2Ffake_news_classifier/lists"}