{"id":16597826,"url":"https://github.com/san089/big_data_project","last_synced_at":"2026-03-13T13:32:08.555Z","repository":{"id":113013041,"uuid":"173197315","full_name":"san089/Big_Data_Project","owner":"san089","description":"Fake News Detection - Feature Extraction using Vectorization such as Count Vectorizer, TFIDF Vectorizer, Hash Vectorizer,. Then used an Ensemble model to classify whether the news is fake or not.","archived":false,"fork":false,"pushed_at":"2020-02-21T21:40:00.000Z","size":12987,"stargazers_count":19,"open_issues_count":0,"forks_count":12,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-10-05T08:37:36.088Z","etag":null,"topics":["classifiers","ensemble-model","fakenewsdetection","machine-learning","news-classification","scikit-learn","text-mining","textclassification","vectorization","vectorizers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/san089.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-28T22:29:15.000Z","updated_at":"2025-03-30T01:06:02.000Z","dependencies_parsed_at":"2023-06-05T11:31:12.631Z","dependency_job_id":null,"html_url":"https://github.com/san089/Big_Data_Project","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/san089/Big_Data_Project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/san089%2FBig_Data_Project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/san089%2FBig_Data_Project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/san089%2FBig_Data_Project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/san089%2FBig_Data_Project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/san089","download_url":"https://codeload.github.com/san089/Big_Data_Project/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/san089%2FBig_Data_Project/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30467802,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T11:00:43.441Z","status":"ssl_error","status_checked_at":"2026-03-13T11:00:23.173Z","response_time":60,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classifiers","ensemble-model","fakenewsdetection","machine-learning","news-classification","scikit-learn","text-mining","textclassification","vectorization","vectorizers"],"created_at":"2024-10-12T00:06:42.327Z","updated_at":"2026-03-13T13:32:08.516Z","avatar_url":"https://github.com/san089.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Big_Data_Project - Fake News Detection\n\nIn this project we displayed the use machine learning algorithms for text classification. We worked on classifying whether the given news article is fake or real. \n\n### Data Cleaning and preprocessing: \nRemoved special characters from text\nSpell checked all the documents\nRemoved the Stop Words\nVectorized the documents.\n\n### Vectorization\nFor Vectorization we used - Count Vectorizer, TFIDF Vectorizer, Hash Vectorizer.\n\n### Classification\nFor Classification prurpose we Used: Multinomial Naive Bayes, Support Vector Machine ( LinearSVC ), PassiveAgressiveClassifier.\nWe compared the performance of the vectorizers as well as the classifiers. \nAt the end, we are using an ensemble model to get a higher accuracy. We use scikit-learn max voting classifier for it.\n\n### How to run \nClone the project and simply run the Main.py inside the src folder.\n\n`python Main.py`\n\n\n## Results\n\n```\n[nltk_data] Downloading package stopwords to C:\\Users\\Sanchit\n[nltk_data]     Kumar\\AppData\\Roaming\\nltk_data...\n[nltk_data]   Package stopwords is already up-to-date!\nRecords Count:  6335\nColumn Count :  4\nColumns :  ['id' 'title' 'text' 'label']\nCount of FAKE and REAL labels :\n        text\nlabel\nFAKE   3164\nREAL   3171\n\n\nStarting Data Cleaning Process.....\nRunning spell check, stemming and stop word removal.....\nData Cleaning Process Completed.\n\n\nRunning Naive Bayes with Count Vectorizer...\nProcess Completed.\n\n\nRunning Naive Bayes with TFIDF Vectorizer...\nProcess Completed.\n\n\nRunning Naive Bayes with Hash Vectorizer...\nProcess Completed.\n######################## NAIVE BAYES ANALYSIS ########################\n\nModel accuracy with Count Vectorizer :  89.04830224772836\nModel accuracy with TFIDF Vectorizer :  88.47441415590626\nModel accuracy with Hash Vectorizer :  81.34863701578192\n\n######################################################################\n\n\nRunning SVM with Count Vectorizer...\nProcess Completed.\n\n\nRunning SVM with TFIDF Vectorizer...\nProcess Completed.\n\n\nRunning SVM with Hash Vectorizer...\nProcess Completed.\n######################## SVM ANALYSIS ########################\n\nModel accuracy with Count Vectorizer :  88.6178861788618\nModel accuracy with TFIDF Vectorizer :  90.14825442372072\nModel accuracy with Hash Vectorizer :  91.96556671449068\n\n######################################################################\n\n\nRunning Passive Agressive with Count Vectorizer...\nProcess Completed.\n\n\nRunning Passive Agressive with TFIDF Vectorizer...\nProcess Completed.\n\n\nRunning Passive Agressive with Hash Vectorizer...\nProcess Completed.\n######################## PASSIVE AGRESSIVE ANALYSIS ########################\n\nModel accuracy with Count Vectorizer :  89.38307030129124\nModel accuracy with TFIDF Vectorizer :  92.58727881396462\nModel accuracy with Hash Vectorizer :  92.01339072214252\n\n######################################################################\nFinal Accuracy is :  0.9340028694404591\n---------------------------------------------------------------------------------------\n\n\n######################## Vectorizer Time Stats ########################\n\nTime Taken by Vectorizers\n\nCount Vectorizer : 6.0200676918029785\nTFIDF Vectorizer : 59.66688680648804\nHash Vectorizer : 2.35701847076416\n\n\n######################## Classifier Time Stats ########################\n\n\nNAIVE BAYES\nTime taken with Count Vectorizer : 0.05902385711669922\nTime taken with TFIDF Vectorizer : 0.40993690490722656\nTime taken with Hash Vectorizer : 0.087799072265625\n\nSVM\nTime taken with Count Vectorizer : 7.1091132164001465\nTime taken with TFIDF Vectorizer : 5.953486919403076\nTime taken with Hash Vectorizer : 2.0117762088775635\n\nPassive Agressive\nTime taken with Count Vectorizer : 0.24056363105773926\nTime taken with TFIDF Vectorizer : 3.281041383743286\nTime taken with Hash Vectorizer : 0.3645296096801758\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsan089%2Fbig_data_project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsan089%2Fbig_data_project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsan089%2Fbig_data_project/lists"}