{"id":20709601,"url":"https://github.com/ineerav/tfidf-map-reduce","last_synced_at":"2026-04-07T21:31:49.244Z","repository":{"id":218510903,"uuid":"719777162","full_name":"INeerav/tfidf-map-reduce","owner":"INeerav","description":"Running Tf-Idf using spark streaming on hillary clinton's infamous leaked email data set https://www.kaggle.com/datasets/kaggle/hillary-clinton-emails","archived":false,"fork":false,"pushed_at":"2024-04-20T15:12:35.000Z","size":14,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-07T02:42:42.991Z","etag":null,"topics":["aws","emr","maven","pig-latin","shell","spark","spring-boot","tf-idf"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/INeerav.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-11-16T22:02:04.000Z","updated_at":"2024-04-20T15:12:38.000Z","dependencies_parsed_at":"2024-01-22T13:26:52.298Z","dependency_job_id":"deff2b9a-d79b-46fa-8395-0fb7c0e14cd5","html_url":"https://github.com/INeerav/tfidf-map-reduce","commit_stats":null,"previous_names":["ineerav/tfidf-map-reduce"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/INeerav/tfidf-map-reduce","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INeerav%2Ftfidf-map-reduce","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INeerav%2Ftfidf-map-reduce/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INeerav%2Ftfidf-map-reduce/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INeerav%2Ftfidf-map-reduce/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/INeerav","download_url":"https://codeload.github.com/INeerav/tfidf-map-reduce/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/INeerav%2Ftfidf-map-reduce/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31530641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T16:28:08.000Z","status":"ssl_error","status_checked_at":"2026-04-07T16:28:06.951Z","response_time":105,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","emr","maven","pig-latin","shell","spark","spring-boot","tf-idf"],"created_at":"2024-11-17T02:07:11.753Z","updated_at":"2026-04-07T21:31:49.220Z","avatar_url":"https://github.com/INeerav.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tfidf-map-reduce\n\n- Problem : Find the most common words in the emails\n- Dataset : This email dataset is publicly available on data.world, file\nsize is 25 MB with 22 columns containing all the required\ncolumns such as email subject, body, to, from, attachments\nand timestamp with enough complexity to continue with this\nassignment.\nThis famous/infamous dataset was released by the US state\ndepartment at the time of the US election roughly 7 years\nago \n- Clinton emails dataset\nhttps://data.world/briangriffey/clinton-emails/workspace/file?filename=Emails.csv\n\n\n\n## Tech stack\n\nEMR cluster.\n- Filesystem : hadoop\n- Fileformat : parquet, avro\n- AWS cloudformation Iaas\n- Versions: Hue 4.11, EMR 6.14, Hadoop 3.3.3, pig 0.17, hive 3.1.3, Zeppelin 0.10.1; \n- Nodes: 1 primary and 1 core node\n- Compute : Spark streaming, mapreduce\n- Data engineering : AWS Ethena, Glue transformation, Pig-latin, Hive\n- Visulization : Apache Hue board (Used Apache HUE to visualise the data with better UI, but in order to connect HUE to web browser, performed SSH tunnelling)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fineerav%2Ftfidf-map-reduce","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fineerav%2Ftfidf-map-reduce","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fineerav%2Ftfidf-map-reduce/lists"}