Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ineerav/tfidf-map-reduce
Running Tf-Idf using spark streaming on hillary clinton's infamous leaked email data set https://www.kaggle.com/datasets/kaggle/hillary-clinton-emails
https://github.com/ineerav/tfidf-map-reduce
aws emr maven pig-latin shell spark spring-boot tf-idf
Last synced: 1 day ago
JSON representation
Running Tf-Idf using spark streaming on hillary clinton's infamous leaked email data set https://www.kaggle.com/datasets/kaggle/hillary-clinton-emails
- Host: GitHub
- URL: https://github.com/ineerav/tfidf-map-reduce
- Owner: INeerav
- License: apache-2.0
- Created: 2023-11-16T22:02:04.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-20T15:12:35.000Z (7 months ago)
- Last Synced: 2024-04-20T16:59:54.427Z (7 months ago)
- Topics: aws, emr, maven, pig-latin, shell, spark, spring-boot, tf-idf
- Language: Python
- Homepage:
- Size: 13.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# tfidf-map-reduce
- Problem : Find the most common words in the emails
- Dataset : This email dataset is publicly available on data.world, file
size is 25 MB with 22 columns containing all the required
columns such as email subject, body, to, from, attachments
and timestamp with enough complexity to continue with this
assignment.
This famous/infamous dataset was released by the US state
department at the time of the US election roughly 7 years
ago
- Clinton emails dataset
https://data.world/briangriffey/clinton-emails/workspace/file?filename=Emails.csv## Tech stack
EMR cluster.
- Filesystem : hadoop
- Fileformat : parquet, avro
- AWS cloudformation Iaas
- Versions: Hue 4.11, EMR 6.14, Hadoop 3.3.3, pig 0.17, hive 3.1.3, Zeppelin 0.10.1;
- Nodes: 1 primary and 1 core node
- Compute : Spark streaming, mapreduce
- Data engineering : AWS Ethena, Glue transformation, Pig-latin, Hive
- Visulization : Apache Hue board (Used Apache HUE to visualise the data with better UI, but in order to connect HUE to web browser, performed SSH tunnelling)