{"id":15519913,"url":"https://github.com/ceteri/ceteri-mapred","last_synced_at":"2025-12-18T02:34:27.450Z","repository":{"id":982458,"uuid":"785120","full_name":"ceteri/ceteri-mapred","owner":"ceteri","description":"MapReduce examples","archived":true,"fork":false,"pushed_at":"2011-11-18T09:16:42.000Z","size":5271,"stargazers_count":20,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-29T22:06:02.558Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://www.slideshare.net/pacoid/getting-started-on-hadoop","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ceteri.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2010-07-19T19:59:16.000Z","updated_at":"2025-03-07T17:37:17.000Z","dependencies_parsed_at":"2022-08-16T11:40:47.378Z","dependency_job_id":null,"html_url":"https://github.com/ceteri/ceteri-mapred","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceteri%2Fceteri-mapred","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceteri%2Fceteri-mapred/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceteri%2Fceteri-mapred/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceteri%2Fceteri-mapred/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ceteri","download_url":"https://codeload.github.com/ceteri/ceteri-mapred/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249319594,"owners_count":21250578,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-02T10:23:35.164Z","updated_at":"2025-12-18T02:34:27.405Z","avatar_url":"https://github.com/ceteri.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Getting Started on Hadoop\n## Paco Nathan \u003cceteri@gmail.com\u003e\n##\n## Silicon Valley Cloud Computing Meetup\n## http://www.meetup.com/cloudcomputing/calendar/13911740/\n## Mountain View, 2010-07-19\n\nGitHub src repo:\n    http://github.com/ceteri/ceteri-mapred\n\nPresentation slides available here in Keynote format or \nonline at SlideShare:\n    doc/enron.key\n    http://www.slideshare.net/pacoid/getting-started-on-hadoop\n\n\nSee the \"WordCount\" example at:\n    bin/run_wc.sh\n\nSee the \"Enron Email Dataset\" demo at:\n    bin/run_enron.sh\n\nR statistics demo:\n    thresh.R, thresh.tsv\n\nGephi graph demo:\n    graph.gephi\n\n\n## to run your own code on Elastic MapReduce\n\n    1. create a bucket in S3\n    2. copy the Python scripts into a \"src\" folder there\n    3. determine some subset of the email message input\n\tcat msgs.tsv | head -1000 \u003e input\n    4. copy \"input\" to your S3 \"src\" folder\n    5. follow examples in slide deck, based on params below\n\n\n## Hadoop job flow 1 on Elastic MapReduce\n\n-input s3n://ceteri-mapred/enron/src/input \n-output s3n://ceteri-mapred/enron/src/output \n-mapper '\"python map_parse.py http://ceteri-mapred.s3.amazonaws.com/ stopwords\"'\n-reducer '\"python red_idf.py 2500\"'\n-cacheFile s3n://ceteri-mapred/enron/src/map_parse.py#map_parse.py\n-cacheFile s3n://ceteri-mapred/enron/src/red_idf.py#red_idf.py\n-cacheFile s3n://ceteri-mapred/enron/src/stopwords#stopwords\n\n## Hadoop job flow 2 on Elastic MapReduce\n\n-input s3n://ceteri-mapred/enron/src/output \n-output s3n://ceteri-mapred/enron/src/filter \n-mapper '\"python map_filter.py\"'\n-reducer '\"python red_filter.py 0.0633\"'\n-cacheFile s3n://ceteri-mapred/enron/src/map_filter.py#map_filter.py\n-cacheFile s3n://ceteri-mapred/enron/src/red_filter.py#red_filter.py\n\n## after downloading the partition file named \"filter\" from S3, then\n## run the following command to build a lexicon:\n\ncat filter/part-* | sort -k1 -k4 -nr \u003e lexicon\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceteri%2Fceteri-mapred","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fceteri%2Fceteri-mapred","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceteri%2Fceteri-mapred/lists"}