{"id":13502416,"url":"https://github.com/Yelp/mrjob","last_synced_at":"2025-03-29T10:33:00.132Z","repository":{"id":1114869,"uuid":"984742","full_name":"Yelp/mrjob","owner":"Yelp","description":"Run MapReduce jobs on Hadoop or Amazon Web Services","archived":false,"fork":false,"pushed_at":"2023-03-24T10:20:24.000Z","size":18073,"stargazers_count":2618,"open_issues_count":213,"forks_count":587,"subscribers_count":105,"default_branch":"master","last_synced_at":"2025-03-12T22:35:45.819Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://packages.python.org/mrjob/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Yelp.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGES.txt","contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2010-10-13T18:35:21.000Z","updated_at":"2025-03-03T20:46:41.000Z","dependencies_parsed_at":"2023-07-05T18:31:33.833Z","dependency_job_id":null,"html_url":"https://github.com/Yelp/mrjob","commit_stats":{"total_commits":7299,"total_committers":134,"mean_commits":54.47014925373134,"dds":0.6666666666666667,"last_synced_commit":"091572e87bc24cc64be40278dd0f5c3617c98d4b"},"previous_names":[],"tags_count":63,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Yelp%2Fmrjob","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Yelp%2Fmrjob/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Yelp%2Fmrjob/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Yelp%2Fmrjob/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Yelp","download_url":"https://codeload.github.com/Yelp/mrjob/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246174208,"owners_count":20735406,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T22:02:13.169Z","updated_at":"2025-03-29T10:32:57.899Z","avatar_url":"https://github.com/Yelp.png","language":"Python","funding_links":[],"categories":["Cluster Computing","资源列表","Distributed Computing","Python","MapReduce","数据管道和流处理","分布式计算","Distributed Computing [🔝](#readme)","Open Source Repos","Awesome Python","Data Pipelines \u0026 Streaming"],"sub_categories":["分布式计算","MapReduce","Elastic MapReduce","Cluster Computing"],"readme":"mrjob: the Python MapReduce library\n===================================\n\n.. image:: https://github.com/Yelp/mrjob/raw/master/docs/logos/logo_medium.png\n\nmrjob is a Python 2.7/3.4+ package that helps you write and run Hadoop\nStreaming jobs.\n\n`Stable version (v0.7.4) documentation \u003chttp://mrjob.readthedocs.org/en/stable/\u003e`_\n\n`Development version documentation \u003chttp://mrjob.readthedocs.org/en/latest/\u003e`_\n\n.. image:: https://travis-ci.org/Yelp/mrjob.png\n   :target: https://travis-ci.org/Yelp/mrjob\n\nmrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you\nto buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc)\nwhich allows you to buy time on a Hadoop cluster on a minute-by-minute basis.  It also works with your own\nHadoop cluster.\n\nSome important features:\n\n* Run jobs on EMR, Google Cloud Dataproc, your own Hadoop cluster, or locally (for testing).\n* Write multi-step jobs (one map-reduce step feeds into the next)\n* Easily launch Spark jobs on EMR or your own Hadoop cluster\n* Duplicate your production environment inside Hadoop\n\n  * Upload your source tree and put it in your job's ``$PYTHONPATH``\n  * Run make and other setup scripts\n  * Set environment variables (e.g. ``$TZ``)\n  * Easily install python packages from tarballs (EMR only)\n  * Setup handled transparently by ``mrjob.conf`` config file\n* Automatically interpret error logs\n* SSH tunnel to hadoop job tracker (EMR only)\n* Minimal setup\n\n  * To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY``\n  * To run on Dataproc, set ``$GOOGLE_APPLICATION_CREDENTIALS``\n  * No setup needed to use mrjob on your own Hadoop cluster\n\nInstallation\n------------\n\n``pip install mrjob``\n\nAs of v0.7.0, Amazon Web Services and Google Cloud Services are optional\ndepedencies. To use these, install with the ``aws`` and ``google`` targets,\nrespectively. For example:\n\n``pip install mrjob[aws]``\n\nA Simple Map Reduce Job\n-----------------------\n\nCode for this example and more live in ``mrjob/examples``.\n\n.. code-block:: python\n\n   \"\"\"The classic MapReduce job: count the frequency of words.\n   \"\"\"\n   from mrjob.job import MRJob\n   import re\n\n   WORD_RE = re.compile(r\"[\\w']+\")\n\n\n   class MRWordFreqCount(MRJob):\n\n       def mapper(self, _, line):\n           for word in WORD_RE.findall(line):\n               yield (word.lower(), 1)\n\n       def combiner(self, word, counts):\n           yield (word, sum(counts))\n\n       def reducer(self, word, counts):\n           yield (word, sum(counts))\n\n\n   if __name__ == '__main__':\n        MRWordFreqCount.run()\n\nTry It Out!\n-----------\n\n::\n\n    # locally\n    python mrjob/examples/mr_word_freq_count.py README.rst \u003e counts\n    # on EMR\n    python mrjob/examples/mr_word_freq_count.py README.rst -r emr \u003e counts\n    # on Dataproc\n    python mrjob/examples/mr_word_freq_count.py README.rst -r dataproc \u003e counts\n    # on your Hadoop cluster\n    python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop \u003e counts\n\n\nSetting up EMR on Amazon\n------------------------\n\n* create an `Amazon Web Services account \u003chttp://aws.amazon.com/\u003e`_\n* Get your access and secret keys (click \"Security Credentials\" on\n  `your account page \u003chttp://aws.amazon.com/account/\u003e`_)\n* Set the environment variables ``$AWS_ACCESS_KEY_ID`` and\n  ``$AWS_SECRET_ACCESS_KEY`` accordingly\n\nSetting up Dataproc on Google\n-----------------------------\n\n* `Create a Google Cloud Platform account \u003chttp://cloud.google.com/\u003e`_, see top-right\n* `Learn about Google Cloud Platform \"projects\" \u003chttps://cloud.google.com/docs/overview/#projects\u003e`_\n* `Select or create a Cloud Platform Console project \u003chttps://console.cloud.google.com/project\u003e`_\n* `Enable billing for your project \u003chttps://console.cloud.google.com/billing\u003e`_\n* Go to the `API Manager \u003chttps://console.cloud.google.com/apis\u003e`_ and search for / enable the following APIs...\n\n  * Google Cloud Storage\n  * Google Cloud Storage JSON API\n  * Google Cloud Dataproc API\n\n* Under Credentials, **Create Credentials** and select **Service account key**.  Then, select **New service account**, enter a Name and select **Key type** JSON.\n\n* Install the `Google Cloud SDK \u003chttps://cloud.google.com/sdk/\u003e`_\n\nAdvanced Configuration\n----------------------\n\nTo run in other AWS regions, upload your source tree, run ``make``, and use\nother advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks\nfor its conf file in:\n\n* The contents of ``$MRJOB_CONF``\n* ``~/.mrjob.conf``\n* ``/etc/mrjob.conf``\n\nSee `the mrjob.conf documentation\n\u003chttps://mrjob.readthedocs.io/en/latest/guides/configs-basics.html\u003e`_ for more\ninformation.\n\n\nProject Links\n-------------\n\n* `Source code \u003chttp://github.com/Yelp/mrjob\u003e`__\n* `Documentation \u003chttps://mrjob.readthedocs.io/en/latest/\u003e`_\n* `Discussion group \u003chttp://groups.google.com/group/mrjob\u003e`_\n\nReference\n---------\n\n* `Hadoop Streaming \u003chttp://hadoop.apache.org/docs/stable1/streaming.html\u003e`_\n* `Elastic MapReduce \u003chttp://aws.amazon.com/documentation/elasticmapreduce/\u003e`_\n* `Google Cloud Dataproc \u003chttps://cloud.google.com/dataproc/overview\u003e`_\n\nMore Information\n----------------\n\n* `PyCon 2011 mrjob overview \u003chttp://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/\u003e`_\n* `Introduction to Recommendations and MapReduce with mrjob \u003chttp://aimotion.blogspot.com/2012/08/introduction-to-recommendations-with.html\u003e`_\n  (`source code \u003chttps://github.com/marcelcaraciolo/recsys-mapreduce-mrjob\u003e`__)\n* `Social Graph Analysis Using Elastic MapReduce and PyPy \u003chttp://postneo.com/2011/05/04/social-graph-analysis-using-elastic-mapreduce-and-pypy\u003e`_\n\nThanks to `Greg Killion \u003cmailto:greg@blind-works.net\u003e`_\n(`ROMEO ECHO_DELTA \u003chttp://www.romeoechodelta.net/\u003e`_) for the logo.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYelp%2Fmrjob","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FYelp%2Fmrjob","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FYelp%2Fmrjob/lists"}