https://github.com/Yelp/mrjob

Run MapReduce jobs on Hadoop or Amazon Web Services
https://github.com/Yelp/mrjob

Last synced: 7 months ago
JSON representation

Run MapReduce jobs on Hadoop or Amazon Web Services

Host: GitHub
URL: https://github.com/Yelp/mrjob
Owner: Yelp
License: other
Created: 2010-10-13T18:35:21.000Z (about 15 years ago)
Default Branch: master
Last Pushed: 2023-03-24T10:20:24.000Z (over 2 years ago)
Last Synced: 2025-03-12T22:35:45.819Z (8 months ago)
Language: Python
Homepage: http://packages.python.org/mrjob/
Size: 17.2 MB
Stars: 2,618
Watchers: 105
Forks: 587
Open Issues: 213
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.txt
- Contributing: CONTRIBUTING.rst
- License: LICENSE.txt

Awesome Lists containing this project

awesome-python-machine-learning-resources - GitHub - 15% open · ⏱️ 16.11.2020): (数据管道和流处理)
awesome-python-resources - GitHub - 15% open · ⏱️ 16.11.2020): (分布式计算)

README

          mrjob: the Python MapReduce library

===================================

.. image:: https://github.com/Yelp/mrjob/raw/master/docs/logos/logo_medium.png

mrjob is a Python 2.7/3.4+ package that helps you write and run Hadoop

Streaming jobs.

`Stable version (v0.7.4) documentation `_

`Development version documentation `_

.. image:: https://travis-ci.org/Yelp/mrjob.png

   :target: https://travis-ci.org/Yelp/mrjob

mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you

to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc)

which allows you to buy time on a Hadoop cluster on a minute-by-minute basis.  It also works with your own

Hadoop cluster.

Some important features:

* Run jobs on EMR, Google Cloud Dataproc, your own Hadoop cluster, or locally (for testing).

* Write multi-step jobs (one map-reduce step feeds into the next)

* Easily launch Spark jobs on EMR or your own Hadoop cluster

* Duplicate your production environment inside Hadoop

  * Upload your source tree and put it in your job's ``$PYTHONPATH``

  * Run make and other setup scripts

  * Set environment variables (e.g. ``$TZ``)

  * Easily install python packages from tarballs (EMR only)

  * Setup handled transparently by ``mrjob.conf`` config file

* Automatically interpret error logs

* SSH tunnel to hadoop job tracker (EMR only)

* Minimal setup

  * To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY``

  * To run on Dataproc, set ``$GOOGLE_APPLICATION_CREDENTIALS``

  * No setup needed to use mrjob on your own Hadoop cluster

Installation

------------

``pip install mrjob``

As of v0.7.0, Amazon Web Services and Google Cloud Services are optional

depedencies. To use these, install with the ``aws`` and ``google`` targets,

respectively. For example:

``pip install mrjob[aws]``

A Simple Map Reduce Job

-----------------------

Code for this example and more live in ``mrjob/examples``.

.. code-block:: python

   """The classic MapReduce job: count the frequency of words.

   """

   from mrjob.job import MRJob

   import re

   WORD_RE = re.compile(r"[\w']+")

   class MRWordFreqCount(MRJob):

       def mapper(self, _, line):

           for word in WORD_RE.findall(line):

               yield (word.lower(), 1)

       def combiner(self, word, counts):

           yield (word, sum(counts))

       def reducer(self, word, counts):

           yield (word, sum(counts))

   if __name__ == '__main__':

        MRWordFreqCount.run()

Try It Out!

-----------

::

    # locally

    python mrjob/examples/mr_word_freq_count.py README.rst > counts

    # on EMR

    python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts

    # on Dataproc

    python mrjob/examples/mr_word_freq_count.py README.rst -r dataproc > counts

    # on your Hadoop cluster

    python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts

Setting up EMR on Amazon

------------------------

* create an `Amazon Web Services account `_

* Get your access and secret keys (click "Security Credentials" on

  `your account page `_)

* Set the environment variables ``$AWS_ACCESS_KEY_ID`` and

  ``$AWS_SECRET_ACCESS_KEY`` accordingly

Setting up Dataproc on Google

-----------------------------

* `Create a Google Cloud Platform account `_, see top-right

* `Learn about Google Cloud Platform "projects" `_

* `Select or create a Cloud Platform Console project `_

* `Enable billing for your project `_

* Go to the `API Manager `_ and search for / enable the following APIs...

  * Google Cloud Storage

  * Google Cloud Storage JSON API

  * Google Cloud Dataproc API

* Under Credentials, **Create Credentials** and select **Service account key**.  Then, select **New service account**, enter a Name and select **Key type** JSON.

* Install the `Google Cloud SDK `_

Advanced Configuration

----------------------

To run in other AWS regions, upload your source tree, run ``make``, and use

other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks

for its conf file in:

* The contents of ``$MRJOB_CONF``

* ``~/.mrjob.conf``

* ``/etc/mrjob.conf``

See `the mrjob.conf documentation

`_ for more

information.

Project Links

-------------

* `Source code `__

* `Documentation `_

* `Discussion group `_

Reference

---------

* `Hadoop Streaming `_

* `Elastic MapReduce `_

* `Google Cloud Dataproc `_

More Information

----------------

* `PyCon 2011 mrjob overview `_

* `Introduction to Recommendations and MapReduce with mrjob `_

  (`source code `__)

* `Social Graph Analysis Using Elastic MapReduce and PyPy `_

Thanks to `Greg Killion `_

(`ROMEO ECHO_DELTA `_) for the logo.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Yelp/mrjob

Awesome Lists containing this project

README