Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Yelp/mrjob
Run MapReduce jobs on Hadoop or Amazon Web Services
https://github.com/Yelp/mrjob
Last synced: 3 months ago
JSON representation
Run MapReduce jobs on Hadoop or Amazon Web Services
- Host: GitHub
- URL: https://github.com/Yelp/mrjob
- Owner: Yelp
- License: other
- Created: 2010-10-13T18:35:21.000Z (over 14 years ago)
- Default Branch: master
- Last Pushed: 2023-03-24T10:20:24.000Z (almost 2 years ago)
- Last Synced: 2024-10-29T15:04:54.081Z (3 months ago)
- Language: Python
- Homepage: http://packages.python.org/mrjob/
- Size: 17.2 MB
- Stars: 2,615
- Watchers: 109
- Forks: 587
- Open Issues: 213
-
Metadata Files:
- Readme: README.rst
- Changelog: CHANGES.txt
- Contributing: CONTRIBUTING.rst
- License: LICENSE.txt
Awesome Lists containing this project
- awesome-python-resources - GitHub - 15% open · ⏱️ 16.11.2020): (分布式计算)
- awesome-python-machine-learning-resources - GitHub - 15% open · ⏱️ 16.11.2020): (数据管道和流处理)
README
mrjob: the Python MapReduce library
===================================.. image:: https://github.com/Yelp/mrjob/raw/master/docs/logos/logo_medium.png
mrjob is a Python 2.7/3.4+ package that helps you write and run Hadoop
Streaming jobs.`Stable version (v0.7.4) documentation `_
`Development version documentation `_
.. image:: https://travis-ci.org/Yelp/mrjob.png
:target: https://travis-ci.org/Yelp/mrjobmrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you
to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc)
which allows you to buy time on a Hadoop cluster on a minute-by-minute basis. It also works with your own
Hadoop cluster.Some important features:
* Run jobs on EMR, Google Cloud Dataproc, your own Hadoop cluster, or locally (for testing).
* Write multi-step jobs (one map-reduce step feeds into the next)
* Easily launch Spark jobs on EMR or your own Hadoop cluster
* Duplicate your production environment inside Hadoop* Upload your source tree and put it in your job's ``$PYTHONPATH``
* Run make and other setup scripts
* Set environment variables (e.g. ``$TZ``)
* Easily install python packages from tarballs (EMR only)
* Setup handled transparently by ``mrjob.conf`` config file
* Automatically interpret error logs
* SSH tunnel to hadoop job tracker (EMR only)
* Minimal setup* To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY``
* To run on Dataproc, set ``$GOOGLE_APPLICATION_CREDENTIALS``
* No setup needed to use mrjob on your own Hadoop clusterInstallation
------------``pip install mrjob``
As of v0.7.0, Amazon Web Services and Google Cloud Services are optional
depedencies. To use these, install with the ``aws`` and ``google`` targets,
respectively. For example:``pip install mrjob[aws]``
A Simple Map Reduce Job
-----------------------Code for this example and more live in ``mrjob/examples``.
.. code-block:: python
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import reWORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)def combiner(self, word, counts):
yield (word, sum(counts))def reducer(self, word, counts):
yield (word, sum(counts))if __name__ == '__main__':
MRWordFreqCount.run()Try It Out!
-----------::
# locally
python mrjob/examples/mr_word_freq_count.py README.rst > counts
# on EMR
python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts
# on Dataproc
python mrjob/examples/mr_word_freq_count.py README.rst -r dataproc > counts
# on your Hadoop cluster
python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > countsSetting up EMR on Amazon
------------------------* create an `Amazon Web Services account `_
* Get your access and secret keys (click "Security Credentials" on
`your account page `_)
* Set the environment variables ``$AWS_ACCESS_KEY_ID`` and
``$AWS_SECRET_ACCESS_KEY`` accordinglySetting up Dataproc on Google
-----------------------------* `Create a Google Cloud Platform account `_, see top-right
* `Learn about Google Cloud Platform "projects" `_
* `Select or create a Cloud Platform Console project `_
* `Enable billing for your project `_
* Go to the `API Manager `_ and search for / enable the following APIs...* Google Cloud Storage
* Google Cloud Storage JSON API
* Google Cloud Dataproc API* Under Credentials, **Create Credentials** and select **Service account key**. Then, select **New service account**, enter a Name and select **Key type** JSON.
* Install the `Google Cloud SDK `_
Advanced Configuration
----------------------To run in other AWS regions, upload your source tree, run ``make``, and use
other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks
for its conf file in:* The contents of ``$MRJOB_CONF``
* ``~/.mrjob.conf``
* ``/etc/mrjob.conf``See `the mrjob.conf documentation
`_ for more
information.Project Links
-------------* `Source code `__
* `Documentation `_
* `Discussion group `_Reference
---------* `Hadoop Streaming `_
* `Elastic MapReduce `_
* `Google Cloud Dataproc `_More Information
----------------* `PyCon 2011 mrjob overview `_
* `Introduction to Recommendations and MapReduce with mrjob `_
(`source code `__)
* `Social Graph Analysis Using Elastic MapReduce and PyPy `_Thanks to `Greg Killion `_
(`ROMEO ECHO_DELTA `_) for the logo.