Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bwhite/hadoopy
Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials.
https://github.com/bwhite/hadoopy
Last synced: 2 months ago
JSON representation
Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials.
- Host: GitHub
- URL: https://github.com/bwhite/hadoopy
- Owner: bwhite
- License: gpl-3.0
- Created: 2009-10-18T01:25:29.000Z (over 14 years ago)
- Default Branch: master
- Last Pushed: 2016-01-08T21:07:56.000Z (over 8 years ago)
- Last Synced: 2024-02-23T00:05:58.666Z (4 months ago)
- Language: C
- Homepage:
- Size: 3.83 MB
- Stars: 244
- Watchers: 23
- Forks: 59
- Open Issues: 58
-
Metadata Files:
- Readme: README
- License: COPYING
Lists
- awesome-hadoop - hadoopy - Python MapReduce library written in Cython. (Hadoop)
- awesome-stars - hadoopy - Python MapReduce library written in Cython. Visit us in #hadoopy on freenode. See the link below for documentation and tutorials. (C)
- awesome-hadoop - hadoopy - Python MapReduce library written in Cython. (Hadoop)
README
Brandyn White
Andrew MillerSource https://github.com/bwhite/hadoopy/
Issues https://github.com/bwhite/hadoopy/issues
Docs http://bwhite.github.com/hadoopy/IRC: #hadoopy @ freenode.net
Requirements
python development headers (python-dev), build tools (build-essential)Optional
cython (>=.13) (without this it falls back to the pregenerated .c files)Features
- oozie support
- Automated job parallelization 'auto-oozie' available in the hadoopy_flow project (maintained out of branch)
- typedbytes support (very fast)
- Local execution of unmodified MapReduce job with launch_local
- Read/write sequence files of TypedBytes directly to HDFS from python (readtb, writetb)
- Works on OS X
- Allows printing to stdout and stderr in Hadoop tasks without causing problems (uses the 'pipe hopping' technique, both are available in the task's stderr)
- critical path is in Cython
- works on clusters without any extra installation, Python, or any Python libraries (uses Pyinstaller that is included in this source tree)
- Simple HDFS access (readtb and ls) inside Python, even inside running jobs
- Unit test interface
- Reporting using status and counters (and print statements! no need to be scared of them in Hadoopy)
- Supports design patterns in the Lin/Dyer book (http://www.umiacs.umd.edu/~jimmylin/book.html)Limitations
- Hadoop Local currently unsupported due to a bug in Hadoop's handling of the distributed cache in this mode. Use psuedo-distributed instead for now. (https://github.com/bwhite/hadoopy/issues/40)Used in
- A Case for Query by Image and Text Content: Searching Computer Help using Screenshots and Keywords (to appear in WWW'11)
- Web-Scale Computer Vision using MapReduce for Multimedia Data Mining (at KDD'10)
- Vitrieve: Visual Search engine
- Picarus: Hadoop computer vision toolboxUbuntu Install (others are similar)
sudo apt-get install python-dev build-essential
sudo python setup.py install