Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/douban/dpark
Python clone of Spark, a MapReduce alike framework in Python
https://github.com/douban/dpark
bigdata dpark mapreduce python spark stream-processing
Last synced: 3 months ago
JSON representation
Python clone of Spark, a MapReduce alike framework in Python
- Host: GitHub
- URL: https://github.com/douban/dpark
- Owner: douban
- License: bsd-3-clause
- Created: 2012-04-11T08:35:06.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2020-12-25T10:36:06.000Z (about 4 years ago)
- Last Synced: 2024-04-14T10:01:23.045Z (9 months ago)
- Topics: bigdata, dpark, mapreduce, python, spark, stream-processing
- Language: Python
- Homepage:
- Size: 2.65 MB
- Stars: 2,693
- Watchers: 267
- Forks: 535
- Open Issues: 1
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-mesos - Dpark
- awesome-python-machine-learning-resources - GitHub - 1% open · ⏱️ 25.12.2020): (数据管道和流处理)
README
DPark
=====|pypi status| |ci status| |gitter|
DPark is a Python clone of Spark, MapReduce(R) alike computing framework
supporting iterative computation.Installation
------------.. code:: bash
## Due to the use of C extensions, some libraries need to be installed first.
$ sudo apt-get install libtool pkg-config build-essential autoconf automake
$ sudo apt-get install python-dev
$ sudo apt-get install libzmq-dev## Then just pip install dpark (``sudo`` maybe needed if you encounter permission problem).
$ pip install dpark
Example
-------for word counting (``wc.py``):
.. code:: python
from dpark import DparkContext
ctx = DparkContext()
file = ctx.textFile("/tmp/words.txt")
words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))
wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()
print wcThis script can run locally or on a Mesos cluster without any
modification, just using different command-line arguments:.. code:: bash
$ python wc.py
$ python wc.py -m process
$ python wc.py -m host[:port]See examples/ for more use cases.
Configuration
------------DPark can run with Mesos 0.9 or higher.
If a ``$MESOS_MASTER`` environment variable is set, you can use a
shortcut and run DPark with Mesos just by typing.. code:: bash
$ python wc.py -m mesos
``$MESOS_MASTER`` can be any scheme of Mesos master, such as
.. code:: bash
$ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master
In order to speed up shuffling, you should deploy Nginx at port 5055 for
accessing data in ``DPARK_WORK_DIR`` (default is ``/tmp/dpark``), such
as:.. code:: bash
server {
listen 5055;
server_name localhost;
root /tmp/dpark/;
}UI
--2 DAGs:
1. stage graph: stage is a running unit, contain a set of task, each run same ops for a split of rdd.
2. use api callsite graphUI when running
~~~~~~~~~~~~~~Just open the url from log like ``start listening on Web UI http://server_01:40812`` .
UI after running
~~~~~~~~~~~~~~~~~~1. before run, config LOGHUB & LOGHUB_PATH_FORMAT in dpark.conf, pre-create LOGHUB_DIR.
2. get log hubdir from log like ``logging/prof to LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8754``, which in clude mesos framework id.
3. run ``dpark_web.py -p 9999 -l LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8728/``, dpark_web.py is in tools/UI examples for features
~~~~~~~show sharing shuffle map output
.. code:: python
rdd = DparkContext().makeRDD([(1,1)]).map(m).groupByKey()
rdd.map(m).collect()
rdd.map(m).collect().. image:: images/share_mapoutput.png
combine nodes iff with same lineage, form a logic tree inside stage, then each node contain a PIPELINE of rdds.
.. code:: python
rdd1 = get_rdd()
rdd2 = dc.union([get_rdd() for i in range(2)])
rdd3 = get_rdd().groupByKey()
dc.union([rdd1, rdd2, rdd3]).collect().. image:: images/unions.png
More docs (in Chinese)
-------------------------https://dpark.readthedocs.io/zh_CN/latest/
https://github.com/jackfengji/test\_pro/wiki
Mailing list: [email protected]
(http://groups.google.com/group/dpark-users).. |pypi status| image:: https://img.shields.io/pypi/v/DPark.svg
:target: https://pypi.python.org/pypi/DPark.. |gitter| image:: https://badges.gitter.im/douban/dpark.svg
:alt: Join the chat at https://gitter.im/douban/dpark
:target: https://gitter.im/douban/dpark?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge.. |ci status| image:: https://travis-ci.org/douban/dpark.svg
:target: https://travis-ci.org/douban/dpark