{"id":13498149,"url":"https://github.com/douban/dpark","last_synced_at":"2025-10-29T12:31:30.799Z","repository":{"id":2976151,"uuid":"3991627","full_name":"douban/dpark","owner":"douban","description":"Python clone of Spark,  a MapReduce alike framework in Python","archived":true,"fork":false,"pushed_at":"2020-12-25T10:36:06.000Z","size":2778,"stargazers_count":2687,"open_issues_count":1,"forks_count":534,"subscribers_count":267,"default_branch":"master","last_synced_at":"2024-10-29T15:20:52.439Z","etag":null,"topics":["bigdata","dpark","mapreduce","python","spark","stream-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/douban.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-04-11T08:35:06.000Z","updated_at":"2024-10-25T16:23:45.000Z","dependencies_parsed_at":"2022-07-16T14:30:49.816Z","dependency_job_id":null,"html_url":"https://github.com/douban/dpark","commit_stats":null,"previous_names":[],"tags_count":20,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/douban%2Fdpark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/douban%2Fdpark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/douban%2Fdpark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/douban%2Fdpark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/douban","download_url":"https://codeload.github.com/douban/dpark/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238813631,"owners_count":19534973,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","dpark","mapreduce","python","spark","stream-processing"],"created_at":"2024-07-31T20:00:52.130Z","updated_at":"2025-10-29T12:31:30.276Z","avatar_url":"https://github.com/douban.png","language":"Python","readme":"DPark\n=====\n\n|pypi status| |ci status| |gitter|\n\nDPark is a Python clone of Spark, MapReduce(R) alike computing framework\nsupporting iterative computation.\n\nInstallation\n------------\n\n.. code:: bash\n\n    ## Due to the use of C extensions, some libraries need to be installed first.\n\n    $ sudo apt-get install libtool pkg-config build-essential autoconf automake\n    $ sudo apt-get install python-dev\n    $ sudo apt-get install libzmq-dev\n\n    ## Then just pip install dpark (``sudo`` maybe needed if you encounter permission problem).\n\n    $ pip install dpark\n\n\nExample\n-------\n\nfor word counting (``wc.py``):\n\n.. code:: python\n\n     from dpark import DparkContext\n     ctx = DparkContext()\n     file = ctx.textFile(\"/tmp/words.txt\")\n     words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))\n     wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()\n     print wc\n\nThis script can run locally or on a Mesos cluster without any\nmodification, just using different command-line arguments:\n\n.. code:: bash\n\n    $ python wc.py\n    $ python wc.py -m process\n    $ python wc.py -m host[:port]\n\nSee examples/ for more use cases.\n\n\nConfiguration\n------------\n\nDPark can run with Mesos 0.9 or higher.\n\nIf a ``$MESOS_MASTER`` environment variable is set, you can use a\nshortcut and run DPark with Mesos just by typing\n\n.. code:: bash\n\n    $ python wc.py -m mesos\n\n``$MESOS_MASTER`` can be any scheme of Mesos master, such as\n\n.. code:: bash\n\n    $ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master\n\nIn order to speed up shuffling, you should deploy Nginx at port 5055 for\naccessing data in ``DPARK_WORK_DIR`` (default is ``/tmp/dpark``), such\nas:\n\n.. code:: bash\n\n            server {\n                    listen 5055;\n                    server_name localhost;\n                    root /tmp/dpark/;\n            }\n\nUI\n--\n\n2 DAGs:\n\n1. stage graph: stage is a running unit, contain a set of task, each run same ops for a split of rdd.\n2. use api callsite graph\n\n\nUI when running\n~~~~~~~~~~~~~~\n\nJust open the url from log like ``start listening on Web UI http://server_01:40812`` .\n\n\nUI after running\n~~~~~~~~~~~~~~~~~~\n\n1. before run, config LOGHUB \u0026 LOGHUB_PATH_FORMAT in dpark.conf, pre-create LOGHUB_DIR.\n2. get log hubdir from log like ``logging/prof to LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8754``, which in clude mesos framework id.\n3. run ``dpark_web.py -p 9999 -l LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8728/``, dpark_web.py is in tools/\n\n\nUI examples for features\n~~~~~~~\n\n\nshow sharing shuffle map output\n\n.. code:: python\n\n\n   rdd = DparkContext().makeRDD([(1,1)]).map(m).groupByKey()\n   rdd.map(m).collect()\n   rdd.map(m).collect()\n\n\n.. image:: images/share_mapoutput.png\n\n\ncombine nodes iff with same lineage,  form a logic tree inside stage, then each node contain a PIPELINE of rdds.\n\n\n.. code:: python\n\n\n   rdd1 = get_rdd()\n   rdd2 = dc.union([get_rdd() for i in range(2)])\n   rdd3 = get_rdd().groupByKey()\n   dc.union([rdd1, rdd2, rdd3]).collect()\n\n\n.. image:: images/unions.png\n\n\nMore docs (in Chinese)\n-------------------------\n\nhttps://dpark.readthedocs.io/zh_CN/latest/\n\nhttps://github.com/jackfengji/test\\_pro/wiki\n\nMailing list: dpark-users@googlegroups.com\n(http://groups.google.com/group/dpark-users)\n\n\n.. |pypi status| image:: https://img.shields.io/pypi/v/DPark.svg\n   :target: https://pypi.python.org/pypi/DPark\n\n.. |gitter| image:: https://badges.gitter.im/douban/dpark.svg\n   :alt: Join the chat at https://gitter.im/douban/dpark\n   :target: https://gitter.im/douban/dpark?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge\n\n.. |ci status| image:: https://travis-ci.org/douban/dpark.svg\n   :target: https://travis-ci.org/douban/dpark\n","funding_links":[],"categories":["MapReduce","资源列表","Python","数据管道和流处理","Frameworks"],"sub_categories":["分布式计算","Data Processing"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdouban%2Fdpark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdouban%2Fdpark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdouban%2Fdpark/lists"}