{"id":15154096,"url":"https://github.com/mongodb/mongo-disco","last_synced_at":"2026-01-11T03:42:31.193Z","repository":{"id":3567928,"uuid":"4629960","full_name":"mongodb/mongo-disco","owner":"mongodb","description":"Integration with the Disco Framework for distributed computation","archived":true,"fork":true,"pushed_at":"2023-04-10T13:01:00.000Z","size":551,"stargazers_count":19,"open_issues_count":3,"forks_count":9,"subscribers_count":27,"default_branch":"master","last_synced_at":"2024-12-20T20:04:31.369Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"10genNYUITP/MongoDisco","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mongodb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-06-11T20:50:40.000Z","updated_at":"2024-11-28T16:30:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mongodb/mongo-disco","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mongodb%2Fmongo-disco","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mongodb%2Fmongo-disco/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mongodb%2Fmongo-disco/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mongodb%2Fmongo-disco/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mongodb","download_url":"https://codeload.github.com/mongodb/mongo-disco/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234687195,"owners_count":18871712,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-09-26T17:02:02.544Z","updated_at":"2025-09-30T01:32:24.243Z","avatar_url":"https://github.com/mongodb.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"#MongoDB Disco Adapter\n\nThe MongoDB Disco Adapter is a plugin which connect MongoDB and Disco MapRedcue framework by enabling users the ability to use MongoDB as an data input and/or an output source.\n\n##Prerequisites\nFor each machine in a disco cluster, it need following:\n\n-Python\n\n-PyMongo\n\n-Disco\n\nFor instructions to setup disco clusters, please refer to the the guide(http://discoproject.org/doc/disco/start/install.html) in disco project website.\n\n##Installation\n1.  Check out the latest source code in github\n\n```bash\n   $ git clone https://github.com/mongo/mongo-disco.git mongo-disco\n```\n\n2.  Go to mongo-disco folder, and run the setup.py file to install MongoDisco package\n\n```bash    \n    $ python setup.py install \n```\n\n    Note: it may request administrator privilege to run the script\n\nIt’s done! Start hacking!\n\n##Example\nWord Counting is a classic example for MapReduce framework. It could be done extremely easily using the MongoDB Disco Adapter.\n\nStep 1. Users need to specify the configuration for this job.\n\nFor example, users could specify where the input data is stored and how they would like to store output data by providing a mongodb uri.\n\n```python\nconfig = {\n        \"input_uri\": \"mongodb://localhost/test.in\",\n        \"output_uri\": \"mongodb://localhost/test.out\",\n        \"create_input_splits\": True,\n        \"split_key\": {‘_id’:1},\n        \"split_size”:1, #MB\n}\n```\n\n\nYou can find more detailed configuration in the appendix.\n\nHere, we assume we assume that input data is in database “test”, collection “in”, and we want to split data on “_id” field by setting the split_size equal to 1 Megabyte. The result would be written back to collection “out” at last.\n\nStep 2. Write up its own map function\n\nHere we would like to read the value under the field “word” and count it, so the map function would like following:\n\n```python\ndef map(doc, params):\n    yield record.get('doc', \"NoWord\"), 1\n```\n\nNote: doc is an ordinary document return by mongodb query. You can perform any operations on it as MongoDB allowed.\n\nSetup 3. Write up reduce function\n\nAs we already get key-value generators from the map process, we only need perform sum operation for each word.\n\n```python\ndef reduce(iter, params):\n    from disco.util import kvgroup\n    for word, counts in kvgroup(sorted(iter)):\n        yield word, sum(counts)\n```\n\nThe first parameter, iter, is an iterator over keys and values produced by the map function. We use disco.util.kvgroup() to simply pull out each word along with its counts, and sum them together.\n\nSetup 4. Create a MongoJob instance and run it\n\n```python\nfrom mongodisco.job import MongoJob\n\nMongoJob().run(map=map, reduce=reduce, **config)\n```\n\nNow you run it in a terminal like other python codes and check the result in MongoDB.\n\n\n##Appendix\n\nConfiguration for MongoJob\n\n\u003ctable\u003e\n\u003ctr\u003e\u003ctd\u003eName\u003c/td\u003e\u003ctd\u003eDefault Value\u003c/td\u003e\u003ctd\u003eNote\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003einput_uri\u003c/td\u003e\u003ctd\u003emongodb://localhost/test.in\u003c/td\u003e\u003ctd\u003emongodb uri for input data\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003eoutput_uri\u003c/td\u003e\u003ctd\u003emongodb://localhost/test.out\u003c/td\u003e\u003ctd\u003emongodb uri for output result\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003eprint_to_stdout\u003c/td\u003e\u003ctd\u003eFalse\u003c/td\u003e\u003ctd\u003eif True, print result to stdout\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003ejob_wait\u003c/td\u003e\u003ctd\u003eTrue\u003c/td\u003e\u003ctd\u003eif False, code won’t wait for end of job\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003ecreate_input_splits\u003c/td\u003e\u003ctd\u003eTrue\u003c/td\u003e\u003ctd\u003eif True, data will be splitted\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003esplit_size\u003c/td\u003e\u003ctd\u003e8\u003c/td\u003e\u003ctd\u003esize for one split\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003esplit_key\u003c/td\u003e\u003ctd\u003e{“_id”:1}\u003c/td\u003e\u003ctd\u003efield for performing splitting\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003euse_shards\u003c/td\u003e\u003ctd\u003eFalse\u003c/td\u003e\u003ctd\u003eif True, directly connect to shards to retrieve data\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003euse_chunks\u003c/td\u003e\u003ctd\u003eTrue\u003c/td\u003e\u003ctd\u003eif True, directly use chunks splitted by mongoDB as splits\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003einput_key\u003c/td\u003e\u003ctd\u003eNone\u003c/td\u003e\u003ctd\u003eUnknown!!!\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003eslave_ok\u003c/td\u003e\u003ctd\u003eFalse\u003c/td\u003e\u003ctd\u003esame as slave_okay\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003equery\u003c/td\u003e\u003ctd\u003e{}\u003c/td\u003e\u003ctd\u003esame as spec parameter of find method\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003efields\u003c/td\u003e\u003ctd\u003eNone\u003c/td\u003e\u003ctd\u003esame as fields parameter of find method\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003esort\u003c/td\u003e\u003ctd\u003eNone\u003c/td\u003e\u003ctd\u003esame as sort parameter of find method\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003elimit\u003c/td\u003e\u003ctd\u003e0\u003c/td\u003e\u003ctd\u003esame as limit parameter of find method\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003eskip\u003c/td\u003e\u003ctd\u003e0\u003c/td\u003e\u003ctd\u003esame as skip parameter of find method\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003ejob_output_key\u003c/td\u003e\u003ctd\u003e“_id”\u003c/td\u003e\u003ctd\u003efield name for output key\u003c/td\u003e\u003c/tr\u003e \n\u003ctr\u003e\u003ctd\u003ejob_output_value\u003c/td\u003e\u003ctd\u003e“value”\u003c/td\u003e\u003ctd\u003efield name for output value\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmongodb%2Fmongo-disco","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmongodb%2Fmongo-disco","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmongodb%2Fmongo-disco/lists"}