{"id":13701232,"url":"https://github.com/groupon/mongo-deep-mapreduce","last_synced_at":"2025-05-04T21:30:38.361Z","repository":{"id":8579162,"uuid":"10210181","full_name":"groupon/mongo-deep-mapreduce","owner":"groupon","description":"Use Hadoop MapReduce directly on Mongo data","archived":false,"fork":false,"pushed_at":"2013-06-21T07:07:10.000Z","size":178,"stargazers_count":30,"open_issues_count":0,"forks_count":19,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-11-13T07:35:37.914Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"spark-jobserver/spark-jobserver","license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/groupon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-05-22T02:40:55.000Z","updated_at":"2021-09-23T06:12:39.000Z","dependencies_parsed_at":"2022-08-25T21:11:26.502Z","dependency_job_id":null,"html_url":"https://github.com/groupon/mongo-deep-mapreduce","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/groupon%2Fmongo-deep-mapreduce","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/groupon%2Fmongo-deep-mapreduce/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/groupon%2Fmongo-deep-mapreduce/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/groupon%2Fmongo-deep-mapreduce/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/groupon","download_url":"https://codeload.github.com/groupon/mongo-deep-mapreduce/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252403635,"owners_count":21742401,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T20:01:23.438Z","updated_at":"2025-05-04T21:30:37.982Z","avatar_url":"https://github.com/groupon.png","language":"Java","funding_links":[],"categories":["Java"],"sub_categories":[],"readme":"Mongo Deep MapReduce\n====================\n\nThis is a library of MongoDB related Hadoop MapReduce classes, in particular an InputFormat that reads directly\nfrom Mongo's binary on-disk format. Developed by Peter Bakkum at Groupon in Palo Alto.\n\nProblem: If you want to use Hadoop MapReduce with a Mongo collection you currently have two options:\n- You can execute one or more cursors over the entire cluster in your MapReduce job.\n- You can export the collection as BSON or JSON, which also executes a cursor over the entire collection,\nand MapReduce over the exported data.\n\nHowever, with a large data set that significantly exceeds the available memory on the Mongo host, these options\ncan both be prohibitively time consuming.\n\nSolution: Move the raw Mongo files into HDFS, without exporting, and MapReduce over them using this library.\n\nMongo uses a proprietary binary format to manage its data, which is essentially a doubly-linked list of BSON records.\nBy reading this format directly, we obviate the need for expensive data conversion prior to a Hadoop MapReduce,\nand we can utilize the full throughput of the Hadoop cluster when reading the data, rather than using single-threaded\ncursors.\n\nData Format\n-----------\n\nData stored on disk by Mongo is generally in groups of files that look like\n\n```\ndbname.ns\ndbname.0\ndbname.1\ndbname.2\n...\n```\n\n`dbname.ns` is a namespace file. This is a hash table of namespace records, which contain a collection name and\nthe first and last Extent of the collection. We use DiskLocs as pointers to Extents. A DiskLoc is essentially\n\n```C\nstruct DiskLoc {\n    int fileNum;\n    int offset;\n}\n```\n\nwritten out to disk. The fileNum is the postfix number on the files shown above, and the offset is the byte offset\nwithin that file.\n\nAn extent is the major unit of physical organization within a Mongo collection. A collection is a doubly-linked list\nof extents, that each hold a block of records within them. The extents are spread across the database files and\neach contains a doubly-linked list of Records.\n\nUsing MongoInputFormat\n----------------------\n\nThis has been written using the newer `mapreduce` interface and CDH4.0.1 and tested against the binary data formats\nfrom Mongo 2.0 and 2.2. It should work out of the box with those\nsystems but may require some tweaking of the dependencies to work on different versions of Hadoop, or be changed\nfor future versions of Mongo with different on-disk formats. Once included as a\ndependency, you can use this library as you would any other Hadoop InputFormat by configuring it to point to\nthe Mongo data in HDFS and the Mongo database and collection you want to query.\n\nBasic use looks like:\n\n```Java\nMongoInputFormat.setMongoDirectory(path);\nMongoInputFormat.setDatabase(dbname);\nMongoInputFormat.setCollection(collname);\n\njob.setInputFormatClass(MongoInputFormat.class);\n```\n\nYou can then implement a Mapper like:\n\n```Java\npublic static class Map extends Mapper\u003cText, WritableBSONObject, Text, Text\u003e {\n    @Override\n    public void map(Text key, WritableBSONObject value, Context context)\n            throws IOException, InterruptedException {\n\n        context.write(null, new Text(value.getBSONObject().toString()));\n    }\n}\n```\n\nLook at the provided [MongoToJson](src/main/java/com/groupon/mapreduce/mongo/MongoToJson.java) job for a full example.\n\nRunning the Tests\n-----------------\n\nTo run the tests you must first generate a set of test database files.\n\n- Start a local Mongo instance at port 27017\n- Run the following to build and insert specific data used for testing\n\n```\nmvn -Dmaven.test.skip=true clean package\nmvn test-compile\njava -cp target/test-classes/:target/mongo-deep-mapreduce-jar-with-dependencies.jar com.groupon.mapreduce.mongo.GenerateTestDB\n```\n\n- Then copy the deepmr_test* files from wherever your Mongo instance keeps its data (often in /data) to src/test/db\n- Now you can run `mvn test`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgroupon%2Fmongo-deep-mapreduce","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgroupon%2Fmongo-deep-mapreduce","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgroupon%2Fmongo-deep-mapreduce/lists"}