{"id":13498106,"url":"https://github.com/dropbox/hydra","last_synced_at":"2025-03-28T22:32:18.731Z","repository":{"id":10524196,"uuid":"12715009","full_name":"dropbox/hydra","owner":"dropbox","description":"A multi-process MongoDB collection copier.","archived":false,"fork":false,"pushed_at":"2015-04-03T12:13:44.000Z","size":158,"stargazers_count":318,"open_issues_count":3,"forks_count":47,"subscribers_count":44,"default_branch":"master","last_synced_at":"2024-12-17T01:03:34.467Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://tech.dropbox.com/2013/09/scaling-mongodb-at-mailbox/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dropbox.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-09-09T22:58:52.000Z","updated_at":"2024-04-12T14:43:32.000Z","dependencies_parsed_at":"2022-08-30T14:10:44.394Z","dependency_job_id":null,"html_url":"https://github.com/dropbox/hydra","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fhydra","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fhydra/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fhydra/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dropbox%2Fhydra/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dropbox","download_url":"https://codeload.github.com/dropbox/hydra/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246110679,"owners_count":20725104,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T20:00:50.983Z","updated_at":"2025-03-28T22:32:18.454Z","avatar_url":"https://github.com/dropbox.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# hydra - the multi-process MongoDB sharded collection copier\n\n## License\n\nSee the accompanying LICENSE.txt file for licensing terms.\n\n## Purpose\n\nThis is working *reference code* that performs a live copy from one MongoDB collection to another, with minimal or no visible impact to your production MongoDB clusters. Keeps the destination up-to-date with changes from the source with a typically small amount of lag.\n\nThere are two conditions that must remain true while running the tools in this suite:\n\n1. `mongos`'s chunk balancer must be disabled using [sh.setBalancerState()](http://docs.mongodb.org/manual/reference/method/sh.setBalancerState/).\n2. The set of source `mongod` instances (I recommend using secondaries) must remain up. Also, primary sources must remain primaries, and secondary sources must remain secondaries. This prevents dead cursors from interfering with the copy.\n\nThis has only been tested on MongoDB 2.2.3 on Ubuntu 12.04. This should work on other Linux platforms but may require work to operate with MongoDB 2.4.x and beyond.\n\n## Required Python Packages\n\nTo use this software, use [pip](http://www.pip-installer.org/en/latest/) to install the following packages into your Python environment:\n\n* [pymongo](https://pypi.python.org/pypi/pymongo/)\n* [gevent version 1.0rc2](https://github.com/surfly/gevent#installing-from-github)\n\t* NOTE: Do not use anything older than 1.0rc2. Earlier versions may have stability issues under load.\n\n## Usage\n\n### copy_collection.py\n\n`copy_collection.py` copies a MongoDB collection from one MongoDB cluster or standalone `mongod` instance to another. It does this in three steps:\n\n1. Creates an initial snapshot of the source collection on the destination cluster/instance.\n2. Copies indexes from source to destination.\n3. Applies oplog entries from source to destination.\n\nSteps #1 and #3 are performed by worker processeses, one for each source you define (more on this below). `copy_collection.py` routinely records its progress in its *state database*. After step #1 finishes, steps #2 and #3 can be resume at any time without issue.\n\nTypical usage for `copy_collection.py` looks like:\n\n~~~\ncopy_collection.py --source source_file.txt --dest mongos_host/database/collection\n~~~\n\nThe file passed to `--source` must have the following format:\n\n~~~\nsource_database_name.source_collection_name\nmongod-instance-1.foo.com\nmongod-instance-2.foo.com:27019\nmongod-instance-3.foo.com:27018\n~~~\n\nAlternatively, `--source` can also accept as a parameter a `mongod` URL of a form similar to `--dest` (host[:port]/database/collection).\n\n**NOTE:** sources need to be `mongod` instances, and preferably secondaries rather than primaries. I had a difficult time getting sufficient reliability and performance when copying from a `mongos` instance. However, the destination must be either a `mongod` instance (for a non-shared MongoDB setup) or a `mongos` instance (for a sharded setup).\n\nUseful options:\n\n* `--percent PCT`: limits your copy to a percentage of the source's documents; meant to be used with the corresponding `--percent` option for `compare_collections.py`\n* `--restart`: re-initialize the state database, to restart from the initial snapshot, rather than continuing where we left off\n* `--state-db`: specify a path in which to store the state database; this defaults to the current directory\n\n\n#### copy_collection.py output\n\n~~~\n06-04 00:59:27 [INFO:MainProcess   ] using state db /home/user/hydra/test.collection.db\n...\n06-04 00:59:29 [INFO:shard1.foo.com] 4% | 5000 / 103993 copied | 2215/sec | 0 dupes | 0 exceptions | 0 retries\n06-04 00:59:29 [INFO:shard2.foo.com] 3% | 3500 / 105326 copied | 1579/sec | 0 dupes | 0 exceptions | 0 retries\n...\n06-04 01:06:23 [INFO:shard1.foo.com] done with initial copy\n06-04 01:06:23 [INFO:shard2.foo.com] done with initial copy\n06-04 01:06:23 [INFO:parent process] building indices\n06-04 01:06:23 [INFO:parent process] ensuring index on [(u'_id', 1)] (options = {'name': u'_id_'})\n06-04 01:06:23 [INFO:parent process] starting oplog apply\n06-04 01:06:23 [INFO:stats         ] OPS APPLIED                                    | WARNINGS\n06-04 01:06:23 [INFO:stats         ] total     lag    inserts   removes   updates   | sleeps    exceptions retries\n06-04 01:06:26 [INFO:shard1.foo.com] 204        2      0         0         204       | 0         0          0\n06-04 01:06:29 [INFO:shard2.foo.com] 214        1      0         0         214       | 0         0          0\n~~~\n\nWatch out for an excessive number of retries and exceptions. Sleeps are generally OK unless there are an excessive number. Unfortunately, the definition of \"excessive\" depends on your specific situation.\n\nAfter `copy_collection.py` begins applying ops, keep an eye on the `lag` column, which shows how many seconds behind `copy_collection.py`'s replication is.\n\n### compare_collections.py\n\n`compare_collections.py` compares two collections and is meant to be used with `copy_collection.py`. The two scripts can run simultaneously, once `copy_collection.py` is up-to-date with applying ops.\n\nTo compensate for small amounts of `copy_collection.py` lag, `compare_collections.py` tries the comparison of each document multiple times to check whether the documents eventually match. The number of retries and delay between retries is generous, to compensate for frequently updated documents and lag in `copy_collection.py`.\n\n#### compare_collections.py output\n\n~~~\n06-04 01:23:00 [INFO:shard1.foo.com] 30% | 32000 / 104001 compared | 7659/sec | 1 retries | 0 mismatches\n06-04 01:23:00 [INFO:shard2.foo.com] 21% | 22700 / 105831 compared | 5402/sec | 0 retries | 0 mismatches\n~~~\n\nRetries are OK, but watch out for frequent retries. Those might presage mismatches. The `_id`'s for mismatching documents are written to a file named `COLLECTION_mismatches.txt`. For example, if your collection name is albums, you'll find any mismatches in `albums_mismatches.txt`. The mismatches file can be used with the `copy_stragglers.py` tool that will be discussed below.\n\n### copy_stragglers.py\n\nGiven the list of `_id`s in the `[collection_name]_mismatches.txt` file generated by `compare_collections.py`, this tool re-copies all documents with the given `_id`s.\n\nFor example, if you had just finished comparing the collection `albums` and `compare_collections.py` reported some mismatches, you'd run `copy_stragglers.py` as follows:\n\n\n~~~\n./copy_stragglers.py --source source-mongos.foo.com --dest destination-mongos.foo.com --mismatches-file albums_mismatches.txt\n~~~\n\n**NOTE**: Unlike `copy_collection.py` and `compare_collections.py`, `copy_stragglers.py` expects the source to be a `mongos` instance. This is mainly to keep the code extremely simple.\n\n### cluster_cop.py\n\n`cluster_cop.py` monitors the source MongoDB cluster for configuration changes that can impact `copy_collection.py` and `compare_collections.py`. These are:\n\n1. Chunk balancing must be off throughout the whole migration\n2. Primary `mongod` instances must remain primaries, secondaries must remain secondaries (this prevents cursors from dying while being used)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdropbox%2Fhydra","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdropbox%2Fhydra","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdropbox%2Fhydra/lists"}