{"id":15517065,"url":"https://github.com/deric/es-dedupe","last_synced_at":"2025-04-23T03:48:43.371Z","repository":{"id":46378800,"uuid":"86470227","full_name":"deric/es-dedupe","owner":"deric","description":"Tool for removing duplicate documents from Elasticsearch","archived":false,"fork":false,"pushed_at":"2023-10-16T09:40:29.000Z","size":133,"stargazers_count":54,"open_issues_count":5,"forks_count":22,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-23T03:48:36.673Z","etag":null,"topics":["duplicates","duplicity","elasticsearch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deric.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-28T14:31:46.000Z","updated_at":"2024-03-26T12:38:06.000Z","dependencies_parsed_at":"2023-10-16T19:29:29.327Z","dependency_job_id":null,"html_url":"https://github.com/deric/es-dedupe","commit_stats":{"total_commits":95,"total_committers":6,"mean_commits":"15.833333333333334","dds":"0.34736842105263155","last_synced_commit":"95483546589627957416539ee2a56dedb1e9cc15"},"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deric%2Fes-dedupe","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deric%2Fes-dedupe/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deric%2Fes-dedupe/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deric%2Fes-dedupe/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deric","download_url":"https://codeload.github.com/deric/es-dedupe/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250366682,"owners_count":21418768,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["duplicates","duplicity","elasticsearch"],"created_at":"2024-10-02T10:11:10.576Z","updated_at":"2025-04-23T03:48:43.347Z","avatar_url":"https://github.com/deric.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ES-dedupe\n\n[![](https://images.microbadger.com/badges/version/deric/es-dedupe.svg)](https://microbadger.com/images/deric/es-dedupe)\n[![](https://images.microbadger.com/badges/image/deric/es-dedupe.svg)](https://microbadger.com/images/deric/es-dedupe)\n\nA tool for removing duplicated documents that are grouped by some unique field (e.g. `--field Uuid`).\n\n## Usage\n\nUse `-h/--help` to see supported options:\n```\ndocker run --rm deric/es-dedupe:latest esdedupe --help\n```\nRemove duplicates from index `exact-index-name` while searching for unique `Uuid` field:\n\n```\ndocker run --rm deric/es-dedupe:latest esdedupe -H localhost -P 9200 -i exact-index-name -f Uuid \u003e es_dedupe.log 2\u003e\u00261\n```\n\n## Multiple unique fields\n\nBuild a local index using ``md5(time,device_id)` as an unique key. It might require a significant amount of memory (depends on the size of your index, it can easily grow to gigabytes - it's stored as a Python dict with a string key, which might occupy a large amount of memory).\n\n\n```bash\nesdedupe --host localhost -field time,device_id -i my_index --noop\n```\n\n\n## Examples\n\nMore advanced example with documents containing timestamps.\n\n```bash\nesdedupe -H localhost -f request_id -i nginx_access_logs-2021.01.29 -b 10000 --timestamp Timestamp --since \"2021-01-29T15:30:00.000Z\" --until \"2021-01-29T16:30:00.000Z\" --flush 1500 --request_timeout 180\n2021-02-01T19:58:25  [139754520647488] INFO  esdedupe elastic: es01, host: localhost, version: 7.6.0\n2021-02-01T19:58:25  [139754520647488] INFO  esdedupe Unique fields: ['request_id']\n2021-02-01T19:58:25  [139754520647488] INFO  esdedupe Building documents mapping on index: nginx_access_logs-2021.01.29, batch size: 10000\n2021-02-01T19:59:16  [139754520647488] INFO  esdedupe Scanned 987,892 unique documents\n2021-02-01T19:59:16  [139754520647488] INFO  esdedupe Memory usage: 414.0MB\n2021-02-01T20:00:03  [139754520647488] INFO  esdedupe Scanned 1,950,957 unique documents\n2021-02-01T20:00:03  [139754520647488] INFO  esdedupe Memory usage: 695.0MB\n2021-02-01T20:00:46  [139754520647488] INFO  esdedupe Scanned 2,861,671 unique documents\n2021-02-01T20:00:46  [139754520647488] INFO  esdedupe Memory usage: 1007.3MB\n2021-02-01T20:01:37  [139754520647488] INFO  esdedupe Scanned 3,579,286 unique documents\n2021-02-01T20:01:37  [139754520647488] INFO  esdedupe Memory usage: 1.2GB\n2021-02-01T20:02:16  [139754520647488] INFO  esdedupe Found 810,993 duplicates out of 4,833,500 docs, unique documents: 4,022,507 (16.8% duplicates)\n100%█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 810001/810993 [7:39:44\u003c00:26, 37.16docs/s]\n2021-02-02T03:42:01  [139754520647488] INFO  esdedupe Deleted 1,621,986/810,993 documents\n2021-02-02T03:42:01  [139754520647488] INFO  esdedupe Successfully completed duplicates removal. Took: 7:43:36.313482\n```\n\n\nWARNING: Running huge bulk operations on Elastic cluster might influence performance of your cluster or even crash some nodes if heap is not large enough.\n\nA sliding window `-w / --window` could be used to prevent running out of memory on larger indexes (if you have a timestamp field):\n\n```bash\n$ esdedupe -H localhost -f request_id -i nginx_access_logs-2021.02.01 -b 10000 --timestamp Timestamp --since 2021-02-01T00:00:00 --until 2021-02-01T10:30:00 --flush 2500 --request_timeout 180 -w 10m --es-level WARN\n2021-02-07T01:27:07  [140045012879168] INFO  esdedupe Found 1,544 duplicates out of 162,805 docs, unique documents: 161,261 (0.9% duplicates)\n  0%|          | 1/1544 [00:17\u003c7:25:23, 17.32s/docs]2021-02-07T01:27:25  [140045012879168] INFO  esdedupe Deleted 3,088 documents (including shard replicas)\n2021-02-07T01:27:25  [140045012879168] INFO  esdedupe Using window 10m, from: 2021-02-01T09:30:00.000Z until: 2021-02-01T09:40:00.000Z\n2021-02-07T01:27:25  [140045012879168] INFO  esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000\n100%|██████████| 1544/1544 [00:18\u003c00:00, 83.11docs/s]\n2021-02-07T01:27:33  [140045012879168] INFO  esdedupe Found 1,338 duplicates out of 162,882 docs, unique documents: 161,544 (0.8% duplicates)\n  0%|          | 1/1338 [00:19\u003c7:23:17, 19.89s/docs]2021-02-07T01:27:53  [140045012879168] INFO  esdedupe Deleted 2,676 documents (including shard replicas)\n2021-02-07T01:27:53  [140045012879168] INFO  esdedupe Using window 10m, from: 2021-02-01T09:40:00.000Z until: 2021-02-01T09:50:00.000Z\n2021-02-07T01:27:53  [140045012879168] INFO  esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000\n100%|██████████| 1338/1338 [00:20\u003c00:00, 64.36docs/s]\n2021-02-07T01:28:02  [140045012879168] INFO  esdedupe Found 1,321 duplicates out of 165,664 docs, unique documents: 164,343 (0.8% duplicates)\n  0%|          | 1/1321 [00:13\u003c4:56:58, 13.50s/docs]2021-02-07T01:28:15  [140045012879168] INFO  esdedupe Deleted 2,642 documents (including shard replicas)\n2021-02-07T01:28:15  [140045012879168] INFO  esdedupe Using window 10m, from: 2021-02-01T09:50:00.000Z until: 2021-02-01T10:00:00.000Z\n2021-02-07T01:28:15  [140045012879168] INFO  esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000\n100%|██████████| 1321/1321 [00:14\u003c00:00, 88.39docs/s]\n2021-02-07T01:28:25  [140045012879168] INFO  esdedupe Found 1,291 duplicates out of 168,842 docs, unique documents: 167,551 (0.8% duplicates)\n  0%|          | 1/1291 [00:12\u003c4:20:59, 12.14s/docs]2021-02-07T01:28:37  [140045012879168] INFO  esdedupe Deleted 2,582 documents (including shard replicas)\n2021-02-07T01:28:37  [140045012879168] INFO  esdedupe Using window 10m, from: 2021-02-01T10:00:00.000Z until: 2021-02-01T10:10:00.000Z\n2021-02-07T01:28:37  [140045012879168] INFO  esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000\n100%|██████████| 1291/1291 [00:15\u003c00:00, 82.91docs/s]\n2021-02-07T01:28:48  [140045012879168] INFO  esdedupe Found 1,371 duplicates out of 173,650 docs, unique documents: 172,279 (0.8% duplicates)\n  0%|          | 1/1371 [00:18\u003c7:07:57, 18.74s/docs]2021-02-07T01:29:07  [140045012879168] INFO  esdedupe Deleted 2,742 documents (including shard replicas)\n2021-02-07T01:29:07  [140045012879168] INFO  esdedupe Using window 10m, from: 2021-02-01T10:10:00.000Z until: 2021-02-01T10:20:00.000Z\n2021-02-07T01:29:07  [140045012879168] INFO  esdedupe Building documents mapping on index: nginx_access_logs-2021.02.01, batch size: 10000\n100%|██████████| 1371/1371 [00:19\u003c00:00, 68.59docs/s]\n2021-02-07T01:29:16  [140045012879168] INFO  esdedupe Found 1,340 duplicates out of 183,592 docs, unique documents: 182,252 (0.7% duplicates)\n  0%|          | 1/1340 [00:21\u003c8:00:21, 21.52s/docs]2021-02-07T01:29:38  [140045012879168] INFO  esdedupe Deleted 2,680 documents (including shard replicas)\n2021-02-07T01:29:38  [140045012879168] INFO  esdedupe Altogether 14115806 documents were removed (including doc replicas)\n2021-02-07T01:29:38  [140045012879168] INFO  esdedupe Total time: 1 day, 10:15:43.528495\n\n```\n\n## Requirements\nFor the installation  use the tools provided by your operating system.\n\nOn Linux   this can be one of the following:  yum, dnf, apt, yast, emerge, ..\n\n* Install python (2 or 3, both will work)\n* Install python*ujson and python*requests for the fitting python version\n\n\nOn Windows you are pretty much on your own, but fear not, you can do the following ;-)\n\n* Download and install a python version from https://www.python.org/ .\n* Open a console terminal and head to the repository copy of es-deduplicator, then run:\npip install -r requirements.txt\n\n\n## Testing\n\nTest can be run from a Docker container. You can use supplied `docker-compose` file:\n```bash\ndocker-compose up\n```\n\nManually run tests:\n```bash\npip3 install -r requirements-dev.txt\npython3 -m pytest -v --capture=no tests/\n```\n\n\n## History\n\nOriginally written in bash which performed terribly due to slow JSON processing with pipes and `jq`. Python with `ujson` seems to be better fitted for this task.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fderic%2Fes-dedupe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fderic%2Fes-dedupe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fderic%2Fes-dedupe/lists"}