{"id":15574929,"url":"https://github.com/hatamiarash7/elasticsearch-dump","last_synced_at":"2025-04-24T02:27:37.151Z","repository":{"id":73239701,"uuid":"255328732","full_name":"hatamiarash7/elasticsearch-dump","owner":"hatamiarash7","description":"Imports raw JSON to Elasticsearch in a multi-thread way","archived":false,"fork":false,"pushed_at":"2024-04-01T20:48:44.000Z","size":130,"stargazers_count":9,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-18T10:23:30.584Z","etag":null,"topics":["big-data","bigdata","bulk-inserts","bulk-loader","bulk-operation","bulkimport","elasticsearch","json","json-data","multi-threading","multithreading","python","threading"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hatamiarash7.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-04-13T13:03:27.000Z","updated_at":"2024-08-07T20:22:55.000Z","dependencies_parsed_at":"2025-04-17T21:55:13.662Z","dependency_job_id":"52a8afe6-60fb-4f35-98aa-b1776a7d9c14","html_url":"https://github.com/hatamiarash7/elasticsearch-dump","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hatamiarash7%2Felasticsearch-dump","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hatamiarash7%2Felasticsearch-dump/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hatamiarash7%2Felasticsearch-dump/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hatamiarash7%2Felasticsearch-dump/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hatamiarash7","download_url":"https://codeload.github.com/hatamiarash7/elasticsearch-dump/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250547327,"owners_count":21448469,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","bigdata","bulk-inserts","bulk-loader","bulk-operation","bulkimport","elasticsearch","json","json-data","multi-threading","multithreading","python","threading"],"created_at":"2024-10-02T18:21:26.977Z","updated_at":"2025-04-24T02:27:37.138Z","avatar_url":"https://github.com/hatamiarash7.png","language":"Python","funding_links":["https://ko-fi.com/D1D1WGU9"],"categories":[],"sub_categories":[],"readme":"# ElasticSearch BigData importer\n\n![GitHub last commit](https://img.shields.io/github/last-commit/hatamiarash7/elasticsearch-dump) ![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/hatamiarash7/elasticsearch-dump) [![GitHub license](https://img.shields.io/github/license/hatamiarash7/elasticsearch-dump)](https://github.com/hatamiarash7/elasticsearch-dump/blob/master/LICENSE) [![Open Source Love](https://badges.frapsoft.com/os/v1/open-source.png?v=103)](https://github.com/ellerbrock/open-source-badges/)\n\nImports raw JSON to Elasticsearch in a multi-thread way\n\n![diagram](Diagram.png)\n\nWe have 5 state here\n\n- Only validating data\n- Import data to ElasticSearch without validation\n  - Import using single-thread\n  - Import using multi-thread\n- Import data to ElasticSearch after validation\n  - Import using single-thread\n  - Import using multi-thread\n\n## Prerequisites\n\nInstall the elasticsearch package with [pip](https://pypi.python.org/pypi/elasticsearch) :\n\n```bash\npip install elasticsearch\n```\n\nRead more about versions [here](https://github.com/elastic/elasticsearch-py#compatibility)\n\n## Use\n\n### Options\n\n```\n--data          : The data file\n--check         : Validate data file\n--bulk          : ElasticSearch endpoint ( http://localhost:9200 )\n--index         : Index name\n--type          : Index type\n--import        : Import data to ES\n--thread        : Threads amount, default = 1\n--help          : Display help message\n```\n\n### Validate data\n\nI suggest you check your data before ( or during ) import process\n\n```bash\npython import.py --data test_data.json --check\n```\n\n### Single Thread\n\n##### Import without validation\n\n```bash\npython import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name\n```\n\n##### Import after validation\n\n```bash\npython import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --check\n```\n\n### Multi Thread\n\n##### Import without validation\n\n```bash\npython import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --thread 16\n```\n\n##### Import after validation\n\n```bash\npython import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --check --thread 16\n```\n\n---\n\nWe have much faster process using multi-thread way. It depends on your computer/server resources. This script used `linecache` to put data in RAM, so you need enough memory capacity too\n\n## My test situation :\n\n- AMD Ryzen 3800X ( 8 core / 16 thread )\n- 64GB Ram ( 3000MHz / CL16 )\n- Windows 10\n- 10Gb JSON file with **~24 million** objects\n- Elasticsearch v7\n\nThe whole process took about **~30 minutes** and the usage of resources were efficient\n\n![usage](threads.png)\n\n## Support\n\n[![ko-fi](https://www.ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/D1D1WGU9)\n\n## Contributing\n\n1. Fork it!\n2. Create your feature branch : `git checkout -b my-new-feature`\n3. Commit your changes : `git commit -am 'Add some feature'`\n4. Push to the branch : `git push origin my-new-feature`\n5. Submit a pull request :D\n\n## Issues\n\nEach project may have many problems. Contributing to the better development of this project by reporting them\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhatamiarash7%2Felasticsearch-dump","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhatamiarash7%2Felasticsearch-dump","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhatamiarash7%2Felasticsearch-dump/lists"}