https://github.com/hatamiarash7/elasticsearch-dump
Imports raw JSON to Elasticsearch in a multi-thread way
https://github.com/hatamiarash7/elasticsearch-dump
big-data bigdata bulk-inserts bulk-loader bulk-operation bulkimport elasticsearch json json-data multi-threading multithreading python threading
Last synced: about 1 year ago
JSON representation
Imports raw JSON to Elasticsearch in a multi-thread way
- Host: GitHub
- URL: https://github.com/hatamiarash7/elasticsearch-dump
- Owner: hatamiarash7
- License: gpl-3.0
- Created: 2020-04-13T13:03:27.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2024-04-01T20:48:44.000Z (about 2 years ago)
- Last Synced: 2025-04-18T10:23:30.584Z (about 1 year ago)
- Topics: big-data, bigdata, bulk-inserts, bulk-loader, bulk-operation, bulkimport, elasticsearch, json, json-data, multi-threading, multithreading, python, threading
- Language: Python
- Size: 127 KB
- Stars: 9
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ElasticSearch BigData importer
  [](https://github.com/hatamiarash7/elasticsearch-dump/blob/master/LICENSE) [](https://github.com/ellerbrock/open-source-badges/)
Imports raw JSON to Elasticsearch in a multi-thread way

We have 5 state here
- Only validating data
- Import data to ElasticSearch without validation
- Import using single-thread
- Import using multi-thread
- Import data to ElasticSearch after validation
- Import using single-thread
- Import using multi-thread
## Prerequisites
Install the elasticsearch package with [pip](https://pypi.python.org/pypi/elasticsearch) :
```bash
pip install elasticsearch
```
Read more about versions [here](https://github.com/elastic/elasticsearch-py#compatibility)
## Use
### Options
```
--data : The data file
--check : Validate data file
--bulk : ElasticSearch endpoint ( http://localhost:9200 )
--index : Index name
--type : Index type
--import : Import data to ES
--thread : Threads amount, default = 1
--help : Display help message
```
### Validate data
I suggest you check your data before ( or during ) import process
```bash
python import.py --data test_data.json --check
```
### Single Thread
##### Import without validation
```bash
python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name
```
##### Import after validation
```bash
python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --check
```
### Multi Thread
##### Import without validation
```bash
python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --thread 16
```
##### Import after validation
```bash
python import.py --data test_data.json --import --bulk http://localhost:9200 --index index_name --type type_name --check --thread 16
```
---
We have much faster process using multi-thread way. It depends on your computer/server resources. This script used `linecache` to put data in RAM, so you need enough memory capacity too
## My test situation :
- AMD Ryzen 3800X ( 8 core / 16 thread )
- 64GB Ram ( 3000MHz / CL16 )
- Windows 10
- 10Gb JSON file with **~24 million** objects
- Elasticsearch v7
The whole process took about **~30 minutes** and the usage of resources were efficient

## Support
[](https://ko-fi.com/D1D1WGU9)
## Contributing
1. Fork it!
2. Create your feature branch : `git checkout -b my-new-feature`
3. Commit your changes : `git commit -am 'Add some feature'`
4. Push to the branch : `git push origin my-new-feature`
5. Submit a pull request :D
## Issues
Each project may have many problems. Contributing to the better development of this project by reporting them