{"id":13749909,"url":"https://github.com/oliver006/elasticsearch-test-data","last_synced_at":"2025-04-05T10:09:21.858Z","repository":{"id":26379433,"uuid":"29828771","full_name":"oliver006/elasticsearch-test-data","owner":"oliver006","description":"Generate and upload test data to Elasticsearch for performance and load testing","archived":false,"fork":false,"pushed_at":"2024-06-07T04:07:42.000Z","size":45,"stargazers_count":257,"open_issues_count":6,"forks_count":122,"subscribers_count":13,"default_branch":"master","last_synced_at":"2024-10-12T22:10:59.473Z","etag":null,"topics":["data","elasticsearch","python","test-data","tornado"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oliver006.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-01-25T20:04:04.000Z","updated_at":"2024-09-21T14:50:48.000Z","dependencies_parsed_at":"2024-06-07T05:37:35.497Z","dependency_job_id":null,"html_url":"https://github.com/oliver006/elasticsearch-test-data","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-test-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-test-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-test-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oliver006%2Felasticsearch-test-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oliver006","download_url":"https://codeload.github.com/oliver006/elasticsearch-test-data/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247318745,"owners_count":20919484,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data","elasticsearch","python","test-data","tornado"],"created_at":"2024-08-03T07:01:18.131Z","updated_at":"2025-04-05T10:09:21.836Z","avatar_url":"https://github.com/oliver006.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Elasticsearch For Beginners: Generate and Upload Randomized Test Data\n\nBecause everybody loves test data.\n\n## Ok, so what is this thing doing?\n\n`es_test_data.py` lets you generate and upload randomized test data to\nyour ES cluster so you can start running queries, see what performance\nis like, and verify your cluster is able to handle the load.\n\nIt allows for easy configuring of what the test documents look like, what\nkind of data types they include and what the field names are called.\n\n## Cool, how do I use this? \n\n### Run Python script\n\nLet's assume you have an Elasticsearch cluster running.\n\nPython and [Tornado](https://github.com/tornadoweb/tornado/) are used. Run\n`pip install tornado` to install Tornado if you don't have it already.\n\nIt's as simple as this:\n\n```\n$ python es_test_data.py --es_url=http://localhost:9200\n[I 150604 15:43:19 es_test_data:42] Trying to create index http://localhost:9200/test_data\n[I 150604 15:43:19 es_test_data:47] Guess the index exists already\n[I 150604 15:43:19 es_test_data:184] Generating 10000 docs, upload batch size is 1000\n[I 150604 15:43:19 es_test_data:62] Upload: OK - upload took:    25ms, total docs uploaded:    1000\n[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took:    25ms, total docs uploaded:    2000\n[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took:    19ms, total docs uploaded:    3000\n[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took:    18ms, total docs uploaded:    4000\n[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took:    27ms, total docs uploaded:    5000\n[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took:    19ms, total docs uploaded:    6000\n[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took:    15ms, total docs uploaded:    7000\n[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took:    24ms, total docs uploaded:    8000\n[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took:    32ms, total docs uploaded:    9000\n[I 150604 15:43:20 es_test_data:62] Upload: OK - upload took:    31ms, total docs uploaded:   10000\n[I 150604 15:43:20 es_test_data:216] Done - total docs uploaded: 10000, took 1 seconds\n[I 150604 15:43:20 es_test_data:217] Bulk upload average:           23 ms\n[I 150604 15:43:20 es_test_data:218] Bulk upload median:            24 ms\n[I 150604 15:43:20 es_test_data:219] Bulk upload 95th percentile:   31 ms\n```\n \nWithout any command line options, it will generate and upload 1000 documents\nof the format\n\n```\n{\n    \"name\":\u003c\u003cstr\u003e\u003e,\n    \"age\":\u003c\u003cint\u003e\u003e,\n    \"last_updated\":\u003c\u003cts\u003e\u003e\n}\n```\nto an Elasticsearch cluster at `http://localhost:9200` to an index called\n`test_data`.\n\n### Docker and Docker Compose\n\nRequires [Docker](https://docs.docker.com/get-docker/) for running the app and [Docker Compose](https://docs.docker.com/compose/install/) for running a single ElasticSearch domain with two nodes (es1 and es2).\n\n1. Set the maximum virtual memory of your machine to `262144` otherwise the ElasticSearch instances will crash, [see the docs](https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html)\n    ```bash\n    $ sudo sysctl -w vm.max_map_count=262144\n    ```\n1. Clone this repository\n    ```bash\n    $ git clone https://github.com/oliver006/elasticsearch-test-data.git\n    $ cd elasticsearch-test-data\n    ```\n1. Run the ElasticSearch stack\n    ```bash\n    $ docker-compose up --detached\n    ```\n1. Run the app and inject random data to the ES stack\n    ```bash\n    $ docker run --rm -it --network host oliver006/es-test-data  \\\n        --es_url=http://localhost:9200  \\\n        --batch_size=10000  \\\n        --username=elastic \\\n        --password=\"esbackup-password\"\n    ```\n1. Cleanup\n    ```bash\n    $ docker-compose down --volumes\n    ```\n\n## Not bad but what can I configure?\n\n`python es_test_data.py --help` gives you the full set of command line\nptions, here are the most important ones:\n\n- `--es_url=http://localhost:9200` the base URL of your ES node, don't\n  include the index name\n- `--username=\u003cusername\u003e` the username when basic auth is required\n- `--password=\u003cpassword\u003e` the password when basic auth is required\n- `--count=###` number of documents to generate and upload\n- `--index_name=test_data` the name of the index to upload the data to.\n  If it doesn't exist it'll be created with these options\n  - `--num_of_shards=2` the number of shards for the index\n  - `--num_of_replicas=0` the number of replicas for the index\n- `--batch_size=###` we use bulk upload to send the docs to ES, this option\n  controls how many we send at a time\n- `--force_init_index=False` if `True` it will delete and re-create the index\n- `--dict_file=filename.dic` if provided the `dict` data type will use words\n  from the dictionary file, format is one word per line. The entire file is\n  loaded at start-up so be careful with (very) large files.\n- `--data_file=filename.json|filename.csv` if provided all data in the filename will be inserted into es. The file content has to be an array of json objects (the documents). If the file ends in `.csv` then the data is automatically converted into json and inserted as documents.\n\n## What about the document format?\n\nGlad you're asking, let's get to the doc format.\n\nThe doc format is configured via `--format=\u003c\u003cFORMAT\u003e\u003e` with the default being\n`name:str,age:int,last_updated:ts`.\n\nThe general syntax looks like this:\n\n`\u003c\u003cfield_name\u003e\u003e:\u003c\u003cfield_type\u003e\u003e,\u003c\u003cfield_name\u003e\u003e::\u003c\u003cfield_type\u003e\u003e, ...`\n\nFor every document, `es_test_data.py` will generate random values for each of\nthe fields configured.\n\nCurrently supported field types are:\n\n- `bool` returns a random true or false\n- `ts` a timestamp (in milliseconds), randomly picked between now +/- 30 days\n- `ipv4` returns a random ipv4\n- `tstxt` a timestamp in the \"%Y-%m-%dT%H:%M:%S.000-0000\" format, randomly\n  picked between now +/- 30 days\n- `int:min:max` a random integer between `min` and `max`. If `min` and `max`\n  are not provided they default to 0 and 100000\n- `str:min:max` a word ( as in, a string), made up of `min` to `max` random\n  upper/lowercase and digit characters. If `min` and `max` are optional,\n  defaulting to `3` and `10`\n- `words:min:max` a random number of `strs`, separated by space, `min` and\n  `max` are optional, defaulting to '2' and `10`\n- `dict:min:max` a random number of entries from the dictionary file,\n  separated by space, `min` and `max` are optional, defaulting to '2' and `10`\n- `text:words:min:max` a random number of words seperated by space from a\n  given list of `-` seperated words, the words are optional defaulting to\n  `text1` `text2` and `text3`, min and max are optional, defaulting to `1`\n  and `1`\n- `arr:[array_length_expression]:[single_element_format]` an array of entries \n  with format specified by `single_element_format`. `array_length_expression` \n  can be either a single number, or pair of numbers separated by `-` (i.e. 3-7), \n  defining range of lengths from with random length will be picked for each array\n  (Example `int_array:arr:1-5:int:1:250`)\n\n\n## Todo\n\n- document the remaining cmd line options\n- more different format types\n- ...\n\nAll suggestions, comments, ideas, pull requests are welcome!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foliver006%2Felasticsearch-test-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foliver006%2Felasticsearch-test-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foliver006%2Felasticsearch-test-data/lists"}