{"id":25095035,"url":"https://github.com/ClickHouse/JSONBench","last_synced_at":"2025-10-22T11:31:05.725Z","repository":{"id":274728700,"uuid":"922274030","full_name":"ClickHouse/JSONBench","owner":"ClickHouse","description":"JSONBench: a Benchmark For Data Analytics On JSON","archived":false,"fork":false,"pushed_at":"2025-02-05T22:04:58.000Z","size":478,"stargazers_count":84,"open_issues_count":2,"forks_count":0,"subscribers_count":38,"default_branch":"main","last_synced_at":"2025-02-05T23:19:50.374Z","etag":null,"topics":["analytics","benchmark","clickhouse","database","duckdb","elasticsearch","json","mongodb","postgresql","sql"],"latest_commit_sha":null,"homepage":"https://jsonbench.com/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ClickHouse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-25T19:14:43.000Z","updated_at":"2025-02-05T22:05:02.000Z","dependencies_parsed_at":"2025-01-29T02:31:51.373Z","dependency_job_id":"d1b13f7a-9a14-4c69-941d-4ccd45c1ef02","html_url":"https://github.com/ClickHouse/JSONBench","commit_stats":null,"previous_names":["clickhouse/jsonbench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2FJSONBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2FJSONBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2FJSONBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ClickHouse%2FJSONBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ClickHouse","download_url":"https://codeload.github.com/ClickHouse/JSONBench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237676236,"owners_count":19348572,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","benchmark","clickhouse","database","duckdb","elasticsearch","json","mongodb","postgresql","sql"],"created_at":"2025-02-07T16:01:02.956Z","updated_at":"2025-10-22T11:31:05.720Z","avatar_url":"https://github.com/ClickHouse.png","language":"HTML","readme":"# JSONBench: A Benchmark For Data Analytics on JSON\n\n## Overview\n\nThis benchmark compares the native JSON support of the most popular analytical databases.\n\nThe [dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) is a collection of files containing JSON objects delimited by newline (ndjson).\nThis was obtained using Jetstream to collect Bluesky events.\nThe dataset contains 1 billion Bluesky events and is currently hosted on a public S3 bucket.\n\nWe wrote a [detailed blog post](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql) about JSONBench, explaining how it works and showcasing benchmark results for five databases: ClickHouse, MongoDB, Elasticsearch, DuckDB, and PostgreSQL.\n\n## Principles\n\nThe [main principles](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#benchmark-methodology) of this benchmark are:\n\n### Reproducibility\n\nIt is easy to reproduce every test in a semi-automated way (although for some systems it may take from several hours to days).\nThe test setup is documented and uses inexpensive cloud VMs.\nThe test process is available in the form of a shell script, covering the installation of each database, loading of the data, running the workload, and collecting the result numbers.\nThe dataset is published and made available for download in multiple formats.\n\n### Realism\n\n[The dataset](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) represents real-world production data.\nThe realistic data distribution allows to account appropriately for compression, indices, codecs, custom data structures, etc., something that is not possible with most random data generators.\nJSONBench tests various aspects of the hardware as well: some queries require high storage throughput, some queries benefit from a large number of CPU cores, and some benefit from single-core speed, some queries benefit from high main memory bandwidth.\n\n### Fairness\n\nDatabases must be benchmarked using their default settings.\nAs an exception, it is okay to specify non-default settings if they are a prerequisite for running the benchmark (example: increasing the maximum JVM heap size).\nNon-mandatory settings, especially settings related to workload tuning, are not allowed.\n\nSome databases provide a native JSON data type that flattens nested JSON documents at insertion time to a single level, typically using `.` as separator between levels.\nWe consider this a grey zone.\nOn the one hand, flattening removes the possibility to restore the original documents.\nOn the other hand, flattening is in many practical situations acceptable.\nThe dashboard provides a toggle which allows to show or hide databases that use flattening.\nIn the scope of JSONBench, we generally discourage flattening.\n\nOther forms of flattening, in particular flattening JSON into multiple non-JSON colums at insertion time, are disallowed.\n\nIt is allowed to index the data using clustered indexes (= specifying the table sort order) or non-clustered indexes (= additional data structures, e.g. B-trees).\nWe recognize that there are pros and cons of this approach.\n\nPros:\n- The JSON documents in JSONBench expose a common and rather static structure. Many real-world use cases expose similar patterns. It is a widely used practice to create indexes based on the anticipated data structure.\n- The original [blog post](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#some-json-paths-can-be-used-for-indexes-and-data-sorting) made use of indexes. Disallowing clustered indexes entirely would invalidate the original measurements.\n\nCons:\n- There may be other real-world use cases where the JSON documents are highly dynamic (they share no common structure). In these cases, indexes are not useful.\n- Many databases use indexes to prune the set of scanned data ranges or retrieve result rows directly (no scan). As a result, the benchmark indirectly also measures the effectiveness of such access path optimization techniques.\n- Likewise, clustered indexes impact how well the data can be compressed. Again, this affects query runtimes indirectly.\n- In some databases, clustered indexes must be build on top of flattened (e.g. concatenated and materialized) JSON documents. This technically contradicts the previous statement that flattening is discouraged.\n\nIt is not allowed to cache query results (or generally: intermediate results at the end of the query processing pipeline) between hot runs.\n\n## Goals\n\nThe goal is to advance the possibilities of data analytics on semistructured data.\nThis benchmark is influenced by **[ClickBench](https://github.com/ClickHouse/ClickBench)** which was published in 2022 and has helped in improving performance, capabilities, and stability of many analytics databases.\nWe would like to see **JSONBench** having a similar impact on the community.\n\n## Limitations\n\nThe benchmark focuses on data analytics queries over JSON documents rather than single-value retrieval or data modification operations.\nThe benchmark does not record data loading times.\nWhile it was one of the initial goals, many systems require a finicky multi-step data preparation process, which makes them difficult to compare.\n\n## Pre-requisites\n\nTo run the benchmark with 1 billion rows, it is important to provision a machine with sufficient resources and disk space.\nThe full compressed dataset takes 125 Gb of disk space, uncompressed it takes up to 425 Gb.\n\nFor reference, the initial benchmarks have been run on the following machines:\n- Hardware: m6i.8xlarge AWS EC2 instance with 10Tb gp3 disks\n- OS: Ubuntu 24.04\n\nIf you're interested in running the full benchmark, be aware that it will take several hours or days, depending on the database.\n\n## Usage\n\nEach folder contains the scripts required to run the benchmark on a database, by example [clickhouse](./clickhouse/) folder contains the scripts to run the benchmark on ClickHouse.\n\nThe full dataset contains 1 billion rows, but the benchmark runs for [different dataset sizes](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) (1 million, 10 million, 100 million and 1 billion rows) and [compression settings](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#the-json-dataset---a-billion-bluesky-events) in order to compare results at different scale.\n\n### Download the data\n\nStart by downloading the dataset using the script [`download_data.sh`](./download_data.sh).\nWhen running the script, you will be prompted the dataset size you want to download.\nIf you just want to test it out, we recommend starting with the default 1m rows.\nIf you are interested in reproducing the results at scale, go with the full dataset (1 billion rows).\n\n```\n./download_data.sh\n\nSelect the dataset size to download:\n1) 1m (default)\n2) 10m\n3) 100m\n4) 1000m\nEnter the number corresponding to your choice:\n```\n\n### Run the benchmark\n\nNavigate to folder corresponding to the database you want to run the benchmark for.\n\nThe script `main.sh` is the script to run each benchmark.\n\nUsage: `main.sh \u003cDATA_DIRECTORY\u003e \u003cSUCCESS_LOG\u003e \u003cERROR_LOG\u003e \u003cOUTPUT_PREFIX\u003e`\n\n- `\u003cDATA_DIRECTORY\u003e`: The directory where the dataset is stored. The default is `~/data/bluesky`.\n- `\u003cSUCCESS_LOG\u003e`: The file to log successful operations. The default is `success.log`.\n- `\u003cERROR_LOG\u003e`: The file to log errors. The default is `error.log`.\n- `\u003cOUTPUT_PREFIX\u003e`: The prefix for output files. The default is `_m6i.8xlarge`.\n\nFor example, for clickhouse:\n\n```\ncd clickhouse\n./main.sh\n\nSelect the dataset size to benchmark:\n1) 1m (default)\n2) 10m\n3) 100m\n4) 1000m\n5) all\nEnter the number corresponding to your choice:\n```\n\nEnter the dataset size for which you want to run the benchmark, then hit enter.\n\nThe script installs the database system on the current machine and then prepares and runs the benchmark.\n\n### Retrieve results\n\nThe results of the benchmark are stored within each folder in files prefixed with the $OUTPUT_PREFIX (Default is `_m6i.8xlarge`).\n\nBelow is a description of the files that might be generated as a result of the benchmark. Depending on the database, some files might not be generated because they are not relevant.\n\n- `.total_size`: Contains the total size of the dataset.\n- `.data_size`: Contains the data size of the dataset.\n- `.index_size`: Contains the index size of the dataset.\n- `.index_usage`: Contains the index usage statistics.\n- `.physical_query_plans`: Contains the physical query plans.\n- `.results_runtime`: Contains the runtime results of the benchmark.\n- `.results_memory_usage`: Contains the memory usage results of the benchmark.\n\nThe last step of our benchmark is manual (PRs to automate this last step are welcome).\nWe manually retrieve the information from the outputted files into the final result JSON documents, which we add to the `results` subdirectory within the benchmark candidate's subdirectory.\n\nFor example, this is the [results](./clickhouse/results) directory for our ClickHouse benchmark results.\n\n## Add a new database\n\nWe highly welcome additions of new entries in the benchmark! Please don't hesitate to contribute one.\nYou don't have to be affiliated with the database engine to contribute to the benchmark.\n\nWe welcome all types of databases, including open-source and closed-source, commercial and experimental, distributed or embedded, except one-off customized builds for the benchmark.\n\nWhile the main benchmark uses a specific machine configuration for reproducibility, we will be interested in receiving results for cloud services and data lakes for reference comparisons.\n\n- [x] [ClickHouse](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#clickhouse)\n- [x] [Elasticsearch](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#elasticsearch)\n- [x] [MongoDB](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#mongodb)\n- [x] [DuckDB](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#duckdb)\n- [x] [PostgreSQL](https://clickhouse.com/blog/json-bench-clickhouse-vs-mongodb-elasticsearch-duckdb-postgresql#postgresql)\n- [x] VictoriaLogs\n- [x] SingleStore\n- [x] GreptimeDB\n- [x] FerretDB\n- [x] Apache Doris\n- [ ] Quickwit\n- [ ] Meilisearch\n- [ ] Sneller\n- [ ] Snowflake\n- [ ] Manticore Search\n- [ ] SurrealDB\n- [ ] OpenText Vertica\n- [ ] PartiQL\n- [ ] FishStore\n- [ ] Apache Drill\n- [ ] GlareDB\n\n## Similar projects\n\n[The fastest command-line tools for querying large JSON datasets](https://colab.research.google.com/github/dcmoura/spyql/blob/master/notebooks/json_benchmark.ipynb)\n","funding_links":[],"categories":["Shell","UIs"],"sub_categories":["CLI"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FClickHouse%2FJSONBench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FClickHouse%2FJSONBench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FClickHouse%2FJSONBench/lists"}