{"id":29823794,"url":"https://github.com/javier/deduplication-stats-questdb","last_synced_at":"2025-07-29T02:08:47.244Z","repository":{"id":200801569,"uuid":"706273126","full_name":"javier/deduplication-stats-questdb","owner":"javier","description":"Just some scripts to play with event deduplication and QuestDB","archived":false,"fork":false,"pushed_at":"2023-11-22T11:20:41.000Z","size":26,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2023-11-23T11:37:00.434Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/javier.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-10-17T16:12:21.000Z","updated_at":"2023-10-17T16:14:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"db2cd862-f52e-4e9b-8d8c-5e90ddb08ac6","html_url":"https://github.com/javier/deduplication-stats-questdb","commit_stats":null,"previous_names":["javier/deduplication-stats-questdb"],"tags_count":0,"template":null,"template_full_name":null,"purl":"pkg:github/javier/deduplication-stats-questdb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/javier%2Fdeduplication-stats-questdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/javier%2Fdeduplication-stats-questdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/javier%2Fdeduplication-stats-questdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/javier%2Fdeduplication-stats-questdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/javier","download_url":"https://codeload.github.com/javier/deduplication-stats-questdb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/javier%2Fdeduplication-stats-questdb/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267617643,"owners_count":24116208,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-29T02:08:46.660Z","updated_at":"2025-07-29T02:08:47.210Z","avatar_url":"https://github.com/javier.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# deduplication-stats-questdb\nJust some scripts to play with event deduplication and QuestDB. After releasing the `DEDUP` keyword I wanted to have an idea of how much overhead we would have when replaying a stream of events while removing duplicates. I was also curious to see if our overhead was in line with the performance impact deduplication has on other analytical databases. More information in this blog post https://questdb.io/blog/solving-duplicate-data-performant-deduplication/\n\n## disclosure\n\nI ran this experiment using the default installation for QuestDB, Clickhouse, and Timescale. I tried my best to follow documented best practices for ingestion. I am using python with parallel processing, and I used the recommended Python library with the recommended batched strategy for each of the databases. I am sure with some tuning all of them would perform even better, but my goal here was not so much measuring each database with each other, but getting the relative impact of ingesting data with no duplicates and then replaying the ingestion with all duplicates.\n\n## some details about the experiment\n\nI am running this test on an AWS EC2 instance (m6a.4xlarge, 16 CPUs, 64 Gigs of RAM, GP3 EBS volume). I will be ingesting 15 uncompressed CSV files, each containing 12,614,400 rows, for a total of 189,216,000 rows representing 12 years of hourly data. The data represents synthetic ecommerce statistics, with one hourly entry per country (ES, DE, FR, IT, and UK) and category (WOMEN, MEN, KIDS, HOME, KITCHEN).\n\nThe dataset I built for this experiment can be found at https://mega.nz/folder/A1BjnSYQ#NQe5qhYLVBqiRwhWRmcVtg.\n\nData will be ingested into five tables (one per country) with this structure:\n\nStructure for QuestDB\n\n```\nCREATE TABLE IF NOT EXISTS  'ecommerce_sample_test_{country}' (\n                            ts TIMESTAMP,\n                            country SYMBOL capacity 256 CACHE,\n                            category SYMBOL capacity 256 CACHE,\n                            visits LONG,\n                            unique_visitors LONG,\n                            avg_unit_price DOUBLE,\n                            sales DOUBLE\n                            ) timestamp (ts) PARTITION BY DAY WAL DEDUP UPSERT KEYS(ts,country,category);\n```\n\nStructure for ClickHouse\n```\nCREATE TABLE IF NOT EXISTS  ecommerce_sample_test_{country} (\n                        ts datetime,\n                        country enum('UK'=1, 'DE'=2, 'FR'=3, 'IT'=4, 'ES'=5),\n                        category enum('WOMEN'=1, 'MEN'=2, 'KIDS'=3, 'HOME'=4, 'KITCHEN'=5, 'BATHROOM'=6),\n                        visits UInt32,\n                        unique_visitors UInt32,\n                        avg_unit_price Decimal32(4),\n                        sales  Decimal64(4)\n                        ) ENGINE = ReplacingMergeTree\n                        PRIMARY KEY(ts, country, category);\n```\n\nStructure for Timescale\n```\nCREATE TABLE IF NOT EXISTS  ecommerce_sample_{country} (\n                        ts TIMESTAMPTZ,\n                        country TEXT,\n                        category TEXT,\n                        visits INT,\n                        unique_visitors INT,\n                        avg_unit_price DOUBLE PRECISION NULL,\n                        sales DOUBLE  PRECISION NULL,\n                        UNIQUE (ts, country, category)\n                        );\nCREATE UNIQUE INDEX IF NOT EXISTS ecommerce_sample_{country}_unique_idx ON ecommerce_sample_test(ts,country, category);\nSELECT create_hypertable('ecommerce_sample_{country}_test', 'ts', if_not_exists =\u003e TRUE);\nCREATE INDEX IF NOT EXISTS ecommerce_sample_{country}_idx ON ecommerce_sample_test(ts,country, category);\n```\n\nThe total size of the raw CSVs is about 17Gig and I am reading from a RAM disk to minimise the impact of reading the files. I am reading/parsing/ingesting from up to 8 files in parallel. The scripts are written in Python, so very likely we could optimise ingestion a bit by reducing CSV parsing time using a different language, but this is not a benchmark, we just want a ballpark of the impact of DEDUP on ingestion.\nThe dataset I created for this experiment is available at https://mega.nz/folder/A1BjnSYQ#NQe5qhYLVBqiRwhWRmcVtg, and the scripts can be found at https://github.com/javier/deduplication-stats-questdb.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjavier%2Fdeduplication-stats-questdb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjavier%2Fdeduplication-stats-questdb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjavier%2Fdeduplication-stats-questdb/lists"}