{"id":19932052,"url":"https://github.com/amazon-science/redset","last_synced_at":"2025-08-24T19:46:01.411Z","repository":{"id":248569296,"uuid":"827390864","full_name":"amazon-science/redset","owner":"amazon-science","description":"Redset is a dataset containing three months worth of user query metadata that ran on a selected sample of instances in the Amazon Redshift fleet. We provide query metadata for 200 provisioned and serverless instances each.","archived":false,"fork":false,"pushed_at":"2024-09-12T16:33:40.000Z","size":22,"stargazers_count":56,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-01T11:32:48.380Z","etag":null,"topics":["dbms","redshift"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/amazon-science.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-11T15:05:49.000Z","updated_at":"2025-02-18T11:31:10.000Z","dependencies_parsed_at":"2024-09-13T03:50:12.036Z","dependency_job_id":null,"html_url":"https://github.com/amazon-science/redset","commit_stats":null,"previous_names":["amazon-science/redset"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/amazon-science/redset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fredset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fredset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fredset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fredset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/amazon-science","download_url":"https://codeload.github.com/amazon-science/redset/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/amazon-science%2Fredset/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271936461,"owners_count":24846739,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-24T02:00:11.135Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dbms","redshift"],"created_at":"2024-11-12T23:08:53.716Z","updated_at":"2025-08-24T19:46:01.328Z","avatar_url":"https://github.com/amazon-science.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"## Redset\nRedset is a dataset containing three months worth of user query metadata that\nran on a selected sample of instances in the Amazon Redshift fleet. We provide\nquery metadata for 200 provisioned and serverless instances each.\n\n\u003cmark style=\"background-color: lightyellow\"\u003eAs stated in the [paper](https://assets.amazon.science/24/3b/04b31ef64c83acf98fe3fdca9107/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet.pdf), Redset is not intended to be representative of Redshift as a whole. Instead, Redset provides biased sample data to support the development of new benchmarks for these specific workloads. For fleet analysis and sampling methodology please take a look at the paper.\u003c/mark\u003e\n\n## Security\nSee [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.\n\n## License\nRedset © 2024 by Amazon is licensed under Creative Commons\nAttribution-NonCommercial 4.0 International.\n\n## FAQ\n**Does the Paper study Redset?** No. The paper studies metadata generated by the entire Amazon Redshift fleet. Redset is a non-representative biased random sample of this metadata. The dataset is released as a standalone contribution from the paper with the express purpose of aiding the development or augmentation of future benchmarks, as well as enabling exploration of ML based techniques, e.g., for workload forecasting.\n\n**Is Redset a representative sample?** No. As stated in section 6 in [the paper](https://www.amazon.science/publications/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet), the Redset workloads is a biased sample based on a  “busy-ness“ score (see the paper for the exact definition). This ensures a diverse set of example workloads that can be studied individually. However, aggregating over the workloads in the dataset will not yield a representative view of the overall Amazon Redshift fleet. \n\n**Does each Redset cluster equal one Customer?** No. A customer/organization often has several clusters of various sizes and purposes. Redset does not disclose whom a cluster belongs to. It is therefore impossible to draw conclusions about the number of customers in Redset and, for the same reason, impossible to make statements like “X% of customers do/have Y”. \n\n**Can I use scanned bytes to approximate table sizes?** No. There a various factors in the execution engine and Amazon Redshift’s overall design that simply make it impossible to derive or even approximate table sizes from scanned bytes. To give two examples: 1) Redshift uses a columnar storage and thus accesses only data that is actually queried 2) Redshift implements block skipping (and other execution techniques) to tremendously limit the amount of data it needs to process.\n\n\n## Download\nFolder structure:\n* s3://redshift-downloads/redset\n  * README\n  * LICENSE\n  * provisioned/\n    * full.parquet\n    * sample_0.01.parquet (1% uniform random data sample)\n    * sample_0.001.parquet (0.1% uniform random data sample)\n    * parts/\n      * One individual `\u003cid\u003e.parquet` file per cluster\n  * serverless/\n    * full.parquet\n    * sample_0.01.parquet (1% uniform random sample)\n    * sample_0.001.parquet (0.1% uniform random data sample)\n    * parts/\n      * One individual `\u003cid\u003e.parquet` file per cluster\n\nYou can either download files using their http link, e.g.,\nhttps://s3.amazonaws.com/redshift-downloads/redset/LICENSE\nOr interact with the s3 bucket using the [AWS CLI](https://aws.amazon.com/cli/).\nFor example, to download the full serverless dataset you can run:\n```\naws s3 cp --no-sign-request s3://redshift-downloads/redset/serverless/full.parquet .\n```\n\n## Schema\n| Column | Name\tDescription\t|\n| ------ | ---------------- |\n| instance_id |\tUniquely identifies a redshift cluster |\n| cluster_size | Size of the cluster (only available for provisioned) |\n| user_id |\tIdentifies the user that issued the query |\n| database_id |\tIdentifies the database that was queried |\n| query_id | Unique per instance |\n| arrival_timestamp | Timestamp when the query arrived on the system |\n| compile_duration_ms |\tTime the query spent compiling in milliseconds |\n| queue_duration_ms | Time the query spent queueing in milliseconds |\n| execution_duration_ms | Time the query spent executing in milliseconds |\n| feature_fingerprint |\tHash value of the query fingerprint. A proxy for query-likeness, though not based on text. Will overestimate repetition. |\n| was_aborted |\tWhether the query was aborted during its lifetime |\n| was_cached | Whether the query was answered from result cache |\n| cache_source_query_id | If query was answered from result cache, this is the query id for the query which populated the cache |\n| query_type | Type of query, e.g.., `select`, `copy`, ... |\n| num_permanent_tables_accessed | Number of permanent table accesses by the query (regular database table) |\n| num_external_tables_accessed | Number of [external tables](https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_TABLE.html) accessed by the query |\n| num_system_tables_accessed | Number of [system tables](https://docs.aws.amazon.com/redshift/latest/dg/cm_chap_system-tables.html) accessed by the query |\n| read_table_ids | Comma separated list of unique permanent table ids read by the query |\n| write_table_ids |\tComma separated list of unique table ids written to by the query |\n| mbytes_scanned | Total number of megabytes scanned by the query |\n| mbytes_spilled | Total number of megabytes spilled by the query |\n| num_joins | Number of joins in the query plan |\n| num_scans | Number of scans in the query plan |\n| num_aggregations | Number of aggregations in the query plan |\n\n## Citation\nYou may find the paper [here](https://assets.amazon.science/24/3b/04b31ef64c83acf98fe3fdca9107/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet.pdf).\n```\n@Inproceedings{Renen2024,\n  author = {Alexander van Renen and Dominik Horn and Pascal Pfeil and Kapil Eknath Vaidya and Wenjian Dong and Murali Narayanaswamy and Zhengchun Liu and Gaurav Saxena and Andreas Kipf and Tim Kraska},\n  title = {Why TPC is not enough: An analysis of the Amazon Redshift fleet},\n  year = {2024},\n  url = {https://www.amazon.science/publications/why-tpc-is-not-enough-an-analysis-of-the-amazon-redshift-fleet},\n  booktitle = {VLDB 2024},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fredset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famazon-science%2Fredset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famazon-science%2Fredset/lists"}