{"id":22546615,"url":"https://github.com/prx/castlehouse","last_synced_at":"2025-03-28T08:45:56.354Z","repository":{"id":234264169,"uuid":"780633788","full_name":"PRX/castlehouse","owner":"PRX","description":"PRX rollups running on Clickhouse. Rename this later, Ryan.","archived":false,"fork":false,"pushed_at":"2024-10-03T20:55:21.000Z","size":27,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-02-02T09:31:13.533Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PRX.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-01T21:50:30.000Z","updated_at":"2024-10-03T20:55:25.000Z","dependencies_parsed_at":"2024-04-22T18:15:27.844Z","dependency_job_id":"e8523eed-8b56-4aa5-aa9b-b4c4202ba7b7","html_url":"https://github.com/PRX/castlehouse","commit_stats":null,"previous_names":["prx/castlehouse"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRX%2Fcastlehouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRX%2Fcastlehouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRX%2Fcastlehouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PRX%2Fcastlehouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PRX","download_url":"https://codeload.github.com/PRX/castlehouse/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245999320,"owners_count":20707554,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-07T15:08:22.564Z","updated_at":"2025-03-28T08:45:56.323Z","avatar_url":"https://github.com/PRX.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CastleHouse\n\nYour castle is your home.\n\n## Data flow\n\n1. The `dt_downloads` table in BigQuery\n2. BQ Scheduled Query to export rollup parquet files\n3. This project copies those parquet files into Clickhouse `ReplacingMergeTree` tables\n4. Query away! Just make sure you use the `FINAL` keyword, or you'll get duplicate upserted rows\n\n## Installation\n\nFor local dev, you can just follow the [Clickhouse quick install instructions](https://clickhouse.com/docs/en/install#quick-install).\nExcept we'll put it in the `clickhouse/` subdirectory.\n\n```sh\n# download clickhouse\nmkdir clickhouse \u0026\u0026 (cd clickhouse \u0026\u0026 curl https://clickhouse.com/ | sh)\n\n# symlink our server config overrides\nmkdir clickhouse/config.d \u0026\u0026 (cd clickhouse/config.d \u0026\u0026 ln -s ../../override-config.xml .)\n```\n\n## Setup\n\nCopy the `env-example` to `.env` and fill it in. You'll need a Google Service Account HMAC key\nin order to pull down the GS files. But who doesn't have one of those lying around, right?\n\nRather than directly running `clickhouse/clickhouse`, the `castlehouse` script will load your\ndotenv before calling clickhouse.\n\nIn a separate tab, get the clickhouse server running:\n\n```sh\n./castlehouse server\n```\n\nThen create your database and tables:\n\n```sh\nrealpath schema/tables.sql | xargs ./castlehouse client --queries-file\n```\n\nNow you're ready to load data! For local dev, it's probably best to just insert a chunk of data\nfrom the bucket. Open up a `./castlehouse client` and then use Google Storage globs to grab all\nthe April 2024 files. Notice that the bucket name does not matter - it is overridden from your\n`GOOGLE_STORAGE_BUCKET_ENDPOINT` env in `override-config.xml`.\n\n```sql\nINSERT INTO daily_agents SELECT * FROM s3('gs://the-bucket/2024/04/**/daily_agents_*.parquet');\nINSERT INTO daily_geos SELECT * FROM s3('gs://the-bucket/2024/04/**/daily_geos_*.parquet');\nINSERT INTO daily_uniques SELECT * FROM s3('gs://the-bucket/2024/04/**/daily_uniques_*.parquet');\nINSERT INTO hourly_downloads SELECT * FROM s3('gs://the-bucket/2024/04/**/hourly_downloads_*.parquet');\n```\n\nIn production, we use [S3Queue](https://clickhouse.com/docs/en/engines/table-engines/integrations/s3queue)\ntables (the ones ending in `_queue` or `_incr`) to continuously stream data from Google Storage into Clickhouse\nvia materialized views.\n\nThis workflow looks like:\n\n1. Throughout the day, updates are written to `gs://rollups/_incr/hourly_downloads_20240403_090302.parquet` files\n2. The `hourly_downloads_incr_mv` materialized view sees these being inserted into hourly_downloads_incr S3Queue table\n3. The data is then inserted into the `hourly_downloads`\n\nTo enable these locally:\n\n```sh\nrealpath schema/mv_backfill.sql | xargs ./castlehouse client --queries-file\nrealpath schema/mv_increments.sql | xargs ./castlehouse client --queries-file\n```\n\nAnd then to remove them, so they're not always churning away in the background on your local machine:\n\n```sh\n./castlehouse client -q \"DROP VIEW daily_agents_queue_mv\"\n./castlehouse client -q \"DROP VIEW daily_geos_queue_mv\"\n./castlehouse client -q \"DROP VIEW daily_uniques_queue_mv\"\n./castlehouse client -q \"DROP VIEW hourly_downloads_queue_mv\"\n./castlehouse client -q \"DROP VIEW daily_agents_incr_mv\"\n./castlehouse client -q \"DROP VIEW daily_geos_incr_mv\"\n./castlehouse client -q \"DROP VIEW daily_uniques_incr_mv\"\n./castlehouse client -q \"DROP VIEW hourly_downloads_incr_mv\"\n```\n\n## Querying\n\nWe're using [ReplacingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree)\ntables, since we expect to \"upsert\" the same days/hours of data multiple times.\n\nThis does mean you could get inaccurate results. Couple strategies to deal with that:\n\n```sql\n# plain query returns 2.99M ... woh, that's more than expected!\nSELECT SUM(count) FROM hourly_downloads WHERE hour \u003e= '2024-04-01' AND hour \u003c '2024-04-02'\n\n# FINAL query returns 1.49M ... that's correct, but this was slower\nSELECT SUM(MAX(count)) FROM hourly_downloads FINAL WHERE hour \u003e= '2024-04-01' AND hour \u003c '2024-04-02'\n\n# 3x faster that FINAL\nSELECT SUM(max_count) FROM (\n  SELECT hour, MAX(count) AS max_count FROM hourly_downloads\n  GROUP BY podcast_id, feed_slug, episode_id, hour\n)\nWHERE hour \u003e= '2024-04-01' AND hour \u003c '2024-04-02'\n\n# or ... cleanup?\nOPTIMIZE TABLE hourly_downloads FINAL\n\n# or a MV populated from the inserts-table?\n```\n\n## BigQuery Exports\n\nThis repo also includes an `exports/` directory.\n\nThese SQL files are not intended to run from here, but instead should be setup as a\nBigQuery Scheduled Query.\n\nFor instance, the `daily_rollups.sql` should be scheduled to run 15 minutes after midnight UTC\nevery day, to rollup the final copy of the previous day's data.\n\nThe incremental `increments.sql` should be scheduled many times per day.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprx%2Fcastlehouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprx%2Fcastlehouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprx%2Fcastlehouse/lists"}