Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/prx/castlehouse
PRX rollups running on Clickhouse. Rename this later, Ryan.
https://github.com/prx/castlehouse
Last synced: about 1 month ago
JSON representation
PRX rollups running on Clickhouse. Rename this later, Ryan.
- Host: GitHub
- URL: https://github.com/prx/castlehouse
- Owner: PRX
- Created: 2024-04-01T21:50:30.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-08-26T18:51:49.000Z (5 months ago)
- Last Synced: 2024-08-26T22:05:58.956Z (5 months ago)
- Language: Shell
- Size: 26.4 KB
- Stars: 0
- Watchers: 6
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CastleHouse
Your castle is your home.
## Data flow
1. The `dt_downloads` table in BigQuery
2. BQ Scheduled Query to export rollup parquet files
3. This project copies those parquet files into Clickhouse `ReplacingMergeTree` tables
4. Query away! Just make sure you use the `FINAL` keyword, or you'll get duplicate upserted rows## Installation
For local dev, you can just follow the [Clickhouse quick install instructions](https://clickhouse.com/docs/en/install#quick-install).
Except we'll put it in the `clickhouse/` subdirectory.```sh
# download clickhouse
mkdir clickhouse && (cd clickhouse && curl https://clickhouse.com/ | sh)# symlink our server config overrides
mkdir clickhouse/config.d && (cd clickhouse/config.d && ln -s ../../override-config.xml .)
```## Setup
Copy the `env-example` to `.env` and fill it in. You'll need a Google Service Account HMAC key
in order to pull down the GS files. But who doesn't have one of those lying around, right?Rather than directly running `clickhouse/clickhouse`, the `castlehouse` script will load your
dotenv before calling clickhouse.In a separate tab, get the clickhouse server running:
```sh
./castlehouse server
```Then create your database and tables:
```sh
realpath schema/tables.sql | xargs ./castlehouse client --queries-file
```Now you're ready to load data! For local dev, it's probably best to just insert a chunk of data
from the bucket. Open up a `./castlehouse client` and then use Google Storage globs to grab all
the April 2024 files. Notice that the bucket name does not matter - it is overridden from your
`GOOGLE_STORAGE_BUCKET_ENDPOINT` env in `override-config.xml`.```sql
INSERT INTO daily_agents SELECT * FROM s3('gs://the-bucket/2024/04/**/daily_agents_*.parquet');
INSERT INTO daily_geos SELECT * FROM s3('gs://the-bucket/2024/04/**/daily_geos_*.parquet');
INSERT INTO daily_uniques SELECT * FROM s3('gs://the-bucket/2024/04/**/daily_uniques_*.parquet');
INSERT INTO hourly_downloads SELECT * FROM s3('gs://the-bucket/2024/04/**/hourly_downloads_*.parquet');
```In production, we use [S3Queue](https://clickhouse.com/docs/en/engines/table-engines/integrations/s3queue)
tables (the ones ending in `_queue` or `_incr`) to continuously stream data from Google Storage into Clickhouse
via materialized views.This workflow looks like:
1. Throughout the day, updates are written to `gs://rollups/_incr/hourly_downloads_20240403_090302.parquet` files
2. The `hourly_downloads_incr_mv` materialized view sees these being inserted into hourly_downloads_incr S3Queue table
3. The data is then inserted into the `hourly_downloads`To enable these locally:
```sh
realpath schema/mv_backfill.sql | xargs ./castlehouse client --queries-file
realpath schema/mv_increments.sql | xargs ./castlehouse client --queries-file
```And then to remove them, so they're not always churning away in the background on your local machine:
```sh
./castlehouse client -q "DROP VIEW daily_agents_queue_mv"
./castlehouse client -q "DROP VIEW daily_geos_queue_mv"
./castlehouse client -q "DROP VIEW daily_uniques_queue_mv"
./castlehouse client -q "DROP VIEW hourly_downloads_queue_mv"
./castlehouse client -q "DROP VIEW daily_agents_incr_mv"
./castlehouse client -q "DROP VIEW daily_geos_incr_mv"
./castlehouse client -q "DROP VIEW daily_uniques_incr_mv"
./castlehouse client -q "DROP VIEW hourly_downloads_incr_mv"
```## Querying
We're using [ReplacingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree)
tables, since we expect to "upsert" the same days/hours of data multiple times.This does mean you could get inaccurate results. Couple strategies to deal with that:
```sql
# plain query returns 2.99M ... woh, that's more than expected!
SELECT SUM(count) FROM hourly_downloads WHERE hour >= '2024-04-01' AND hour < '2024-04-02'# FINAL query returns 1.49M ... that's correct, but this was slower
SELECT SUM(MAX(count)) FROM hourly_downloads FINAL WHERE hour >= '2024-04-01' AND hour < '2024-04-02'# 3x faster that FINAL
SELECT SUM(max_count) FROM (
SELECT hour, MAX(count) AS max_count FROM hourly_downloads
GROUP BY podcast_id, feed_slug, episode_id, hour
)
WHERE hour >= '2024-04-01' AND hour < '2024-04-02'# or ... cleanup?
OPTIMIZE TABLE hourly_downloads FINAL# or a MV populated from the inserts-table?
```## BigQuery Exports
This repo also includes an `exports/` directory.
These SQL files are not intended to run from here, but instead should be setup as a
BigQuery Scheduled Query.For instance, the `daily_rollups.sql` should be scheduled to run 15 minutes after midnight UTC
every day, to rollup the final copy of the previous day's data.The incremental `increments.sql` should be scheduled many times per day.