Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/prx/castlehouse

PRX rollups running on Clickhouse. Rename this later, Ryan.
https://github.com/prx/castlehouse

Last synced: about 1 month ago
JSON representation

PRX rollups running on Clickhouse. Rename this later, Ryan.

Host: GitHub
URL: https://github.com/prx/castlehouse
Owner: PRX
Created: 2024-04-01T21:50:30.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-08-26T18:51:49.000Z (5 months ago)
Last Synced: 2024-08-26T22:05:58.956Z (5 months ago)
Language: Shell
Size: 26.4 KB
Stars: 0
Watchers: 6
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # CastleHouse

Your castle is your home.

## Data flow

1. The `dt_downloads` table in BigQuery

2. BQ Scheduled Query to export rollup parquet files

3. This project copies those parquet files into Clickhouse `ReplacingMergeTree` tables

4. Query away! Just make sure you use the `FINAL` keyword, or you'll get duplicate upserted rows

## Installation

For local dev, you can just follow the [Clickhouse quick install instructions](https://clickhouse.com/docs/en/install#quick-install).

Except we'll put it in the `clickhouse/` subdirectory.

```sh

# download clickhouse

mkdir clickhouse && (cd clickhouse && curl https://clickhouse.com/ | sh)

# symlink our server config overrides

mkdir clickhouse/config.d && (cd clickhouse/config.d && ln -s ../../override-config.xml .)

```

## Setup

Copy the `env-example` to `.env` and fill it in. You'll need a Google Service Account HMAC key

in order to pull down the GS files. But who doesn't have one of those lying around, right?

Rather than directly running `clickhouse/clickhouse`, the `castlehouse` script will load your

dotenv before calling clickhouse.

In a separate tab, get the clickhouse server running:

```sh

./castlehouse server

```

Then create your database and tables:

```sh

realpath schema/tables.sql | xargs ./castlehouse client --queries-file

```

Now you're ready to load data! For local dev, it's probably best to just insert a chunk of data

from the bucket. Open up a `./castlehouse client` and then use Google Storage globs to grab all

the April 2024 files. Notice that the bucket name does not matter - it is overridden from your

`GOOGLE_STORAGE_BUCKET_ENDPOINT` env in `override-config.xml`.

```sql

INSERT INTO daily_agents SELECT * FROM s3('gs://the-bucket/2024/04/**/daily_agents_*.parquet');

INSERT INTO daily_geos SELECT * FROM s3('gs://the-bucket/2024/04/**/daily_geos_*.parquet');

INSERT INTO daily_uniques SELECT * FROM s3('gs://the-bucket/2024/04/**/daily_uniques_*.parquet');

INSERT INTO hourly_downloads SELECT * FROM s3('gs://the-bucket/2024/04/**/hourly_downloads_*.parquet');

```

In production, we use [S3Queue](https://clickhouse.com/docs/en/engines/table-engines/integrations/s3queue)

tables (the ones ending in `_queue` or `_incr`) to continuously stream data from Google Storage into Clickhouse

via materialized views.

This workflow looks like:

1. Throughout the day, updates are written to `gs://rollups/_incr/hourly_downloads_20240403_090302.parquet` files

2. The `hourly_downloads_incr_mv` materialized view sees these being inserted into hourly_downloads_incr S3Queue table

3. The data is then inserted into the `hourly_downloads`

To enable these locally:

```sh

realpath schema/mv_backfill.sql | xargs ./castlehouse client --queries-file

realpath schema/mv_increments.sql | xargs ./castlehouse client --queries-file

```

And then to remove them, so they're not always churning away in the background on your local machine:

```sh

./castlehouse client -q "DROP VIEW daily_agents_queue_mv"

./castlehouse client -q "DROP VIEW daily_geos_queue_mv"

./castlehouse client -q "DROP VIEW daily_uniques_queue_mv"

./castlehouse client -q "DROP VIEW hourly_downloads_queue_mv"

./castlehouse client -q "DROP VIEW daily_agents_incr_mv"

./castlehouse client -q "DROP VIEW daily_geos_incr_mv"

./castlehouse client -q "DROP VIEW daily_uniques_incr_mv"

./castlehouse client -q "DROP VIEW hourly_downloads_incr_mv"

```

## Querying

We're using [ReplacingMergeTree](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree)

tables, since we expect to "upsert" the same days/hours of data multiple times.

This does mean you could get inaccurate results. Couple strategies to deal with that:

```sql

# plain query returns 2.99M ... woh, that's more than expected!

SELECT SUM(count) FROM hourly_downloads WHERE hour >= '2024-04-01' AND hour < '2024-04-02'

# FINAL query returns 1.49M ... that's correct, but this was slower

SELECT SUM(MAX(count)) FROM hourly_downloads FINAL WHERE hour >= '2024-04-01' AND hour < '2024-04-02'

# 3x faster that FINAL

SELECT SUM(max_count) FROM (

  SELECT hour, MAX(count) AS max_count FROM hourly_downloads

  GROUP BY podcast_id, feed_slug, episode_id, hour

)

WHERE hour >= '2024-04-01' AND hour < '2024-04-02'

# or ... cleanup?

OPTIMIZE TABLE hourly_downloads FINAL

# or a MV populated from the inserts-table?

```

## BigQuery Exports

This repo also includes an `exports/` directory.

These SQL files are not intended to run from here, but instead should be setup as a

BigQuery Scheduled Query.

For instance, the `daily_rollups.sql` should be scheduled to run 15 minutes after midnight UTC

every day, to rollup the final copy of the previous day's data.

The incremental `increments.sql` should be scheduled many times per day.