https://github.com/go-faster/gh-archive-clickhouse
Save public github event stream to ClickHouse as raw json
https://github.com/go-faster/gh-archive-clickhouse
Last synced: 6 months ago
JSON representation
Save public github event stream to ClickHouse as raw json
- Host: GitHub
- URL: https://github.com/go-faster/gh-archive-clickhouse
- Owner: go-faster
- License: apache-2.0
- Created: 2022-06-27T11:54:46.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-12-05T07:40:49.000Z (6 months ago)
- Last Synced: 2024-12-05T08:31:23.235Z (6 months ago)
- Language: Go
- Size: 466 KB
- Stars: 6
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# gh-archive-clickhouse
Save public GitHub event stream to ClickHouse as json.* [List public events](https://docs.github.com/en/rest/activity/events#list-public-events) GitHub endpoint
* Original [GitHub Archive](https://github.com/igrigorik/gharchive.org) project
* [ClickHouse](https://clickhouse.tech/) OLAP database```sql
CREATE TABLE github_events_raw
(
id Int64,
ts DateTime32,
raw String CODEC (ZSTD(16))
) ENGINE = ReplacingMergeTree
PARTITION BY toYYYYMMDD(ts)
ORDER BY (ts, id);
```Alternative to [gharchive crawler](https://github.com/igrigorik/gharchive.org/tree/master/crawler) with
decreased probability to miss events.* Streaming to ClickHouse via native protocol instead of using files, so storage and fethching are decoupled.
* Automatic pagination if more than one page of new events is available
* Automatic fetch rate adjustment based on rate limit github headers and request duration
* ETag support to skip cached results## gh-load
```console
$ gh-load --help
Usage of gh-load:
-b int
max token bytes (default 104857600)
-db string
db
-from string
start from (default "2022-03-14T21")
-host string
host (default "localhost")
-jobs int
jobs (default 1)
-port int
port (default 9000)
-table string
table (default "github_events_raw")
-to string
end with (default "2022-04-14T21")
```Example usage (consumes ~340MB RAM per job):
```console
gh-load --db faster --from 2020-01-01T00 --to 2020-01-20T00 --jobs 20
```