Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/preset-io/dbt_github_archive_bigquery
A dbt project to make GitHub archive events useful
https://github.com/preset-io/dbt_github_archive_bigquery
Last synced: 3 months ago
JSON representation
A dbt project to make GitHub archive events useful
- Host: GitHub
- URL: https://github.com/preset-io/dbt_github_archive_bigquery
- Owner: preset-io
- Created: 2023-06-02T22:39:35.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-06-28T01:12:52.000Z (over 1 year ago)
- Last Synced: 2024-01-28T23:09:57.489Z (9 months ago)
- Language: Python
- Size: 560 KB
- Stars: 3
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-dbt - dbt_github_archive_bigquery - A dbt project for GitHub Archive data on BigQuery. (Sample Projects)
README
# dbt reference implementation for Github Archive
This is a quick and dirty reference implementation to make sense of the
GitHub public information made available by the
[GH Archive](https://www.gharchive.org/) through BigQuery public datasets.That data is a bit rough in a bunch of yearly/monthly/daily archived tables
that are fairly large (TBs) and you probably want to bring only the orgs/repos
you care about in a single table, and hopefully do some decent incremental
loads to make this queryable.This dbt project does all this:
- brings all the archived tables in one centralized table in your local BigQuery project
- partitions by day, does incremental loads
- allows you to select just the repos you need
- rebuils some state tables off of the events table
- parses out important information out of JSON blobs
- get rid of redundant or not-so-useful-for-analytics information