{"id":28944452,"url":"https://github.com/phenrickson/bgg-data-warehouse","last_synced_at":"2026-04-15T20:02:21.524Z","repository":{"id":298187325,"uuid":"999164181","full_name":"phenrickson/bgg-data-warehouse","owner":"phenrickson","description":"ETL process for BGG cloud data warehouse","archived":false,"fork":false,"pushed_at":"2025-07-21T14:12:06.000Z","size":309,"stargazers_count":2,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-21T16:34:01.473Z","etag":null,"topics":["bigquery","data-engineering","elt-pipeline","python"],"latest_commit_sha":null,"homepage":"https://bgg-dashboard-hyfyvchp4a-uc.a.run.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/phenrickson.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-09T20:50:26.000Z","updated_at":"2025-07-15T19:29:22.000Z","dependencies_parsed_at":"2025-06-09T21:42:37.925Z","dependency_job_id":"1c8a6fa4-9039-4b7e-99e2-1fdea6c7b643","html_url":"https://github.com/phenrickson/bgg-data-warehouse","commit_stats":null,"previous_names":["phenrickson/bgg-data-warehouse"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/phenrickson/bgg-data-warehouse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phenrickson%2Fbgg-data-warehouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phenrickson%2Fbgg-data-warehouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phenrickson%2Fbgg-data-warehouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phenrickson%2Fbgg-data-warehouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/phenrickson","download_url":"https://codeload.github.com/phenrickson/bgg-data-warehouse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/phenrickson%2Fbgg-data-warehouse/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267652790,"owners_count":24122099,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","data-engineering","elt-pipeline","python"],"created_at":"2025-06-23T06:01:59.320Z","updated_at":"2026-04-15T20:02:21.518Z","avatar_url":"https://github.com/phenrickson.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BGG Data Warehouse\n\nA data pipeline for collecting, processing, and analyzing BoardGameGeek game data using BigQuery.\n\n## Overview\n\nThis project collects board game data from BoardGameGeek's API, stores raw responses in BigQuery, and processes them into a normalized data warehouse for analysis.\n\n## Architecture\n\n### Pipeline Components\n\n**Fetch New Games** (`src/pipeline/fetch_new_games.py`)\n- Retrieves new board game IDs from BoardGameGeek\n- Fetches API responses for unfetched games\n- Processes responses into normalized tables\n- Runs daily at 6 AM UTC\n\n**Refresh Old Games** (`src/pipeline/refresh_old_games.py`)\n- Refreshes stale game data based on publication year\n- Recent games (0-2 years): refreshed weekly\n- Established games (2-5 years): refreshed monthly\n- Classic games (5-10 years): refreshed quarterly\n- Vintage games (10+ years): refreshed bi-annually\n- Runs daily at 7 AM UTC\n\n### Data Flow\n\n```mermaid\ngraph TD\n    A[BGG XML API] --\u003e|Fetch Game IDs| B[thing_ids Table]\n    B --\u003e|Unfetched IDs| C[Response Fetcher]\n    C --\u003e|Rate-Limited Requests| A\n    C --\u003e|Store Raw Data| D[raw_responses Table]\n    C --\u003e|Track Fetch| E[fetched_responses Table]\n    E --\u003e|Unprocessed Records| F[Response Processor]\n    D --\u003e|Raw XML| F\n    F --\u003e|Normalized Data| G[BigQuery Warehouse]\n    F --\u003e|Track Processing| H[processed_responses Table]\n```\n\n### BigQuery Datasets\n\n**Raw Dataset** (`raw`)\n- `thing_ids`: Game ID registry\n- `raw_responses`: Raw API responses\n- `fetched_responses`: Fetch tracking\n- `processed_responses`: Processing tracking\n- `request_log`: API request audit log\n- `fetch_in_progress`: Prevents duplicate concurrent fetches\n\n**Core Dataset** (`core`)\n- `games`: Core game data\n- `categories`, `mechanics`, `families`: Dimension tables\n- `designers`, `artists`, `publishers`: Creator tables\n- `rankings`, `player_counts`: Metrics tables\n\n**Analytics Dataset** (`analytics`) - Managed by Dataform\n- `games_active`: View of latest game data (deduped by game_id)\n- `games_features`: Denormalized table with computed columns:\n  - `hurdle`: Binary flag for games with 25+ ratings\n  - `geek_rating`, `complexity`, `rating`: Renamed metrics\n  - `log_users_rated`: Log-transformed user count\n  - Aggregated arrays for categories, mechanics, publishers, designers, artists, families\n\n### Infrastructure\n\nAll infrastructure is managed via Terraform in the `terraform/` directory.\n\n- **Cloud Run Jobs**: Two jobs execute the pipelines daily\n  - `bgg-fetch-new-games`: 1 vCPU, 2GB memory\n  - `bgg-refresh-old-games`: 1 vCPU, 2GB memory\n- **Dataform**: Analytics transformations run via GitHub Actions workflow\n- **GitHub Actions**: Triggers Cloud Run jobs on schedule, deploys on merge to main, runs Dataform\n- **Cloud Build**: Builds and deploys Docker images\n\n## Prerequisites\n\n- Python 3.12+\n- UV package manager\n- Google Cloud project with:\n  - Cloud Run API\n  - Cloud Build API\n  - BigQuery API\n- Service account with BigQuery Data Editor, Cloud Run Invoker roles\n\n## Setup\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/phenrickson/bgg-data-warehouse.git\ncd bgg-data-warehouse\n```\n\n2. Install UV and dependencies:\n```bash\n# Install UV (see https://docs.astral.sh/uv/getting-started/installation/)\ncurl -LsSf https://astral.sh/uv/install.sh | sh\n\n# Create virtual environment and install dependencies\nuv venv\nsource .venv/bin/activate  # Unix/macOS\n.venv\\Scripts\\activate     # Windows\nuv sync\n```\n\n3. Configure environment:\n```bash\ncp .env.example .env\n# Edit .env with your GCP_PROJECT_ID, ENVIRONMENT, BGG_API_TOKEN\n```\n\n4. Configure GitHub repository secrets:\n- `SERVICE_ACCOUNT_KEY`: GCP service account key JSON\n- `GCP_PROJECT_ID`: Google Cloud project ID\n- `BGG_API_TOKEN`: BoardGameGeek API token\n\n## Usage\n\n### Local Development\n\n```bash\n# Fetch new games\nuv run python -m src.pipeline.fetch_new_games\n\n# Refresh old games\nuv run python -m src.pipeline.refresh_old_games\n\n# Run tests\nuv run pytest\n```\n\n### Manual Job Execution\n\n```bash\ngcloud run jobs execute bgg-fetch-new-games-prod --region us-central1 --wait\ngcloud run jobs execute bgg-refresh-old-games-prod --region us-central1 --wait\n```\n\n## Dashboard\n\nA Streamlit dashboard is available for exploring the data:\n\n```bash\nstreamlit run src/visualization/dashboard.py\n```\n\nThe dashboard is also deployed to Cloud Run and accessible via the URL output by the deploy workflow.\n\n## Versioning\n\nThis project uses semantic versioning. When changes are merged to main with a version bump in `pyproject.toml`, a GitHub Action automatically creates a corresponding git tag.\n\nSee [CHANGELOG.md](CHANGELOG.md) for version history.\n\n## License\n\nMIT License\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphenrickson%2Fbgg-data-warehouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fphenrickson%2Fbgg-data-warehouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fphenrickson%2Fbgg-data-warehouse/lists"}