{"id":49804676,"url":"https://github.com/timfanda35/cost-to-bq","last_synced_at":"2026-05-12T17:01:38.078Z","repository":{"id":353360537,"uuid":"1219090343","full_name":"timfanda35/cost-to-bq","owner":"timfanda35","description":"Transfer cost data to GCP BigQuery","archived":false,"fork":false,"pushed_at":"2026-04-23T14:23:06.000Z","size":51,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-23T16:30:03.404Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/timfanda35.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-23T14:14:27.000Z","updated_at":"2026-04-23T14:23:10.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/timfanda35/cost-to-bq","commit_stats":null,"previous_names":["timfanda35/cost-to-bq"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/timfanda35/cost-to-bq","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timfanda35%2Fcost-to-bq","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timfanda35%2Fcost-to-bq/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timfanda35%2Fcost-to-bq/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timfanda35%2Fcost-to-bq/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/timfanda35","download_url":"https://codeload.github.com/timfanda35/cost-to-bq/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/timfanda35%2Fcost-to-bq/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32948571,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-12T09:19:52.626Z","status":"ssl_error","status_checked_at":"2026-05-12T09:17:33.438Z","response_time":102,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-12T17:01:20.655Z","updated_at":"2026-05-12T17:01:37.784Z","avatar_url":"https://github.com/timfanda35.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# billing-loader\n\nA FastAPI service that extracts billing files from AWS S3 (Cost and Usage Reports in Hive-partitioned format), stages them in Google Cloud Storage (GCS), and loads them into BigQuery. Designed to run on Cloud Run, triggered daily by Cloud Scheduler.\n\n## Architecture\n\n```\nS3 (CUR Hive partitions)  →  GCS (staging)  →  BigQuery (partitioned WRITE_TRUNCATE)\n```\n\nBy default each run loads **3 billing periods** (current month + previous two). The `/run` endpoint also accepts optional parameters to process a specific export or a single billing period.\n\n## BigQuery Schemas\n\nExplicit schemas for BigQuery loads are stored in `src/bq_schema/`:\n\n| File | Format |\n|---|---|\n| `aws-cur-2.0-parquet.json` | AWS Cost and Usage Report (CUR) 2.0 — Parquet |\n| `aws-focus-1.2-parquet.json` | AWS FOCUS 1.2 — Parquet |\n\n## Prerequisites\n\n- Python 3.11+\n- A GCP project with the following APIs enabled: Cloud Run, Cloud Scheduler, Cloud Storage, BigQuery\n- A GCP service account with these roles:\n  - `roles/storage.objectAdmin` on the GCS staging bucket\n  - `roles/bigquery.dataEditor` and `roles/bigquery.jobUser` on the BQ project\n\n## Configuration\n\nCopy `.env.example` to `.env` and fill in the values.\n\n| Variable | Required | Default | Description |\n|---|---|---|---|\n| `SOURCE_TYPE` | Yes | — | Must be `s3` |\n| `SOURCE_BUCKET` | Yes | — | S3 bucket name |\n| `SOURCE_PREFIX` | No | `\"\"` | Path prefix in the bucket before the export name |\n| `EXPORT_NAME` | Yes | — | CUR export name; forms the Hive path `{SOURCE_PREFIX}/{EXPORT_NAME}/data/BILLING_PERIOD=YYYY-MM/` |\n| `GCS_BUCKET` | Yes | — | GCS staging bucket name |\n| `GCS_DESTINATION_PREFIX` | No | `\"\"` | Path prefix in GCS (e.g. `billing/`) |\n| `BQ_PROJECT_ID` | Yes | — | GCP project for BigQuery |\n| `BQ_DATASET_ID` | Yes | — | BigQuery dataset name |\n| `BQ_TABLE_ID` | Yes | — | BigQuery table name (partition and cluster fields depend on `BILLING_SCHEMA`) |\n| `AWS_REGION` | Yes | — | AWS region (e.g. `us-east-1`) |\n| `AWS_ACCESS_KEY_ID` | No | — | AWS key ID; uses instance role if omitted |\n| `AWS_SECRET_ACCESS_KEY` | No | — | Required if `AWS_ACCESS_KEY_ID` is set |\n| `S3_ENDPOINT_URL` | No | — | Override the S3 endpoint (e.g. an AWS VPC/PrivateLink endpoint); omit to use the default public AWS endpoint |\n| `BQ_CMEK_KEY_NAME` | No | — | Full Cloud KMS key resource name (`projects/{project}/locations/{location}/keyRings/{ring}/cryptoKeys/{key}`); when set, all BigQuery load jobs use this CMEK instead of Google-managed encryption |\n| `BILLING_SCHEMA` | No | `cur2` | BigQuery schema to use. `cur2` = AWS CUR 2.0 (partition: `bill_billing_period_start_date`, cluster: `line_item_usage_start_date`, `line_item_usage_account_id`); `focus1.2` = AWS FOCUS 1.2 (partition: `BillingPeriodStart`, cluster: `BillingAccountId`) |\n| `PORT` | No | `8080` | HTTP port for the uvicorn server |\n| `LOG_LEVEL` | No | `INFO` | Python log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |\n\n## Local Development\n\n```bash\npip install -r requirements-dev.txt\n\n# Copy and fill in environment variables\ncp .env.example .env\n\n# Run the server\npython main.py\n```\n\nTest the endpoints:\n\n```bash\ncurl http://localhost:8080/health\n# {\"status\": \"ok\"}\n\n# Default run: current month + previous two\ncurl -X POST http://localhost:8080/run\n# {\"run_id\": \"20240115-1705300800\", \"export_name\": \"my-export\", \"periods\": [...], \"bq_table\": \"project.dataset.table\"}\n\n# Run a single specific partition\ncurl -X POST http://localhost:8080/run \\\n  -H 'Content-Type: application/json' \\\n  -d '{\"partition\": \"2024-01\"}'\n\n# Override the export name and partition\ncurl -X POST http://localhost:8080/run \\\n  -H 'Content-Type: application/json' \\\n  -d '{\"export_name\": \"other-export\", \"partition\": \"2024-01\"}'\n```\n\n## Running Tests\n\n```bash\npip install -r requirements-dev.txt\npytest\n```\n\n## Deployment to Cloud Run\n\n**1. Store secrets in Secret Manager** (first deploy only):\n\n```bash\necho -n \"YOUR_AWS_KEY_ID\" | gcloud secrets create billing-loader-aws-key-id --data-file=-\necho -n \"YOUR_AWS_SECRET\" | gcloud secrets create billing-loader-aws-secret-key --data-file=-\n\n# Grant the service account access to each secret\nfor SECRET in billing-loader-aws-key-id billing-loader-aws-secret-key; do\n  gcloud secrets add-iam-policy-binding $SECRET \\\n    --member=\"serviceAccount:${SERVICE_ACCOUNT}\" \\\n    --role=\"roles/secretmanager.secretAccessor\"\ndone\n```\n\n**2. Build and deploy:**\n\n```bash\nIMAGE=\"gcr.io/${GCP_PROJECT_ID}/billing-loader\"\n\ngcloud builds submit --tag \"${IMAGE}\" .\n\ngcloud run deploy billing-loader \\\n  --image \"${IMAGE}\" \\\n  --platform managed \\\n  --region \"${GCP_REGION:-us-central1}\" \\\n  --no-allow-unauthenticated \\\n  --service-account \"${SERVICE_ACCOUNT}\" \\\n  --set-env-vars \"SOURCE_TYPE=s3,SOURCE_BUCKET=${SOURCE_BUCKET},SOURCE_PREFIX=${SOURCE_PREFIX:-},EXPORT_NAME=${EXPORT_NAME},GCS_BUCKET=${GCS_BUCKET},GCS_DESTINATION_PREFIX=${GCS_DESTINATION_PREFIX:-},BQ_PROJECT_ID=${BQ_PROJECT_ID},BQ_DATASET_ID=${BQ_DATASET_ID},BQ_TABLE_ID=${BQ_TABLE_ID},AWS_REGION=${AWS_REGION}\" \\\n  --set-secrets \"AWS_ACCESS_KEY_ID=billing-loader-aws-key-id:latest,AWS_SECRET_ACCESS_KEY=billing-loader-aws-secret-key:latest\"\n```\n\n**3. Create the Cloud Scheduler job:**\n\n```bash\nSERVICE_URL=$(gcloud run services describe billing-loader \\\n  --platform managed --region \"${GCP_REGION:-us-central1}\" \\\n  --format \"value(status.url)\")\n\ngcloud scheduler jobs create http billing-loader-daily \\\n  --schedule \"${CRON_SCHEDULE:-0 6 * * *}\" \\\n  --uri \"${SERVICE_URL}/run\" \\\n  --http-method POST \\\n  --oidc-service-account-email \"${SERVICE_ACCOUNT}\" \\\n  --location \"${GCP_REGION:-us-central1}\"\n```\n\nTrigger a manual run:\n\n```bash\ngcloud scheduler jobs run billing-loader-daily --location \"${GCP_REGION:-us-central1}\"\n```\n\n## Observability\n\nThe service emits structured JSON logs to stdout via `python-json-logger`. On Cloud Run these are captured automatically in Google Cloud Logging with queryable `jsonPayload` fields.\n\nEvery log line includes `log_event` (dotted name), `run_id`, and `export_name`.\n\n### Log events\n\n| `log_event` | Level | When |\n|---|---|---|\n| `request.received` | INFO | Start of `POST /run` |\n| `pipeline.started` | INFO | After run_id generated; includes `periods` list |\n| `period.started` / `period.files_listed` / `period.complete` | INFO | Per billing period |\n| `period.skipped` | WARNING | S3 partition has no parquet files; includes `reason: \"no_parquet_files\"` |\n| `gcs.file.uploaded` | INFO | After each file uploaded; includes `s3_key`, `gcs_uri` |\n| `bq.job.submitted` | INFO | Immediately after BQ job created; includes `job_id` |\n| `bq.job.complete` | INFO | After `job.result()` returns; includes `output_rows`, `output_bytes` |\n| `bq.job.failed` | ERROR | Before `RuntimeError` is raised; includes `job_id`, `errors` |\n| `pipeline.complete` | INFO | After all periods; includes `periods_loaded`, `periods_skipped`, `duration_seconds` |\n| `pipeline.failed` | ERROR | Any unhandled exception; re-raises after logging |\n\nThe BigQuery job ID is logged at `bq.job.submitted` so you can look up the job in the BQ console even while the run is still in progress.\n\nSet `LOG_LEVEL=DEBUG` to lower the root log level (default `INFO`).\n\n### Useful Cloud Logging filters\n\n```\n# Full timeline for one pipeline run\njsonPayload.run_id=\"20260423-1745400000\"\n\n# Audit log: every BQ partition written (includes output_rows and output_bytes)\njsonPayload.log_event=\"bq.job.complete\"\n\n# Find all periods that were skipped (no parquet files in S3)\njsonPayload.log_event=\"period.skipped\"\n```\n\n## API Endpoints\n\n### `GET /health`\n\nReturns service health status.\n\n```json\n{\"status\": \"ok\"}\n```\n\n### `POST /run`\n\nRuns the ETL pipeline. Accepts an optional JSON body:\n\n| Field | Type | Description |\n|---|---|---|\n| `export_name` | string | Override the `EXPORT_NAME` env var for this run |\n| `partition` | string | Process only this month (`YYYY-MM`, e.g. `\"2024-01\"`). Omit to run the default 3-period window. If the partition has no files in S3, it is silently skipped. |\n\nReturns a summary per period:\n\n```json\n{\n  \"run_id\": \"20240115-1705300800\",\n  \"export_name\": \"my-export\",\n  \"periods\": [\n    {\n      \"partition\": \"BILLING_PERIOD=2023-11\",\n      \"files\": 3,\n      \"gcs_uris\": [\"gs://my-bucket/billing/my-export/data/.../file.parquet\"]\n    },\n    {\"partition\": \"BILLING_PERIOD=2023-12\", \"files\": 3, \"gcs_uris\": [\"...\"]},\n    {\"partition\": \"BILLING_PERIOD=2024-01\", \"files\": 3, \"gcs_uris\": [\"...\"]}\n  ],\n  \"bq_table\": \"my-project.billing.daily_costs\"\n}\n```\n\nReturns `500` with `{\"error\": \"...\"}` if the pipeline fails.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimfanda35%2Fcost-to-bq","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftimfanda35%2Fcost-to-bq","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftimfanda35%2Fcost-to-bq/lists"}