https://github.com/timfanda35/cost-to-bq

Transfer cost data to GCP BigQuery
https://github.com/timfanda35/cost-to-bq
Last synced: about 2 months ago
JSON representation
Transfer cost data to GCP BigQuery
Host: GitHub
URL: https://github.com/timfanda35/cost-to-bq
Owner: timfanda35
License: mit
Created: 2026-04-23T14:14:27.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-04-23T14:23:06.000Z (2 months ago)
Last Synced: 2026-04-23T16:30:03.404Z (2 months ago)
Language: Python
Size: 49.8 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # billing-loader

A FastAPI service that extracts billing files from AWS S3 (Cost and Usage Reports in Hive-partitioned format), stages them in Google Cloud Storage (GCS), and loads them into BigQuery. Designed to run on Cloud Run, triggered daily by Cloud Scheduler.

## Architecture

```

S3 (CUR Hive partitions)  →  GCS (staging)  →  BigQuery (partitioned WRITE_TRUNCATE)

```

By default each run loads **3 billing periods** (current month + previous two). The `/run` endpoint also accepts optional parameters to process a specific export or a single billing period.

## BigQuery Schemas

Explicit schemas for BigQuery loads are stored in `src/bq_schema/`:

| File | Format |

|---|---|

| `aws-cur-2.0-parquet.json` | AWS Cost and Usage Report (CUR) 2.0 — Parquet |

| `aws-focus-1.2-parquet.json` | AWS FOCUS 1.2 — Parquet |

## Prerequisites

- Python 3.11+

- A GCP project with the following APIs enabled: Cloud Run, Cloud Scheduler, Cloud Storage, BigQuery

- A GCP service account with these roles:

  - `roles/storage.objectAdmin` on the GCS staging bucket

  - `roles/bigquery.dataEditor` and `roles/bigquery.jobUser` on the BQ project

## Configuration

Copy `.env.example` to `.env` and fill in the values.

| Variable | Required | Default | Description |

|---|---|---|---|

| `SOURCE_TYPE` | Yes | — | Must be `s3` |

| `SOURCE_BUCKET` | Yes | — | S3 bucket name |

| `SOURCE_PREFIX` | No | `""` | Path prefix in the bucket before the export name |

| `EXPORT_NAME` | Yes | — | CUR export name; forms the Hive path `{SOURCE_PREFIX}/{EXPORT_NAME}/data/BILLING_PERIOD=YYYY-MM/` |

| `GCS_BUCKET` | Yes | — | GCS staging bucket name |

| `GCS_DESTINATION_PREFIX` | No | `""` | Path prefix in GCS (e.g. `billing/`) |

| `BQ_PROJECT_ID` | Yes | — | GCP project for BigQuery |

| `BQ_DATASET_ID` | Yes | — | BigQuery dataset name |

| `BQ_TABLE_ID` | Yes | — | BigQuery table name (partition and cluster fields depend on `BILLING_SCHEMA`) |

| `AWS_REGION` | Yes | — | AWS region (e.g. `us-east-1`) |

| `AWS_ACCESS_KEY_ID` | No | — | AWS key ID; uses instance role if omitted |

| `AWS_SECRET_ACCESS_KEY` | No | — | Required if `AWS_ACCESS_KEY_ID` is set |

| `S3_ENDPOINT_URL` | No | — | Override the S3 endpoint (e.g. an AWS VPC/PrivateLink endpoint); omit to use the default public AWS endpoint |

| `BQ_CMEK_KEY_NAME` | No | — | Full Cloud KMS key resource name (`projects/{project}/locations/{location}/keyRings/{ring}/cryptoKeys/{key}`); when set, all BigQuery load jobs use this CMEK instead of Google-managed encryption |

| `BILLING_SCHEMA` | No | `cur2` | BigQuery schema to use. `cur2` = AWS CUR 2.0 (partition: `bill_billing_period_start_date`, cluster: `line_item_usage_start_date`, `line_item_usage_account_id`); `focus1.2` = AWS FOCUS 1.2 (partition: `BillingPeriodStart`, cluster: `BillingAccountId`) |

| `PORT` | No | `8080` | HTTP port for the uvicorn server |

| `LOG_LEVEL` | No | `INFO` | Python log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |

## Local Development

```bash

pip install -r requirements-dev.txt

# Copy and fill in environment variables

cp .env.example .env

# Run the server

python main.py

```

Test the endpoints:

```bash

curl http://localhost:8080/health

# {"status": "ok"}

# Default run: current month + previous two

curl -X POST http://localhost:8080/run

# {"run_id": "20240115-1705300800", "export_name": "my-export", "periods": [...], "bq_table": "project.dataset.table"}

# Run a single specific partition

curl -X POST http://localhost:8080/run \

  -H 'Content-Type: application/json' \

  -d '{"partition": "2024-01"}'

# Override the export name and partition

curl -X POST http://localhost:8080/run \

  -H 'Content-Type: application/json' \

  -d '{"export_name": "other-export", "partition": "2024-01"}'

```

## Running Tests

```bash

pip install -r requirements-dev.txt

pytest

```

## Deployment to Cloud Run

**1. Store secrets in Secret Manager** (first deploy only):

```bash

echo -n "YOUR_AWS_KEY_ID" | gcloud secrets create billing-loader-aws-key-id --data-file=-

echo -n "YOUR_AWS_SECRET" | gcloud secrets create billing-loader-aws-secret-key --data-file=-

# Grant the service account access to each secret

for SECRET in billing-loader-aws-key-id billing-loader-aws-secret-key; do

  gcloud secrets add-iam-policy-binding $SECRET \

    --member="serviceAccount:${SERVICE_ACCOUNT}" \

    --role="roles/secretmanager.secretAccessor"

done

```

**2. Build and deploy:**

```bash

IMAGE="gcr.io/${GCP_PROJECT_ID}/billing-loader"

gcloud builds submit --tag "${IMAGE}" .

gcloud run deploy billing-loader \

  --image "${IMAGE}" \

  --platform managed \

  --region "${GCP_REGION:-us-central1}" \

  --no-allow-unauthenticated \

  --service-account "${SERVICE_ACCOUNT}" \

  --set-env-vars "SOURCE_TYPE=s3,SOURCE_BUCKET=${SOURCE_BUCKET},SOURCE_PREFIX=${SOURCE_PREFIX:-},EXPORT_NAME=${EXPORT_NAME},GCS_BUCKET=${GCS_BUCKET},GCS_DESTINATION_PREFIX=${GCS_DESTINATION_PREFIX:-},BQ_PROJECT_ID=${BQ_PROJECT_ID},BQ_DATASET_ID=${BQ_DATASET_ID},BQ_TABLE_ID=${BQ_TABLE_ID},AWS_REGION=${AWS_REGION}" \

  --set-secrets "AWS_ACCESS_KEY_ID=billing-loader-aws-key-id:latest,AWS_SECRET_ACCESS_KEY=billing-loader-aws-secret-key:latest"

```

**3. Create the Cloud Scheduler job:**

```bash

SERVICE_URL=$(gcloud run services describe billing-loader \

  --platform managed --region "${GCP_REGION:-us-central1}" \

  --format "value(status.url)")

gcloud scheduler jobs create http billing-loader-daily \

  --schedule "${CRON_SCHEDULE:-0 6 * * *}" \

  --uri "${SERVICE_URL}/run" \

  --http-method POST \

  --oidc-service-account-email "${SERVICE_ACCOUNT}" \

  --location "${GCP_REGION:-us-central1}"

```

Trigger a manual run:

```bash

gcloud scheduler jobs run billing-loader-daily --location "${GCP_REGION:-us-central1}"

```

## Observability

The service emits structured JSON logs to stdout via `python-json-logger`. On Cloud Run these are captured automatically in Google Cloud Logging with queryable `jsonPayload` fields.

Every log line includes `log_event` (dotted name), `run_id`, and `export_name`.

### Log events

| `log_event` | Level | When |

|---|---|---|

| `request.received` | INFO | Start of `POST /run` |

| `pipeline.started` | INFO | After run_id generated; includes `periods` list |

| `period.started` / `period.files_listed` / `period.complete` | INFO | Per billing period |

| `period.skipped` | WARNING | S3 partition has no parquet files; includes `reason: "no_parquet_files"` |

| `gcs.file.uploaded` | INFO | After each file uploaded; includes `s3_key`, `gcs_uri` |

| `bq.job.submitted` | INFO | Immediately after BQ job created; includes `job_id` |

| `bq.job.complete` | INFO | After `job.result()` returns; includes `output_rows`, `output_bytes` |

| `bq.job.failed` | ERROR | Before `RuntimeError` is raised; includes `job_id`, `errors` |

| `pipeline.complete` | INFO | After all periods; includes `periods_loaded`, `periods_skipped`, `duration_seconds` |

| `pipeline.failed` | ERROR | Any unhandled exception; re-raises after logging |

The BigQuery job ID is logged at `bq.job.submitted` so you can look up the job in the BQ console even while the run is still in progress.

Set `LOG_LEVEL=DEBUG` to lower the root log level (default `INFO`).

### Useful Cloud Logging filters

```

# Full timeline for one pipeline run

jsonPayload.run_id="20260423-1745400000"

# Audit log: every BQ partition written (includes output_rows and output_bytes)

jsonPayload.log_event="bq.job.complete"

# Find all periods that were skipped (no parquet files in S3)

jsonPayload.log_event="period.skipped"

```

## API Endpoints

### `GET /health`

Returns service health status.

```json

{"status": "ok"}

```

### `POST /run`

Runs the ETL pipeline. Accepts an optional JSON body:

| Field | Type | Description |

|---|---|---|

| `export_name` | string | Override the `EXPORT_NAME` env var for this run |

| `partition` | string | Process only this month (`YYYY-MM`, e.g. `"2024-01"`). Omit to run the default 3-period window. If the partition has no files in S3, it is silently skipped. |

Returns a summary per period:

```json

{

  "run_id": "20240115-1705300800",

  "export_name": "my-export",

  "periods": [

    {

      "partition": "BILLING_PERIOD=2023-11",

      "files": 3,

      "gcs_uris": ["gs://my-bucket/billing/my-export/data/.../file.parquet"]

    },

    {"partition": "BILLING_PERIOD=2023-12", "files": 3, "gcs_uris": ["..."]},

    {"partition": "BILLING_PERIOD=2024-01", "files": 3, "gcs_uris": ["..."]}

  ],

  "bq_table": "my-project.billing.daily_costs"

}

```

Returns `500` with `{"error": "..."}` if the pipeline fails.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/timfanda35/cost-to-bq

Awesome Lists containing this project

README