https://github.com/timfanda35/cost-to-bq
Transfer cost data to GCP BigQuery
https://github.com/timfanda35/cost-to-bq
Last synced: about 2 months ago
JSON representation
Transfer cost data to GCP BigQuery
- Host: GitHub
- URL: https://github.com/timfanda35/cost-to-bq
- Owner: timfanda35
- License: mit
- Created: 2026-04-23T14:14:27.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-23T14:23:06.000Z (2 months ago)
- Last Synced: 2026-04-23T16:30:03.404Z (2 months ago)
- Language: Python
- Size: 49.8 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# billing-loader
A FastAPI service that extracts billing files from AWS S3 (Cost and Usage Reports in Hive-partitioned format), stages them in Google Cloud Storage (GCS), and loads them into BigQuery. Designed to run on Cloud Run, triggered daily by Cloud Scheduler.
## Architecture
```
S3 (CUR Hive partitions) → GCS (staging) → BigQuery (partitioned WRITE_TRUNCATE)
```
By default each run loads **3 billing periods** (current month + previous two). The `/run` endpoint also accepts optional parameters to process a specific export or a single billing period.
## BigQuery Schemas
Explicit schemas for BigQuery loads are stored in `src/bq_schema/`:
| File | Format |
|---|---|
| `aws-cur-2.0-parquet.json` | AWS Cost and Usage Report (CUR) 2.0 — Parquet |
| `aws-focus-1.2-parquet.json` | AWS FOCUS 1.2 — Parquet |
## Prerequisites
- Python 3.11+
- A GCP project with the following APIs enabled: Cloud Run, Cloud Scheduler, Cloud Storage, BigQuery
- A GCP service account with these roles:
- `roles/storage.objectAdmin` on the GCS staging bucket
- `roles/bigquery.dataEditor` and `roles/bigquery.jobUser` on the BQ project
## Configuration
Copy `.env.example` to `.env` and fill in the values.
| Variable | Required | Default | Description |
|---|---|---|---|
| `SOURCE_TYPE` | Yes | — | Must be `s3` |
| `SOURCE_BUCKET` | Yes | — | S3 bucket name |
| `SOURCE_PREFIX` | No | `""` | Path prefix in the bucket before the export name |
| `EXPORT_NAME` | Yes | — | CUR export name; forms the Hive path `{SOURCE_PREFIX}/{EXPORT_NAME}/data/BILLING_PERIOD=YYYY-MM/` |
| `GCS_BUCKET` | Yes | — | GCS staging bucket name |
| `GCS_DESTINATION_PREFIX` | No | `""` | Path prefix in GCS (e.g. `billing/`) |
| `BQ_PROJECT_ID` | Yes | — | GCP project for BigQuery |
| `BQ_DATASET_ID` | Yes | — | BigQuery dataset name |
| `BQ_TABLE_ID` | Yes | — | BigQuery table name (partition and cluster fields depend on `BILLING_SCHEMA`) |
| `AWS_REGION` | Yes | — | AWS region (e.g. `us-east-1`) |
| `AWS_ACCESS_KEY_ID` | No | — | AWS key ID; uses instance role if omitted |
| `AWS_SECRET_ACCESS_KEY` | No | — | Required if `AWS_ACCESS_KEY_ID` is set |
| `S3_ENDPOINT_URL` | No | — | Override the S3 endpoint (e.g. an AWS VPC/PrivateLink endpoint); omit to use the default public AWS endpoint |
| `BQ_CMEK_KEY_NAME` | No | — | Full Cloud KMS key resource name (`projects/{project}/locations/{location}/keyRings/{ring}/cryptoKeys/{key}`); when set, all BigQuery load jobs use this CMEK instead of Google-managed encryption |
| `BILLING_SCHEMA` | No | `cur2` | BigQuery schema to use. `cur2` = AWS CUR 2.0 (partition: `bill_billing_period_start_date`, cluster: `line_item_usage_start_date`, `line_item_usage_account_id`); `focus1.2` = AWS FOCUS 1.2 (partition: `BillingPeriodStart`, cluster: `BillingAccountId`) |
| `PORT` | No | `8080` | HTTP port for the uvicorn server |
| `LOG_LEVEL` | No | `INFO` | Python log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |
## Local Development
```bash
pip install -r requirements-dev.txt
# Copy and fill in environment variables
cp .env.example .env
# Run the server
python main.py
```
Test the endpoints:
```bash
curl http://localhost:8080/health
# {"status": "ok"}
# Default run: current month + previous two
curl -X POST http://localhost:8080/run
# {"run_id": "20240115-1705300800", "export_name": "my-export", "periods": [...], "bq_table": "project.dataset.table"}
# Run a single specific partition
curl -X POST http://localhost:8080/run \
-H 'Content-Type: application/json' \
-d '{"partition": "2024-01"}'
# Override the export name and partition
curl -X POST http://localhost:8080/run \
-H 'Content-Type: application/json' \
-d '{"export_name": "other-export", "partition": "2024-01"}'
```
## Running Tests
```bash
pip install -r requirements-dev.txt
pytest
```
## Deployment to Cloud Run
**1. Store secrets in Secret Manager** (first deploy only):
```bash
echo -n "YOUR_AWS_KEY_ID" | gcloud secrets create billing-loader-aws-key-id --data-file=-
echo -n "YOUR_AWS_SECRET" | gcloud secrets create billing-loader-aws-secret-key --data-file=-
# Grant the service account access to each secret
for SECRET in billing-loader-aws-key-id billing-loader-aws-secret-key; do
gcloud secrets add-iam-policy-binding $SECRET \
--member="serviceAccount:${SERVICE_ACCOUNT}" \
--role="roles/secretmanager.secretAccessor"
done
```
**2. Build and deploy:**
```bash
IMAGE="gcr.io/${GCP_PROJECT_ID}/billing-loader"
gcloud builds submit --tag "${IMAGE}" .
gcloud run deploy billing-loader \
--image "${IMAGE}" \
--platform managed \
--region "${GCP_REGION:-us-central1}" \
--no-allow-unauthenticated \
--service-account "${SERVICE_ACCOUNT}" \
--set-env-vars "SOURCE_TYPE=s3,SOURCE_BUCKET=${SOURCE_BUCKET},SOURCE_PREFIX=${SOURCE_PREFIX:-},EXPORT_NAME=${EXPORT_NAME},GCS_BUCKET=${GCS_BUCKET},GCS_DESTINATION_PREFIX=${GCS_DESTINATION_PREFIX:-},BQ_PROJECT_ID=${BQ_PROJECT_ID},BQ_DATASET_ID=${BQ_DATASET_ID},BQ_TABLE_ID=${BQ_TABLE_ID},AWS_REGION=${AWS_REGION}" \
--set-secrets "AWS_ACCESS_KEY_ID=billing-loader-aws-key-id:latest,AWS_SECRET_ACCESS_KEY=billing-loader-aws-secret-key:latest"
```
**3. Create the Cloud Scheduler job:**
```bash
SERVICE_URL=$(gcloud run services describe billing-loader \
--platform managed --region "${GCP_REGION:-us-central1}" \
--format "value(status.url)")
gcloud scheduler jobs create http billing-loader-daily \
--schedule "${CRON_SCHEDULE:-0 6 * * *}" \
--uri "${SERVICE_URL}/run" \
--http-method POST \
--oidc-service-account-email "${SERVICE_ACCOUNT}" \
--location "${GCP_REGION:-us-central1}"
```
Trigger a manual run:
```bash
gcloud scheduler jobs run billing-loader-daily --location "${GCP_REGION:-us-central1}"
```
## Observability
The service emits structured JSON logs to stdout via `python-json-logger`. On Cloud Run these are captured automatically in Google Cloud Logging with queryable `jsonPayload` fields.
Every log line includes `log_event` (dotted name), `run_id`, and `export_name`.
### Log events
| `log_event` | Level | When |
|---|---|---|
| `request.received` | INFO | Start of `POST /run` |
| `pipeline.started` | INFO | After run_id generated; includes `periods` list |
| `period.started` / `period.files_listed` / `period.complete` | INFO | Per billing period |
| `period.skipped` | WARNING | S3 partition has no parquet files; includes `reason: "no_parquet_files"` |
| `gcs.file.uploaded` | INFO | After each file uploaded; includes `s3_key`, `gcs_uri` |
| `bq.job.submitted` | INFO | Immediately after BQ job created; includes `job_id` |
| `bq.job.complete` | INFO | After `job.result()` returns; includes `output_rows`, `output_bytes` |
| `bq.job.failed` | ERROR | Before `RuntimeError` is raised; includes `job_id`, `errors` |
| `pipeline.complete` | INFO | After all periods; includes `periods_loaded`, `periods_skipped`, `duration_seconds` |
| `pipeline.failed` | ERROR | Any unhandled exception; re-raises after logging |
The BigQuery job ID is logged at `bq.job.submitted` so you can look up the job in the BQ console even while the run is still in progress.
Set `LOG_LEVEL=DEBUG` to lower the root log level (default `INFO`).
### Useful Cloud Logging filters
```
# Full timeline for one pipeline run
jsonPayload.run_id="20260423-1745400000"
# Audit log: every BQ partition written (includes output_rows and output_bytes)
jsonPayload.log_event="bq.job.complete"
# Find all periods that were skipped (no parquet files in S3)
jsonPayload.log_event="period.skipped"
```
## API Endpoints
### `GET /health`
Returns service health status.
```json
{"status": "ok"}
```
### `POST /run`
Runs the ETL pipeline. Accepts an optional JSON body:
| Field | Type | Description |
|---|---|---|
| `export_name` | string | Override the `EXPORT_NAME` env var for this run |
| `partition` | string | Process only this month (`YYYY-MM`, e.g. `"2024-01"`). Omit to run the default 3-period window. If the partition has no files in S3, it is silently skipped. |
Returns a summary per period:
```json
{
"run_id": "20240115-1705300800",
"export_name": "my-export",
"periods": [
{
"partition": "BILLING_PERIOD=2023-11",
"files": 3,
"gcs_uris": ["gs://my-bucket/billing/my-export/data/.../file.parquet"]
},
{"partition": "BILLING_PERIOD=2023-12", "files": 3, "gcs_uris": ["..."]},
{"partition": "BILLING_PERIOD=2024-01", "files": 3, "gcs_uris": ["..."]}
],
"bq_table": "my-project.billing.daily_costs"
}
```
Returns `500` with `{"error": "..."}` if the pipeline fails.