https://github.com/timfanda35/aws-pricing-list-loader
Load AWS Pricing List to Postgres
https://github.com/timfanda35/aws-pricing-list-loader
Last synced: 17 days ago
JSON representation
Load AWS Pricing List to Postgres
- Host: GitHub
- URL: https://github.com/timfanda35/aws-pricing-list-loader
- Owner: timfanda35
- Created: 2026-04-27T16:38:36.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-10T06:57:15.000Z (about 2 months ago)
- Last Synced: 2026-05-10T08:40:45.224Z (about 2 months ago)
- Language: Python
- Size: 279 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AWS Pricing List Loader
Crawls the [AWS Pricing API](https://pricing.us-east-1.amazonaws.com) to discover all service and savings plan pricing URLs, and bulk-loads every region's pricing CSV into PostgreSQL using an ingestion/swap pattern.
Exposed as a FastAPI HTTP service, with a CLI for local use and a Cloud Run Job entry point for batch execution.
## Setup
### Docker (recommended)
```bash
cp .env.example .env # fill in Postgres credentials
docker compose up --build -d
```
Starts both PostgreSQL and the API. The API automatically runs any pending DB migrations on startup. Interactive docs at `http://localhost:8000/docs`.
### Docker with SSL (GCP Cloud SQL or local mTLS test)
For GCP Cloud SQL, download `server-ca.pem`, `client-cert.pem`, and `client-key.pem` from the Console → Cloud SQL → your instance → Connections → SSL, and place them in `certs/`.
For local testing, generate self-signed equivalents instead:
```bash
bash scripts/gen-dev-certs.sh # creates certs/ with matching file names
docker compose -f docker-compose.yml -f docker-compose.ssl.yml up --build -d
```
The SSL compose override enables TLS on the PostgreSQL container and injects these env vars into `api`:
| Env var | Value (in container) |
|---|---|
| `POSTGRES_SSL_MODE` | `verify-ca` |
| `POSTGRES_SSL_ROOTCERT` | `/app/certs/server-ca.pem` |
| `POSTGRES_SSL_CERT` | `/app/certs/client-cert.pem` |
| `POSTGRES_SSL_KEY` | `/app/certs/client-key.pem` |
For local dev without Docker, set these vars in `.env` pointing to your local cert paths.
### Local development
```bash
pip install -r requirements-dev.txt
cp .env.example .env # fill in Postgres credentials
docker compose up -d db # start Postgres only
uvicorn app.main:app --reload # migrations run automatically on startup
```
## Environment variables
All vars are read from `.env` (or the shell environment). Copy `.env.example` to get started.
| Variable | Default | Required | Description |
|---|---|---|---|
| `POSTGRES_HOST` | — | Yes | PostgreSQL host |
| `POSTGRES_PORT` | `5432` | No | PostgreSQL port |
| `POSTGRES_DB` | — | Yes | Database name |
| `POSTGRES_USER` | — | Yes | Database user |
| `POSTGRES_PASSWORD` | — | Yes | Database password |
| `POSTGRES_SSL_MODE` | — | No | SSL mode (e.g. `verify-ca`); omit for plain TCP |
| `POSTGRES_SSL_ROOTCERT` | — | No | Path to server CA cert |
| `POSTGRES_SSL_CERT` | — | No | Path to client cert |
| `POSTGRES_SSL_KEY` | — | No | Path to client private key |
## API
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/health` | Liveness check — returns `{"status":"ok"}` |
| `GET` | `/pricing/urls` | List all discovered pricing URLs; generates any missing schema files |
| `POST` | `/pricing/load` | Load pricing data into PostgreSQL (blocks until complete) |
| `GET` | `/versions` | List all loaded service versions |
**GET /pricing/urls**
```mermaid
flowchart LR
U1["Fetch AWS service index"] --> U2["Fetch region indexes per service"]
U2 --> U3["Generate missing schema files\n(schema/*.sql)"]
U3 --> U4["Return all pricing URLs"]
```
**POST /pricing/load**
```mermaid
flowchart LR
L1["Check versions table\n(skip already-loaded)"] --> L2["Union columns from all region CSVs\nresolve name collisions"]
L2 --> L3["CREATE TABLE {service}_ingestion\n(deduplicated columns)"]
L3 --> L4["COPY each region CSV → staging table\n→ INSERT with COALESCE merge\nON CONFLICT DO NOTHING → ingestion table"]
L4 --> L5["Swap ingestion → production table"]
L5 --> L6["Upsert version record"]
```
`POST /pricing/load` accepts an optional JSON body to target a single service:
```json
{ "name": "comprehend" }
```
Response:
```json
{ "loaded": 14, "services": 1, "elapsed_seconds": 42.3 }
```
## CLI
The CLI shares the same service layer as the API.
### List pricing URLs
```bash
python fetch_pricing_index.py
python fetch_pricing_index.py > output.txt # save to file
```
Output (stdout) is a CSV with columns: `type, name, region, csv_url, publication_date`.
### Load pricing data into PostgreSQL
```bash
python fetch_pricing_index.py --load
python fetch_pricing_index.py --load --name comprehend
python fetch_pricing_index.py --load --name AWSDatabaseSavingsPlans
```
Already-loaded versions are skipped automatically (tracked in `aws_pricing_list_versions`).
## Cloud Run Job
`run_job.py` is designed to run as a [Google Cloud Run Job](https://cloud.google.com/run/docs/create-jobs) (one-time batch execution). It runs three steps in sequence, exiting with code 1 on any failure so the job runtime can detect and retry failures.
### Steps
1. **DB migrations** — applies any pending SQL migrations (same as API startup)
2. **Version check** — queries current loaded versions, discovers how many service/region entries have new data available
3. **Load** — streams and loads all new pricing CSVs into PostgreSQL
### Usage
```bash
# Load all services with new versions
python run_job.py
# Load a single service (useful for testing)
python run_job.py --name AWSComputeSavingsPlan
# Force reload all services (ignore already-loaded versions)
python run_job.py --force
# Force reload a single service
python run_job.py --name AWSComputeSavingsPlan --force
```
### Running as a Cloud Run Job
The same container image serves both the API and the job. The container entrypoint passes arguments through, so override the CMD via `--args`:
```bash
# Create the job
gcloud run jobs create aws-pricing-loader \
--image REGION-docker.pkg.dev/PROJECT/REPO/IMAGE \
--args "python,run_job.py" \
--set-env-vars "POSTGRES_HOST=...,POSTGRES_DB=...,POSTGRES_USER=...,POSTGRES_PASSWORD=..."
# Execute the job
gcloud run jobs execute aws-pricing-loader
# Force reload all services (skip version check)
gcloud run jobs update aws-pricing-loader \
--args "python,run_job.py,--force"
gcloud run jobs execute aws-pricing-loader
# Target a single service
gcloud run jobs update aws-pricing-loader \
--args "python,run_job.py,--name,comprehend"
gcloud run jobs execute aws-pricing-loader
```
Commas in `--args` delimit separate argv entries.
## How loading works
For each service with new data:
1. Fetches column headers from all region CSVs concurrently and unions them. Columns that normalise to the same snake_case name (e.g. `StorageType` and `Storage Type` → `storage_type`) are detected as collisions: the staging representation uses `_2`/`_3` suffixes to preserve CSV positions, while the ingestion table schema keeps only the base name. Generates and executes a `CREATE TABLE` DDL directly to the DB.
2. Streams each region's CSV, strips the first 6 lines (metadata + header), and bulk-loads via `COPY … FROM STDIN` into a temporary `UNLOGGED` staging table (created with all staging columns including collision suffixes, all as `TEXT`). Rows are then merged into the ingestion table with `INSERT … SELECT … ON CONFLICT (rate_code, pricing_region) DO NOTHING`, using `COALESCE` to collapse suffix variants into a single column. Columns with non-TEXT types (e.g. `effective_date DATE`, `price_per_unit DECIMAL`) are cast via `NULLIF(…, '')::TYPE` to handle both empty strings and NULLs. Global items that appear identically in multiple region CSVs are silently deduplicated.
3. Atomically swaps the ingestion table into production: renames the existing `{service}` table to `drop_{service}`, renames `{service}_ingestion` to `{service}`, then drops `drop_{service}`.
4. Records the loaded version in `aws_pricing_list_versions` so subsequent runs skip it.
Table names use the original AWS service name (e.g. `AmazonEC2`, `AWSDatabaseSavingsPlans`), not snake_case.
## Schema files
Schema files in `schema/` are generated during URL listing (both CLI listing mode and `GET /pricing/urls`). Each file (`{service}_ingestion.sql`) contains a `CREATE TABLE` + index DDL. In `--load` mode the DDL is generated on-the-fly and executed directly.
To force schema regeneration, delete the corresponding `.sql` file and re-run in listing mode.
## Migrations
DB migrations live in `migrations/` as numbered SQL files (`0001_*.sql`, `0002_*.sql`, …). They are applied automatically in filename order every time the API starts. Applied migrations are tracked in the `schema_migrations` table so each file runs exactly once.
To add a migration:
```bash
# Create the file
echo "ALTER TABLE aws_pricing_list_versions ADD COLUMN IF NOT EXISTS notes TEXT;" \
> migrations/0002_add_notes_column.sql
# Deploy — runs on next startup
uvicorn app.main:app
# Log: Applied migration: 0002_add_notes_column.sql
```
## Testing
### Unit and API tests
No database or network connection required:
```bash
pip install -r requirements-dev.txt
pytest tests/
```
Covers:
- `tests/test_aws_client.py` — `to_snake_case` with real AWS service names
- `tests/test_schema_builder.py` — column type overrides, index name truncation, DDL generation, collision detection and merge_map
- `tests/test_loader.py` — column union, collision deduplication, COALESCE INSERT, staging/ingestion schema split
- `tests/test_api.py` — all endpoints via FastAPI `TestClient` with mocked service layer
- `tests/test_migrations.py` — migration runner: ordering, skip-applied, connection cleanup
- `tests/test_main.py` — lifespan calls `run_migrations()` on startup
### Integration tests
Requires a running PostgreSQL instance (`.env` configured):
```bash
pytest test_create_table.py
```
### Manual smoke tests
Smoke test a savings plan:
```bash
psql -c "DELETE FROM aws_pricing_list_versions WHERE name = 'aws_database_savings_plans';"
python fetch_pricing_index.py --load --name AWSDatabaseSavingsPlans
# Expected: [TABLE] created → [COPY] 36 regions → [SWAP] → [VERSION]
```
Smoke test a service:
```bash
psql -c "DELETE FROM aws_pricing_list_versions WHERE name = 'comprehend';"
python fetch_pricing_index.py --load --name comprehend
# Expected: [TABLE] created → [COPY] 14 regions → [SWAP] → [VERSION]
```
## References
See [references.md](references.md) for AWS Pricing API documentation links and JSON/CSV structure details.