https://github.com/vbalalian/roman_coins_data_pipeline
Work-in-progress, learning-focused, end-to-end ELT data pipeline project.
https://github.com/vbalalian/roman_coins_data_pipeline
airbyte api beautifulsoup cicd coins dagster data-engineering data-pipeline docker docker-compose duckdb elt fastapi minio postgresql python rest-api sql web-scraping webscraping
Last synced: 7 months ago
JSON representation
Work-in-progress, learning-focused, end-to-end ELT data pipeline project.
- Host: GitHub
- URL: https://github.com/vbalalian/roman_coins_data_pipeline
- Owner: vbalalian
- Created: 2023-11-03T03:54:56.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2024-03-05T02:55:11.000Z (over 1 year ago)
- Last Synced: 2024-03-05T22:45:49.665Z (over 1 year ago)
- Topics: airbyte, api, beautifulsoup, cicd, coins, dagster, data-engineering, data-pipeline, docker, docker-compose, duckdb, elt, fastapi, minio, postgresql, python, rest-api, sql, web-scraping, webscraping
- Language: Python
- Homepage:
- Size: 3.67 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README

# Roman Coins
## End-to-end ELT pipeline project
[](https://github.com/vbalalian/RomanCoins/actions/workflows/ci.yml)Extracting, Loading, and Transforming data on Roman Coins gathered from wildwinds.com
**Tools:** Python, PostgreSQL, Docker, FastAPI, Airbyte, MinIO, Dagster, DuckDB, dbt
### [Web Scraper](web_scraping/web_scraper.py)
Scrapes data on coins from the Roman Empire from wildwinds.com, and loads the data into a postgres server. Due to the required 30-second delay between page requests, scraping takes several hours to complete; the data is loaded into the server as it arrives.
### [API](api/main.py)
Serves data from the roman coins dataset, and allows data addition and manipulation via POST, PUT, and PATCH endpoints. Data is continuously added during web scraping.
### [Airbyte](extract-load-transform/airbyte-api-minio-connection/airbyte_connection_config.py)
[Custom airbyte connector](extract-load-transform/custom-airbyte-connector/source_roman_coin_api/source.py) streams incremental data from the API to a standalone MinIO bucket.
### [MinIO](https://min.io)
Resilient storage for the incoming data stream. Data is replicated ["at least once"](https://docs.airbyte.com/using-airbyte/core-concepts/sync-modes/incremental-append-deduped#inclusive-cursors) by Airbyte, so some duplicated data is acceptable at this stage. Deduplication will be easily handled by dbt at the next stage of the pipeline.
### [Dagster](orchestration/orchestration)
[Sensors](extract-load-transform/orchestration/orchestration/sensors/__init__.py) trigger Airbyte syncs and DuckDB loads on a minute-by-minute basis.
### [DuckDB](https://duckdb.org/)
Local data warehouse.
### [dbt](https://docs.getdbt.com/docs/introduction)
Transforms data within the data warehouse.
## Requirements:
[Docker](https://docs.docker.com/engine/install/)\
[Docker Compose](https://docs.docker.com/compose/install/)\
[Airbyte](https://docs.airbyte.com/deploying-airbyte/local-deployment)## To Run:
**Step 1:** Ensure Docker and Airbyte are both up and running.
**Step 2: (Optional)** Set preferred credentials/variables in project .env file
**Step 3:** Run the following terminal commands:
```
git clone https://github.com/vbalalian/roman_coins_data_pipeline.git
cd roman_coins_data_pipeline
docker compose up
```
This will run the web scraper, the API, MinIO, and [Dagster](https://dagster.io); then build the custom Airbyte connector, configure the API-Airbyte-Minio connection, and trigger Airbyte syncs and DuckDB load jobs automatically using sensors.- View the web_scraper container logs in Docker to follow the progress of the Web Scraping
- Access the API directly at http://localhost:8010, or interact with the different endpoints at http://localhost:8010/docs
- Access the Airbyte UI at http://localhost:8000
- Access the MinIO Console at http://localhost:9090
- Access the Dagster UI at http://localhost:3000
- At the moment, duckdb access is limited to docker exec commands on one of the dagster services with access to the duckdb volume.