https://github.com/tcd93/invoice-data-pipeline
A sample data pipeline for transforming invoice images and CSV files into beautiful numbers
https://github.com/tcd93/invoice-data-pipeline
airflow data-pipeline kubernetes python trino
Last synced: 2 months ago
JSON representation
A sample data pipeline for transforming invoice images and CSV files into beautiful numbers
- Host: GitHub
- URL: https://github.com/tcd93/invoice-data-pipeline
- Owner: tcd93
- License: mit
- Created: 2025-02-04T14:18:27.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2025-02-04T14:45:22.000Z (4 months ago)
- Last Synced: 2025-02-04T15:41:10.133Z (4 months ago)
- Topics: airflow, data-pipeline, kubernetes, python, trino
- Language: Shell
- Homepage:
- Size: 10.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
#
Invoice Data Platform
**A sample data pipeline for transforming invoice images and CSV files into BI Service Dashboards**## Summary
TL;DR
Raw data (images and CSV) from repo's [/k8s/object_store](./k8s/object_store) will be transformed into
beautiful numbers displayed in Apache Superset.> * Invoice images are sampled from CORDv2 dataset
> * CSV file is from [Kaggle](https://www.kaggle.com/code/mahabubsheikh/cafe-sales-dirty-data-for-cleaning)> This is a simplified data pipeline, meant to be run on a single machine (e.g. your laptop). In a production environment, the Airflow would only act as a scheduler to trigger jobs on a separate Spark Cluster. Trino is probably not needed in this case, and can be replaced with SparkSQL.
## Requirement
- **[Docker for Desktop](https://www.docker.com/products/docker-desktop/)** (Enable Kubernetes and WSL2) or **minikube**
- [Helm](https://helm.sh/docs/intro/install/)
- Python 3.12 ([Microsoft store](https://apps.microsoft.com/search?query=python+3.12))
- openssl: generate secrets for SuperSet and cert for Trino
- For Windows users: just install [Git for Windows](https://gitforwindows.org/), it'll be included in Git Bash console
- \>16GB RAM. Preferably 32GB## Quick Start
TL;DR
```bash
(cd ./k8s && ./deploy.sh)
```Many services are of type NodePort, run `kubectl get svc -n everest` to get their exposed port numbers. Go to [defaults.sh](./k8s/defaults.sh)
to see default login credentials.Step-by-step [guide](./guide/README.md)