https://github.com/hamidurrahman1/dockerized-data-pipeline
A repository to learn Apache Airlfow by integrating it with other technologies.
https://github.com/hamidurrahman1/dockerized-data-pipeline
apache-airflow apache-maven apache-spark docker hibernate java ldap postgres python scala shell spring-boot
Last synced: 7 months ago
JSON representation
A repository to learn Apache Airlfow by integrating it with other technologies.
- Host: GitHub
- URL: https://github.com/hamidurrahman1/dockerized-data-pipeline
- Owner: HamidurRahman1
- Created: 2023-10-22T15:52:41.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-12-23T01:20:00.000Z (almost 2 years ago)
- Last Synced: 2025-02-04T23:51:15.875Z (8 months ago)
- Topics: apache-airflow, apache-maven, apache-spark, docker, hibernate, java, ldap, postgres, python, scala, shell, spring-boot
- Language: Java
- Homepage:
- Size: 127 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# dockerized-data-pipeline
* Steps to run:
1. Update `MOUNT_VOL` in `dev.env` file if you want to store all data generated by the containers in a specific folder.
It will be relative to the project/repo dir if left alone as it is.2. `cd` into project/repo directory.
3. Build one of the custom docker images (image name and tag is used in `ddp-airflow-compose.yml`) -
1. Run the multi-stage dockerfile (image size: ~1.73 GB) -
`DOCKER_BUILDKIT=1 docker build -f ./dockerfiles/ddp-airflow-multi-stage -t ddp-airflow:v1 . --target=RUNTIME`
2. Run the single stage dockerfile (image size: ~2.28 GB) -
`docker build -f ./dockerfiles/ddp-airflow -t ddp-airflow:v1 .`4. Spin up the init compose file and wait until vault server is up and running -
`docker-compose -f ./ddp-init-compose.yml --env-file dev.env up`
5. Finally, spin up the airflow compose file -
`docker-compose -f ./ddp-airflow-compose.yml --env-file dev.env --env-file ./vault/vol/keys/vault-token.env up`* Vault UI: http://localhost:8200/
* Vault token is available in: `vault/vol/keys/vault-token.env`* PHP LDAP Admin: http://localhost:8001/
* Login DN: `cn=admin,dc=ddp,dc=com`
* password: `admin`
* Airflow Webserver: http://localhost:8000/login/
* users:
* username: `hrahman`, password: `hrahman1` (Admin Role)
* username: `jdoe`, password: `jdoe1` (Viewer Role)
* You may trigger both `ddp.failed_banks_processor` and `ddp.nyc_parking_and_camera_violations` and head over to flower UI to see if scheduler has distributed the work to both worker nodes or not.
* Invoke the test REST API DAG -
* `curl -X 'POST' 'http://localhost:8000/api/v1/dags/basic.called_via_rest_api/dagRuns' -u "hrahman:hrahman1" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "conf": { "param1": "value 1", "param2": "value 2" } }'`
* Celery flower UI: http://localhost:9000/
* DDP rest api: http://localhost:7000/* Nice to have:
* Multi-stage docker build. (implemented)
* Use LDAP for airflow webserver. (implemented)
* Vault or similar for storing database credentials. (implemented)
* Logging instead of sout.
* Use `hdfs` instead of local file system.