{"id":15009056,"url":"https://github.com/hamidurrahman1/dockerized-data-pipeline","last_synced_at":"2026-03-10T20:35:54.741Z","repository":{"id":202941394,"uuid":"708451552","full_name":"HamidurRahman1/dockerized-data-pipeline","owner":"HamidurRahman1","description":"A repository to learn Apache Airlfow by integrating it with other technologies.","archived":false,"fork":false,"pushed_at":"2023-12-23T01:20:00.000Z","size":130,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-04T23:51:15.875Z","etag":null,"topics":["apache-airflow","apache-maven","apache-spark","docker","hibernate","java","ldap","postgres","python","scala","shell","spring-boot"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HamidurRahman1.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-10-22T15:52:41.000Z","updated_at":"2023-12-21T15:51:20.000Z","dependencies_parsed_at":"2023-11-22T05:23:33.909Z","dependency_job_id":"d712c20c-e402-48f7-b8c5-1f1da5cc70dd","html_url":"https://github.com/HamidurRahman1/dockerized-data-pipeline","commit_stats":null,"previous_names":["hamidurrahman1/dockerized-data-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HamidurRahman1%2Fdockerized-data-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HamidurRahman1%2Fdockerized-data-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HamidurRahman1%2Fdockerized-data-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HamidurRahman1%2Fdockerized-data-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HamidurRahman1","download_url":"https://codeload.github.com/HamidurRahman1/dockerized-data-pipeline/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243204842,"owners_count":20253415,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","apache-maven","apache-spark","docker","hibernate","java","ldap","postgres","python","scala","shell","spring-boot"],"created_at":"2024-09-24T19:22:44.590Z","updated_at":"2025-12-16T06:36:35.553Z","avatar_url":"https://github.com/HamidurRahman1.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dockerized-data-pipeline\n\n* \u003cb\u003eSteps to run:\u003c/b\u003e\n\n  1. Update `MOUNT_VOL` in `dev.env` file if you want to store all data generated by the containers in a specific folder. \n  It will be relative to the project/repo dir if left alone as it is.\n\n  2. `cd` into project/repo directory.\n\n  3. Build one of the custom docker images (image name and tag is used in `ddp-airflow-compose.yml`) -\n     1. Run the multi-stage dockerfile (image size: ~1.73 GB) - \n     `DOCKER_BUILDKIT=1 docker build -f ./dockerfiles/ddp-airflow-multi-stage -t ddp-airflow:v1 . --target=RUNTIME`\n     2. Run the single stage dockerfile (image size: ~2.28 GB) - \n     `docker build -f ./dockerfiles/ddp-airflow -t ddp-airflow:v1 .`\n\n  4. Spin up the init compose file and wait until vault server is up and running - \n  `docker-compose -f ./ddp-init-compose.yml --env-file dev.env up`\n  \n  5. Finally, spin up the airflow compose file - \n  `docker-compose -f ./ddp-airflow-compose.yml --env-file dev.env --env-file ./vault/vol/keys/vault-token.env up`\n\n\n* Vault UI: http://localhost:8200/\n  * Vault token is available in: `vault/vol/keys/vault-token.env`\n\n\n* PHP LDAP Admin: http://localhost:8001/\n  * Login DN: `cn=admin,dc=ddp,dc=com`\n  * password: `admin`\n* Airflow Webserver: http://localhost:8000/login/ \n  * users:\n    * username: `hrahman`, password: `hrahman1` (Admin Role)\n    * username: `jdoe`, password: `jdoe1` (Viewer Role)\n  * You may trigger both `ddp.failed_banks_processor` and `ddp.nyc_parking_and_camera_violations` and head over to flower UI to see if scheduler has distributed the work to both worker nodes or not.\n  * Invoke the test REST API DAG - \n    * `curl -X 'POST' 'http://localhost:8000/api/v1/dags/basic.called_via_rest_api/dagRuns' -u \"hrahman:hrahman1\" -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ \"conf\": { \"param1\": \"value 1\", \"param2\": \"value 2\" } }'`\n* Celery flower UI: http://localhost:9000/\n* DDP rest api: http://localhost:7000/\n\n\n* \u003cb\u003eNice to have:\u003c/b\u003e\n  * \u003cs\u003eMulti-stage docker build.\u003c/s\u003e (implemented)\n  * \u003cs\u003eUse LDAP for airflow webserver.\u003c/s\u003e (implemented)\n  * \u003cs\u003eVault or similar for storing database credentials.\u003c/s\u003e (implemented)\n  * Logging instead of sout.\n  * Use `hdfs` instead of local file system.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhamidurrahman1%2Fdockerized-data-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhamidurrahman1%2Fdockerized-data-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhamidurrahman1%2Fdockerized-data-pipeline/lists"}