{"id":22342943,"url":"https://github.com/pirate-emperor/bigdata-pipeline","last_synced_at":"2026-02-08T13:07:59.238Z","repository":{"id":268649443,"uuid":"882361120","full_name":"Pirate-Emperor/BigData-Pipeline","owner":"Pirate-Emperor","description":"BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.","archived":false,"fork":false,"pushed_at":"2024-11-02T16:24:51.000Z","size":8338,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-06T15:03:16.876Z","etag":null,"topics":["airflow","airflow-dags","airflow-docker","big-data","data-lake","data-lakestore","data-warehouse","dbt","dbt-core","distributed-computing","docker","docker-compose","hadoop","hive","hiveql","kudu","mysql","mysql-server","trino","trino-cli"],"latest_commit_sha":null,"homepage":"","language":"Dockerfile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Pirate-Emperor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-02T16:15:45.000Z","updated_at":"2025-02-24T17:15:10.000Z","dependencies_parsed_at":"2024-12-18T05:04:18.886Z","dependency_job_id":null,"html_url":"https://github.com/Pirate-Emperor/BigData-Pipeline","commit_stats":null,"previous_names":["pirate-emperor/bigdata-pipeline"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Pirate-Emperor/BigData-Pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pirate-Emperor%2FBigData-Pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pirate-Emperor%2FBigData-Pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pirate-Emperor%2FBigData-Pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pirate-Emperor%2FBigData-Pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Pirate-Emperor","download_url":"https://codeload.github.com/Pirate-Emperor/BigData-Pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pirate-Emperor%2FBigData-Pipeline/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":268250552,"owners_count":24219862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-01T02:00:08.611Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","airflow-dags","airflow-docker","big-data","data-lake","data-lakestore","data-warehouse","dbt","dbt-core","distributed-computing","docker","docker-compose","hadoop","hive","hiveql","kudu","mysql","mysql-server","trino","trino-cli"],"created_at":"2024-12-04T08:14:12.272Z","updated_at":"2026-02-08T13:07:59.114Z","avatar_url":"https://github.com/Pirate-Emperor.png","language":"Dockerfile","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003clink rel=\"stylesheet\" type=\"text/css\" href=\"style.css\"\u003e\n\n\u003c!-- \nAuthor: Pirate-Emperor\nDate: [Insert Date]\nDescription: README file for BigData Pipeline project.\n--\u003e\n\n# BigData Pipeline\n![BigData Pipeline](docs/bigdata-pipeline.png)\n\n## Project Overview\n\nBigData Pipeline is a local testing environment designed for experimenting with various storage solutions, query engines, schedulers, and ETL/ELT tools. The project includes:\n\n- **Storage Solutions**: RDB, HDFS, Columnar Storage\n- **Query Engines**: Trino\n- **Schedulers**: Airflow\n- **ETL/ELT Tools**: DBT\n\n## Pipeline Components\n\n| Pipeline Component | Version | Description                      | Port                         |\n|--------------------|---------|----------------------------------|------------------------------|\n| MySQL              | 8.36+   | Relational Database               | 3306                         |\n| Hadoop             | 3.3.6+  | Distributed Storage               | namenode: 9870, datanode: 9864 |\n| Trino              | 438+    | Distributed Query Engine          | 8080                         |\n| Hive               | 3.1.3   | DFS Query Solution                | hiveserver2(thrift): 10002   |\n| Kudu               | 2.3+    | Columnar Distributed Database     | master: 7051, tserver: 7050 |\n| Airflow            | 2.7+    | Scheduler                         | 8888                         |\n| DBT                | 1.7.1   | Analytics Framework               | -                            |\n\n## Connection Info\n\n| Pipeline Component | User    | Password | Database   |\n|--------------------|---------|----------|------------|\n| MySQL              | root    | root     | default    |\n| MySQL              | airflow | airflow | airflow_db |\n| Trino              | Allowing all | - | 8080      |\n| Hive               | hive    | hive     | default    |\n| Airflow            | airflow | airflow | -          |\n\nYou can create databases, schemas, and tables with these accounts.\n\n## Execution\n\nApache open-source software is manually installed on an Ubuntu image, downloading from Apache mirror servers (CDN) to improve overall installation speed. The installation speed may vary based on the user's network environment, so a stable network is recommended.\n\n- **MySQL**: For MySQL, the docker-compose file is set for Mac Silicon (platform: linux/amd64). If running on Windows, comment out this line.\n- **Trino**: For Trino's Web UI/JDBC connections (e.g., DBeaver), any string can be used as the User. There is no password. Ensure that the user in `dbt-trino`'s `profiles.yml` matches this.\n- **DBT**: DBT operates within Airflow using `airflow-dbt`. For local use, create a virtual environment. (Future plans include building improvements with poetry.)\n- **Kudu \u0026 Hadoop**: For local environments with limited resources, the replica count for `kudu-tserver` and `hadoop-datanode` has been set to 1. Kudu is a storage-only DB, requiring a separate engine (e.g., Impala, Trino) for executing queries.\n- **Hue**: If Hue is needed, uncomment the section in `docker-compose.yml` to use it.\n- **Airflow**: Airflow is configured with the Celery Executor. `airflow-trigger` is restricted due to resource constraints.\n\n## Getting Started\n\nTo get started with BigData Pipeline, follow these steps:\n\n### 1. Start the Containers\n\n- **1-1.** If you want to specify the required profile and bring up containers using the CLI:\n\n  ```bash\n  COMPOSE_PROFILES=trino,kudu,hive,dbt,airflow docker-compose -f docker-compose.yml up --build -d --remove-orphans\n  ```\n\n- **1-2.** If you want to bring up all containers at once:\n\n  ```bash\n  make up\n  ```\n\n### 2. Manage Containers\n\n- **2-1.** If you want to stop running containers:\n\n  ```bash\n  make down\n  ```\n\n- **2-2.** If you want to remove running containers while deleting Docker images, volumes, and network resources:\n\n  ```bash\n  make delete.all\n  ```\n\n## Checking if It's Running Properly\n\n- **Hive Metastore Initialization**: Check for an initialized file in the `./mnt/schematool-check` folder.\n- **Container Start Success**: Look for the following image when running Docker Compose:\n\n  \u003cimg src=\"./docs/docker-run-success.png\" style=\"width:300px;height:auto;\"\u003e\n\n- **Web UI Access**: If you can’t access the web UI for a specific platform after container startup, you may need to rebuild the containers.\n\n  \u003cimg src=\"./docs/hadoop-namenode-web-ui.png\" style=\"width:1000px;height:auto;\" alt=\"hadoop namenode\"\u003e \n  \u003chr\u003e \n  \u003cimg src=\"./docs/hive-server-2-web-ui.png\" style=\"width:1000px;height:auto;\" alt=\"hive-server-2\"\u003e \n  \u003chr\u003e \n  \u003cimg src=\"./docs/kudu-master-web-ui.png\" style=\"width:1000px;height:auto;\" alt=\"kudu-master\"\u003e \n  \u003chr\u003e \n  \u003cimg src=\"./docs/trino-web-ui.png\" style=\"width:1000px;height:auto;\" alt=\"trino\"\u003e \n  \u003chr\u003e \n  \u003cimg src=\"./docs/airflow-web-ui.png\" style=\"width:1000px;height:auto;\" alt=\"airflow\"\u003e \n  \u003chr\u003e\n\n- **Trino JDBC Connection**: If you see three catalogs (hive, kudu, mysql) after JDBC connection (`jdbc:trino://localhost:8080`) in DBeaver, it is working correctly.\n\n  \u003cimg src=\"./docs/dbeaver-trino.png\" alt=\"image\" style=\"width:300px;height:auto;\"\u003e\n\n## Trino Test Code\n\nTest codes are located in the `init-sql/trino` directory.\n\n- **test_code_1.sql**: Tests schema and table creation, data insertion, and selection in the Hive catalog.\n\n  \u003cimg src=\"./docs/trino-query-test1.png\" alt=\"image\" style=\"width:1000;height:auto;\"\u003e\n\n- **test_code_2.sql**: Tests Union queries between heterogeneous DB tables (Hive, Kudu).\n\n  \u003cimg src=\"./docs/trino-query-test2.png\" alt=\"image\" style=\"width:1000;height:auto;\"\u003e\n\n## Next Challenge\n\n- Enhance static analysis tools and build systems for clean code (black, ruff, isort, mypy, poetry).\n- Improve CI automation for static analysis (pre-commit).\n- Simulate ETL/ELT with DBT-Airflow integration.\n\n## Contributing\n\nFeel free to fork the repository, make changes, and submit pull requests. Contributions are welcome!\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n## Author\n\n**Pirate-Emperor**\n\n[![Twitter](https://skillicons.dev/icons?i=twitter)](https://twitter.com/PirateKingRahul)\n[![Discord](https://skillicons.dev/icons?i=discord)](https://discord.com/users/1200728704981143634)\n[![LinkedIn](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/in/piratekingrahul)\n\n[![Reddit](https://img.shields.io/badge/Reddit-FF5700?style=for-the-badge\u0026logo=reddit\u0026logoColor=white)](https://www.reddit.com/u/PirateKingRahul)\n[![Medium](https://img.shields.io/badge/Medium-42404E?style=for-the-badge\u0026logo=medium\u0026logoColor=white)](https://medium.com/@piratekingrahul)\n\n- GitHub: [Pirate-Emperor](https://github.com/Pirate-Emperor)\n- Reddit: [PirateKingRahul](https://www.reddit.com/u/PirateKingRahul/)\n- Twitter: [PirateKingRahul](https://twitter.com/PirateKingRahul)\n- Discord: [PirateKingRahul](https://discord.com/users/1200728704981143634)\n- LinkedIn: [PirateKingRahul](https://www.linkedin.com/in/piratekingrahul)\n- Skype: [Join Skype](https://join.skype.com/invite/yfjOJG3wv9Ki)\n- Medium: [PirateKingRahul](https://medium.com/@piratekingrahul)\n\nThank you for visiting the BigData Pipeline project!\n\n---\n\nFor more details, please refer to the [GitHub repository](https://github.com/Pirate-Emperor/BigData-Pipeline).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpirate-emperor%2Fbigdata-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpirate-emperor%2Fbigdata-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpirate-emperor%2Fbigdata-pipeline/lists"}