Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pirate-emperor/bigdata-pipeline
BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.
https://github.com/pirate-emperor/bigdata-pipeline
airflow airflow-dags airflow-docker big-data data-lake data-lakestore data-warehouse dbt dbt-core distributed-computing docker docker-compose hadoop hive hiveql kudu mysql mysql-server trino trino-cli
Last synced: 1 day ago
JSON representation
BigData Pipeline is a local testing environment for experimenting with various storage solutions (RDB, HDFS), query engines (Trino), schedulers (Airflow), and ETL/ELT tools (DBT). It supports MySQL, Hadoop, Hive, Kudu, and more.
- Host: GitHub
- URL: https://github.com/pirate-emperor/bigdata-pipeline
- Owner: Pirate-Emperor
- License: other
- Created: 2024-11-02T16:15:45.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-11-02T16:24:51.000Z (3 months ago)
- Last Synced: 2024-12-18T04:53:43.695Z (about 2 months ago)
- Topics: airflow, airflow-dags, airflow-docker, big-data, data-lake, data-lakestore, data-warehouse, dbt, dbt-core, distributed-computing, docker, docker-compose, hadoop, hive, hiveql, kudu, mysql, mysql-server, trino, trino-cli
- Language: Dockerfile
- Homepage:
- Size: 7.95 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# BigData Pipeline
![BigData Pipeline](docs/bigdata-pipeline.png)## Project Overview
BigData Pipeline is a local testing environment designed for experimenting with various storage solutions, query engines, schedulers, and ETL/ELT tools. The project includes:
- **Storage Solutions**: RDB, HDFS, Columnar Storage
- **Query Engines**: Trino
- **Schedulers**: Airflow
- **ETL/ELT Tools**: DBT## Pipeline Components
| Pipeline Component | Version | Description | Port |
|--------------------|---------|----------------------------------|------------------------------|
| MySQL | 8.36+ | Relational Database | 3306 |
| Hadoop | 3.3.6+ | Distributed Storage | namenode: 9870, datanode: 9864 |
| Trino | 438+ | Distributed Query Engine | 8080 |
| Hive | 3.1.3 | DFS Query Solution | hiveserver2(thrift): 10002 |
| Kudu | 2.3+ | Columnar Distributed Database | master: 7051, tserver: 7050 |
| Airflow | 2.7+ | Scheduler | 8888 |
| DBT | 1.7.1 | Analytics Framework | - |## Connection Info
| Pipeline Component | User | Password | Database |
|--------------------|---------|----------|------------|
| MySQL | root | root | default |
| MySQL | airflow | airflow | airflow_db |
| Trino | Allowing all | - | 8080 |
| Hive | hive | hive | default |
| Airflow | airflow | airflow | - |You can create databases, schemas, and tables with these accounts.
## Execution
Apache open-source software is manually installed on an Ubuntu image, downloading from Apache mirror servers (CDN) to improve overall installation speed. The installation speed may vary based on the user's network environment, so a stable network is recommended.
- **MySQL**: For MySQL, the docker-compose file is set for Mac Silicon (platform: linux/amd64). If running on Windows, comment out this line.
- **Trino**: For Trino's Web UI/JDBC connections (e.g., DBeaver), any string can be used as the User. There is no password. Ensure that the user in `dbt-trino`'s `profiles.yml` matches this.
- **DBT**: DBT operates within Airflow using `airflow-dbt`. For local use, create a virtual environment. (Future plans include building improvements with poetry.)
- **Kudu & Hadoop**: For local environments with limited resources, the replica count for `kudu-tserver` and `hadoop-datanode` has been set to 1. Kudu is a storage-only DB, requiring a separate engine (e.g., Impala, Trino) for executing queries.
- **Hue**: If Hue is needed, uncomment the section in `docker-compose.yml` to use it.
- **Airflow**: Airflow is configured with the Celery Executor. `airflow-trigger` is restricted due to resource constraints.## Getting Started
To get started with BigData Pipeline, follow these steps:
### 1. Start the Containers
- **1-1.** If you want to specify the required profile and bring up containers using the CLI:
```bash
COMPOSE_PROFILES=trino,kudu,hive,dbt,airflow docker-compose -f docker-compose.yml up --build -d --remove-orphans
```- **1-2.** If you want to bring up all containers at once:
```bash
make up
```### 2. Manage Containers
- **2-1.** If you want to stop running containers:
```bash
make down
```- **2-2.** If you want to remove running containers while deleting Docker images, volumes, and network resources:
```bash
make delete.all
```## Checking if It's Running Properly
- **Hive Metastore Initialization**: Check for an initialized file in the `./mnt/schematool-check` folder.
- **Container Start Success**: Look for the following image when running Docker Compose:
- **Web UI Access**: If you can’t access the web UI for a specific platform after container startup, you may need to rebuild the containers.
- **Trino JDBC Connection**: If you see three catalogs (hive, kudu, mysql) after JDBC connection (`jdbc:trino://localhost:8080`) in DBeaver, it is working correctly.
## Trino Test Code
Test codes are located in the `init-sql/trino` directory.
- **test_code_1.sql**: Tests schema and table creation, data insertion, and selection in the Hive catalog.
- **test_code_2.sql**: Tests Union queries between heterogeneous DB tables (Hive, Kudu).
## Next Challenge
- Enhance static analysis tools and build systems for clean code (black, ruff, isort, mypy, poetry).
- Improve CI automation for static analysis (pre-commit).
- Simulate ETL/ELT with DBT-Airflow integration.## Contributing
Feel free to fork the repository, make changes, and submit pull requests. Contributions are welcome!
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## Author
**Pirate-Emperor**
[![Twitter](https://skillicons.dev/icons?i=twitter)](https://twitter.com/PirateKingRahul)
[![Discord](https://skillicons.dev/icons?i=discord)](https://discord.com/users/1200728704981143634)
[![LinkedIn](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/in/piratekingrahul)[![Reddit](https://img.shields.io/badge/Reddit-FF5700?style=for-the-badge&logo=reddit&logoColor=white)](https://www.reddit.com/u/PirateKingRahul)
[![Medium](https://img.shields.io/badge/Medium-42404E?style=for-the-badge&logo=medium&logoColor=white)](https://medium.com/@piratekingrahul)- GitHub: [Pirate-Emperor](https://github.com/Pirate-Emperor)
- Reddit: [PirateKingRahul](https://www.reddit.com/u/PirateKingRahul/)
- Twitter: [PirateKingRahul](https://twitter.com/PirateKingRahul)
- Discord: [PirateKingRahul](https://discord.com/users/1200728704981143634)
- LinkedIn: [PirateKingRahul](https://www.linkedin.com/in/piratekingrahul)
- Skype: [Join Skype](https://join.skype.com/invite/yfjOJG3wv9Ki)
- Medium: [PirateKingRahul](https://medium.com/@piratekingrahul)Thank you for visiting the BigData Pipeline project!
---
For more details, please refer to the [GitHub repository](https://github.com/Pirate-Emperor/BigData-Pipeline).