https://github.com/sanogotech/open-source-data-stack
modern open source data stack
https://github.com/sanogotech/open-source-data-stack
airbyte airflow data data-science dbt docker postgresql python
Last synced: 2 months ago
JSON representation
modern open source data stack
- Host: GitHub
- URL: https://github.com/sanogotech/open-source-data-stack
- Owner: sanogotech
- Created: 2022-11-10T14:00:25.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2023-02-02T16:36:53.000Z (over 3 years ago)
- Last Synced: 2025-04-03T10:44:22.622Z (about 1 year ago)
- Topics: airbyte, airflow, data, data-science, dbt, docker, postgresql, python
- Language: Shell
- Homepage:
- Size: 39.3 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Cartelis
Cartelis is an operational data & CRM consulting firm. We improve marketing ROI and streamline the customer's digital journey. We work with our clients to first design and implement martech solutions and then help them capitalize on their data to prioritize impactful marketing actions.
# Integration tool based on open-source data management frameworks
Designed to be self-sufficient in a local or remote machine ( limited to Ubuntu 18.04 as of 30/06/2021), this project brings together best-of-breed open-source frameworks in order to provide automated data migration, hosting, and transformation features in a single consolidated and fully customizable tool. Specifically, it currently puts to use:
- Airflow: to orchestrate and automate the entire data pipeline
- Airbyte: to integrate data, relying on a growing set of connectors
- DBT: to transform data hosted in a data warehouse with models based on largely on SQL syntax
- Postgresql: to host and transform data
## Requirements
An objective of this project is to avoid version conflicts due to the rapidly evolving nature of these open-source frameworks, so the tool is designed to rely on specific versions of each software package:
- Airflow:2.1.0
- Airbyte:0.26.4
- DBT:0.19.0
- PostgreSQL:13
## Setup
- Clone the repository
- Enter the repository: `cd data-integration`
- Run `./setup.sh`
Once the build has terminated:
- List the docker containers running: `sudo docker ps`
- Find the container named airflow-worker and run: `sudo docker exec -it /bin/bash`
Inside the docker container:
- Check if dbt is installed (not fully working as of 28/07/2021): `dbt --version`
- If an error is returned, run: `pip install dbt==0.19.0`
## Connections
* Airflow webserver UI: localhost:8080
* Airbyte webserver UI: localhost:8000
* Postgresql server: localhost:5400
* Postgresql server: localhost:5399
## Accessing Postgresql from the host machine
- docker exec -it psql -U airflow
- Useful commands:
* `\l` : list databases
* `\c` : enter database
* `\dt`: view database relations
* `TABLE `: view table data
* `SELECT * FROM `: view table data
## Manipulating the installation
- Stop and remove containers : `docker-compose down -v` in the docker folder
- Stop containers : `docker-compose stop` in the docker folder
- Run containers : `docker-compose run` in the docker folder
- Stop and delete containers : `docker-compose down --volumes --rmi all`
## Setting up dbt in the container (manually)
- access docker: `sudo docker exec -it /bin/bash`
- install vim: `apt-get up date` and `apt-get install vim`
- Check which version is installed: `dbt --version`
- Check which dbt python packages are installed: `pip list`, if only is present, install more packages: `pip install dbt`
In the docker folder:
- Create a local dbt project in the dbt folder: `dbt init `
- Enter into the dbt project : `cd `
- Edit dbt_project.yml file : `vim dbt_project.yml`
- Replace "my_new_project","default", and my_new_project in models by ``
- Navigate to `.dbt` folder (`/root/.dbt`) and edit file for a Postgresql connection (host is the host machine IP address, and the port number is 5400)
Ref: https://docs.getdbt.com/reference/warehouse-profiles/postgres-profile
## DAGS
Airflow's purpose in this implementation is to orchestrate Airbyte, DBT, and Postgresql task in order to provide as close to full automation as possible in this data pipeline.
### Airbyte DAGs
DAGS written to run integration tasks between external sources and the local Postgresql server will make use of existing Airbyte operators made available by downloading the `apache-airflow-providers-airbyte` python package.
Airbyte-Airflow connection: set-up in Airflow admin connections with connector as HTTP
The Airbyte task needs to contain the Airbyte connector ID contained in the connector URL.
### Airflow-Airbyte
Connect Airflow to airbyte with a new connection
```
By default, if you are using the docker-compose example provided by airflow, the user is:
user: airflow
password: airflow
```
```
If you want to create another account with docker use this (you have to be in the same folder of the docker-compose.yaml file):
docker-compose run airflow-worker airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin
```
### DBT DAGs
While it may be easier to run all DBT models with a single task in a DAG (Directed Acyclic Graph) file, this method runs the risk of performing excessive computational tasks despite a potential lack of updated data and is inflexible when performing debugging operations.
A better way is to create a DAG file capable of parsing the `manifest.json` file and creating individual tasks based on identified nodes which are then run or tested while taking into account potential dependencies.
Ref: https://www.astronomer.io/blog/airflow-dbt-1.