Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/debussy-labs/debussy_concert

Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.
https://github.com/debussy-labs/debussy_concert

airflow airflow-operators airflow-plugin big-data-platform bigquery data-architecture data-engineering data-pipeline dataform dataproc dbt gcp google-cloud mssql mysql postgresql spark sql workflow

Last synced: about 4 hours ago
JSON representation

Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.

Awesome Lists containing this project

README

        

[![GitHub issues](https://img.shields.io/github/issues/DotzInc/debussy_concert)](https://github.com/DotzInc/debussy_concert/issues)
[![GitHub forks](https://img.shields.io/github/forks/DotzInc/debussy_concert)](https://github.com/DotzInc/debussy_concert/network)
[![GitHub stars](https://img.shields.io/github/stars/DotzInc/debussy_concert)](https://github.com/DotzInc/debussy_concert/stargazers)
[![GitHub license](https://img.shields.io/github/license/DotzInc/debussy_concert)](https://github.com/DotzInc/debussy_concert/blob/master/LICENSE)

# Debussy Concert

[Debussy](https://github.com/DotzInc/debussy_concert/wiki) is a free, open-source, opinionated Data Architecture and Engineering framework. It enables data analysts and engineers to build better data platforms through first class data pipelines, following a low-code and self-service approach.


Description
·
Key Features
·
Key Benefits
·
Quick Start
·
Integrations

Full Documentation
·
Communication
·
Contributions
·
License

---

## Description

In the data engineering field, everyone is reinventing the wheel all the time – it's still rare to see the adoption of software engineering best practices, such as [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself), [KISS](https://en.wikipedia.org/wiki/KISS_principle) or [YAGNI](https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it). Despite the existence of several tools for data orchestration (e.g. [Apache Airflow](https://airflow.apache.org/), [Prefect](https://www.prefect.io/), [Dagster](https://dagster.io/)) and distributed data processing (e.g. [Apache Spark](https://spark.apache.org/), [Apache Beam](https://beam.apache.org/)), every time a new data pipeline demand arises it usually implies lengthy development projects. Think of developing a web application without the help of a web framework such as [Django](https://www.djangoproject.com/) or [Flask](https://palletsprojects.com/p/flask/)!

What's even worse, although sharing key concepts, these data orchestration tools have very distinct syntaxes and features, making migrations a daunting task! Moreover, simply adopting these tools does not guarantee that best practices are being followed, including with regard to data architecture (think of data modeling, data management lifecycle, among others).

While lots of companies have faced these same issues, most of them have decided to develop their own in-house solutions, missing the opportunity for colaboration and wider adoption of data architecture and sofware engineering best practices.

With that in mind, we created Debussy! Debussy Concert is the core component of Debussy. It's a code generation engine for orchestration tools, currently supporting only Airflow, but with others on the Roadmap. It provides abstraction layers in the form of a musical themed semantic model, decoupling the pipeline logic to the underlying orchestration tool, and enabling a low-code approach to data engineering. We also provides pipelines templates (e.g. data ingestion, data transformation and reverse ETL) built with our engine, while always striving to offer the aforementioned best practices.

## Key Features
- Dynamic data pipeline generation from YAML configuration files or directly through Python
- Provides a semantic model for data pipeline development, abstracting the inner orchestration engine
- Enables seamless integration of first class data projects, such as Airflow, Spark, and dbt

## Key Benefits

✔ It provides lower time to delivery and costs related to data pipeline development, while enabling higher ROI

✔ Avoid pipeline debt by following sound software engineering design principles

✔ Ensure your platform is following data architecture best practices

## Quick Start

Debussy works on any installation of Apache Airflow 2.0, but since we currently support only GCP based data platforms as the target Data Lakehouse, we recommend a deployment to [Cloud Composer](https://cloud.google.com/composer).

In order to use Debussy, you first need to go through the following steps:

1. [Select or create a Google Cloud Platform project](https://console.cloud.google.com/cloud-resource-manager).
2. [Enable billing for your project](https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project).
3. [Create a Cloud Composer 2 environment](https://cloud.google.com/composer/docs/composer-2/create-environments).
4. Install Debussy on your Cloud Composer instance: just upload the project to your `plugins/` folder.
5. Check our [User's Guide](https://github.com/DotzInc/debussy_concert/wiki/User's-Guide) and [examples](https://github.com/DotzInc/debussy_concert/tree/master/examples) to learn how to use it!

Integrations
-------------------------------------------------------------------------------
Debussy works with the tools and systems that you're already using with your data, including:




Integration
Notes





Apache Airflow
An open source orchestration engine



Spark
Open source distributed processing engine, used for the data ingestion pipelines



dbt
dbt is an open-source data transformation tool, used for the data transformation pipelines



Google Cloud Storage
Cloud based blob storage, supported as data source or destination



BigQuery
Google serverless massive-scale SQL analytics platform, supported as the analytical environment (aka. Data Lakehouse)



MySQL
Leading open source database, supported as a data source or destination



PostgreSQL
Leading open source database, supported as a data source or destination



Other SQL Relational DBs
Most RDBMS are supported as data sources via JDBC drivers through Spark



AWS S3
Cloud based blob storage, supported as data source or destination

## Full Documentation
See the [Wiki](https://github.com/DotzInc/debussy_concert/wiki) for full documentation, examples, operational details and other information.

## Communication
[GitHub Issues](https://github.com/DotzInc/debussy_concert/issues)

[Discord Server](https://discord.gg/FpNX79pY)

## Contributions
We welcome all community contributions!

In order to have a more open and welcoming community, Debussy adheres to a [code of conduct](https://github.com/DotzInc/debussy_concert/wiki/Code-of-Conduct) adapted from Contributor Covenant.

Please read through our [contributing guidelines](https://github.com/DotzInc/debussy_concert/wiki/Contributing-Guide). Included are directions for opening issues, coding standards, and notes on development.

## License
Copyright 2022 Dotz, Inc.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.