Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Projects in Awesome Lists tagged with data-lake
A curated list of projects in awesome lists tagged with data-lake .
https://github.com/treeverse/lakefs
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 28 Sep 2024
https://github.com/treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 31 Jul 2024
https://github.com/apache/kyuubi
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
data-lake hacktoberfest hadoop hive jdbc kubernetes spark spark-sql sql thrift
Last synced: 28 Sep 2024
https://github.com/dlt-hub/dlt
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
data data-engineering data-lake data-loading data-warehouse elt extract load python transform
Last synced: 31 Jul 2024
https://github.com/bytedance/bitsail
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
big-data data-integration data-lake data-pipeline data-synchronization flink high-performance real-time
Last synced: 30 Sep 2024
https://github.com/san089/udacity-data-engineering-projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
airflow airflow-operators aws aws-ec2 aws-s3 aws-sdk cassandra cassandra-database cloudformation cluster data data-engineering data-engineering-pipeline data-lake data-modeling data-warehouse etl-pipeline infrastructure postgres postgresql-database
Last synced: 29 Sep 2024
https://github.com/san089/Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
airflow airflow-operators aws aws-ec2 aws-s3 aws-sdk cassandra cassandra-database cloudformation cluster data data-engineering data-engineering-pipeline data-lake data-modeling data-warehouse etl-pipeline infrastructure postgres postgresql-database
Last synced: 01 Aug 2024
https://github.com/san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
airflow airflow-dag apache-airflow apache-spark data-engineering data-engineering-pipeline data-lake data-migration emr-cluster etl-framework etl-job etl-pipeline goodreads-data-pipeline livy python redshift s3 scheduler spark warehouse
Last synced: 28 Sep 2024
https://github.com/teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Last synced: 28 Sep 2024
https://github.com/Teradata/kylo
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
data-lake hadoop kylo nifi spark teradata
Last synced: 01 Aug 2024
https://github.com/alanchn31/data-engineering-projects
Personal Data Engineering Projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: 29 Sep 2024
https://github.com/alanchn31/Data-Engineering-Projects
Personal Data Engineering Projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: 01 Aug 2024
https://github.com/canner/vulcan-sql
Data API Framework for AI Agents and Data Apps
ai ai-agent analytics api-builder bigquery clickhouse data-lake data-warehouse database duckdb ksqldb postgresql reporting restful-api snowflake spreadsheet sql typescript vulcan-sql vulcansql
Last synced: 26 Sep 2024
https://github.com/Canner/vulcan-sql
Data API Framework for AI Agents and Data Apps
ai ai-agent analytics api-builder bigquery clickhouse data-lake data-warehouse database duckdb ksqldb postgresql reporting restful-api snowflake spreadsheet sql typescript vulcan-sql vulcansql
Last synced: 01 Aug 2024
https://github.com/uber/marmaray
Generic Data Ingestion & Dispersal Library for Hadoop
avro-schema data-lake hadoop ingest-data schema-format spark
Last synced: 31 Jul 2024
https://github.com/awslabs/aws-serverless-data-lake-framework
Enterprise-grade, production-hardened, serverless data lake on AWS
analytics aws best-practices data-engineering data-lake etl framework iac lake-formation serverless
Last synced: 02 Aug 2024
https://github.com/cuebook/cuelake
Use SQL to build ELT pipelines on a data lakehouse.
apache-iceberg apache-spark data-engineering data-ingestion data-integration data-lake data-pipeline data-transfer datalake delta elt etl incremental-updates lakehouse pipelines spark-sql sql upsert zeppelin-notebook
Last synced: 28 Sep 2024
https://github.com/awslabs/amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
amazon-s3 aws big-data ccpa data data-erasure data-lake gdpr parquet privacy right-to-be-forgotten s3
Last synced: 01 Aug 2024
https://github.com/maxi-k/btrblocks
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
compression data-lake databases research
Last synced: 01 Oct 2024
https://github.com/azure/azuredatalake
Samples and Docs for Azure Data Lake Store and Analytics
Last synced: 30 Sep 2024
https://github.com/learningjournal/spark-streaming-in-python
Apache Spark 3 - Structured Streaming Course Material
apache-spark big-data bigdata data-lake pyspark python spark-sql spark-streaming
Last synced: 28 Sep 2024
https://github.com/smart-data-lake/smart-data-lake
Smart Automation Tool for building modern Data Lakes and Data Pipelines
data-lake data-pipelines deltalake hadoop hive scala smart-data-lake spark transform-data
Last synced: 28 Sep 2024
https://github.com/learningjournal/sparkprogramminginscala
Apache Spark Course Material
apache-spark big-data bigdata data-lake datalake scala spark spark-scala spark-sql
Last synced: 28 Sep 2024
https://github.com/aws-samples/aws-dbs-refarch-datalake
Reference Architectures for Datalakes on AWS
amazon-emr data-analytics data-catalog data-lake data-transformation emr-cluster glue hive-metastore ingest-data
Last synced: 02 Aug 2024
https://github.com/camunda-community-hub/zeeqs
GraphQL API for Zeebe data
data-lake graphql zeebe zeebe-tool
Last synced: 30 Jul 2024
https://github.com/OElesin/querypal
Web UI for Amazon Athena
analytics aws aws-athena data data-lake sql
Last synced: 13 Aug 2024
https://github.com/kenthsu/udacity-data-engineering-nanodgree
Udacity Data Engineering Nanodegree Program
apache-airflow apache-cassandra apache-spark aws-redshift aws-s3 data-engineering data-lake data-pipelines data-quality data-warehouses postgresql
Last synced: 29 Sep 2024
https://github.com/GitDataAI/jiaozifs
An Git-like version control file system for data lineage & data collaboration.
aiops data-collaboration data-lake data-lake-management data-lineage data-mesh data-product data-version-control data-versioning datalake dataops digital-twins enterprise-datahub federated-learning git-filesystem git-for-data jiaozifs mlops version-controlled-filesystem
Last synced: 01 Aug 2024
https://github.com/nodestream-proj/nodestream
A Declarative framework for Building, Maintaining, and Analyzing Graph Data
api athena aws cli data-engineering data-lake data-science declarative etl framework graph graphql kafka knowledge-graph neo4j python s3 security visualization yaml
Last synced: 26 Sep 2024
https://github.com/suecodelabs/cnfuzz
Breaking Cloud Native Web APIs in their natural habitat.
aws aws-s3 cicd cloud-native data-lake fuzzing golang kubernetes microsoft openapi openapi-spec opensource rest-api rest-api-test restler security-tools service-mesh
Last synced: 30 Sep 2024
https://github.com/Canner/vulcan-sql-examples
Curated VulcanSQL show cases
analytics api-builder bigquery data data-lake data-warehouse database duckdb examples postgresql reporting restful-api sql vulcan-sql vulcansql
Last synced: 01 Aug 2024
https://github.com/AuFeld/Data_Engineering_Projects
A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousing, containerization, and a dashboard to monitor data pipeline KPIs
airflow aws cassandra data-engineering data-lake data-warehouse docker emr etl-pipeline infrastructure-as-code infrastructure-setup postgresql python redshift s3 spark
Last synced: 13 Aug 2024
https://github.com/apache/kyuubi-docker
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
data-lake hadoop hive jdbc kubernetes spark spark-sql sql thrift
Last synced: 30 Sep 2024
https://github.com/DataDrivenGit/Music-Streaming-App-using-AWS-ETL
Implemented Data Warehouse, Data Lake on AWS and Data modeling with Postgres and Apache Cassandra, Also used Apache Airflow to create data pipeline
airflow-operators cassandra data-lake data-pipelines datawarehouse postgres python3 sql
Last synced: 08 Aug 2024
https://github.com/manuelandersen/football-pipeline
DE Zoomcamp 2024 Final Project 🧙
bigquery data-engineering data-lake data-warehouse dbt dbt-cloud etl-pipeline google-cloud looker-studio mageai python
Last synced: 29 Sep 2024
https://github.com/hsiehshujeng/dynamodb-streaming-datalake
A demo of DynamoDB CDC into data lake with AWS CDK v2
aws-cdk data-lake ddb dynamodb dynamodb-stream kinesis-firehose kinesis-stream typescript
Last synced: 02 Oct 2024