Projects in Awesome Lists tagged with data-engineering
A curated list of projects in awesome lists tagged with data-engineering .
https://github.com/apache/superset
Apache Superset is a Data Visualization and Data Exploration Platform
analytics apache apache-superset asf bi business-analytics business-intelligence data-analysis data-analytics data-engineering data-science data-visualization data-viz flask python react sql-editor superset
Last synced: 12 May 2025
https://github.com/airbnb/caravel
Apache Superset is a Data Visualization and Data Exploration Platform
analytics apache apache-superset asf bi business-analytics business-intelligence data-analysis data-analytics data-engineering data-science data-visualization data-viz flask python react sql-editor superset
Last synced: 23 Nov 2024
https://github.com/apache/incubator-superset
Apache Superset is a Data Visualization and Data Exploration Platform
analytics apache apache-superset asf bi business-analytics business-intelligence data-analysis data-analytics data-engineering data-science data-visualization data-viz flask python react sql-editor superset
Last synced: 09 Dec 2024
https://github.com/apache/airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
airflow apache apache-airflow automation dag data-engineering data-integration data-orchestrator data-pipelines data-science elt etl machine-learning mlops orchestration python scheduler workflow workflow-engine workflow-orchestration
Last synced: 12 May 2025
https://github.com/gokumohandas/made-with-ml
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 12 May 2025
https://github.com/apache/incubator-airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
airflow apache apache-airflow automation dag data-engineering data-integration data-orchestrator data-pipelines data-science elt etl machine-learning mlops orchestration python scheduler workflow workflow-engine workflow-orchestration
Last synced: 23 Nov 2024
https://github.com/GokuMohandas/Made-With-ML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 15 Mar 2025
https://github.com/practicalAI/practicalAI
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 15 Feb 2025
https://github.com/GokuMohandas/MadeWithML
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml distributed-training llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 03 Mar 2025
https://github.com/datatalksclub/data-engineering-zoomcamp
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
data-engineering dbt docker kafka kestra spark
Last synced: 13 May 2025
https://github.com/DataTalksClub/data-engineering-zoomcamp
Data Engineering Zoomcamp is a free nine-week course that covers the fundamentals of data engineering.
data-engineering dbt docker kafka kestra spark
Last synced: 14 Mar 2025
https://github.com/eugeneyan/applied-ml
π Papers & tech blogs by companies sharing their work on data science & machine learning in production.
applied-data-science applied-machine-learning computer-vision data-discovery data-engineering data-quality data-science deep-learning machine-learning natural-language-processing production recsys reinforcement-learning search
Last synced: 17 Mar 2025
https://github.com/prefecthq/prefect
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
automation data data-engineering data-ops data-science infrastructure ml-ops observability orchestration pipeline prefect python workflow workflow-engine
Last synced: 12 May 2025
https://github.com/airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
bigquery change-data-capture data data-analysis data-collection data-engineering data-integration data-pipeline elt etl java mssql mysql pipeline postgresql python redshift s3 self-hosted snowflake
Last synced: 12 May 2025
https://github.com/avaiga/taipy
Turns Data and AI algorithms into production-ready web applications in no time.
automation data-engineering data-integration data-ops data-visualization datascience developer-tools hacktoberfest hacktoberfest2023 job-scheduler mlops orchestration pipeline pipelines python scenario scenario-analysis taipy-core taipy-gui workflow
Last synced: 12 May 2025
https://github.com/PrefectHQ/prefect
Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
automation data data-engineering data-ops data-science infrastructure ml-ops observability orchestration pipeline prefect python workflow workflow-engine
Last synced: 24 Mar 2025
https://github.com/argoproj/argo-workflows
Workflow Engine for Kubernetes
airflow argo argo-workflows batch-processing cloud-native cncf dag data-engineering gitops hacktoberfest k8s knative kubernetes machine-learning mlops pipelines workflow workflow-engine
Last synced: 12 May 2025
https://argoproj.github.io/argo-workflows/
Workflow Engine for Kubernetes
airflow argo argo-workflows batch-processing cloud-native cncf dag data-engineering gitops hacktoberfest k8s knative kubernetes machine-learning mlops pipelines workflow workflow-engine
Last synced: 24 Mar 2025
https://github.com/Avaiga/taipy
Turns Data and AI algorithms into production-ready web applications in no time.
automation data-engineering data-integration data-ops data-visualization datascience developer-tools hacktoberfest hacktoberfest2023 job-scheduler mlops orchestration pipeline pipelines python scenario scenario-analysis taipy-core taipy-gui workflow
Last synced: 05 Apr 2025
https://github.com/argoproj/argo
Workflow Engine for Kubernetes
airflow argo argo-workflows batch-processing cloud-native cncf dag data-engineering gitops hacktoberfest k8s knative kubernetes machine-learning mlops pipelines workflow workflow-engine
Last synced: 22 Nov 2024
https://github.com/andkret/Cookbook
The Data Engineering Cookbook
best-practices big-data cookbook data-engineer data-engineering
Last synced: 14 Mar 2025
https://github.com/andkret/cookbook
The Data Engineering Cookbook
best-practices big-data cookbook data-engineer data-engineering
Last synced: 24 Mar 2025
https://github.com/dagster-io/dagster
An orchestration platform for the development, production, and observation of data assets.
analytics dagster data-engineering data-integration data-orchestrator data-pipelines data-science etl metadata mlops orchestration python scheduler workflow workflow-automation
Last synced: 12 May 2025
https://github.com/datastacktv/data-engineer-roadmap
Roadmap to becoming a data engineer in 2021
cloud data-engineer-roadmap data-engineering roadmap
Last synced: 23 Mar 2025
https://github.com/great-expectations/great_expectations
Always know what to expect from your data.
cleandata data-engineering data-profilers data-profiling data-quality data-science data-unit-tests datacleaner datacleaning dataquality dataunittest eda exploratory-analysis exploratory-data-analysis exploratorydataanalysis mlops pipeline pipeline-debt pipeline-testing pipeline-tests
Last synced: 12 May 2025
https://github.com/xonsh/xonsh
:shell: Python-powered shell. Full-featured and cross-platform.
artificial-intelligence bash cli command-line console data-engineering data-science devops fish iterm2 python raspberry-pi security-automation shell xonsh zsh
Last synced: 12 May 2025
https://github.com/redpanda-data/connect
Fancy stream processing made operationally mundane
amqp cqrs data-engineering data-ops etl event-sourcing go golang kafka logs message-bus message-queue nats rabbitmq stream-processing stream-processor streaming-data
Last synced: 11 May 2025
https://github.com/mage-ai/mage-ai
π§ Build, run, and manage data pipelines for integrating and transforming data.
artificial-intelligence data data-engineering data-integration data-pipelines data-science dbt elt etl machine-learning orchestration pipeline pipelines python reverse-etl spark sql transformation
Last synced: 13 May 2025
https://github.com/Jeffail/benthos
Fancy stream processing made operationally mundane
amqp cqrs data-engineering data-ops etl event-sourcing go golang kafka logs message-bus message-queue nats rabbitmq stream-processing stream-processor streaming-data
Last synced: 25 Mar 2025
https://github.com/risingwavelabs/risingwave
Stream processing and management platform.
data-engineering database kafka materialized-view postgresql rust stream-processing
Last synced: 13 May 2025
https://github.com/singularity-data/risingwave
Stream processing and management platform.
data-engineering database kafka materialized-view postgresql rust stream-processing
Last synced: 15 Apr 2025
https://github.com/RisingWaveLabs/risingwave
Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
analytics big-data cloud-native data-engineering database distributed-database etl flink kafka ksqldb materialized-view postgres postgresql real-time real-time-analytics rust serverless spark-streaming sql stream-processing
Last synced: 29 Mar 2025
https://github.com/growthbook/growthbook
Open Source Feature Flagging and A/B Testing Platform
ab-testing abtest abtesting analytics bigquery clickhouse continuous-delivery data-analysis data-engineering data-science experimentation feature-flagging feature-flags mixpanel redshift remote-config snowflake split-testing statistics
Last synced: 12 May 2025
https://github.com/cloudquery/cloudquery
The developer first cloud governance platform
airbyte attack-surface-management aws azure bigquery cspm data data-analysis data-collection data-engineering data-integration elt etl etl-framework gcp github-api go google kubernetes sql
Last synced: 14 May 2025
https://github.com/feast-dev/feast
The Open Source Feature Store for AI/ML
big-data data-engineering data-quality data-science feature-store features machine-learning ml mlops python
Last synced: 13 May 2025
https://github.com/evidence-dev/evidence
Business intelligence as code: build fast, interactive data visualizations in SQL and markdown
analytics business-intelligence dashboard data-engineering data-science data-visualization dbt duckdb exploratory-data-analysis self-hosted sql svelte tailwindcss webassembly
Last synced: 13 May 2025
https://github.com/treeverse/lakefs
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 13 May 2025
https://github.com/treeverse/lakeFS
lakeFS - Data version control for your data lake | Git for data
apache-spark apache-sparksql aws-s3 azure-blob-storage azure-storage data-engineering data-lake data-quality data-version-control data-versioning datalake datalakes git-for-data go golang google-cloud-storage hadoop-filesystem lakefs object-storage
Last synced: 20 Mar 2025
https://github.com/whoiskatrin/sql-translator
SQL Translator is a tool for converting natural language queries into SQL code using artificial intelligence. This project is 100% free and open source.
data-analysis data-engineering dataquery datascience dataset openai postgresql query sql
Last synced: 14 May 2025
https://github.com/rudderlabs/rudder-server
Privacy and Security focused Segment-alternative, in Golang and React
bigquery cdp customer-data customer-data-lake customer-data-pipeline customer-data-platform data-engineering data-integration data-pipeline data-synchronization data-warehouse elt etl event-streaming privacy redshift segment-alternative snowflake warehouse-management warehouse-native
Last synced: 13 May 2025
https://github.com/aws/aws-sdk-pandas
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
amazon-athena amazon-sagemaker-notebook apache-arrow apache-parquet athena aws aws-glue aws-lambda data-engineering data-science emr etl glue-catalog lambda modin mysql pandas python ray redshift
Last synced: 12 May 2025
https://github.com/adilkhash/data-engineering-howto
A list of useful resources to learn Data Engineering from scratch
cloud-providers data-engineering data-pipeline distributed-systems scala
Last synced: 14 May 2025
https://github.com/adilkhash/Data-Engineering-HowTo
A list of useful resources to learn Data Engineering from scratch
cloud-providers data-engineering data-pipeline distributed-systems scala
Last synced: 28 Mar 2025
https://github.com/moataz-elmesmary/data-science-roadmap
Data Science Roadmap from A to Z
big-data chatgpt cheatsheet cv-template data-analysis data-engineering data-science data-visualization deep-learning interview-questions linear-algebra llms machine-learning mathematics neural-network nlp probability python sql statistics
Last synced: 14 May 2025
https://github.com/Moataz-Elmesmary/Data-Science-Roadmap
Data Science Roadmap from A to Z
big-data chatgpt cheatsheet cv-template data-analysis data-engineering data-science data-visualization deep-learning interview-questions linear-algebra llms machine-learning mathematics neural-network nlp probability python sql statistics
Last synced: 25 Mar 2025
https://github.com/quadratichq/quadratic
Spreadsheet with AI, Code, Connections
ai data data-analysis data-engineering data-science etl python quadratic spreadsheet sql wasm webgl
Last synced: 13 May 2025
https://github.com/hemansnation/god-level-ai
A collection of scientific methods, processes, algorithms, and systems to build stories & models.
computer-vision data-engineering data-science data-structures-and-algorithms data-system-design data-visualization datastructures deep-learning machine-learning matplotlib mlops natural-language-processing numpy pandas python pytorch scikit-learn statistics tableau
Last synced: 10 Apr 2025
https://github.com/ploomber/ploomber
The fastest β‘οΈ way to build data pipelines. Develop iteratively, deploy anywhere. βοΈ
data-engineering data-science jupyter jupyter-notebooks machine-learning mlops notebooks papermill pipelines pycharm vscode workflow
Last synced: 29 Apr 2025
https://github.com/hemansnation/God-Level-Data-Science-ML-Full-Stack
A collection of scientific methods, processes, algorithms, and systems to build stories & models.
computer-vision data-engineering data-science data-structures-and-algorithms data-system-design data-visualization datastructures deep-learning machine-learning matplotlib mlops natural-language-processing numpy pandas python pytorch scikit-learn statistics tableau
Last synced: 01 Feb 2025
https://github.com/hemansnation/God-Level-AI
A collection of scientific methods, processes, algorithms, and systems to build stories & models.
computer-vision data-engineering data-science data-structures-and-algorithms data-system-design data-visualization datastructures deep-learning machine-learning matplotlib mlops natural-language-processing numpy pandas python pytorch scikit-learn statistics tableau
Last synced: 28 Mar 2025
https://github.com/dlt-hub/dlt
data load tool (dlt) is an open source Python library that makes data loading easy π οΈ
data data-engineering data-lake data-loading data-warehouse elt extract load python transform
Last synced: 26 Mar 2025
https://github.com/superstreamlabs/memphis
Memphis.dev is a highly scalable and effortless data streaming platform
data data-engineering data-pipeline data-stream-processing data-streaming enrichment golang kubernetes message-broker message-bus message-queue messaging-queue microservices schema-registry
Last synced: 14 May 2025
https://github.com/gokumohandas/mlops-course
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 15 May 2025
https://github.com/GokuMohandas/mlops-course
Learn how to design, develop, deploy and iterate on production-grade ML applications.
data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray
Last synced: 27 Mar 2025
https://github.com/datafold/data-diff
Compare tables within or across databases
data data-diffing data-engineering data-quality data-quality-monitoring data-science database databricks-sql dataengineering dataquality dbt mysql oracle-database postgres postgresql python rdbms snowflake sql trino
Last synced: 24 Mar 2025
https://github.com/eventual-inc/daft
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
big-data data-engineering data-science dataframe distributed-computing machine-learning python rust
Last synced: 08 May 2025
https://github.com/dathere/qsv
Blazing-fast Data-Wrangling toolkit
ckan cli csv data-engineering data-wrangling dcat excel geocode libreoffice luau metadata opendata parquet polars postgresql sampling sql sqlite statistics timeseries
Last synced: 11 May 2025
https://github.com/apache/incubator-devlake
Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
dashboard-friendly data data-analysis data-engineering data-integration data-transfers devops domain-layer dora etl golang hacktoberfest integration jira open-source user-friendly
Last synced: 14 May 2025
https://github.com/Eventual-Inc/Daft
Distributed data engine for Python/SQL designed for the cloud, powered by Rust
big-data data-engineering data-science dataframe distributed-computing machine-learning python rust
Last synced: 09 Apr 2025
https://github.com/jqnatividad/qsv
Blazing-fast Data-Wrangling toolkit
ckan cli csv data-engineering data-wrangling dcat excel geocode luau metadata opendata parquet polars postgresql snappy sql sqlite statistics timeseries
Last synced: 25 Nov 2024
https://github.com/dagworks-inc/hamilton
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
dag data-analysis data-engineering data-science dataframe etl etl-framework etl-pipeline feature-engineering hacktoberfest lineage llmops machine-learning mlops orchestration pandas python rag software-engineering
Last synced: 13 May 2025
https://github.com/metarank/metarank
A low code Machine Learning personalized ranking service for articles, listings, search results, recommendations that boosts user engagement. A friendly Learn-to-Rank engine
automl data-engineering data-science deep-learning feature-engineering feature-extraction kubernetes machine-learning neural-networks personalization ranking scala search
Last synced: 14 May 2025
https://github.com/running-elephant/datart
Datart is a next generation Data Visualization Open Platform
analytics bi business-analytics business-intelligence chart d3 dashboard data-analysis data-analytics data-engineering data-visualization data-viz datart davinci display echarts react report sql-editor typescript
Last synced: 14 May 2025
https://github.com/sodadata/soda-core
:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
data-contracts data-engineering data-governance data-monitoring data-observability data-profiling data-quality data-quality-checks data-quality-monitoring data-quality-testing data-reliability data-testing data-unit-tests data-validation dataquality datatesting dbt pipeline-testing python snowflake
Last synced: 14 May 2025
https://github.com/meltano/meltano
Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.
connectors data data-engineering data-pipelines dataops dataops-platform elt extract-data integration loaders meltano meltano-sdk open-source opensource pipelines singer tap taps target targets
Last synced: 12 May 2025
https://github.com/feathr-ai/feathr
Feathr β A scalable, unified data and AI engineering platform for enterprise
apache-spark artificial-intelligence azure data-engineering data-quality data-science feature-engineering feature-governance feature-management feature-marketplace feature-metadata feature-platform feature-store machine-learning mlops
Last synced: 14 May 2025
https://github.com/DAGWorks-Inc/hamilton
Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
dag data-analysis data-engineering data-science dataframe etl etl-framework etl-pipeline feature-engineering hacktoberfest lineage llmops machine-learning mlops orchestration pandas python rag software-engineering
Last synced: 26 Mar 2025
https://github.com/alexioannides/pyspark-example-project
Implementing best practices for PySpark ETL jobs and applications.
data-engineering data-science etl etl-job etl-pipeline pyspark python spark
Last synced: 14 Apr 2025
https://github.com/bytewax/bytewax
Python Stream Processing
data-engineering data-processing data-science dataflow machine-learning python rust stream-processing streaming-data
Last synced: 13 May 2025
https://github.com/data-engineering-community/data-engineering-wiki
The best place to learn data engineering. Built and maintained by the data engineering community.
data data-engineer data-engineering data-modeling data-pipelines database etl sql
Last synced: 14 May 2025
https://github.com/multiwoven/multiwoven
π₯π₯π₯ Open source composable CDP - alternative to hightouch and census.
bigquery cdp customer-data-platform data-activation data-engineering data-pipeline data-warehouse databricks dbt etl hacktoberfest open-source postresql react redshift reverse-etl ruby self-hosted snowflake typescript
Last synced: 13 May 2025
https://github.com/san089/Udacity-Data-Engineering-Projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
airflow airflow-operators aws aws-ec2 aws-s3 aws-sdk cassandra cassandra-database cloudformation cluster data data-engineering data-engineering-pipeline data-lake data-modeling data-warehouse etl-pipeline infrastructure postgres postgresql-database
Last synced: 15 Apr 2025
https://github.com/san089/udacity-data-engineering-projects
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
airflow airflow-operators aws aws-ec2 aws-s3 aws-sdk cassandra cassandra-database cloudformation cluster data data-engineering data-engineering-pipeline data-lake data-modeling data-warehouse etl-pipeline infrastructure postgres postgresql-database
Last synced: 08 Apr 2025
https://github.com/Multiwoven/multiwoven
π₯π₯π₯ Open source composable CDP - alternative to hightouch and census.
bigquery cdp customer-data-platform data-activation data-engineering data-pipeline data-warehouse databricks dbt etl hacktoberfest open-source postresql react redshift reverse-etl ruby self-hosted snowflake typescript
Last synced: 01 Apr 2025
https://github.com/kantord/just-dashboard
:bar_chart: :clipboard: Dashboards using YAML or JSON files
big-data business-intelligence chart csv d3 d3js dashboard data data-driven data-engineering data-science data-visualization gist github-gist json just-dashboard yaml
Last synced: 15 May 2025
https://github.com/mlrun/mlrun
MLRun is an open source MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications.
data-engineering data-science experiment-tracking kubernetes machine-learning mlops mlops-workflow model-serving python workflow
Last synced: 13 May 2025
https://github.com/pyjanitor-devs/pyjanitor
Clean APIs for data cleaning. Python implementation of R package Janitor
cleaning-data data data-engineering dataframe hacktoberfest pandas pydata
Last synced: 13 May 2025
https://github.com/googlecloudplatform/data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning
Last synced: 14 Apr 2025
https://github.com/san089/goodreads_etl_pipeline
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
airflow airflow-dag apache-airflow apache-spark data-engineering data-engineering-pipeline data-lake data-migration emr-cluster etl-framework etl-job etl-pipeline goodreads-data-pipeline livy python redshift s3 scheduler spark warehouse
Last synced: 16 May 2025
https://github.com/quixio/quix-streams
Python Streaming DataFrames for Kafka
data-engineering data-intensive-applications data-science event-driven-architecture kafka machine-learning python real-time-data-processing stream-processing stream-processor streaming-data streaming-data-pipelines streaming-data-processing time-series-data
Last synced: 13 May 2025
https://github.com/ericmjl/pyjanitor
Clean APIs for data cleaning. Python implementation of R package Janitor
cleaning-data data data-engineering dataframe hacktoberfest pandas pydata
Last synced: 07 Jan 2025
https://github.com/quiltdata/quilt
Quilt is a data mesh for connecting people with actionable data
data data-engineering data-version-control data-versioning parquet python serialization
Last synced: 13 May 2025
https://github.com/GoogleCloudPlatform/data-science-on-gcp
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning
Last synced: 27 Nov 2024
https://github.com/opendatadiscovery/odd-platform
First open-source data discovery and observability platform. We make a life for data practitioners easy so you can focus on your business.
alerting bigdata data-catalog data-discovery data-engineering data-exploration data-governance data-lineage data-observability data-pipelines data-platform data-profiling data-quality data-science datacatalog lineage metadata metadata-management observability oss
Last synced: 15 May 2025
https://github.com/obenner/data-engineering-interview-questions
More than 2000+ Data engineer interview questions.
airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql
Last synced: 14 May 2025
https://github.com/OBenner/data-engineering-interview-questions
More than 2000+ Data engineer interview questions.
airflow avro aws azure cassandra data-engineering data-structures flink flume hadoop hadoop-hdfs hbase hive impala interview interview-questions kafka nifi spark sql
Last synced: 10 Apr 2025
https://github.com/abhishek-ch/around-dataengineering
A Data Engineering & Machine Learning Knowledge Hub
airflow data-engineering datascience devops infrastructure machine-learning mlops spark
Last synced: 08 Apr 2025
https://github.com/cocoindex-io/cocoindex
ETL framework to turn your data AI-ready - with realtime incremental updates and support custom logic like lego.
ai change-data-capture data data-engineering data-indexing data-infrastructure data-processing dataflow etl help-wanted indexing knowledge-graph llm pipeline python rag real-time rust semantic-search streaming
Last synced: 14 May 2025
https://github.com/daochenzha/data-centric-ai
A curated, but incomplete, list of data-centric AI resources.
ai artificial-intelligence data-centric data-centric-ai data-centric-machine-learning data-curation data-engineering data-quality data-science machine-learning
Last synced: 24 Mar 2025
https://github.com/daochenzha/data-centric-AI
A curated, but incomplete, list of data-centric AI resources.
ai artificial-intelligence data-centric data-centric-ai data-centric-machine-learning data-curation data-engineering data-quality data-science machine-learning
Last synced: 26 Mar 2025
https://github.com/alanchn31/data-engineering-projects
Personal Data Engineering Projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: 12 Apr 2025
https://github.com/dataform-co/dataform
Dataform is a framework for managing SQL based data operations in BigQuery
analytics business-intelligence data-engineering data-pipelines elt etl hacktoberfest
Last synced: 13 May 2025
https://github.com/yobulkdev/yobulkdev
π₯ π₯ π₯Open Source & AI driven Data Onboarding Platform:Free flatfile.com alternative
csv-import csv-parser csv-reader data-engineering datacleaning embeddable javascript languagemodel mongodb nextjs nodejs open-source react stream streaming
Last synced: 21 Apr 2025
https://github.com/stitchfix/hamilton
A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton
dag data-engineering data-platform data-science dataframe etl etl-framework etl-pipeline feature-engineering featurization hamilton hamiltonian machine-learning numpy pandas python software-engineering stitch-fix
Last synced: 18 Jan 2025
https://github.com/neumtry/neumai
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
ai chatgpt data data-engineering database embeddings etl llm llmops mlops ops pipeline python rag retrieval vector-database vectors
Last synced: 21 Apr 2025
https://github.com/alanchn31/Data-Engineering-Projects
Personal Data Engineering Projects
airflow aws-redshift cassandra data-engineering data-engineering-nanodegree data-lake data-modeling data-warehouse ingest-data mongodb postgres scrapy spark star-schema
Last synced: 16 Apr 2025
https://github.com/odpi/egeria
Egeria core
data-engineering data-governance egeria governance hacktoberfest java linux-foundation metadata-management odpi odpi-egeria
Last synced: 14 May 2025
https://github.com/oleg-agapov/data-engineering-book
Accumulated knowledge and experience in the field of Data Engineering
data data-engineering engineering
Last synced: 15 Apr 2025
https://github.com/NeumTry/NeumAI
Neum AI is a best-in-class framework to manage the creation and synchronization of vector embeddings at large scale.
ai chatgpt data data-engineering database embeddings etl llm llmops mlops ops pipeline python rag retrieval vector-database vectors
Last synced: 11 Apr 2025
https://github.com/automaticmode/active_workflow
Polyglot workflows without leaving the comfort of your technology stack.
activeworkflow agents data-engineering data-ops event-driven ifttt orchestration-framework scheduler scheduling self-hosted services-platform workflow
Last synced: 14 Mar 2025