Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/GokuMohandas/mlops-course

Learn how to design, develop, deploy and iterate on production-grade ML applications.

data-engineering data-quality data-science deep-learning distributed-ml llms machine-learning mlops natural-language-processing python pytorch ray

Last synced: 13 May 2024

https://github.com/DataKitchen/data-observability-installer

Installer for DataKitchen's Open Source Data Observability Products. Data breaks. Servers break. Your toolchain breaks. Ensure your team is the first to know and the first to solve with visibility across and down your data estate. Save time with simple, fast data quality test generation and execution. Trust your data, tools, and systems end to end.

data data-engineering data-observability data-profiling data-quality data-reliability data-science datachecker datacleaner datacleaning dataops dataquality datatesting datavalidation mssql pipeline-tests postgresql redshift self-hosted snowflake

Last synced: 12 May 2024

https://github.com/morph-kgc/morph-kgc

Powerful RDF Knowledge Graph Generation with RML Mappings

data-engineering data-integration database etl knowledge-graph python r2rml rdf rdf-star rml

Last synced: 12 May 2024

https://github.com/datastacktv/data-engineer-roadmap

Roadmap to becoming a data engineer in 2021

cloud data-engineer-roadmap data-engineering roadmap

Last synced: 11 May 2024

https://github.com/airbytehq/airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

bigquery change-data-capture data data-analysis data-collection data-engineering data-integration data-pipeline elt etl java mssql mysql pipeline postgresql python redshift s3 self-hosted snowflake

Last synced: 09 May 2024

https://github.com/ploomber/ploomber

The fastest ⚡️ way to build data pipelines. Develop iteratively, deploy anywhere. ☁️

data-engineering data-science jupyter jupyter-notebooks machine-learning mlops notebooks papermill pipelines pycharm vscode workflow

Last synced: 06 May 2024

https://github.com/insitro/redun

Yet another redundant workflow engine

aws bioinformatics data-engineering data-science docker etl gcp ml python workflow-engine

Last synced: 05 May 2024

https://github.com/rdagumampan/yuniql

Free and open source schema versioning and database migration made natively with .NET/6. NEW THIS MAY 2022! v1.3.15 released!

amazon-rds azure-sql-database data-engineering database-migrations datawarehouse dotnet-core dotnet-tool mariadb mysql oracle postgresql redshift snowflake sql sqlserver yuniql

Last synced: 05 May 2024

https://github.com/ocademy-ai/machine-learning

Learn AI together, for free. AI learning and teaching resources for everyone.

ai data-engineering data-science deep-learning jupyter jupyter-notebook machine-learning ml mlops python scikit-learn visualization

Last synced: 04 May 2024

https://ddotta.github.io/cookbook-rpolars/

Cookbook to provide solutions to common tasks and problems in using Polars with R

benchmark cookbook data-engineering data-science datatable dplyr polars r tidyr

Last synced: 02 May 2024

https://github.com/RisingWaveLabs/risingwave

Cloud-native SQL stream processing, analytics, and management. KsqlDB and Apache Flink alternative. 🚀 10x more productive. 🚀 10x more cost-efficient.

analytics big-data cloud-native data-engineering database distributed-database flink kafka ksqldb materialized-view postgres postgresql postgresql-database real-time rust serverless spark spark-streaming sql stream-processing

Last synced: 29 Apr 2024

https://github.com/e-alizadeh/sample_dbt_project

Companion template repo for the blog post "dbt for Data Transformation - A Hands-on Tutorial" (https://ealizadeh.com/blog/dbt-tutorial)

data-engineering data-transformation database dbt dbt-packages dbtcloud etl sql

Last synced: 28 Apr 2024

https://github.com/DAGWorks-Inc/hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.

dag data-analysis data-engineering data-science dataframe etl etl-framework etl-pipeline feature-engineering featurization hacktoberfest lineage llmops machine-learning mlops numpy orchestration pandas python software-engineering

Last synced: 28 Apr 2024

https://github.com/pyjanitor-devs/pyjanitor

Clean APIs for data cleaning. Python implementation of R package Janitor

cleaning-data data data-engineering dataframe hacktoberfest pandas pydata

Last synced: 28 Apr 2024

https://github.com/PrefectHQ/prefect

Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines

automation data data-engineering data-ops data-science infrastructure ml-ops observability orchestration pipeline prefect python workflow workflow-engine

Last synced: 28 Apr 2024

https://github.com/aws/aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

amazon-athena amazon-sagemaker-notebook apache-arrow apache-parquet athena aws aws-glue aws-lambda data-engineering data-science emr etl glue-catalog lambda modin mysql pandas python ray redshift

Last synced: 28 Apr 2024

https://github.com/electronick1/stairs

Framework which helps you to make parallel/distributed calculations using data pipelines

data-engineering data-pipeline data-science distributed-computing python

Last synced: 27 Apr 2024

https://github.com/ankurchavda/streamify

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

airflow data-engineering dbt gcp kafka python spark

Last synced: 22 Apr 2024

https://github.com/Desbordante/desbordante-core

Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

anomaly-detection correlations data-analytics data-cleaning data-cleansing data-engineering data-exploration data-mining data-mining-algorithms data-preprocessing data-profiling data-science data-wrangling exploratory-data-analysis feature-engineering feature-extraction feature-selection knowledge-discovery spreadsheets tabular-data

Last synced: 21 Apr 2024

https://github.com/stitchfix/hamilton

A scalable general purpose micro-framework for defining dataflows. THIS REPOSITORY HAS BEEN MOVED TO www.github.com/dagworks-inc/hamilton

dag data-engineering data-platform data-science dataframe etl etl-framework etl-pipeline feature-engineering featurization hamilton hamiltonian machine-learning numpy pandas python software-engineering stitch-fix

Last synced: 20 Apr 2024

https://github.com/rupurt/odbc-scanner-duckdb-extension

A DuckDB extension to read data directly from databases supporting the ODBC interface

analytics bigquery columnar-database cpp data-engineering db2 duckdb mariadb mssql mysql nix odbc olap oracle postgres snowflake vector-engine

Last synced: 20 Apr 2024

https://github.com/GoogleCloudPlatform/data-science-on-gcp

Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017

cloud-computing data-analysis data-engineering data-pipeline data-processing data-science data-visualization machine-learning

Last synced: 17 Apr 2024

https://github.com/Dineshkarthik/awesome-data-science-and-engineering

A curated list of Data Science and Engineering frameworks, tools, libraries and related list of tutorials.

beginner-friendly data-engineering data-science tutorials

Last synced: 11 Apr 2024

https://github.com/yahwang/Awesome-Data-Engineering

📒(GitBook) A curated list of awesome Data Engineering resources

data-engineering data-lake data-pipeline

Last synced: 10 Apr 2024

https://github.com/yarncraft/awesome-edge

A qualitative compilation of production-ready frameworks, services and repositories with a focus on Edge Computing & IoT

awesome awesome-list cloud cloud-native cloudcomputing computing data-engineering database edge edge-computing iot iot-platform kubernetes

Last synced: 09 Apr 2024

https://github.com/iesahin/xvc

A robust (🐢) and fast (🐇) MLOps tool for managing data and pipelines in Rust (🦀)

command-line-tool data data-engineering data-pipelines data-science devops machine-learning machine-learning-engineering mlops rust

Last synced: 01 Apr 2024

https://github.com/recap-build/recap

Work with your web service, database, and streaming schemas in a single format.

data-catalog data-discovery data-engineering data-integration data-pipelines etl metadata recap

Last synced: 01 Apr 2024

https://github.com/swoop-inc/spark-alchemy

Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive

data-engineering data-science scala spark

Last synced: 31 Mar 2024

https://pyjanitor-devs.github.io/pyjanitor/

Clean APIs for data cleaning. Python implementation of R package Janitor

cleaning-data data data-engineering dataframe hacktoberfest pandas pydata

Last synced: 29 Mar 2024

https://github.com/benthecoder/yt-channels-DS-AI-ML-CS

A comprehensive list of 180+ YouTube Channels for Data Science, Data Engineering, Machine Learning, Deep learning, Computer Science, programming, software engineering, etc.

ai artificial-intelligence awesome awesome-list coding data data-analysis data-engineering data-science deep-learning machine-learning math ml programming python resources software-engineering statistics web-development youtube

Last synced: 29 Mar 2024

https://github.com/Minyus/pipelinex

PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more

data-engineering data-science deep-learning experimentation machine-learning pipeline

Last synced: 23 Mar 2024

https://github.com/dgarnitz/vectorflow

VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.

ai data-engineering embeddings machine-learning nlp vectors

Last synced: 21 Mar 2024

https://github.com/awslabs/aws-serverless-data-lake-framework

Enterprise-grade, production-hardened, serverless data lake on AWS

analytics aws best-practices data-engineering data-lake etl framework iac lake-formation serverless

Last synced: 19 Mar 2024

https://github.com/ris-tlp/audiophile-e2e-pipeline

Pipeline that extracts data from Crinacle's Headphone and InEarMonitor databases and finalizes data for a Metabase Dashboard.

airflow aws data-engineering metabase python terraform

Last synced: 18 Mar 2024

https://github.com/metarank/metarank

A low code Machine Learning personalized ranking service for articles, listings, search results, recommendations that boosts user engagement. A friendly Learn-to-Rank engine

automl data-engineering data-science deep-learning feature-engineering feature-extraction kubernetes machine-learning neural-networks personalization ranking scala search

Last synced: 17 Mar 2024

https://github.com/twosigma/uberjob

uberjob is a Python package for building and running call graphs.

data-engineering python

Last synced: 16 Mar 2024