https://github.com/chen1649chenli/dataOpsResource

Awesome List for Data Operations
https://github.com/chen1649chenli/dataOpsResource
Last synced: 7 months ago
JSON representation
Awesome List for Data Operations
Host: GitHub
URL: https://github.com/chen1649chenli/dataOpsResource
Owner: chen1649chenli
Created: 2019-04-07T00:57:34.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-08-14T20:36:59.000Z (over 5 years ago)
Last Synced: 2024-11-18T02:36:33.528Z (about 1 year ago)
Size: 14.8 MB
Stars: 23
Watchers: 5
Forks: 9
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

awesome-dataops - DataOps Resource
README

          # My Awesome Data Ops Resources [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

> A curated list of data operations resources, focused for Cultural Heritage Organizations usage.

## Books

- [The DataOps Cookbook](https://www.datakitchen.io/dataops-cookbook-main.html) A 135-page long book that describes the steip-by-step implmentation of Data Ops. 

- [blogs](#another-section)

## Papers and Blogs

### ETL

 - [Managing Data in Motion](https://www.progress.com/docs/default-source/default-document-library/Progress/Documents/book-club/Managing-Data-in-Motion.p)

### Data Quality

- [A Deep Dive Into Data Quality](https://towardsdatascience.com/a-deep-dive-into-data-quality-c1d1ee576046)

### Metadata

 - [Importance of Metadata in Data Warehousing](http://sdsu-dspace.calstate.edu/bitstream/handle/10211.10/2354/Dhiman_Abhinav.pdf;sequence=1)

 

### Pipeline Engineering

- [Smart pipelining — reactive approach to computation scheduling](https://medium.com/casumotech/smart-pipelining-reactive-approach-to-computation-scheduling-5a7e39658df5)

## Data Ops Software

### Data Pipeline Orchestration

- [Airflow](https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8)

an open-source platform to programmatically author, schedule and monitor data pipelines.

- [Apache Oozie](http://oozie.apache.org/)

an open-source workflow scheduler system to manage Apache Hadoop jobs.

- [DBT (Data Build Tool)](https://www.getdbt.com/)

is a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively.

- [BMC Control-M](http://www.bmc.com/it-solutions/control-m.html)

a digital business automation solution that simplifies and automates diverse batch application workloads.

- [DataKitchen](https://www.datakitchen.io/)

a DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics.

- [Reflow](https://github.com/grailbio/reflow)

Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.

- [ElementL](https://github.com/elementl)

A current stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source.

- [Astronomer.io](https://www.astronomer.io/)

Astronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows.

- [Piperr.io](http://piperr.io/) 

Use Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs.

- [Prefect Technologies](https://www.prefect.io/)

Open-source data engineering platform that builds, tests, and runs data workflows.

- [Genie](https://netflix.github.io/genie/)

Distributed Big Data Orchestration Service by Netflix

### Testing and Production Quality

- [ICEDQ](https://icedq.com/)

software used to automate the testing of ETL/Data Warehouse and Data Migration.

- [Naveego](http://www.naveego.com/)

A simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management.

- [DataKitchen](https://www.datakitchen.io/)

a DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data.

- [FirstEigen](http://firsteigen.com/)

Automatic Data Quality Rule Discovery and Continuous Data Monitoring

- [Great Expectations](https://github.com/great-expectations/great_expectations)

Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compiling or deploy time).

- [Enterprise Data Foundation](https://enterprise-data.org/)

Open-source enterprise data toolkit providing efficient unit testing, automated refreshes, and automated deployment.

### Deployment Automation and Development Sandbox Creation

- [Jenkins](https://jenkins-ci.org/)

a ‘CI/CD’ tool used by software development teams to deploy code from development into production

- [DataKitchen](https://www.datakitchen.io/)

a DataOps Platform that supports the deployment of all data analytics code and configuration.

- [Amaterasu](http://shinto.io/index.html)

is a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies.

- [Meltano](https://about.gitlab.com/2018/08/01/hey-data-teams-we-are-working-on-a-tool-just-for-you/)

aims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle.

### Data Science Model Deployment

- [Domino](https://www.dominodatalab.com/)

accelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility.

- [Hydrosphere.io](https://hydrosphere.io/)

deploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines.

- [Open Data Group](https://www.opendatagroup.com/)

a software solution that facilitates the deployment of analytics using models.

- [ParallelM](http://www.parallelm.com/)

moves machine learning into production, automates orchestration, and manages the ML pipeline.

- [Seldon](https://www.seldon.io/)

streamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment.

- [Metis Machine](https://metismachine.com/)

Enterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications.

- [Datatron](http://www.datatron.com/)

Automate deployment and monitoring of AI Models.

- [DSFlow](http://dsflow.io/)Go from data extraction to business value in days, not months. Build on top of open source tech, using Silicon Valley’s best practices.

- [DataMo-Datmo](https://datmo.com/)

tools help you seamlessly deploy and manage models in a scalable, reliable, and cost-optimized way.

- [MLFlow](https://www.mlflow.org/)

An open source platform for the complete machine learning lifecycle from MapR.

- [Studio.ML](https://www.studio.ml/)

Studio is a model management framework written in Python to help simplify and expedite your model building experience.

- [Comet.ML](https://www.comet.ml/)

Comet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility.

- [Polyaxon](https://polyaxon.com/)

An open source platform for reproducible machine learning at scale.

- [Missinglink.ai](https://missinglink.ai/)

MissingLink helps data engineers streamline and automate the entire deep learning lifecycle.

- [kubeflow](https://www.kubeflow.org/)

The Machine Learning Toolkit for Kubernetes

- [Vert.ai](https://www.verta.ai/)

Models are the new code!

## License

[![CC0](http://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](http://creativecommons.org/publicdomain/zero/1.0)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chen1649chenli/dataOpsResource

Awesome Lists containing this project

README