Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/chen1649chenli/dataOpsResource

Awesome List for Data Operations
https://github.com/chen1649chenli/dataOpsResource

Last synced: 3 months ago
JSON representation

Awesome List for Data Operations

Awesome Lists containing this project

README

        

# My Awesome Data Ops Resources [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)

> A curated list of data operations resources, focused for Cultural Heritage Organizations usage.

## Books

- [The DataOps Cookbook](https://www.datakitchen.io/dataops-cookbook-main.html) A 135-page long book that describes the steip-by-step implmentation of Data Ops.

- [blogs](#another-section)

## Papers and Blogs
### ETL
- [Managing Data in Motion](https://www.progress.com/docs/default-source/default-document-library/Progress/Documents/book-club/Managing-Data-in-Motion.p)

### Data Quality
- [A Deep Dive Into Data Quality](https://towardsdatascience.com/a-deep-dive-into-data-quality-c1d1ee576046)

### Metadata
- [Importance of Metadata in Data Warehousing](http://sdsu-dspace.calstate.edu/bitstream/handle/10211.10/2354/Dhiman_Abhinav.pdf;sequence=1)

### Pipeline Engineering
- [Smart pipelining — reactive approach to computation scheduling](https://medium.com/casumotech/smart-pipelining-reactive-approach-to-computation-scheduling-5a7e39658df5)

## Data Ops Software

### Data Pipeline Orchestration

- [Airflow](https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8)
an open-source platform to programmatically author, schedule and monitor data pipelines.
- [Apache Oozie](http://oozie.apache.org/)
an open-source workflow scheduler system to manage Apache Hadoop jobs.
- [DBT (Data Build Tool)](https://www.getdbt.com/)
is a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively.
- [BMC Control-M](http://www.bmc.com/it-solutions/control-m.html)
a digital business automation solution that simplifies and automates diverse batch application workloads.
- [DataKitchen](https://www.datakitchen.io/)
a DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics.
- [Reflow](https://github.com/grailbio/reflow)
Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.
- [ElementL](https://github.com/elementl)
A current stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source.
- [Astronomer.io](https://www.astronomer.io/)
Astronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows.
- [Piperr.io](http://piperr.io/)
Use Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs.
- [Prefect Technologies](https://www.prefect.io/)
Open-source data engineering platform that builds, tests, and runs data workflows.
- [Genie](https://netflix.github.io/genie/)
Distributed Big Data Orchestration Service by Netflix

### Testing and Production Quality
- [ICEDQ](https://icedq.com/)
software used to automate the testing of ETL/Data Warehouse and Data Migration.
- [Naveego](http://www.naveego.com/)
A simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management.
- [DataKitchen](https://www.datakitchen.io/)
a DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data.
- [FirstEigen](http://firsteigen.com/)
Automatic Data Quality Rule Discovery and Continuous Data Monitoring
- [Great Expectations](https://github.com/great-expectations/great_expectations)
Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compiling or deploy time).
- [Enterprise Data Foundation](https://enterprise-data.org/)
Open-source enterprise data toolkit providing efficient unit testing, automated refreshes, and automated deployment.

### Deployment Automation and Development Sandbox Creation
- [Jenkins](https://jenkins-ci.org/)
a ‘CI/CD’ tool used by software development teams to deploy code from development into production
- [DataKitchen](https://www.datakitchen.io/)
a DataOps Platform that supports the deployment of all data analytics code and configuration.
- [Amaterasu](http://shinto.io/index.html)
is a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies.
- [Meltano](https://about.gitlab.com/2018/08/01/hey-data-teams-we-are-working-on-a-tool-just-for-you/)
aims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle.

### Data Science Model Deployment
- [Domino](https://www.dominodatalab.com/)
accelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility.
- [Hydrosphere.io](https://hydrosphere.io/)
deploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines.
- [Open Data Group](https://www.opendatagroup.com/)
a software solution that facilitates the deployment of analytics using models.
- [ParallelM](http://www.parallelm.com/)
moves machine learning into production, automates orchestration, and manages the ML pipeline.
- [Seldon](https://www.seldon.io/)
streamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment.
- [Metis Machine](https://metismachine.com/)
Enterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications.
- [Datatron](http://www.datatron.com/)
Automate deployment and monitoring of AI Models.
- [DSFlow](http://dsflow.io/)Go from data extraction to business value in days, not months. Build on top of open source tech, using Silicon Valley’s best practices.
- [DataMo-Datmo](https://datmo.com/)
tools help you seamlessly deploy and manage models in a scalable, reliable, and cost-optimized way.
- [MLFlow](https://www.mlflow.org/)
An open source platform for the complete machine learning lifecycle from MapR.
- [Studio.ML](https://www.studio.ml/)
Studio is a model management framework written in Python to help simplify and expedite your model building experience.
- [Comet.ML](https://www.comet.ml/)
Comet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility.
- [Polyaxon](https://polyaxon.com/)
An open source platform for reproducible machine learning at scale.
- [Missinglink.ai](https://missinglink.ai/)
MissingLink helps data engineers streamline and automate the entire deep learning lifecycle.
- [kubeflow](https://www.kubeflow.org/)
The Machine Learning Toolkit for Kubernetes
- [Vert.ai](https://www.verta.ai/)
Models are the new code!

## License

[![CC0](http://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](http://creativecommons.org/publicdomain/zero/1.0)