{"id":13788075,"url":"https://github.com/chen1649chenli/dataOpsResource","last_synced_at":"2025-05-12T02:32:47.829Z","repository":{"id":71679074,"uuid":"179903104","full_name":"chen1649chenli/dataOpsResource","owner":"chen1649chenli","description":"Awesome List for Data Operations","archived":false,"fork":false,"pushed_at":"2020-08-14T20:36:59.000Z","size":15508,"stargazers_count":23,"open_issues_count":1,"forks_count":9,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-11-18T02:36:33.528Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chen1649chenli.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-04-07T00:57:34.000Z","updated_at":"2024-11-17T23:18:14.000Z","dependencies_parsed_at":"2023-02-27T09:00:13.785Z","dependency_job_id":null,"html_url":"https://github.com/chen1649chenli/dataOpsResource","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen1649chenli%2FdataOpsResource","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen1649chenli%2FdataOpsResource/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen1649chenli%2FdataOpsResource/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chen1649chenli%2FdataOpsResource/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chen1649chenli","download_url":"https://codeload.github.com/chen1649chenli/dataOpsResource/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253662746,"owners_count":21944123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T21:00:36.264Z","updated_at":"2025-05-12T02:32:46.767Z","avatar_url":"https://github.com/chen1649chenli.png","language":null,"readme":"# My Awesome Data Ops Resources [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)\n\n\u003e A curated list of data operations resources, focused for Cultural Heritage Organizations usage.\n\n\n## Books\n\n- [The DataOps Cookbook](https://www.datakitchen.io/dataops-cookbook-main.html) A 135-page long book that describes the steip-by-step implmentation of Data Ops. \n\n- [blogs](#another-section)\n\n\n## Papers and Blogs\n### ETL\n - [Managing Data in Motion](https://www.progress.com/docs/default-source/default-document-library/Progress/Documents/book-club/Managing-Data-in-Motion.p)\n\n\n### Data Quality\n- [A Deep Dive Into Data Quality](https://towardsdatascience.com/a-deep-dive-into-data-quality-c1d1ee576046)\n\n### Metadata\n - [Importance of Metadata in Data Warehousing](http://sdsu-dspace.calstate.edu/bitstream/handle/10211.10/2354/Dhiman_Abhinav.pdf;sequence=1)\n \n### Pipeline Engineering\n- [Smart pipelining — reactive approach to computation scheduling](https://medium.com/casumotech/smart-pipelining-reactive-approach-to-computation-scheduling-5a7e39658df5)\n\n\n\n## Data Ops Software\n\n### Data Pipeline Orchestration\n\n- [Airflow](https://medium.com/airbnb-engineering/airflow-a-workflow-management-platform-46318b977fd8)\nan open-source platform to programmatically author, schedule and monitor data pipelines.\n- [Apache Oozie](http://oozie.apache.org/)\nan open-source workflow scheduler system to manage Apache Hadoop jobs.\n- [DBT (Data Build Tool)](https://www.getdbt.com/)\nis a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively.\n- [BMC Control-M](http://www.bmc.com/it-solutions/control-m.html)\na digital business automation solution that simplifies and automates diverse batch application workloads.\n- [DataKitchen](https://www.datakitchen.io/)\na DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics.\n- [Reflow](https://github.com/grailbio/reflow)\nReflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.\n- [ElementL](https://github.com/elementl)\nA current stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source.\n- [Astronomer.io](https://www.astronomer.io/)\nAstronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows.\n- [Piperr.io](http://piperr.io/) \nUse Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs.\n- [Prefect Technologies](https://www.prefect.io/)\nOpen-source data engineering platform that builds, tests, and runs data workflows.\n- [Genie](https://netflix.github.io/genie/)\nDistributed Big Data Orchestration Service by Netflix\n\n### Testing and Production Quality\n- [ICEDQ](https://icedq.com/)\nsoftware used to automate the testing of ETL/Data Warehouse and Data Migration.\n- [Naveego](http://www.naveego.com/)\nA simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management.\n- [DataKitchen](https://www.datakitchen.io/)\na DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data.\n- [FirstEigen](http://firsteigen.com/)\nAutomatic Data Quality Rule Discovery and Continuous Data Monitoring\n- [Great Expectations](https://github.com/great-expectations/great_expectations)\nGreat Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compiling or deploy time).\n- [Enterprise Data Foundation](https://enterprise-data.org/)\nOpen-source enterprise data toolkit providing efficient unit testing, automated refreshes, and automated deployment.\n\n### Deployment Automation and Development Sandbox Creation\n- [Jenkins](https://jenkins-ci.org/)\na ‘CI/CD’ tool used by software development teams to deploy code from development into production\n- [DataKitchen](https://www.datakitchen.io/)\na DataOps Platform that supports the deployment of all data analytics code and configuration.\n- [Amaterasu](http://shinto.io/index.html)\nis a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies.\n- [Meltano](https://about.gitlab.com/2018/08/01/hey-data-teams-we-are-working-on-a-tool-just-for-you/)\naims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle.\n\n### Data Science Model Deployment\n- [Domino](https://www.dominodatalab.com/)\naccelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility.\n- [Hydrosphere.io](https://hydrosphere.io/)\ndeploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines.\n- [Open Data Group](https://www.opendatagroup.com/)\na software solution that facilitates the deployment of analytics using models.\n- [ParallelM](http://www.parallelm.com/)\nmoves machine learning into production, automates orchestration, and manages the ML pipeline.\n- [Seldon](https://www.seldon.io/)\nstreamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment.\n- [Metis Machine](https://metismachine.com/)\nEnterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications.\n- [Datatron](http://www.datatron.com/)\nAutomate deployment and monitoring of AI Models.\n- [DSFlow](http://dsflow.io/)Go from data extraction to business value in days, not months. Build on top of open source tech, using Silicon Valley’s best practices.\n- [DataMo-Datmo](https://datmo.com/)\ntools help you seamlessly deploy and manage models in a scalable, reliable, and cost-optimized way.\n- [MLFlow](https://www.mlflow.org/)\nAn open source platform for the complete machine learning lifecycle from MapR.\n- [Studio.ML](https://www.studio.ml/)\nStudio is a model management framework written in Python to help simplify and expedite your model building experience.\n- [Comet.ML](https://www.comet.ml/)\nComet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility.\n- [Polyaxon](https://polyaxon.com/)\nAn open source platform for reproducible machine learning at scale.\n- [Missinglink.ai](https://missinglink.ai/)\nMissingLink helps data engineers streamline and automate the entire deep learning lifecycle.\n- [kubeflow](https://www.kubeflow.org/)\nThe Machine Learning Toolkit for Kubernetes\n- [Vert.ai](https://www.verta.ai/)\nModels are the new code!\n\n\n\n\n## License\n\n[![CC0](http://mirrors.creativecommons.org/presskit/buttons/88x31/svg/cc-zero.svg)](http://creativecommons.org/publicdomain/zero/1.0)\n","funding_links":[],"categories":["Other Lists"],"sub_categories":["Vector Database"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchen1649chenli%2FdataOpsResource","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchen1649chenli%2FdataOpsResource","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchen1649chenli%2FdataOpsResource/lists"}