{"id":13815038,"url":"https://github.com/Nike-Inc/koheesio","last_synced_at":"2025-05-15T06:34:00.002Z","repository":{"id":241566829,"uuid":"796773455","full_name":"Nike-Inc/koheesio","owner":"Nike-Inc","description":"Python framework for building efficient data pipelines. It promotes modularity and collaboration, enabling the creation of complex pipelines from simple, reusable components.","archived":false,"fork":false,"pushed_at":"2025-05-13T13:49:36.000Z","size":8295,"stargazers_count":637,"open_issues_count":29,"forks_count":36,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-05-13T14:33:25.805Z","etag":null,"topics":["data-engineering","delta-lake","pydantic","pyspark","python"],"latest_commit_sha":null,"homepage":"https://engineering.nike.com/koheesio/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Nike-Inc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-05-06T15:47:49.000Z","updated_at":"2025-05-12T20:03:17.000Z","dependencies_parsed_at":"2024-05-29T03:33:16.382Z","dependency_job_id":"e5d57c7e-0c7d-4fd4-8a9b-754e3ae835c2","html_url":"https://github.com/Nike-Inc/koheesio","commit_stats":null,"previous_names":["nike-inc/koheesio"],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nike-Inc%2Fkoheesio","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nike-Inc%2Fkoheesio/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nike-Inc%2Fkoheesio/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Nike-Inc%2Fkoheesio/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Nike-Inc","download_url":"https://codeload.github.com/Nike-Inc/koheesio/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253962999,"owners_count":21991279,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","delta-lake","pydantic","pyspark","python"],"created_at":"2024-08-04T04:02:51.968Z","updated_at":"2025-05-15T06:33:59.993Z","avatar_url":"https://github.com/Nike-Inc.png","language":"Python","readme":"# Koheesio\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/Nike-Inc/koheesio/main/docs/assets/logo_koheesio.svg\" alt=\"Koheesio logo\" width=\"500\" role=\"img\"\u003e\n\u003c/p\u003e\n\n|         |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |\n|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| CI/CD   | [![CI - Test](https://github.com/Nike-Inc/koheesio/actions/workflows/test.yml/badge.svg)](https://github.com/Nike-Inc/koheesio/actions/workflows/test.yml) [![CD - Release Koheesio](https://github.com/Nike-Inc/koheesio/actions/workflows/release.yml/badge.svg)](https://github.com/Nike-Inc/koheesio/actions/workflows/release.yml)                                                                                                                                                                                                                                                                                                                                                                                                 |\n| Package | [![PyPI - Version](https://img.shields.io/pypi/v/koheesio.svg?logo=pypi\u0026label=PyPI\u0026logoColor=gold)](https://pypi.org/project/koheesio/) [![PyPI - Python Version](https://img.shields.io/pypi/pyversions/koheesio.svg?logo=python\u0026label=Python\u0026logoColor=gold)](https://pypi.org/project/koheesio/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/koheesio?color=blue\u0026label=Installs\u0026logo=pypi\u0026logoColor=gold)](https://pypi.org/project/koheesio/)                       |\n| Meta    | [![Hatch project](https://img.shields.io/badge/%F0%9F%A5%9A-Hatch-4051b5.svg)](https://github.com/pypa/hatch) [![linting - Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) [![types - Mypy](https://img.shields.io/badge/types-Mypy-blue.svg)](https://github.com/python/mypy) [![docstring - numpydoc](https://img.shields.io/badge/docstring-numpydoc-blue)](https://numpydoc.readthedocs.io/en/latest/format.html) [![code style - black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![License - Apache 2.0](https://img.shields.io/github/license/Nike-Inc/koheesio)](LICENSE.txt) |\n\n[//]: # (suggested edit: )\n# Koheesio: A Python Framework for Efficient Data Pipelines\n\nKoheesio - the Finnish word for cohesion - is a robust Python framework designed to build efficient data pipelines. It\nencourages modularity and collaboration, allowing the creation of complex pipelines from simple, reusable components.\n\n\n## What is Koheesio?\n\nKoheesio is a versatile framework that supports multiple implementations and works seamlessly with various data \nprocessing libraries or frameworks. This ensures that Koheesio can handle any data processing task, regardless of the\nunderlying technology or data scale.\n\nKoheesio uses [Pydantic] for strong typing, data validation, and settings management, ensuring a high level of type\nsafety and structured configurations within pipeline components.\n\n[Pydantic]: docs/includes/glossary.md#pydantic\n\nThe goal of Koheesio is to ensure predictable pipeline execution through a solid foundation of well-tested code and a\nrich set of features. This makes it an excellent choice for developers and organizations seeking to build robust and\nadaptable data pipelines.\n\n\n### What Koheesio is Not\n\nKoheesio is **not** a workflow orchestration tool. It does not serve the same purpose as tools like Luigi,\nApache Airflow, or Databricks workflows, which are designed to manage complex computational workflows and generate \nDAGs (Directed Acyclic Graphs).\n\nInstead, Koheesio is focused on providing a robust, modular, and testable framework for data tasks. It's designed to\nmake it easier to write, maintain, and test data processing code in Python, with a strong emphasis on modularity,\nreusability, and error handling.\n\nIf you're looking for a tool to orchestrate complex workflows or manage dependencies between different tasks, you might\nwant to consider dedicated workflow orchestration tools.\n\n\n### The Strength of Koheesio\n\nThe core strength of Koheesio lies in its **focus on the individual tasks within those workflows**. It's all about\nmaking these tasks as robust, repeatable, and maintainable as possible. Koheesio aims to break down tasks into small,\nmanageable units of work that can be easily tested, reused, and composed into larger workflows orchestrated with other\ntools or frameworks (such as Apache Airflow, Luigi, or Databricks Workflows).\n\nBy using Koheesio, you can ensure that your data tasks are resilient, observable, and repeatable, adhering to good\nsoftware engineering practices. This makes your data pipelines more reliable and easier to maintain, ultimately leading\nto more efficient and effective data processing.\n\n\n### Promoting Collaboration and Innovation\n\nKoheesio encapsulates years of software and data engineering expertise. It fosters a collaborative and innovative\ncommunity, setting itself apart with its unique design and focus on data pipelines, data transformation, ETL jobs,\ndata validation, and large-scale data processing.\n\nThe core components of Koheesio are designed to bring strong software engineering principles to data engineering. \n\n'Steps' break down tasks and workflows into manageable, reusable, and testable units. Each 'Step' comes with built-in\nlogging, providing transparency and traceability. The 'Context' component allows for flexible customization of task \nbehavior, making it adaptable to various data processing needs.\n\nIn essence, Koheesio is a comprehensive solution for data engineering challenges, designed with the principles of\nmodularity, reusability, testability, and transparency at its core. It aims to provide a rich set of features including\nutilities, readers, writers, and transformations for any type of data processing. It is not in competition with other\nlibraries, but rather aims to offer wide-ranging support and focus on utility in a multitude of scenarios. Our\npreference is for integration, not competition.\n\nWe invite contributions from all, promoting collaboration and innovation in the data engineering community.\n\n\n### Comparison to other libraries\n\n#### ML frameworks\n\nThe libraries listed under this section are primarily focused on Machine Learning (ML) workflows. They provide various\nfunctionalities, from orchestrating ML and data processing workflows, simplifying the deployment of ML workflows on \nKubernetes, to managing the end-to-end ML lifecycle. While these libraries have a strong emphasis on ML, Koheesio is a \nmore general data pipeline framework. It is designed to handle a variety of data processing tasks, not exclusively\nfocused on ML. This makes Koheesio a versatile choice for data pipeline construction, regardless of whether the\npipeline involves ML tasks or not.\n\n- [Flyte](https://flyte.org/): A cloud-native platform for orchestrating ML and data processing workflows. Unlike Koheesio, it requires Kubernetes for deployment and has a strong focus on workflow orchestration.\n- [Kubeflow](https://kubeflow.org/): A project dedicated to simplifying the deployment of ML workflows on Kubernetes. Unlike Koheesio, it is more specialized for ML workflows.\n- [Kedro](https://kedro.readthedocs.io/): A Python framework that applies software engineering best-practice to data and machine-learning pipelines. It is similar to Koheesio but has a stronger emphasis on machine learning pipelines.\n- [Metaflow](https://docs.metaflow.org/): A human-centric framework for data science that addresses the entire data science lifecycle. It is more focused on data science projects compared to Koheesio.\n- [MLflow](https://mlflow.org/docs/latest/index.html): An open source platform for managing the end-to-end machine learning lifecycle. It is more focused on machine learning projects compared to Koheesio.\n- [TFX](https://www.tensorflow.org/tfx/guide): An end-to-end platform for deploying production ML pipelines. It is more focused on TensorFlow-based machine learning pipelines compared to Koheesio.\n- [Seldon Core](https://docs.seldon.io/projects/seldon-core/en/latest/): An open source platform for deploying machine learning models on Kubernetes. Unlike Koheesio, it is more focused on model deployment.\n\n\n#### Orchestration tools\n\nThe libraries listed under this section are primarily focused on workflow orchestration. They provide various \nfunctionalities, from authoring, scheduling, and monitoring workflows, to building complex pipelines of batch jobs, and \ncreating and executing Directed Acyclic Graphs (DAGs). Some of these libraries are designed for modern infrastructure \nand powered by open-source workflow engines, while others use a Python-style language for defining workflows. While \nthese libraries have a strong emphasis on workflow orchestration, Koheesio is a more general data pipeline framework. It\nis designed to handle a variety of data processing tasks, not limited to workflow orchestration.Ccode written with\nKoheesio is often compatible with these orchestration engines. This makes Koheesio a versatile choice for data pipeline \nconstruction, regardless of how the pipeline orchestration is set up.\n\n- [Apache Airflow](https://airflow.apache.org/docs/): A platform to programmatically author, schedule and monitor workflows. Unlike Koheesio, it focuses on managing complex computational workflows.\n- [Luigi](https://luigi.readthedocs.io/): A Python module that helps you build complex pipelines of batch jobs. It is more focused on workflow orchestration compared to Koheesio.\n- [Databricks Workflows](https://www.databricks.com/product/workflows): A set of tools for building, debugging, deploying, and running Apache Spark workflows on Databricks.\n- [Prefect](https://docs.prefect.io/): A new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. It is more focused on workflow orchestration and management compared to Koheesio.\n- [Snakemake](https://snakemake.readthedocs.io/en/stable/): A workflow management system that uses a Python-style language for defining workflows. While it's powerful for creating complex workflows, Koheesio's focus on modularity and reusability might make it easier to build, test, and maintain your data pipelines.\n- [Dagster](https://docs.dagster.io/): A data orchestrator for machine learning, analytics, and ETL. It's more focused on orchestrating and visualizing data workflows compared to Koheesio.\n- [Ploomber](https://ploomber.readthedocs.io/): A Python library for building robust data pipelines. In some ways it is similar to Koheesio, but has a very different API design more focused on workflow orchestration.\n- [Pachyderm](https://docs.pachyderm.com/): A data versioning, data lineage, and workflow system running on Kubernetes. It is more focused on data versioning and lineage compared to Koheesio.\n- [Argo](https://argoproj.github.io/): An open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Unlike Koheesio, it requires Kubernetes for deployment.\n\n\n#### Others\n  The libraries listed under this section offer a variety of unique functionalities, from parallel and distributed\n  computing, to SQL-first transformation workflows, to data versioning and lineage, to data relation definition and\n  manipulation, and data warehouse management. Some of these libraries are designed for specific tasks such as\n  transforming data in warehouses using SQL, building concurrent, multi-stage data ingestion and processing pipelines, \n  or orchestrating parallel jobs on Kubernetes.\n\n- [Dask](https://dask.org/): A flexible parallel computing library for analytics. Unlike Koheesio, it is more focused on parallel computing and distributed computing. While not currently support, Dask could be a future implementation pattern for Koheesio, just like Pandas and PySpark at the moment. \n- [dbt](https://www.getdbt.com/): A SQL-first transformation workflow that also supports Python. It excels in transforming data in warehouses using SQL. In contrast, Koheesio is a more general data pipeline framework with strong typing, capable of handling a variety of data processing tasks beyond transformations.\n- [Broadway](https://elixir-broadway.org/): An Elixir library for building concurrent, multi-stage data ingestion and processing pipelines. If your team is more comfortable with Python or if you're looking for a framework that emphasizes modularity and collaboration, Koheesio could be a better fit.\n- [Ray](https://docs.ray.io/en/latest/): A general-purpose distributed computing framework. Unlike Koheesio, it is more focused on distributed computing.\n- [DataJoint](https://docs.datajoint.io/): A language for defining data relations and manipulating data. Unlike Koheesio, it is more focused on data relation definition and manipulation.\n\n\n## Koheesio Core Components\n\nHere are the 3 core components included in Koheesio:\n\n- __Step__: This is the fundamental unit of work in Koheesio. It represents a single operation in a data pipeline,\n  taking in inputs and producing outputs.\n- __Context__: This is a configuration class used to set up the environment for a Task. It can be used to share\nvariables across tasks and adapt the behavior of a Task based on its environment.\n- __Logger__: This is a class for logging messages at different levels.\n\n## Installation\n\nYou can install Koheesio using either pip, hatch, or poetry.\n\n### Using Pip\n\nTo install Koheesio using pip, run the following command in your terminal:\n\n```bash\npip install koheesio\n```\n\n### Using Hatch\n\nIf you're using Hatch for package management, you can add Koheesio to your project by simply adding koheesio to your\n`pyproject.toml`.\n  \n  ```toml title=\"pyproject.toml\"\n  [dependencies]\n  koheesio = \"\u003cversion\u003e\"\n  ```\n\n### Using Poetry\n\nIf you're using poetry for package management, you can add Koheesio to your project with the following command:\n\n```bash\npoetry add koheesio\n```\n\nor add the following line to your `pyproject.toml` (under `[tool.poetry.dependencies]`), making sure to replace\n`...` with the version you want to have installed:\n\n```toml title=\"pyproject.toml\"\nkoheesio = {version = \"...\"}\n```\n\n## Extras\n\nKoheesio also provides some additional features that can be useful in certain scenarios. We call these 'integrations'.\nWith an integration we mean a module that requires additional dependencies to be installed.\n\nExtras can be added by adding `extras=['name_of_the_extra']` (poetry) or `koheesio[name_of_the_extra]` (pip/hatch) to \nthe `pyproject.toml` entry mentioned above or installing through pip.\n\n### Integrations\n\n- __Spark Expectations:__   \n    Available through the `koheesio.integrations.spark.dq.spark_expectations` module; installable through the `se` extra.\n    - SE Provides Data Quality checks for Spark DataFrames.\n    - For more information, refer to the [Spark Expectations docs](https://engineering.nike.com/spark-expectations).\n\n- __Spark Connect and Delta:__\n    Koheesio is ready to be used with Spark Connect. In case you are using Delta package in combination with a remote/connect session, you are getting full support in Databricks and partial support for Delta package in Apache Spark. Full support for Delta in Apache Spark is coming with the release of PySpark 4.0.\n    - The spark extra can be installed by adding `koheesio[spark]` to the `pyproject.toml` entry mentioned above.\n    - The spark module is available through the `koheesio.spark` module.\n    - The delta module is available through the `koheesio.spark.writers.delta` module.\n    - For more information, refer to the [Databricks documentation](https://docs.databricks.com/).\n    - For more information on Apache Spark, refer to the [Apache Spark documentation](https://spark.apache.org/docs/latest/).\n\n[//]: # (- **Brickflow:** Available through the `koheesio.integrations.workflow` module; installable through the `bf` extra.)\n[//]: # (    - Brickflow is a workflow orchestration tool that allows you to define and execute workflows in a declarative way.)\n[//]: # (    - For more information, refer to the [Brickflow docs]\u0026#40;https://engineering.nike.com/brickflow\u0026#41;)\n\n- __Box__:  \n    Available through the `koheesio.integrations.box` module; installable through the `box` extra.\n    - [Box](https://www.box.com) is a cloud content management and file sharing service for businesses.\n\n- __SFTP__:  \n    Available through the `koheesio.integrations.spark.sftp` module; installable through the `sftp` extra.\n    - SFTP is a network protocol used for secure file transfer over a secure shell.\n    - The SFTP integration of Koheesio relies on [paramiko](https://www.paramiko.org/)\n\n- __Snowflake__:\n    Available through the `koheesio.integrations.snowflake` module; installable through the `snowflake` extra.\n    - [Snowflake](https://www.snowflake.com) is a cloud-based data warehousing platform.\n\n[//]: # (TODO: add implementations)\n[//]: # (## Implementations)\n[//]: # (TODO: add async extra)\n[//]: # (TODO: add spark extra)\n[//]: # (TODO: add pandas extra)\n\n\u003e __Note:__  \n\u003e Some of the steps require extra dependencies. See the [Extras](#extras) section for additional info.  \n\u003e Extras can be done by adding `features=['name_of_the_extra']` to the toml entry mentioned above\n\n## Contributing\n\n### How to Contribute\n\nWe welcome contributions to our project! Here's a brief overview of our development process:\n\n- __Code Standards__: We use `pylint`, `black`, and `mypy` to maintain code standards. Please ensure your code passes\n  these checks by running `make check`. No errors or warnings should be reported by the linter before you submit a pull\n  request.\n\n- __Testing__: We use `pytest` for testing. Run the tests with `make test` and ensure all tests pass before submitting\n  a pull request.\n\n- __Release Process__: We aim for frequent releases. Typically when we have a new feature or bugfix, a developer with\n  admin rights will create a new release on GitHub and publish the new version to PyPI.\n\nFor more detailed information, please refer to our [contribution guidelines](https://github.com/Nike-Inc/koheesio/blob/main/CONTRIBUTING.md).\nWe also adhere to [Nike's Code of Conduct](https://github.com/Nike-Inc/nike-inc.github.io/blob/master/CONDUCT.md).\n\n### Additional Resources\n\n- [General GitHub documentation](https://support.github.com/)\n- [GitHub pull request documentation](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests)\n- [Nike OSS](https://nike-inc.github.io/)\n","funding_links":[],"categories":["Python","🔄 Data Plattform Tools"],"sub_categories":["🧠 Prompt Engineering \u0026 Memory Bank"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNike-Inc%2Fkoheesio","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNike-Inc%2Fkoheesio","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNike-Inc%2Fkoheesio/lists"}