{"id":13459552,"url":"https://github.com/Minyus/pipelinex","last_synced_at":"2025-03-24T18:30:43.621Z","repository":{"id":41506639,"uuid":"221912857","full_name":"Minyus/pipelinex","owner":"Minyus","description":"PipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more","archived":false,"fork":false,"pushed_at":"2023-11-28T12:49:00.000Z","size":2609,"stargazers_count":221,"open_issues_count":3,"forks_count":11,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-09-19T18:37:19.657Z","etag":null,"topics":["data-engineering","data-science","deep-learning","experimentation","machine-learning","pipeline"],"latest_commit_sha":null,"homepage":"https://pipelinex.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Minyus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-11-15T11:43:39.000Z","updated_at":"2024-09-02T18:47:23.000Z","dependencies_parsed_at":"2024-01-07T01:44:13.671Z","dependency_job_id":"a6d63cfb-f7fc-42ae-a2bf-620d92b6dee6","html_url":"https://github.com/Minyus/pipelinex","commit_stats":{"total_commits":534,"total_committers":5,"mean_commits":106.8,"dds":0.02621722846441943,"last_synced_commit":"88ada1393962205e5532094bc0ef2bb4a5b3519b"},"previous_names":[],"tags_count":47,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Minyus%2Fpipelinex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Minyus%2Fpipelinex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Minyus%2Fpipelinex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Minyus%2Fpipelinex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Minyus","download_url":"https://codeload.github.com/Minyus/pipelinex/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":221995704,"owners_count":16913546,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-science","deep-learning","experimentation","machine-learning","pipeline"],"created_at":"2024-07-31T10:00:19.437Z","updated_at":"2024-10-29T05:31:04.791Z","avatar_url":"https://github.com/Minyus.png","language":"Python","funding_links":[],"categories":["Example projects","Data Pipeline","Python"],"sub_categories":[],"readme":"# PipelineX\n\nPipelineX: Python package to build ML pipelines for experimentation with Kedro, MLflow, and more\n\n[![Python version](https://img.shields.io/badge/python-3.5%20%7C%203.6%20%7C%203.7%20%7C%203.8-blue.svg)](https://pypi.org/project/pipelinex/)\n[![PyPI version](https://badge.fury.io/py/pipelinex.svg)](https://badge.fury.io/py/pipelinex)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://github.com/Minyus/pipelinex/blob/master/LICENSEj)\n[![Documentation](https://readthedocs.org/projects/pipelinex/badge/?version=latest)](https://pipelinex.readthedocs.io/)\n\n## PipelineX Overview\n\nPipelineX is a Python package to build ML pipelines for experimentation with Kedro, MLflow, and more\n\nPipelineX provides the following options which can be used independently or together.\n\n- HatchDict: Python in YAML/JSON\n\n  `HatchDict` is a Python dict parser that enables you to include Python objects in YAML/JSON files. \n\n  Note: `HatchDict` can be used with or without Kedro.\n\n- Flex-Kedro: Kedro plugin for flexible config\n\n  - Flex-Kedro-Pipeline: Kedro plugin for quicker pipeline set up \n\n  - Flex-Kedro-Context: Kedro plugin for YAML lovers\n\n- MLflow-on-Kedro: Kedro plugin for MLflow users\n\n  `MLflow-on-Kedro` provides integration of Kedro with [MLflow](https://github.com/mlflow/mlflow) with Kedro DataSets and Hooks.\n\n  Note: You do not need to install MLflow if you do not use.\n\n- Kedro-Extras: Kedro plugin to use various Python packages \n\n  `Kedro-Extras` provides Kedro DataSets, decorators, and wrappers to use various Python packages such as: \n\n  - \u003c[PyTorch](https://github.com/pytorch/pytorch)\u003e\n  - \u003c[Ignite](https://github.com/pytorch/ignite)\u003e\n  - \u003c[Pandas](https://github.com/pandas-dev/pandas)\u003e\n  - \u003c[OpenCV](https://github.com/skvark/opencv-python)\u003e\n  - \u003c[Memory Profiler](https://github.com/pythonprofilers/memory_profiler)\u003e\n  - \u003c[NVIDIA Management Library](https://github.com/gpuopenanalytics/pynvml)\u003e\n\n  Note: You do not need to install Python packages you do not use.\n\nPlease refer [here](https://github.com/Minyus/Python_Packages_for_Pipeline_Workflow) to find out how PipelineX differs from other pipeline/workflow packages: Airflow, Luigi, Gokart, Metaflow, and Kedro.\n\n\n## Install PipelineX\n\n### [Option 1] Install from the PyPI\n\n```bash\npip install pipelinex\n```\n\n### [Option 2] Development install \n\nThis is recommended only if you want to modify the source code of PipelineX.\n\n```bash\ngit clone https://github.com/Minyus/pipelinex.git\ncd pipelinex\npython setup.py develop\n```\n\n### Prepare development environment for PipelineX\n\nYou can install packages and organize development environment with [pipenv](https://github.com/pypa/pipenv).\nRefer the [pipenv](https://github.com/pypa/pipenv) document to install pipenv.\nOnce you installed pipenv, you can use pipenv to install and organize your environment.\n\n```sh\n# install dependent libraries\n$ pipenv install\n\n# install development libraries\n$ pipenv install --dev\n\n# install pipelinex\n$ pipenv run install\n\n# install pipelinex via setup.py\n$ pipenv run install_dev\n\n# lint python code\n$ pipenv run lint\n\n# format python code\n$ pipenv run fmt\n\n# sort imports\n$ pipenv run sort\n\n# apply mypy to python code\n$ pipenv run vet\n\n# get into shell\n$ pipenv shell\n\n# run test\n$ pipenv run test\n```\n\n### Prepare Docker environment for PipelineX\n\n```bash\ngit clone https://github.com/Minyus/pipelinex.git\ncd pipelinex\ndocker build --tag pipelinex .\ndocker run --rm -it pipelinex\n```\n\n## Getting Started with PipelineX\n\n### Kedro (0.17-0.18) Starter projects\n\nKedro starters (Cookiecutter templates) to use Kedro, Scikit-learn, MLflow, and PipelineX are available at:\n[kedro-starters-sklearn](https://github.com/Minyus/kedro-starters-sklearn)\n\nIris dataset is included and used, but you can easily change to Kaggle Titanic dataset.\n\n### Example/Demo Projects tested with Kedro 0.16\n\n- [Computer Vision using PyTorch](https://github.com/Minyus/pipelinex_pytorch)\n\n  - `parameters.yml` at [conf/base/parameters.yml](https://github.com/Minyus/pipelinex_pytorch/blob/master/conf/base/parameters.yml)\n\n  - Essential packages: PyTorch, Ignite, Shap, Kedro, MLflow\n  - Application: Image classification\n  - Data: MNIST images\n  - Model: CNN (Convolutional Neural Network)\n  - Loss: Cross-entropy\n\n- [Kaggle competition using PyTorch](https://github.com/Minyus/kaggle_nfl)\n\n  - `parameters.yml` at [kaggle/conf/base/parameters.yml](https://github.com/Minyus/kaggle_nfl/blob/master/kaggle/conf/base/parameters.yml)\n\n  - Essential packages: PyTorch, Ignite, pandas, numpy, Kedro, MLflow\n  - Application: [Kaggle competition to predict the results of American Football plays](https://www.kaggle.com/c/nfl-big-data-bowl-2020/data)\n  - Data: Sparse heatmap-like field images and tabular data\n  - Model: Combination of CNN and MLP\n  - Loss: Continuous Rank Probability Score (CRPS)\n\n- [Computer Vision using OpenCV](https://github.com/Minyus/pipelinex_image_processing)\n\n  - `parameters.yml` at [conf/base/parameters.yml](https://github.com/Minyus/pipelinex_image_processing/blob/master/conf/base/parameters.yml)\n  - Essential packages: OpenCV, Scikit-image, numpy, TensorFlow (pretrained model), Kedro, MLflow\n  - Application: Image processing to estimate the empty area ratio of cuboid container on a truck\n  - Data: container images\n\n- [Uplift Modeling using CausalLift](https://github.com/Minyus/pipelinex_causallift)\n\n  - `parameters.yml` at [conf/base/parameters.yml](https://github.com/Minyus/pipelinex_causallift/blob/master/conf/base/parameters.yml)\n  - Essential packages: CausalLift, Scikit-learn, XGBoost, pandas, Kedro\n  - Application: Uplift Modeling to find which customers should be targeted and which customers should not for a marketing campaign (treatment)\n  - Data: generated by simulation\n\n## HatchDict: Python in YAML/JSON\n\n[API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.hatch_dict.html)\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Minyus/pipelinex/blob/master/notebooks/HatchDict_demo.ipynb)\n\n### Python objects in YAML/JSON\n\n#### Introduction to YAML\n\nYAML is a common text format used for application config files.\n\nYAML's most notable advantage is allowing users to mix 2 styles, block style and flow style.\n\nExample:\n\n```python\nimport yaml\nfrom pprint import pprint  # pretty-print for clearer look\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml=\"\"\"\nblock_style_demo:\n  key1: value1\n  key2: value2\nflow_style_demo: {key1: value1, key2: value2}\n\"\"\"\nparameters = yaml.safe_load(params_yaml)\n\nprint(\"### 2 styles in YAML ###\")\npprint(parameters)\n```\n\n```\n### 2 styles in YAML ###\n{'block_style_demo': {'key1': 'value1', 'key2': 'value2'},\n 'flow_style_demo': {'key1': 'value1', 'key2': 'value2'}}\n```\n\nTo store highly nested (hierarchical) dict or list, YAML is more conveinient than hard-coding in Python code.\n\n- YAML's block style, which uses indentation, allows users to omit opening and closing symbols to specify a Python dict or list (`{}` or `[]`).\n\n- YAML's flow style, which uses opening and closing symbols, allows users to specify a Python dict or list within a single line.\n\nSo simply using YAML with Python will be the best way for Machine Learning experimentation?\n\nLet's check out the next example.\n\nExample:\n\n```python\nimport yaml\nfrom pprint import pprint  # pretty-print for clearer look\n\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml = \"\"\"\nmodel_kind: LogisticRegression\nmodel_params:\n  C: 1.23456\n  max_iter: 987\n  random_state: 42\n\"\"\"\nparameters = yaml.safe_load(params_yaml)\n\nprint(\"### Before ###\")\npprint(parameters)\n\nmodel_kind = parameters.get(\"model_kind\")\nmodel_params_dict = parameters.get(\"model_params\")\n\nif model_kind == \"LogisticRegression\":\n    from sklearn.linear_model import LogisticRegression\n    model = LogisticRegression(**model_params_dict)\n\nelif model_kind == \"DecisionTree\":\n    from sklearn.tree import DecisionTreeClassifier\n    model = DecisionTreeClassifier(**model_params_dict)\n\nelif model_kind == \"RandomForest\":\n    from sklearn.ensemble import RandomForestClassifier\n    model = RandomForestClassifier(**model_params_dict)\n\nelse:\n    raise ValueError(\"Unsupported model_kind.\")\n\nprint(\"\\n### After ###\")\nprint(model)\n```\n\n```\n### Before ###\n{'model_kind': 'LogisticRegression',\n 'model_params': {'C': 1.23456, 'max_iter': 987, 'random_state': 42}}\n\n### After ###\nLogisticRegression(C=1.23456, class_weight=None, dual=False, fit_intercept=True,\n                   intercept_scaling=1, l1_ratio=None, max_iter=987,\n                   multi_class='warn', n_jobs=None, penalty='l2',\n                   random_state=42, solver='warn', tol=0.0001, verbose=0,\n                   warm_start=False)\n```\n\nThis way is inefficient as we need to add `import` and `if` statements for the options in the Python code in addition to modifying the YAML config file.\n\nAny better way?\n\n#### Python tags in YAML\n\nPyYAML provides [UnsafeLoader](\u003chttps://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation\u003e) which can load Python objects without `import`.\n\n\nExample usage of `!!python/object`\n\n```python\nimport yaml\n# You do not need `import sklearn.linear_model` using PyYAML's UnsafeLoader\n\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml = \"\"\"\nmodel:\n  !!python/object:sklearn.linear_model.LogisticRegression\n  C: 1.23456\n  max_iter: 987\n  random_state: 42\n\"\"\"\n\nparameters = yaml.unsafe_load(params_yaml)  # unsafe_load required\n\nmodel = parameters.get(\"model\")\n\nprint(\"### model object by PyYAML's UnsafeLoader ###\")\nprint(model)\n```\n\n```\n### model object by PyYAML's UnsafeLoader ###\nLogisticRegression(C=1.23456, class_weight=None, dual=None, fit_intercept=None,\n                   intercept_scaling=None, l1_ratio=None, max_iter=987,\n                   multi_class=None, n_jobs=None, penalty=None, random_state=42,\n                   solver=None, tol=None, verbose=None, warm_start=None)\n```\n\nExample usage of `!!python/name`\n\n```python\nimport yaml\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml = \"\"\"\nnumpy_array_func: \n  !!python/name:numpy.array\n\"\"\"\n\ntry:\n    parameters = yaml.unsafe_load(params_yaml)  # unsafe_load required for PyYAML 5.1 or later\nexcept:\n    parameters = yaml.load(params_yaml)\n\nnumpy_array_func = parameters.get(\"numpy_array_func\")\n\nimport numpy\n\nassert numpy_array_func == numpy.array\n```\n\n[PyYAML's `!!python/object` and `!!python/name`](https://pyyaml.org/wiki/PyYAMLDocumentation), however, has the following problems.\n\n- `!!python/object` or `!!python/name` are too long to write.\n- Positional (unnamed) arguments are apparently not supported.\n\nAny better way?\n\nPipelineX provides the solution.\n\n#### Alternative to Python tags in YAML\n\nPipelineX's HatchDict provides an easier syntax, as follows, to convert Python dictionaries read from YAML or JSON files to Python objects without `import`.\n\n- Use `=` key to specify the package, module, and class/function with `.` separator in `foo_package.bar_module.baz_class` format.\n- [Optional] Use `_` key to specify (list of) positional (unnamed) arguments if any.\n- [Optional] Add keyword arguments (kwargs) if any.\n\nTo return an object instance like PyYAML's `!!python/object`, feed positional and/or keyword arguments. If it has no arguments, just feed null (known as `None` in Python) to `_` key.\n\nTo return an uninstantiated (raw) object like PyYAML's `!!python/name`, just feed `=` key without any arguments.\n\nExample alternative to `!!python/object` specifying keyword arguments:\n\n```python\nfrom pipelinex import HatchDict\nimport yaml\nfrom pprint import pprint  # pretty-print for clearer look\n# You do not need `import sklearn.linear_model` using PipelineX's HatchDict\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml=\"\"\"\nmodel:\n  =: sklearn.linear_model.LogisticRegression\n  C: 1.23456\n  max_iter: 987\n  random_state: 42\n\"\"\"\nparameters = yaml.safe_load(params_yaml)\n\nmodel_dict = parameters.get(\"model\")\n\nprint(\"### Before ###\")\npprint(model_dict)\n\nmodel = HatchDict(parameters).get(\"model\")\n\nprint(\"\\n### After ###\")\nprint(model)\n```\n\n```\n### Before ###\n{'=': 'sklearn.linear_model.LogisticRegression',\n 'C': 1.23456,\n 'max_iter': 987,\n 'random_state': 42}\n\n### After ###\nLogisticRegression(C=1.23456, class_weight=None, dual=False, fit_intercept=True,\n                   intercept_scaling=1, l1_ratio=None, max_iter=987,\n                   multi_class='warn', n_jobs=None, penalty='l2',\n                   random_state=42, solver='warn', tol=0.0001, verbose=0,\n                   warm_start=False)\n```\n\nExample alternative to `!!python/object` specifying both positional and keyword arguments:\n\n```python\nfrom pipelinex import HatchDict\nimport yaml\nfrom pprint import pprint  # pretty-print for clearer look\n\nparams_yaml = \"\"\"\nmetrics:\n  - =: functools.partial\n    _:\n      =: sklearn.metrics.roc_auc_score\n    multiclass: ovr\n\"\"\"\nparameters = yaml.safe_load(params_yaml)\n\nmetrics_dict = parameters.get(\"metrics\")\n\nprint(\"### Before ###\")\npprint(metrics_dict)\n\nmetrics = HatchDict(parameters).get(\"metrics\")\n\nprint(\"\\n### After ###\")\nprint(metrics)\n```\n\n```\n### Before ###\n[{'=': 'functools.partial',\n  '_': {'=': 'sklearn.metrics.roc_auc_score'},\n  'multiclass': 'ovr'}]\n\n### After ###\n[functools.partial(\u003cfunction roc_auc_score at 0x16bcf19d0\u003e, multiclass='ovr')]\n```\n\nExample alternative to `!!python/name`:\n\n```python\nfrom pipelinex import HatchDict\nimport yaml\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml=\"\"\"\nnumpy_array_func:\n  =: numpy.array\n\"\"\"\nparameters = yaml.safe_load(params_yaml)\n\nnumpy_array_func = HatchDict(parameters).get(\"numpy_array_func\")\n\nimport numpy\n\nassert numpy_array_func == numpy.array\n```\n\nThis import-less Python object supports nested objects (objects that receives object arguments) by recursive depth-first search.\n\nFor more examples, please see [Use with PyTorch](https://pipelinex.readthedocs.io/en/latest/section08.html#use-with-pytorch). \n\nThis import-less Python object feature, inspired by the fact that Kedro uses `load_obj` for file I/O (`DataSet`), uses `load_obj` copied from [kedro.utils](https://github.com/quantumblacklabs/kedro/blob/0.15.4/kedro/utils.py) which dynamically imports Python objects using [`importlib`](https://docs.python.org/3.6/library/importlib.html), a Python standard library.\n\n### Anchor-less aliasing in YAML/JSON\n\n#### Aliasing in YAML\n\nTo avoid repeating, YAML natively provides Anchor\u0026Alias [Anchor\u0026Alias](https://confluence.atlassian.com/bitbucket/yaml-anchors-960154027.html) feature, and [Jsonnet](https://github.com/google/jsonnet) provides [Variable](https://github.com/google/jsonnet/blob/master/examples/variables.jsonnet) feature to JSON.\n\nExample:\n\n```python\nimport yaml\nfrom pprint import pprint  # pretty-print for clearer look\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml=\"\"\"\ntrain_params:\n  train_batch_size: \u0026batch_size 32\n  val_batch_size: *batch_size\n\"\"\"\nparameters = yaml.safe_load(params_yaml)\n\ntrain_params_dict = parameters.get(\"train_params\")\n\nprint(\"### Conversion by YAML's Anchor\u0026Alias feature ###\")\npprint(train_params_dict)\n```\n\n```\n### Conversion by YAML's Anchor\u0026Alias feature ###\n{'train_batch_size': 32, 'val_batch_size': 32}\n```\n\nUnfortunately, YAML and Jsonnet require a medium to share the same value.\n\nThis is why PipelineX provides anchor-less aliasing feature.\n\n#### Alternative to aliasing in YAML\n\nYou can directly look up another value in the same YAML/JSON file using \"$\" key without an anchor nor variable.\n\nTo specify the nested key (key in a dict of dict), use \".\" as the separator.\n\nExample:\n\n```python\nfrom pipelinex import HatchDict\nimport yaml\nfrom pprint import pprint  # pretty-print for clearer look\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml=\"\"\"\ntrain_params:\n  train_batch_size: 32\n  val_batch_size: {$: train_params.train_batch_size}\n\"\"\"\nparameters = yaml.safe_load(params_yaml)\n\ntrain_params_dict = parameters.get(\"train_params\")\n\nprint(\"### Before ###\")\npprint(train_params_dict)\n\ntrain_params = HatchDict(parameters).get(\"train_params\")\n\nprint(\"\\n### After ###\")\npprint(train_params)\n```\n\n```\n### Before ###\n{'train_batch_size': 32,\n 'val_batch_size': {'$': 'train_params.train_batch_size'}}\n\n### After ###\n{'train_batch_size': 32, 'val_batch_size': 32}\n```\n\n### Python expression in YAML/JSON\n\nStrings wrapped in parentheses are evaluated as a Python expression.\n\n```python\nfrom pipelinex import HatchDict\nimport yaml\nfrom pprint import pprint  # pretty-print for clearer look\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml = \"\"\"\ntrain_params:\n  param1_tuple_python: (1, 2, 3)\n  param1_tuple_yaml: !!python/tuple [1, 2, 3]\n  param2_formula_python: (2 + 3)\n  param3_neg_inf_python: (float(\"-Inf\"))\n  param3_neg_inf_yaml: -.Inf\n  param4_float_1e9_python: (1e9)\n  param4_float_1e9_yaml: 1.0e+09\n  param5_int_1e9_python: (int(1e9))\n\"\"\"\nparameters = yaml.load(params_yaml)\n\ntrain_params_raw = parameters.get(\"train_params\")\n\nprint(\"### Before ###\")\npprint(train_params_raw)\n\ntrain_params_converted = HatchDict(parameters).get(\"train_params\")\n\nprint(\"\\n### After ###\")\npprint(train_params_converted)\n```\n\n```\n### Before ###\n{'param1_tuple_python': '(1, 2, 3)',\n 'param1_tuple_yaml': (1, 2, 3),\n 'param2_formula_python': '(2 + 3)',\n 'param3_neg_inf_python': '(float(\"-Inf\"))',\n 'param3_neg_inf_yaml': -inf,\n 'param4_float_1e9_python': '(1e9)',\n 'param4_float_1e9_yaml': 1000000000.0,\n 'param5_int_1e9_python': '(int(1e9))'}\n\n### After ###\n{'param1_tuple_python': (1, 2, 3),\n 'param1_tuple_yaml': (1, 2, 3),\n 'param2_formula_python': 5,\n 'param3_neg_inf_python': -inf,\n 'param3_neg_inf_yaml': -inf,\n 'param4_float_1e9_python': 1000000000.0,\n 'param4_float_1e9_yaml': 1000000000.0,\n 'param5_int_1e9_python': 1000000000}\n```\n\n## Introduction to Kedro\n\n### Why the unified data interface framework is needed\n\nMachine Learning projects involves with loading and saving various data in various ways such as:\n\n- files in local/network file system, Hadoop Distributed File System (HDFS), Amazon S3, Google Cloud Storage\n  - e.g. CSV, JSON, YAML, pickle, images, models, etc.\n- databases \n  - Postgresql, MySQL etc.\n- Spark\n- REST API (HTTP(S) requests)\n\nIt is often the case that many Machine Learning Engineers code both data loading/saving and data transformation mixed in the same Python module or Jupyter notebook during experimentation/prototyping phase and suffer later on because:\n\n- During experimentation/prototyping, we often want to save the intermediate data after each transformation. \n- In production environments, we often want to skip saving data to minimize latency and storage space.\n- To benchmark the performance or troubleshoot, we often want to switch the data source.\n  - e.g. read image files in local storage or download images through REST API\n\nThe proposed solution is the unified data interface.\n\nHere is a simple demo example to predict survival on the [Titanic](https://www.kaggle.com/c/titanic/data).\n\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"img/example_kedro_pipeline.PNG\"\u003e\nPipeline visualized by Kedro-viz\n\u003c/p\u003e\n\nCommon code to define the tasks/operations/transformations:\n\n```python\n# Define tasks\n\ndef train_model(model, df, cols_features, col_target):\n    # train a model here\n    return model\n\ndef run_inference(model, df, cols_features):\n    # run inference here\n    return df\n```\n\nIt is notable that you do _not_ need to add any Kedro-related code here to use Kedro later on.\n\nFurthermore, you do _not_ need to add any MLflow-related code here to use MLflow later on as Kedro hooks provided by PipelineX can handle behind the scenes.\n\nThis advantage enables you to keep your pipelines for experimentation/prototyping/benchmarking production-ready.\n\n\n1. Plain code:\n\n```python\n# Configure: can be written in a config file (YAML, JSON, etc.)\n\ntrain_data_filepath = \"data/input/train.csv\"\ntrain_data_load_args = {\"float_precision\": \"high\"}\n\ntest_data_filepath = \"data/input/test.csv\"\ntest_data_load_args = {\"float_precision\": \"high\"}\n\npred_data_filepath = \"data/load/pred.csv\"\npred_data_save_args = {\"index\": False, \"float_format\": \"%.16e\"}\n\nmodel_kind = \"LogisticRegression\"\nmodel_params_dict = {\n  \"C\": 1.23456\n  \"max_iter\": 987\n  \"random_state\": 42\n}\n\n# Run tasks\n\nimport pandas as pd\n\nif model_kind == \"LogisticRegression\":\n    from sklearn.linear_model import LogisticRegression\n    model = LogisticRegression(**model_params_dict)\n\ntrain_df = pd.read_csv(train_data_filepath, **train_data_load_args)\nmodel = train_model(model, train_df)\n\ntest_df = pd.read_csv(test_data_filepath, **test_data_load_args)\npred_df = run_inference(model, test_df)\npred_df.to_csv(pred_data_filepath, **pred_data_save_args)\n\n```\n\n2. Following the data interface framework, objects with `_load`, and `_save` methods,  proposed by [Kedro](https://github.com/quantumblacklabs/kedro) and supported by PipelineX:\n\n```python\n\n# Define a data interface: better ones such as \"CSVDataSet\" are provided by Kedro\n\nimport pandas as pd\nfrom pathlib import Path\n\n\nclass CSVDataSet:\n    def __init__(self, filepath, load_args={}, save_args={}):\n        self._filepath = filepath\n        self._load_args = {}\n        self._load_args.update(load_args)\n        self._save_args = {\"index\": False}\n        self._save_args.update(save_args)\n\n    def _load(self) -\u003e pd.DataFrame:\n        return pd.read_csv(self._filepath, **self._load_args)\n\n    def _save(self, data: pd.DataFrame) -\u003e None:\n        save_path = Path(self._filepath)\n        save_path.parent.mkdir(parents=True, exist_ok=True)\n        data.to_csv(str(save_path), **self._save_args)\n\n\n# Configure data interface: can be written in catalog config file using Kedro\n\ntrain_dataset = CSVDataSet(\n    filepath=\"data/input/train.csv\",\n    load_args={\"float_precision\": \"high\"},\n    # save_args={\"float_format\": \"%.16e\"},  # You can set save_args for future use\n)\n\ntest_dataset = CSVDataSet(\n    filepath=\"data/input/test.csv\",\n    load_args={\"float_precision\": \"high\"},\n    # save_args={\"float_format\": \"%.16e\"},  # You can set save_args for future use\n)\n\npred_dataset = CSVDataSet(\n    filepath=\"data/load/pred.csv\",\n    # load_args={\"float_precision\": \"high\"},  # You can set load_args for future use\n    save_args={\"float_format\": \"%.16e\"},\n)\n\nmodel_kind = \"LogisticRegression\"\nmodel_params_dict = {\n  \"C\": 1.23456\n  \"max_iter\": 987\n  \"random_state\": 42\n}\ncols_features = [\n  \"Pclass\",  # The passenger's ticket class\n  \"Parch\",  # # of parents / children aboard the Titanic\n]\ncol_target = \"Survived\"  # Column used as the target: whether the passenger survived or not\n\n\n# Run tasks: can be configured as a pipeline using Kedro\n# and can be written in parameters config file using PipelineX\n\nif model_kind == \"LogisticRegression\":\n    from sklearn.linear_model import LogisticRegression\n    model = LogisticRegression(**model_params_dict)\n\ntrain_df = train_dataset._load()\nmodel = train_model(model, train_df, cols_features, col_target)\n\ntest_df = test_dataset._load()\npred_df = run_inference(model, test_df, cols_features)\n\npred_dataset._save(pred_df)\n\n```\n\nJust following the data interface framework might be somewhat beneficial in the long run, but not enough.\n\nLet's see what Kedro and PipelineX can do.\n\n\n### Kedro overview\n\nKedro is a Python package to develop pipelines consisting of:\n\n- data interface sets (data loading/saving wrappers, called \"DataSets\", that follows the unified data interface framework) such as:\n  - [`pandas.CSVDataSet`](https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.pandas.CSVDataSet.html#kedro.extras.datasets.pandas.CSVDataSet): a CSV file in local or cloud (Amazon S3, Google Cloud Storage) utilizing [filesystem_spec (`fsspec`)](https://github.com/intake/filesystem_spec)\n  - [`pickle.PickleDataSet`](https://kedro.readthedocs.io/en/latest/kedro.extras.datasets.pickle.PickleDataSet.html): a pickle file  in local or cloud (Amazon S3, Google Cloud Storage) utilizing [filesystem_spec (`fsspec`)](https://github.com/intake/filesystem_spec)\n  - [`pandas.SQLTableDataSet`](https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.pandas.SQLTableDataSet.html#kedro.extras.datasets.pandas.SQLTableDataSet): a table data in an SQL database supported by [SQLAlchemy](https://www.sqlalchemy.org/features.html)\n  - [data interface sets for Spark, Google BigQuery, Feather, HDF, Parquet, Matplotlib, NetworkX, Excel, and more provided by Kedro](https://kedro.readthedocs.io/en/stable/kedro.extras.datasets.html#data-sets)\n  - Custom data interface sets provided by Kedro users\n\n- tasks/operations/transformations (called \"Nodes\") provided by Kedro users such as:\n  - data pre-processing\n  - training a model\n  - inference using a model\n\n- inter-task dependency provided by Kedro users\n\nKedro pipelines can be run sequentially or in parallel.\n\nRegarding Kedro, please see:\n- \u003c[Kedro's document](https://kedro.readthedocs.io/en/stable/)\u003e\n- \u003c[YouTube playlist: Writing Data Pipelines with Kedro](https://www.youtube.com/playlist?list=PLTU89LAWKRwEdiDKeMOU2ye6yU9Qd4MRo)\u003e\n- \u003c[Python Packages for Pipeline/Workflow](https://github.com/Minyus/Python_Packages_for_Pipeline_Workflow)\u003e\n\nHere is a simple example Kedro project.\n\n\n```yaml\n#  catalog.yml\n\ntrain_df:\n  type: pandas.CSVDataSet # short for kedro.extras.datasets.pandas.CSVDataSet\n  filepath: data/input/train.csv\n  load_args:\n    float_precision: high\n  # save_args: # You can set save_args for future use\n  # float_format\": \"%.16e\"\n\ntest_df:\n  type: pandas.CSVDataSet # short for kedro.extras.datasets.pandas.CSVDataSet\n  filepath: data/input/test.csv\n  load_args:\n    float_precision: high\n  # save_args: # You can set save_args for future use\n  # float_format\": \"%.16e\"\n\npred_df:\n  type: pandas.CSVDataSet # short for kedro.extras.datasets.pandas.CSVDataSet\n  filepath: data/load/pred.csv\n  # load_args: # You can set load_args for future use\n  # float_precision: high\n  save_args:\n    float_format: \"%.16e\"\n```\n\n```yaml\n# parameters.yml\n\nmodel:\n  !!python/object:sklearn.linear_model.LogisticRegression\n  C: 1.23456\n  max_iter: 987\n  random_state: 42\ncols_features: # Columns used as features in the Titanic data table\n  - Pclass # The passenger's ticket class\n  - Parch # # of parents / children aboard the Titanic\ncol_target: Survived # Column used as the target: whether the passenger survived or not\n```\n\n```python\n# pipeline.py\n\nfrom kedro.pipeline import Pipeline, node\n\nfrom my_module import train_model, run_inference\n\ndef create_pipeline(**kwargs):\n    return Pipeline(\n        [\n            node(\n                func=train_model,\n                inputs=[\"params:model\", \"train_df\", \"params:cols_features\", \"params:col_target\"],\n                outputs=\"model\",\n            ),\n            node(\n                func=run_inference,\n                inputs=[\"model\", \"test_df\", \"params:cols_features\"],\n                outputs=\"pred_df\",\n            ),\n        ]\n    )\n```\n\n```python\n# run.py\n\nfrom kedro.runner import SequntialRunner\n\n# Set up ProjectContext here\n\ncontext = ProjectContext()\ncontext.run(pipeline_name=\"__default__\", runner=SequentialRunner())\n```\n\nKedro pipelines can be visualized using [kedro-viz](https://github.com/quantumblacklabs/kedro-viz).\n\nKedro pipelines can be productionized using:\n- [kedro-airflow](https://github.com/quantumblacklabs/kedro-airflow): converts a Kedro pipeline into Airflow Python operators.\n- [kedro-docker](https://github.com/quantumblacklabs/kedro-docker): builds a Docker image that can run a Kedro pipeline \n- [kedro-argo](https://github.com/nraw/kedro-argo): converts a Kedro pipeline into an Argo (backend of Kubeflow) pipeline\n\n\n## Flex-Kedro: Kedro plugin for flexible config\n\n[API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.flex_kedro.html)\n\nFlex-Kedro provides more options to configure Kedro projects flexibly and thus quickly by KFlex-Kedro-Pipeline and Flex-Kedro-Context features.\n\n### Flex-Kedro-Pipeline: Kedro plugin for quicker pipeline set up \n\nIf you want to define Kedro pipelines quickly, you can consider to use `pipelinex.FlexiblePipeline` instead of `kedro.pipeline.Pipeline`. \n`pipelinex.FlexiblePipeline` adds the following options to `kedro.pipeline.Pipeline`.\n\n#### Dict for nodes\n\nTo define each node, dict can be used instead of `kedro.pipeline.node`. \n\n  Example:\n\n  ```python\n  pipelinex.FlexiblePipeline(\n      nodes=[dict(func=task_func1, inputs=\"my_input\", outputs=\"my_output\")]\n  )\n  ```\n\n  will be equivalent to:\n\n  ```python\n  kedro.pipeline.Pipeline(\n      nodes=[\n          kedro.pipeline.node(func=task_func1, inputs=\"my_input\", outputs=\"my_output\")\n      ]\n  )\n  ```\n\n#### Sequential nodes\n\nFor sub-pipelines consisting of nodes of only single input and single output, you can optionally use Sequential API similar to PyTorch (`torch.nn.Sequential`) and Keras (`tf.keras.Sequential`)\n\n  Example:\n\n  ```python\n  pipelinex.FlexiblePipeline(\n      nodes=[\n          dict(\n              func=[task_func1, task_func2, task_func3],\n              inputs=\"my_input\",\n              outputs=\"my_output\",\n          )\n      ]\n  )\n  ```\n\n  will be equivalent to:\n\n  ```python\n  kedro.pipeline.Pipeline(\n      nodes=[\n          kedro.pipeline.node(\n              func=task_func1, inputs=\"my_input\", outputs=\"my_output__001\"\n          ),\n          kedro.pipeline.node(\n              func=task_func2, inputs=\"my_output__001\", outputs=\"my_output__002\"\n          ),\n          kedro.pipeline.node(\n              func=task_func3, inputs=\"my_output__002\", outputs=\"my_output\"\n          ),\n      ]\n  )\n  ```\n\n#### Decorators without using the method\n\n- Optionally specify the Python function decorator(s) to apply to multiple nodes under the pipeline using `decorator` argument instead of using [`decorate`](https://kedro.readthedocs.io/en/stable/kedro.pipeline.Pipeline.html#kedro.pipeline.Pipeline.decorate) method of `kedro.pipeline.Pipeline`.\n\n  Example:\n\n  ```python\n  pipelinex.FlexiblePipeline(\n      nodes=[\n          kedro.pipeline.node(func=task_func1, inputs=\"my_input\", outputs=\"my_output\")\n      ],\n      decorator=[task_deco, task_deco],\n  )\n  ```\n\n  will be equivalent to:\n\n  ```python\n  kedro.pipeline.Pipeline(\n      nodes=[\n          kedro.pipeline.node(func=task_func1, inputs=\"my_input\", outputs=\"my_output\")\n      ]\n  ).decorate(task_deco, task_deco)\n\n  ```\n\n- Optionally specify the default python module (path of .py file) if you do not want to repeat the same (deep and/or long) Python module (e.g. `foo.bar.my_task1`, `foo.bar.my_task2`, etc.)\n\n\n### Flex-Kedro-Context: Kedro plugin for YAML lovers\n\nIf you want to take advantage of YAML more than Kedro supports, you can consider to use \n`pipelinex.FlexibleContext` instead of `kedro.framework.context.KedroContext`. \n`pipelinex.FlexibleContext` adds preprocess of `parameters.yml` and `catalog.yml` to `kedro.framework.context.KedroContext` to provide flexibility.\nThis option is for YAML lovers only. \nIf you don't like YAML very much, skip this one.\n\n#### Define Kedro pipelines in `parameters.yml`\n  \nYou can define the inter-task dependency (DAG) for Kedro pipelines in `parameters.yml` using `PIPELINES` key. To define each Kedro pipeline, you can use the `kedro.pipeline.Pipeline` or its variant such as `pipelinex.FlexiblePipeline` as shown below.\n\n```yaml\n# parameters.yml\n\nPIPELINES:\n  __default__:\n    =: pipelinex.FlexiblePipeline\n    module: # Optionally specify the default Python module so you can omit the module name to which functions belongs\n    decorator: # Optionally specify function decorator(s) to apply to each node\n    nodes:\n      - inputs: [\"params:model\", train_df, \"params:cols_features\", \"params:col_target\"]\n        func: sklearn_demo.train_model\n        outputs: model\n\n      - inputs: [model, test_df, \"params:cols_features\"]\n        func: sklearn_demo.run_inference\n        outputs: pred_df\n```\n\n#### Configure Kedro run config in `parameters.yml`\n\nYou can specify the run config in `parameters.yml` using `RUN_CONFIG` key instead of specifying the args for `kedro run` command for every run. \n\nYou can still set the args for `kedro run` to overwrite. \n\nIn addition to the args for `kedro run`, you can opt to run only missing nodes (skip tasks which have already been run to resume pipeline using the intermediate data files or databases.) by `only_missing` key.\n\n\n```yaml\n# parameters.yml\n\nRUN_CONFIG:\n  pipeline_name: __default__\n  runner: SequentialRunner # Set to \"ParallelRunner\" to run in parallel\n  only_missing: False # Set True to run only missing nodes\n  tags: # None\n  node_names: # None\n  from_nodes: # None\n  to_nodes: # None\n  from_inputs: # None\n  load_versions: # None\n```\n\n#### Use `HatchDict` feature in `parameters.yml`\n\nYou can use `HatchDict` feature in `parameters.yml`.\n\n```yaml\n# parameters.yml\n\nmodel:\n  =: sklearn.linear_model.LogisticRegression\n  C: 1.23456\n  max_iter: 987\n  random_state: 42\ncols_features: # Columns used as features in the Titanic data table\n  - Pclass # The passenger's ticket class\n  - Parch # # of parents / children aboard the Titanic\ncol_target: Survived # Column used as the target: whether the passenger survived or not\n```\n\n#### Enable caching for Kedro DataSets in `catalog.yml`\n\nEnable caching using `cached` key set to True if you do not want Kedro to load the data from disk/database which were in the memory. ([`kedro.io.CachedDataSet`](https://kedro.readthedocs.io/en/latest/kedro.io.CachedDataSet.html#kedro.io.CachedDataSet) is used under the hood.)\n\n#### Use `HatchDict` feature in `catalog.yml`\n\nYou can use `HatchDict` feature in `catalog.yml`.\n\n\n## MLflow-on-Kedro: Kedro plugin for MLflow users\n\n[API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.html)\n\n### How to use MLflow from Kedro projects\n\nKedro DataSet and Hooks (callbacks) are provided to use MLflow without adding any MLflow-related code in the node (task) functions.\n\n- [`pipelinex.MLflowDataSet`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.datasets.mlflow.html)\n  \n  Kedro Dataset that saves data to or loads data from MLflow depending on `dataset` argument as follows.\n\n  - If set to \"p\", the value will be saved/loaded as an MLflow parameter (string).\n\n  - If set to \"m\", the value will be saved/loaded as an MLflow metric (numeric).\n\n  - If set to \"a\", the value will be saved/loaded based on the data type.\n\n      - If the data type is either {float, int}, the value will be saved/loaded as an MLflow metric.\n\n      - If the data type is either {str, list, tuple, set}, the value will be saved/load as an MLflow parameter.\n\n      - If the data type is dict, the value will be flattened with dot (\".\") as the separator and then saved/loaded as either an MLflow metric or parameter based on each data type as explained above.\n\n  - If set to either {\"json\", \"csv\", \"xls\", \"parquet\", \"png\", \"jpg\", \"jpeg\", \"img\", \"pkl\", \"txt\", \"yml\", \"yaml\"}, the backend dataset instance will be created accordingly to save/load as an MLflow artifact.\n\n  - If set to a Kedro DataSet object or a dictionary, it will be used as the backend dataset to save/load as an MLflow artifact.\n\n  - If set to None (default), MLflow logging will be skipped.\n\n  Regarding all the options, please see the [API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.datasets.mlflow.html)\n\n- Kedro Hooks \n\n  - [`pipelinex.MLflowBasicLoggerHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html#module-pipelinex.mlflow_on_kedro.hooks.mlflow.mlflow_basic_logger): Configures MLflow logging and logs duration time for the pipeline to MLflow.\n\n  - [`pipelinex.MLflowArtifactsLoggerHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html#module-pipelinex.mlflow_on_kedro.hooks.mlflow.mlflow_artifacts_logger): Logs artifacts of specified file paths and dataset names to MLflow.\n    \n  - [`pipelinex.MLflowDataSetsLoggerHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html#pipelinex.mlflow_on_kedro.hooks.mlflow.mlflow_datasets_logger.MLflowDataSetsLoggerHook): Logs datasets of (list of) float/int and str classes to MLflow.\n\n  - [`pipelinex.MLflowTimeLoggerHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html#pipelinex.mlflow_on_kedro.hooks.mlflow.mlflow_time_logger.MLflowTimeLoggerHook): Logs duration time for each node (task) to MLflow and optionally visualizes the execution logs as a Gantt chart by [`plotly.figure_factory.create_gantt`](https://plotly.github.io/plotly.py-docs/generated/plotly.figure_factory.create_gantt.html) if `plotly` is installed. \n  \n  - [`pipelinex.AddTransformersHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.extras.hooks.html#pipelinex.extras.hooks.add_transformers.AddTransformersHook): Adds Kedro transformers such as:\n    - [`pipelinex.MLflowIOTimeLoggerTransformer`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.transformers.mlflow.html#pipelinex.mlflow_on_kedro.transformers.mlflow.mlflow_io_time_logger.MLflowIOTimeLoggerTransformer): Logs duration time to load and save each dataset with args:\n  \n  Regarding all the options, please see the [API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html)\n\nMLflow-ready Kedro projects can be generated by the [Kedro starters](https://github.com/Minyus/kedro-starters-sklearn) (Cookiecutter template) which include the following example config:\n\n```yaml\n# catalog.yml\n\n# Write a pickle file \u0026 upload to MLflow\nmodel:\n  type: pipelinex.MLflowDataSet\n  dataset: pkl\n\n# Write a csv file \u0026 upload to MLflow\npred_df: \n  type: pipelinex.MLflowDataSet\n  dataset: csv\n\n# Write an MLflow metric\nscore:\n  type: pipelinex.MLflowDataSet\n  dataset: m  \n```\n\n```python\n# catalog.py (alternative to catalog.yml)\n\ncatalog_dict = {\n  \"model\": MLflowDataSet(dataset=\"pkl\"),  # Write a pickle file \u0026 upload to MLflow\n  \"pred_df\": MLflowDataSet(dataset=\"csv\"),  # Write a csv file \u0026 upload to MLflow\n  \"score\": MLflowDataSet(dataset=\"m\"),  # Write an MLflow metric\n}\n```\n\n```python\n# mlflow_config.py\n\nimport pipelinex\n\nmlflow_hooks = (\n    pipelinex.MLflowBasicLoggerHook(\n        uri=\"sqlite:///mlruns/sqlite.db\",\n        experiment_name=\"experiment_001\",\n        artifact_location=\"./mlruns/experiment_001\",\n        offset_hours=0,\n    ),\n    pipelinex.MLflowCatalogLoggerHook(\n        auto=True,\n    ),\n    pipelinex.MLflowArtifactsLoggerHook(\n        filepaths_before_pipeline_run=[\"conf/base/parameters.yml\"],\n        filepaths_after_pipeline_run=[\n            \"info.log\",\n            \"errors.log\",\n        ],\n    ),\n    pipelinex.MLflowEnvVarsLoggerHook(\n        param_env_vars=[\"HOSTNAME\"],\n        metric_env_vars=[],\n    ),\n    pipelinex.MLflowTimeLoggerHook(),\n)\n``` \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/Minyus/pipelinex/master/_doc_images/mlflow_ui_metrics.png\"\u003e\nLogged metrics shown in MLflow's UI\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/Minyus/pipelinex/master/_doc_images/mlflow_ui_gantt.png\"\u003e\nGantt chart for execution time, generated using Plotly, shown in MLflow's UI\n\u003c/p\u003e\n\n### Comparison with `kedro-mlflow` package\n\nBoth [PipelineX](https://pipelinex.readthedocs.io/)'s MLflow-on-Kedro and [kedro-mlflow](https://kedro-mlflow.readthedocs.io/) provide integration of MLflow to Kedro. \nHere are the comparisons.\n\n- Features supported by both PipelineX and kedro-mlflow\n  - Kedro DataSets and Hooks to log (save/upload) artifacts, parameters, and metrics to MLflow.\n  - Truncate MLflow parameter values to 250 characters to avoid error due to MLflow parameter length limit.\n  - Dict values can be flattened using dot (\".\") as the separator to log each value inside the dict separately.\n\n- Features supported by only PipelineX\n  - [Time logging] Option to log execution time for each task (Kedro node) as MLflow metrics\n  - [Gantt logging] Option to log Gantt chart HTML file that visualizes execution time using Plotly as an MLflow artifact (inspired by [Apache Airflow](https://airflow.apache.org/docs/apache-airflow/stable/ui.html#gantt-chart))\n  - [Automatic backend Kedro DataSets for common artifacts] Option to specify a common file extension ({\"json\", \"csv\", \"xls\", \"parquet\", \"png\", \"jpg\", \"jpeg\", \"img\", \"pkl\", \"txt\", \"yml\", \"yaml\"}) so the Kedro DataSet object will be created behind the scene instead of manually specifying a Kedro DataSet including filepath in the catalog (inspired by [Kedro Wings](https://github.com/tamsanh/kedro-wings#default-datasets)).\n  - [Automatic logging for MLflow parameters and metrics] Option to log each dataset not listed in the catalog as MLflow parameter or metric, instead of manually specifying a Kedro DataSet in the catalog.\n    - If the data type is either {float, int}, the value will be saved/loaded \n    as an MLflow metric.\n    - If the data type is either {str, list, tuple, set}, the value will be \n    saved/load as an MLflow parameter.\n    - If the data type is dict, the value will be flattened with dot (\".\") as\n    the separator and then saved/loaded as either an MLflow metric or parameter \n    based on each data type as explained above. \n    - For example, `\"data_loading_config\": {\"train\": {\"batch_size\": 32}}` will be logged as MLflow metric of `\"data_loading_config.train.batch_size\": 32`\n  - [Flexible config per DataSet] For each Kedro DataSet, it is possible to configure differently. For example, a dict value can be logged as an MLflow parameter (string) as is while another one can be logged as an MLflow metric after being flattened.\n  - [Direct artifact logging] Option to specify the paths of any data to log as MLflow artifacts after Kedro pipeline runs without using a Kedro DataSet, which is useful if you want to save local files (e.g. info/warning/error log files, intermediate model weights saved by Machine Learning packages such as PyTorch and TensorFlow, etc.) \n  - [Environment Variable logging] Option to log Environment Variables\n  - [Downloading] Option to download MLflow artifacts, params, metrics from an existing MLflow experiment run using the Kedro DataSet\n  - [Up to date] Support for Kedro 0.17.x (released in Dec 2020) or later \n\n- Features provided by only kedro-mlflow\n  - A wrapper for MLflow's `log_model`\n  - Configure MLflow logging in a YAML file\n  - Option to use MLflow tag or raise error if MLflow parameter values exceed 250 characters\n\n\n## Kedro-Extras: Kedro plugin to use various Python packages \n\n[API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.extras.html)\n\nKedro-Extras provides Kedro DataSets and decorators not available in [kedro.extras](https://github.com/quantumblacklabs/kedro/tree/master/kedro/extras).\n\nContributors who are willing to help preparing the test code and send pull request to Kedro following Kedro's [CONTRIBUTING.md](https://github.com/quantumblacklabs/kedro/blob/master/CONTRIBUTING.md#contribute-a-new-feature) are welcomed.\n\n### Additional Kedro datasets (data interface sets)\n  \n[pipelinex.extras.datasets](https://github.com/Minyus/pipelinex/tree/master/src/pipelinex/extras/datasets) provides the following Kedro Datasets (data interface sets) mainly for Computer Vision applications using PyTorch/torchvision, OpenCV, and Scikit-image.\n\n- [pipelinex.ImagesLocalDataSet](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/datasets/pillow/images_dataset.py\n)\n  - loads/saves multiple numpy arrays (RGB, BGR, or monochrome image) from/to a folder in local storage using `pillow` package, working like ``kedro.extras.datasets.pillow.ImageDataSet`` and\n  ``kedro.io.PartitionedDataSet`` with conversion between numpy arrays and Pillow images.\n  - an example project is at [pipelinex_image_processing](https://github.com/Minyus/pipelinex_image_processing)\n- [pipelinex.APIDataSet](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/datasets/requests/api_dataset.py)\n  - modified version of [kedro.extras.APIDataSet](https://github.com/quantumblacklabs/kedro/blob/master/kedro/extras/datasets/api/api_dataset.py) with more flexible options including downloading multiple contents (such as images and json) by HTTP requests to multiple URLs using `requests` package\n  - an example project is at [pipelinex_image_processing](https://github.com/Minyus/pipelinex_image_processing)\n- [pipelinex.AsyncAPIDataSet](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/datasets/httpx/async_api_dataset.py)\n  - downloads multiple contents (such as images and json) by asynchronous HTTP requests to multiple URLs using `httpx` package\n  - an example project is at [pipelinex_image_processing](https://github.com/Minyus/pipelinex_image_processing)\n\n- [pipelinex.IterableImagesDataSet](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/datasets/torchvision/iterable_images_dataset.py)\n  - wrapper of [`torchvision.datasets.ImageFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder) that loads images in a folder as an iterable data loader to use with PyTorch.\n\n- [pipelinex.PandasProfilingDataSet](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/datasets/pandas_profiling/pandas_profiling.py)\n  - generates a pandas dataframe summary report using [pandas-profiling](https://github.com/pandas-profiling/pandas-profiling)\n\n- [more data interface sets for pandas dataframe summarization/visualization provided by PipelineX](https://github.com/Minyus/pipelinex/tree/master/src/pipelinex/extras/datasets)\n\n### Additional function decorators for benchmarking\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Minyus/pipelinex/blob/master/notebooks/decorators_demo.ipynb)\n\n[pipelinex.extras.decorators](https://github.com/Minyus/pipelinex/tree/master/src/pipelinex/extras/decorators) provides Python decorators for benchmarking.\n\n- [log_time](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/decorators/decorators.py)\n  - logs the duration time of a function (difference of timestamp before and after running the function).\n  - Slightly modified version of Kedro's [log_time](https://github.com/quantumblacklabs/kedro/blob/develop/kedro/pipeline/decorators.py#L59)\n\n- [mem_profile](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/decorators/memory_profiler.py)\n  - logs the peak memory usage during running the function.\n  - `memory_profiler` needs to be installed.\n  - Slightly modified version of Kedro's [mem_profile](https://github.com/quantumblacklabs/kedro/blob/develop/kedro/extras/decorators/memory_profiler.py#L48)\n\n- [nvml_profile](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/decorators/nvml_profiler.py)\n  - logs the difference of NVIDIA GPU usage before and after running the function.\n  - `pynvml` or `py3nvml` needs to be installed.\n\n```python\nfrom pipelinex import log_time\nfrom pipelinex import mem_profile  # Need to install memory_profiler for memory profiling\nfrom pipelinex import nvml_profile  # Need to install pynvml for NVIDIA GPU profiling\nfrom time import sleep\nimport logging\n\nlogging.basicConfig(level=logging.INFO)\n\n@nvml_profile\n@mem_profile\n@log_time\ndef foo_func(i=1):\n    sleep(0.5)  # Needed to avoid the bug reported at https://github.com/pythonprofilers/memory_profiler/issues/216\n    return \"a\" * i\n\noutput = foo_func(100_000_000)\n```\n\n```\nINFO:pipelinex.decorators.decorators:Running 'foo_func' took 549ms [0.549s]\nINFO:pipelinex.decorators.memory_profiler:Running 'foo_func' consumed 579.02MiB memory at peak time\nINFO:pipelinex.decorators.nvml_profiler:Ran: 'foo_func', NVML returned: {'_Driver_Version': '418.67', '_NVML_Version': '10.418.67', 'Device_Count': 1, 'Devices': [{'_Name': 'Tesla P100-PCIE-16GB', 'Total_Memory': 17071734784, 'Free_Memory': 17071669248, 'Used_Memory': 65536, 'GPU_Utilization_Rate': 0, 'Memory_Utilization_Rate': 0}]}, Used memory diff: [0]\n```\n\n### Use with PyTorch\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Minyus/pipelinex/blob/master/notebooks/PyTorch_demo.ipynb)\n\nTo develop a simple neural network, it is convenient to use Sequential API\n(e.g. `torch.nn.Sequential`, `tf.keras.Sequential`).\n\n- Hardcoded:\n\n```python\nfrom torch.nn import Sequential, Conv2d, ReLU\n\nmodel = Sequential(\n    Conv2d(in_channels=3, out_channels=16, kernel_size=[3, 3]),\n    ReLU(),\n)\n\nprint(\"### model object by hard-coding ###\")\nprint(model)\n```\n\n```\n### model object by hard-coding ###\nSequential(\n  (0): Conv2d(3, 16, kernel_size=[3, 3], stride=(1, 1))\n  (1): ReLU()\n)\n```\n\n- Using import-less Python object feature:\n\n```python\nfrom pipelinex import HatchDict\nimport yaml\nfrom pprint import pprint  # pretty-print for clearer look\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml=\"\"\"\nmodel:\n  =: torch.nn.Sequential\n  _:\n    - {=: torch.nn.Conv2d, in_channels: 3, out_channels: 16, kernel_size: [3, 3]}\n    - {=: torch.nn.ReLU, _: }\n\"\"\"\nparameters = yaml.safe_load(params_yaml)\n\nmodel_dict = parameters.get(\"model\")\n\nprint(\"### Before ###\")\npprint(model_dict)\n\nmodel = HatchDict(parameters).get(\"model\")\n\nprint(\"\\n### After ###\")\nprint(model)\n```\n\n```\n### Before ###\n{'=': 'torch.nn.Sequential',\n '_': [{'=': 'torch.nn.Conv2d',\n        'in_channels': 3,\n        'kernel_size': [3, 3],\n        'out_channels': 16},\n       {'=': 'torch.nn.ReLU', '_': None}]}\n\n### After ###\nSequential(\n  (0): Conv2d(3, 16, kernel_size=[3, 3], stride=(1, 1))\n  (1): ReLU()\n)\n```\n\nIn addition to `Sequential`, TensorFLow/Keras provides modules to merge branches such as\n`tf.keras.layers.Concatenate`, but PyTorch provides only functional interface such as `torch.cat`.\n\nPipelineX provides modules to merge branches such as `ModuleConcat`, `ModuleSum`, and `ModuleAvg`.\n\n- Hardcoded:\n\n```python\nfrom torch.nn import Sequential, Conv2d, AvgPool2d, ReLU\nfrom pipelinex import ModuleConcat\n\nmodel = Sequential(\n    ModuleConcat(\n        Conv2d(in_channels=3, out_channels=16, kernel_size=[3, 3], stride=[2, 2], padding=[1, 1]),\n        AvgPool2d(kernel_size=[3, 3], stride=[2, 2], padding=[1, 1]),\n    ),\n    ReLU(),\n)\nprint(\"### model object by hard-coding ###\")\nprint(model)\n```\n\n```\n### model object by hard-coding ###\nSequential(\n  (0): ModuleConcat(\n    (0): Conv2d(3, 16, kernel_size=[3, 3], stride=[2, 2], padding=[1, 1])\n    (1): AvgPool2d(kernel_size=[3, 3], stride=[2, 2], padding=[1, 1])\n  )\n  (1): ReLU()\n)\n```\n\n- Using import-less Python object feature:\n\n```python\nfrom pipelinex import HatchDict\nimport yaml\nfrom pprint import pprint  # pretty-print for clearer look\n\n# Read parameters dict from a YAML file in actual use\nparams_yaml=\"\"\"\nmodel:\n  =: torch.nn.Sequential\n  _:\n    - =: pipelinex.ModuleConcat\n      _:\n        - {=: torch.nn.Conv2d, in_channels: 3, out_channels: 16, kernel_size: [3, 3], stride: [2, 2], padding: [1, 1]}\n        - {=: torch.nn.AvgPool2d, kernel_size: [3, 3], stride: [2, 2], padding: [1, 1]}\n    - {=: torch.nn.ReLU, _: }\n\"\"\"\nparameters = yaml.safe_load(params_yaml)\n\nmodel_dict = parameters.get(\"model\")\n\nprint(\"### Before ###\")\npprint(model_dict)\n\nmodel = HatchDict(parameters).get(\"model\")\n\nprint(\"\\n### After ###\")\nprint(model)\n```\n\n```\n### Before ###\n{'=': 'torch.nn.Sequential',\n '_': [{'=': 'pipelinex.ModuleConcat',\n        '_': [{'=': 'torch.nn.Conv2d',\n               'in_channels': 3,\n               'kernel_size': [3, 3],\n               'out_channels': 16,\n               'padding': [1, 1],\n               'stride': [2, 2]},\n              {'=': 'torch.nn.AvgPool2d',\n               'kernel_size': [3, 3],\n               'padding': [1, 1],\n               'stride': [2, 2]}]},\n       {'=': 'torch.nn.ReLU', '_': None}]}\n\n### After ###\nSequential(\n  (0): ModuleConcat(\n    (0): Conv2d(3, 16, kernel_size=[3, 3], stride=[2, 2], padding=[1, 1])\n    (1): AvgPool2d(kernel_size=[3, 3], stride=[2, 2], padding=[1, 1])\n  )\n  (1): ReLU()\n)\n```\n\n### Use with PyTorch Ignite\n\nWrappers of PyTorch Ignite provides most of features available in Ignite, including integration with MLflow, in an easy declarative way.\n\nIn addition, the following optional features are available in PipelineX.\n\n- Use only partial samples in dataset (Useful for quick preliminary check before using the whole dataset)\n- Time limit for training (Useful for code-only (Kernel-only) Kaggle competitions with time limit)\n\nHere are the arguments for [`NetworkTrain`](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/ops/ignite/declaratives/declarative_trainer.py):\n\n```\nloss_fn (callable): Loss function used to train.\n    Accepts an instance of loss functions at https://pytorch.org/docs/stable/nn.html#loss-functions\nepochs (int, optional): Max epochs to train\nseed (int, optional): Random seed for training.\noptimizer (torch.optim, optional): Optimizer used to train.\n    Accepts optimizers at https://pytorch.org/docs/stable/optim.html\noptimizer_params (dict, optional): Parameters for optimizer.\ntrain_data_loader_params (dict, optional): Parameters for data loader for training.\n    Accepts args at https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader\nval_data_loader_params (dict, optional): Parameters for data loader for validation.\n    Accepts args at https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader\nevaluation_metrics (dict, optional): Metrics to compute for evaluation.\n    Accepts dict of metrics at https://pytorch.org/ignite/metrics.html\nevaluate_train_data (str, optional): When to compute evaluation_metrics using training dataset.\n    Accepts events at https://pytorch.org/ignite/engine.html#ignite.engine.Events\nevaluate_val_data (str, optional): When to compute evaluation_metrics using validation dataset.\n    Accepts events at https://pytorch.org/ignite/engine.html#ignite.engine.Events\nprogress_update (bool, optional): Whether to show progress bar using tqdm package\nscheduler (ignite.contrib.handle.param_scheduler.ParamScheduler, optional): Param scheduler.\n    Accepts a ParamScheduler at\n    https://pytorch.org/ignite/contrib/handlers.html#module-ignite.contrib.handlers.param_scheduler\nscheduler_params (dict, optional): Parameters for scheduler\nmodel_checkpoint (ignite.handlers.ModelCheckpoint, optional): Model Checkpoint.\n    Accepts a ModelCheckpoint at https://pytorch.org/ignite/handlers.html#ignite.handlers.ModelCheckpoint\nmodel_checkpoint_params (dict, optional): Parameters for ModelCheckpoint at\n    https://pytorch.org/ignite/handlers.html#ignite.handlers.ModelCheckpoint\nearly_stopping_params (dict, optional): Parameters for EarlyStopping at\n    https://pytorch.org/ignite/handlers.html#ignite.handlers.EarlyStopping\ntime_limit (int, optioinal): Time limit for training in seconds.\ntrain_dataset_size_limit (int, optional): If specified, only the subset of training dataset is used.\n    Useful for quick preliminary check before using the whole dataset.\nval_dataset_size_limit (int, optional): If specified, only the subset of validation dataset is used.\n    useful for qucik preliminary check before using the whole dataset.\ncudnn_deterministic (bool, optional): Value for torch.backends.cudnn.deterministic.\n    See https://pytorch.org/docs/stable/notes/randomness.html for details.\ncudnn_benchmark (bool, optional): Value for torch.backends.cudnn.benchmark.\n    See https://pytorch.org/docs/stable/notes/randomness.html for details.\nmlflow_logging (bool, optional): If True and MLflow is installed, MLflow logging is enabled.\n```\n\nPlease see the [example code using MNIST dataset](https://github.com/Minyus/pipelinex/blob/master/examples/mnist/mnist_with_declarative_trainer.py) prepared based on the [original code](https://github.com/pytorch/ignite/blob/master/examples/mnist/mnist.py).\n\nIt is also possible to use:\n\n- [FlexibleModelCheckpoint](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/ops/ignite/handlers/flexible_checkpoint.py) handler which enables to use timestamp in the model checkpoint file name to clarify which one is the latest.\n- [CohenKappaScore](https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/ops/ignite/metrics/cohen_kappa_score.py) metric which can compute Quadratic Weighted Kappa Metric used in some Kaggle competitions. See [sklearn.metrics.cohen_kappa_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html) for details.\n\nIt is planned to port some [code used with PyTorch Ignite](https://github.com/Minyus/pipelinex/tree/master/src/pipelinex/ops/ignite) to [PyTorch Ignite](https://github.com/pytorch/ignite) repository once test and example codes are prepared.\n\n### Use with OpenCV\n\nA challenge of image processing is that the parameters and algorithms that work with an image often do not work with another image. You will want to output intermediate images from each image processing pipeline step for visual check during development, but you will not want to output all the intermediate images to save time and disk space in production.\n\nWrappers of OpenCV and `ImagesLocalDataSet` are the solution. You can concentrate on developping your image processing pipeline for an image (3-D or 2-D numpy array), and it will run for all the images in a folder.\n\nIf you are devepping an image processing pipeline consisting of 5 steps and you have 10 images, for example, you can check 10 generated images in each of 5 folders, 50 images in total, during development.\n\n\n## Story behind PipelineX\n\nWhen I was working on a Deep Learning project, it was very time-consuming to develop the pipeline for experimentation.\nI wanted 2 features.\n\nFirst one was an option to resume the pipeline using the intermediate data files instead of running the whole pipeline.\nThis was important for rapid Machine/Deep Learning experimentation.\n\nSecond one was modularity, which means keeping the 3 components, task processing, file/database access, and DAG definition, independent.\nThis was important for efficient software engineering.\n\nAfter this project, I explored for a long-term solution.\nI researched about 3 Python packages for pipeline development, Airflow, Luigi, and Kedro, but none of these could be a solution.\n\nLuigi provided resuming feature, but did not offer modularity.\nKedro offered modularity, but did not provide resuming feature.\n\nAfter this research, I decided to develop my own package that works on top of Kedro.\nBesides, I added syntactic sugars including Sequential API similar to Keras and PyTorch to define DAG.\nFurthermore, I added integration with MLflow, PyTorch, Ignite, pandas, OpenCV, etc. while working on more Machine/Deep Learning projects.\n\nAfter I confirmed my package worked well with the Kaggle competition, I released it as PipelineX.\n\n## Author\n\n[Yusuke Minami @Minyus](https://github.com/Minyus)\n\n- \u003c[Linkedin](https://www.linkedin.com/in/yusukeminami/)\u003e\n- \u003c[Twitter](https://twitter.com/Minyus86)\u003e\n\n## Contributors are welcome!\n\n### How to contribute\n\nPlease see [CONTRIBUTING.md](https://github.com/Minyus/pipelinex/blob/master/CONTRIBUTING.md) for details.\n\n### Contributor list\n\n- \u003c[@shibuiwilliam](https://github.com/shibuiwilliam)\u003e\n- \u003c[@MarchRaBBiT](https://github.com/MarchRaBBiT)\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMinyus%2Fpipelinex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMinyus%2Fpipelinex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMinyus%2Fpipelinex/lists"}