{"id":30348885,"url":"https://github.com/rvandewater/recipies","last_synced_at":"2025-08-18T19:14:52.674Z","repository":{"id":115800574,"uuid":"570525006","full_name":"rvandewater/ReciPies","owner":"rvandewater","description":"🥧 Easily define reproducible preprocessing steps for ML on Polars and Pandas dataframes. ","archived":false,"fork":false,"pushed_at":"2025-07-30T09:24:53.000Z","size":4179,"stargazers_count":4,"open_issues_count":0,"forks_count":3,"subscribers_count":1,"default_branch":"development","last_synced_at":"2025-08-02T01:34:54.893Z","etag":null,"topics":["data-science","machine-learning","pandas","polars","python","scikit-learn","tidymodels"],"latest_commit_sha":null,"homepage":"https://rvandewater.github.io/ReciPies/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rvandewater.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-11-25T11:56:34.000Z","updated_at":"2025-07-30T09:24:56.000Z","dependencies_parsed_at":null,"dependency_job_id":"edfc8094-5883-4b0b-a1da-20b8343899fe","html_url":"https://github.com/rvandewater/ReciPies","commit_stats":{"total_commits":173,"total_committers":4,"mean_commits":43.25,"dds":0.5028901734104047,"last_synced_commit":"b0bb6301166553866ab96fe11ccbd77293d307c5"},"previous_names":["rvandewater/recipys"],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/rvandewater/ReciPies","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rvandewater%2FReciPies","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rvandewater%2FReciPies/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rvandewater%2FReciPies/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rvandewater%2FReciPies/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rvandewater","download_url":"https://codeload.github.com/rvandewater/ReciPies/tar.gz/refs/heads/development","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rvandewater%2FReciPies/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271044714,"owners_count":24690014,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","machine-learning","pandas","polars","python","scikit-learn","tidymodels"],"created_at":"2025-08-18T19:14:47.720Z","updated_at":"2025-08-18T19:14:52.656Z","avatar_url":"https://github.com/rvandewater.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/rvandewater/ReciPies/blob/development/docs/figures/recipies_logo.svg?raw=true\" \nalt=\"recipies logo\" height=\"300\"\u003e\n\u003c/div\u003e\n\n# ReciPies 🥧\n\n[![CI](https://github.com/rvandewater/ReciPies/actions/workflows/ci.yml/badge.svg)](https://github.com/rvandewater/ReciPies/actions/workflows/ci.yml)\n![Platform](https://img.shields.io/badge/platform-linux--64%20|%20win--64%20|%20osx--64-lightgrey)\n[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)\n[![PyPI version shields.io](https://img.shields.io/pypi/v/recipies.svg)](https://pypi.python.org/pypi/recipies/)\n[![Python Version](https://img.shields.io/pypi/pyversions/recipies.svg)](https://pypi.python.org/pypi/recipies/)\n[![Downloads](https://pepy.tech/badge/recipies)](https://pepy.tech/project/recipies)\n[![arXiv](https://img.shields.io/badge/arXiv-2306.05109-b31b1b.svg)](http://arxiv.org/abs/2306.05109)\n\nModern machine learning (ML) workflows live or die by their data‑preprocessing steps, yet in Python—a language with a \nrich ecosystem for data science and ML—these steps are often scattered across ad‑hoc scripts or opaque Scikit-Learn \n(sklearn) snippets that are hard to read, audit, or reuse. `ReciPies` provides a concise, human‑readable, and fully \nreproducible way to declare, execute, and share preprocessing pipelines, adhering to Configuration as Code principles. \nIt lets users describe transformations as a recipe made of ordered *steps* (e.g., imputing, encoding, normalizing) \napplied to variables identified by semantic roles (predictor, outcome, ID, time stamp, etc.). Recipes can be *prepped* \n(trained) once, *baked* many times, and cleanly separated between training and new data—preventing data leakage by \nconstruction. Under the hood, `ReciPies` targets both Pandas and Polars backends for performance and flexibility, and \nit is easily extensible: users can register custom steps with minimal boilerplate. Each recipe is serializable to \nJSON/YAML for provenance tracking, collaboration, and publication, and integrates smoothly with downstream modeling \nlibraries. Packaging preprocessing as clear, declarative objects, `ReciPies` lowers the cognitive load of feature \nengineering, improves reproducibility, and makes methodological choices explicit, benefiting individual researchers, \nengineering teams, and peer reviewers alike.\n\nThe backend can either be [Polars](https://github.com/pola-rs/polars) or [Pandas](https://github.com/pandas-dev/pandas) dataframes. \nThe operation of this package is inspired by the R-package [recipes](https://recipes.tidymodels.org/). Please check the [documentation](rvandewater.github.io/ReciPies/) for more details.\n## Installation\n\nYou can install ReciPies from pip using:\n\n```\npip install recipies\n```\n\n\u003e Note that the package is called `recipies`  on pip.\n\u003e\nYou can install ReciPies from source to ensure you have the latest version:\n\n```\nconda env update -f environment.yml\nconda activate ReciPies\npip install -e .\n```\n\n\u003e Note that the last command installs the package called `recipies`.\n## Quick Start\n\nHere's a simple example of using ReciPies:\n\n```python\n# Import necessary libraries\nimport polars as pl\nimport numpy as np\nfrom datetime import datetime, MINYEAR\nfrom recipies import Ingredients, Recipe\nfrom recipies.selector import all_numeric_predictors, all_predictors\nfrom recipies.step import StepSklearn, StepHistorical, Accumulator, StepImputeFill\nfrom sklearn.impute import MissingIndicator\n\n# Set up random state for reproducible results\nrand_state = np.random.RandomState(42)\n\n# Create time columns for two different groups\ntimecolumn = pl.concat([\n  pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 5), \"1h\", eager=True),\n  pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 3), \"1h\", eager=True)\n])\n\n# Create sample DataFrame\ndf = pl.DataFrame({\n  \"id\": [1] * 6 + [2] * 4,\n  \"time\": timecolumn,\n  \"y\": rand_state.normal(size=(10,)),\n  \"x1\": rand_state.normal(loc=10, scale=5, size=(10,)),\n  \"x2\": rand_state.binomial(n=1, p=0.3, size=(10,)),\n  \"x3\": pl.Series([\"a\", \"b\", \"c\", \"a\", \"c\", \"b\", \"c\", \"a\", \"b\", \"c\"], dtype=pl.Categorical),\n  \"x4\": pl.Series([\"x\", \"y\", \"y\", \"x\", \"y\", \"y\", \"x\", \"x\", \"y\", \"x\"], dtype=pl.Categorical),\n})\n\n# Introduce some missing values\ndf = df.with_columns(\n  pl.when(pl.int_range(pl.len()).is_in([1, 2, 4, 7]))\n  .then(None)\n  .otherwise(pl.col(\"x1\"))\n  .alias(\"x1\")\n)\n\ndf2 = df.clone()\n\n# Create Ingredients and Recipe\ning = Ingredients(df)\nrec = Recipe(\n  ing,\n  outcomes=[\"y\"],\n  predictors=[\"x1\", \"x2\", \"x3\", \"x4\"],\n  groups=[\"id\"],\n  sequences=[\"time\"]\n)\n\nrec.add_step(StepSklearn(MissingIndicator(features=\"all\"), sel=all_predictors()))\nrec.add_step(StepImputeFill(sel=all_predictors(), strategy=\"forward\"))\nrec.add_step(StepHistorical(sel=all_predictors(), fun=Accumulator.MEAN, suffix=\"mean_hist\"))\n\n# Apply the recipe to the ingredients\ndf = rec.prep()\n\n# Apply the recipe to a new DataFrame (e.g., test set)\ndf2 = rec.bake(df2)\n```\n\n## Core Concepts\n\n**Ingredients**  \nA wrapper around DataFrames that maintains column role information, ensuring data semantics are preserved during transformations.\n\n**Recipe**  \nA collection of processing steps that can be applied to Ingredients objects to create reproducible data pipelines.\n\n**Step**  \nIndividual data transformation operations that understand column roles and can work with both Polars and Pandas backends.\n\n**Selector**  \nUtilities for selecting columns based on their roles or other criteria.\n\n## Backend Support\n\nReciPies supports both Polars and Pandas backends:\n\n- **Polars**: High-performance DataFrame library with lazy evaluation\n- **Pandas**: Traditional DataFrame library with extensive ecosystem support\n\nThe package automatically detects the backend and provides a consistent API regardless of the underlying DataFrame implementation.\n\n## Examples\n\nCheck out the `examples/` directory for Jupyter notebooks demonstrating various use cases of ReciPies.\nCheck out the `benchmarks/` directory for performance comparisons between Polars and Pandas backends.\n\n## Contributing\n\nContributions are welcome! Please see our contributing guidelines and open an issue or submit a pull request on the [GitHub repository](https://github.com/rvandewater/ReciPies).\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](https://github.com/rvandewater/ReciPies/blob/main/LICENSE) file for details.\n\n\nTo define preprocessing operations, one has to supply _roles_ to the different columns of the Dataframe.\nThis allows the user to create groups of columns which have a particular function.\nThen, we provide several \"steps\" that can be applied to the datasets, among which: Historical accumulation,\nResampling the time resolution, A number of imputation methods, and a wrapper for any\n[Scikit-learn](https://github.com/scikit-learn/scikit-learn) preprocessing step.\nWe believe to have covered any basic preprocessing needs for prepared datasets.\nAny missing step can be added by following the step interface.\n\n# 📄Paper\n\nIf you use this code in your research, please cite the following publication which uses ReciPys extensively to create a \ncustomisable preprocessing pipeline (a standalone paper is in preparation):\n\n```\n@inproceedings{vandewaterYetAnotherICUBenchmark2024,\n  title = {Yet Another ICU Benchmark: A Flexible Multi-Center Framework for Clinical ML},\n  shorttitle = {Yet Another ICU Benchmark},\n  booktitle = {The Twelfth International Conference on Learning Representations},\n  author = {van de Water, Robin and Schmidt, Hendrik Nils Aurel and Elbers, Paul and Thoral, Patrick and Arnrich, Bert and Rockenschaub, Patrick},\n  year = {2024},\n  month = oct,\n  urldate = {2024-02-19},\n  langid = {english},\n}\n\n```\n\nThis paper can also be found on arxiv: https://arxiv.org/pdf/2306.05109.pdf\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frvandewater%2Frecipies","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frvandewater%2Frecipies","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frvandewater%2Frecipies/lists"}