{"id":15288203,"url":"https://github.com/data-science-lab-amsterdam/skippa","last_synced_at":"2025-09-08T08:34:21.780Z","repository":{"id":57140492,"uuid":"430746088","full_name":"data-science-lab-amsterdam/skippa","owner":"data-science-lab-amsterdam","description":"SciKIt-learn Pipeline in PAndas","archived":false,"fork":false,"pushed_at":"2023-08-18T07:39:31.000Z","size":433,"stargazers_count":42,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-08T12:05:53.354Z","etag":null,"topics":["data-science","machine-learning","pandas","pandas-dataframe","pipeline","preprocessing","python","scikit-learn","sklearn"],"latest_commit_sha":null,"homepage":"https://skippa.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/data-science-lab-amsterdam.png","metadata":{"files":{"readme":"README.md","changelog":"HISTORY.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-11-22T14:45:08.000Z","updated_at":"2024-04-10T11:44:19.000Z","dependencies_parsed_at":"2023-01-22T07:02:38.790Z","dependency_job_id":"341c6e67-985a-4bfe-a977-e86831bd406b","html_url":"https://github.com/data-science-lab-amsterdam/skippa","commit_stats":{"total_commits":80,"total_committers":1,"mean_commits":80.0,"dds":0.0,"last_synced_commit":"1eac9e40dc8a8b50ab28b354e696dcd0b05c8784"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/data-science-lab-amsterdam%2Fskippa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/data-science-lab-amsterdam%2Fskippa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/data-science-lab-amsterdam%2Fskippa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/data-science-lab-amsterdam%2Fskippa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/data-science-lab-amsterdam","download_url":"https://codeload.github.com/data-science-lab-amsterdam/skippa/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248673257,"owners_count":21143462,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","machine-learning","pandas","pandas-dataframe","pipeline","preprocessing","python","scikit-learn","sklearn"],"created_at":"2024-09-30T15:44:41.454Z","updated_at":"2025-04-13T06:27:30.657Z","avatar_url":"https://github.com/data-science-lab-amsterdam.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![pypi](https://img.shields.io/pypi/v/skippa)\n![python versions](https://img.shields.io/pypi/pyversions/skippa)\n![downloads](https://img.shields.io/pypi/dm/skippa)\n![Build status](https://img.shields.io/azure-devops/build/data-science-lab/Intern/263)\n\n\u003cbr\u003e\u003cbr\u003e\n\u003cimg src=\"skippa-logo-transparent.png\" alt=\"logo\" width=\"200\"/\u003e\n\n# Skippa \n\nSciKIt-learn Pre-processing Pipeline in PAndas\n\n\u003e __*Read more in the [introduction blog on towardsdatascience](https://towardsdatascience.com/introducing-skippa-bab260acf6a7)*__\n\n\n\nWant to create a machine learning model using pandas \u0026 scikit-learn? This should make your life easier.\n\nSkippa helps you to easily create a pre-processing and modeling pipeline, based on scikit-learn transformers but preserving pandas dataframe format throughout all pre-processing. This makes it a lot easier to define a series of subsequent transformation steps, while referring to columns in your intermediate dataframe.\n\nSo basically the same idea as `scikit-pandas`, but a different (and hopefully better) way to achieve it.\n\n- [pypi](https://pypi.org/project/skippa/)\n- [Documentation](https://skippa.readthedocs.io/)\n\n## Installation\n```\npip install skippa\n```\nOptional, if you want to use the [gradio app functionality](./examples/04-gradio-app.py):\n```\npip install skippa[gradio]\n```\n\n## Basic usage\n\nImport `Skippa` class and `columns` helper function\n```\nimport numpy as np\nimport pandas as pd\nfrom sklearn.linear_model import LogisticRegression\n\nfrom skippa import Skippa, columns\n```\n\nGet some data\n```\ndf = pd.DataFrame({\n    'q': [0, 0, 0],\n    'date': ['2021-11-29', '2021-12-01', '2021-12-03'],\n    'x': ['a', 'b', 'c'],\n    'x2': ['m', 'n', 'm'],\n    'y': [1, 16, 1000],\n    'z': [0.4, None, 8.7]\n})\ny = np.array([0, 0, 1])\n```\n\nDefine your pipeline:\n```\npipe = (\n    Skippa()\n        .select(columns(['x', 'x2', 'y', 'z']))\n        .cast(columns(['x', 'x2']), 'category')\n        .impute(columns(dtype_include='number'), strategy='median')\n        .impute(columns(dtype_include='category'), strategy='most_frequent')\n        .scale(columns(dtype_include='number'), type='standard')\n        .onehot(columns(['x', 'x2']))\n        .model(LogisticRegression())\n)\n```\n\nand use it for fitting / predicting like this:\n```\npipe.fit(X=df, y=y)\n\npredictions = pipe.predict_proba(df)\n```\n\nIf you want details on your model, use:\n```\nmodel = pipe.get_model()\nprint(model.coef_)\nprint(model.intercept_)\n```\n\n## (de)serialization\nAnd of course you can save and load your model pipelines (for deployment).\nN.B. [`dill`](https://pypi.org/project/dill/) is used for ser/de because joblib and pickle don't provide enough support.\n```\npipe.save('./models/my_skippa_model_pipeline.dill')\n\n...\n\nmy_pipeline = Skippa.load_pipeline('./models/my_skippa_model_pipeline.dill')\npredictions = my_pipeline.predict(df_new_data)\n```\n\nSee the [./examples](./examples) directory for more examples:\n- [01-standard-pipeline.py](./examples/01-standard-pipeline.py)\n- [02-preprocessing-only.py](./examples/02-preprocessing-only.py)\n- [03a-gridsearch.py](./examples/03a-gridsearch.py)\n- [03b-hyperopt.py](./examples/03b-hyperopt.py)\n- [04-gradio-app.py](./examples/04-gradio-app.py)\n- [05-PCA.py](./examples/05-PCA.py)\n\n## To Do\n- [x] Support pandas assign for creating new columns based on existing columns\n- [x] Support cast / astype transformer\n- [x] Support for .apply transformer: wrapper around `pandas.DataFrame.apply`\n- [x] Check how GridSearch (or other param search) works with Skippa\n- [x] Add a method to inspect a fitted pipeline/model by creating a Gradio app defining raw features input and model output\n- [x] Support PCA transformer\n- [ ] Facilitate random seed in Skippa object that is dispatched to all downstream operations\n- [ ] fit-transform does lazy evaluation \u003e cast to category and then selecting category columns doesn't work \u003e each fit/transform should work on the expected output state of the previous transformer, rather than on the original dataframe\n- [ ] Investigate if Skippa can directly extend sklearn's Pipeline -\u003e using __getitem__ trick\n- [ ] Use sklearn's new dataframe output setting\n- [ ] Validation of pipeline steps\n- [ ] Input validation in transformers\n- [ ] Transformer for replacing values (pandas .replace)\n- [ ] Support arbitrary transformer (if column-preserving)\n- [ ] Eliminate the need to call columns explicitly\n\n\n## Credits\n- Skippa is powered by [Data Science Lab Amsterdam](https://www.datasciencelab.nl)\n- This project structure is based on the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdata-science-lab-amsterdam%2Fskippa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdata-science-lab-amsterdam%2Fskippa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdata-science-lab-amsterdam%2Fskippa/lists"}