{"id":20397948,"url":"https://github.com/bcg-x-official/sklearndf","last_synced_at":"2025-04-09T21:20:58.271Z","repository":{"id":37856350,"uuid":"285236275","full_name":"BCG-X-Official/sklearndf","owner":"BCG-X-Official","description":"DataFrame support for scikit-learn.","archived":false,"fork":false,"pushed_at":"2023-11-15T17:48:38.000Z","size":16034,"stargazers_count":62,"open_issues_count":3,"forks_count":6,"subscribers_count":9,"default_branch":"2.3.x","last_synced_at":"2024-03-14T23:24:06.294Z","etag":null,"topics":["cross-validation","data-science","feature-traceability","hyper-parameter-tuning","machine-learning","model-selection","pandas-dataframe","python"],"latest_commit_sha":null,"homepage":"https://bcg-x-official.github.io/sklearndf/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BCG-X-Official.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-05T09:04:19.000Z","updated_at":"2024-02-21T02:58:19.000Z","dependencies_parsed_at":"2023-02-13T00:15:46.940Z","dependency_job_id":"90435932-adac-4839-9431-4d85807dcc44","html_url":"https://github.com/BCG-X-Official/sklearndf","commit_stats":null,"previous_names":["bcg-x-official/sklearndf","bcg-gamma/sklearndf"],"tags_count":28,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BCG-X-Official%2Fsklearndf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BCG-X-Official%2Fsklearndf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BCG-X-Official%2Fsklearndf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BCG-X-Official%2Fsklearndf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BCG-X-Official","download_url":"https://codeload.github.com/BCG-X-Official/sklearndf/tar.gz/refs/heads/2.3.x","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248112365,"owners_count":21049646,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cross-validation","data-science","feature-traceability","hyper-parameter-tuning","machine-learning","model-selection","pandas-dataframe","python"],"created_at":"2024-11-15T04:17:32.203Z","updated_at":"2025-04-09T21:20:58.235Z","avatar_url":"https://github.com/BCG-X-Official.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":".. image:: sphinx/source/_images/sklearndf_logo.png\n\n----\n\n.. Begin-Badges\n\n|pypi| |conda| |azure_build| |azure_code_cov|\n|python_versions| |code_style| |made_with_sphinx_doc| |License_badge|\n\n.. End-Badges\n\n*sklearndf* is an open source library designed to address a common need with\n`scikit-learn \u003chttps://github.com/scikit-learn/scikit-learn\u003e`__: the outputs of\ntransformers are numpy arrays, even when the input is a\ndata frame. However, to inspect a model it is essential to keep track of the\nfeature names.\n\nTo this end, *sklearndf* enhances scikit-learn's estimators as follows:\n\n- **Preserve data frame structure**:\n  Return data frames as results of transformations, preserving feature names as the\n  column index.\n- **Feature name tracing**:\n  Add additional estimator properties to enable tracing a feature name back to its\n  original input feature; this is especially useful for transformers that create new\n  features (e.g., one-hot encode), and for pipelines that include such transformers.\n- **Easy use**:\n  Simply append DF at the end of your usual scikit-learn class names to get enhanced\n  data frame support!\n\nThe following quickstart guide provides a minimal example workflow to get up and running\nwith *sklearndf*.\nFor additional tutorials and the API reference,\nsee the `sklearndf documentation \u003chttps://bcg-x-official.github.io/sklearndf/\u003e`__.\nChanges and additions to new versions are summarized in the\n`release notes \u003chttps://bcg-x-official.github.io/sklearndf/release_notes.html\u003e`__.\n\n\nInstallation\n------------\n\n*sklearndf* supports both PyPI and Anaconda.\nWe recommend to install *sklearndf* into a dedicated environment.\n\n\nAnaconda\n~~~~~~~~\n\n.. code-block:: sh\n\n    conda create -n sklearndf\n    conda activate sklearndf\n    conda install -c bcg_gamma -c conda-forge sklearndf\n\n\nPip\n~~~\n\nmacOS and Linux:\n^^^^^^^^^^^^^^^^\n\n.. code-block:: sh\n\n    python -m venv sklearndf\n    source sklearndf/bin/activate\n    pip install sklearndf\n\nWindows:\n^^^^^^^^\n\n.. code-block:: dosbatch\n\n    python -m venv sklearndf\n    sklearndf\\Scripts\\activate.bat\n    pip install sklearndf\n\n\nQuickstart\n----------\n\nCreating a DataFrame-friendly scikit-learn preprocessing pipeline\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe titanic data set includes categorical features such as class and sex, and also has\nmissing values for numeric features (i.e., age) and categorical features (i.e., embarked).\nThe aim is to predict whether or not a passenger survived.\nA standard sklearn example for this dataset can be found\n`here \u003chttps://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#sphx-glr-auto-examples-compose-plot-column-transformer-mixed-types-py\u003e`_.\n\n\nWe will build a preprocessing pipeline which:\n\n- for categorical variables fills missing values with the string 'Unknown' and then one-hot encodes\n- for numerical values fills missing values using median values\n\nThe strength of *sklearndf* is to maintain the scikit-learn conventions and\nexpressiveness, while also preserving data frames, and hence feature names. We can see\nthis after using ``fit_transform`` on our preprocessing pipeline.\n\n.. code-block:: Python\n\n    import numpy as np\n    from sklearn.datasets import fetch_openml\n    from sklearn.model_selection import train_test_split\n\n    # relevant sklearndf imports\n    from sklearndf.transformation import (\n        ColumnTransformerDF,\n        OneHotEncoderDF,\n        SimpleImputerDF,\n    )\n    from sklearndf.pipeline import (\n        PipelineDF,\n        ClassifierPipelineDF,\n    )\n    from sklearndf.classification import RandomForestClassifierDF\n\n    # load titanic data\n    titanic_X, titanic_y = fetch_openml(\n        \"titanic\", version=1, as_frame=True, return_X_y=True\n    )\n\n    # select features\n    numerical_features = ['age', 'fare']\n    categorical_features = ['embarked', 'sex', 'pclass']\n\n    # create a preprocessing pipeline\n    preprocessing_numeric_df = SimpleImputerDF(strategy=\"median\")\n\n    preprocessing_categorical_df = PipelineDF(\n        steps=[\n            ('imputer', SimpleImputerDF(strategy='constant', fill_value='Unknown')),\n            ('one-hot', OneHotEncoderDF(sparse=False, handle_unknown=\"ignore\")),\n        ]\n    )\n\n    preprocessing_df = ColumnTransformerDF(\n        transformers=[\n            ('categorical', preprocessing_categorical_df, categorical_features),\n            ('numeric', preprocessing_numeric_df, numerical_features),\n        ]\n    )\n\n    # run preprocessing\n    transformed_df = preprocessing_df.fit_transform(X=titanic_X, y=titanic_y)\n    transformed_df.head()\n\n\n+-------------+------------+------------+---+------------+--------+--------+\n| feature_out | embarked_C | embarked_Q | … | pclass_3.0 | age    | fare   |\n+=============+============+============+===+============+========+========+\n| **0**       | 0          | 0          | … | 0          | 29     | 211.34 |\n+-------------+------------+------------+---+------------+--------+--------+\n| **1**       | 0          | 0          | … | 0          | 0.9167 | 151.55 |\n+-------------+------------+------------+---+------------+--------+--------+\n| **2**       | 0          | 0          | … | 0          | 2      | 151.55 |\n+-------------+------------+------------+---+------------+--------+--------+\n| **3**       | 0          | 0          | … | 0          | 30     | 151.55 |\n+-------------+------------+------------+---+------------+--------+--------+\n| **4**       | 0          | 0          | … | 0          | 25     | 151.55 |\n+-------------+------------+------------+---+------------+--------+--------+\n\n\nTracing features from post-transform to original \n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe *sklearndf* pipeline has a ``feature_names_original_`` attribute\nwhich returns a *pandas* ``Series``, mapping the output column names (the series' index)\nto the input column names (the series' values).\nWe can therefore easily select all output features generated from a given input feature,\nsuch as in this case for embarked.\n\n.. code-block:: Python\n\n    embarked_type_derivatives = preprocessing_df.feature_names_original_ == \"embarked\"\n    transformed_df.loc[:, embarked_type_derivatives].head()\n\n\n+-------------+------------+------------+------------+------------------+\n| feature_out | embarked_C | embarked_Q | embarked_S | embarked_Unknown |\n+=============+============+============+============+==================+\n| **0**       | 0.0        | 0.0        | 1.0        | 0.0              |\n+-------------+------------+------------+------------+------------------+\n| **1**       | 0.0        | 0.0        | 1.0        | 0.0              |\n+-------------+------------+------------+------------+------------------+\n| **2**       | 0.0        | 0.0        | 1.0        | 0.0              |\n+-------------+------------+------------+------------+------------------+\n| **3**       | 0.0        | 0.0        | 1.0        | 0.0              |\n+-------------+------------+------------+------------+------------------+\n| **4**       | 0.0        | 0.0        | 1.0        | 0.0              |\n+-------------+------------+------------+------------+------------------+\n\n\nCompleting the pipeline with a classifier\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nScikit-learn regressors and classifiers have a *sklearndf* sibling obtained by appending\n``DF`` to the class name; the API of the native estimators is preserved.\nThe result of any predict and decision function will be returned as a *pandas*\n``Series`` (single output) or ``DataFrame`` (class probabilities or multi-output).\n\nWe can combine the preprocessing pipeline above with a classifier to create a full\npredictive pipeline. *sklearndf* provides two useful, specialised pipeline objects for\nthis, ``RegressorPipelineDF`` and ``ClassifierPipelineDF``.\nBoth implement a special two-step pipeline with one preprocessing step and one\nprediction step, while staying compatible with the general sklearn pipeline idiom.\n\nUsing ``ClassifierPipelineDF`` we can combine the preprocessing pipeline with\n``RandomForestClassifierDF`` to fit a model to a selected training set and then score\non a test set.\n\n.. code-block:: Python\n\n    # create full pipeline\n    pipeline_df = ClassifierPipelineDF(\n        preprocessing=preprocessing_df,\n        classifier=RandomForestClassifierDF(\n            n_estimators=1000,\n            max_features=2/3,\n            max_depth=7,\n            random_state=42,\n            n_jobs=-3,\n        )\n    )\n\n    # split data and then fit and score random forest classifier\n    df_train, df_test, y_train, y_test = train_test_split(\n        titanic_X, titanic_y, random_state=42\n    )\n    pipeline_df.fit(df_train, y_train)\n    print(f\"model score: {pipeline_df.score(df_test, y_test).round(2)}\")\n\n\n|\n\n    model score: 0.79\n\n\nContributing\n------------\n\n*sklearndf* is stable and is being supported long-term.\n\nContributions to *sklearndf* are welcome and appreciated.\nFor any bug reports or feature requests/enhancements please use the appropriate\n`GitHub form \u003chttps://github.com/BCG-X-Official/sklearndf/issues\u003e`_, and if you wish to do\nso, please open a PR addressing the issue.\n\nWe do ask that for any major changes please discuss these with us first via an issue.\n\nFor further information on contributing please see our\n`contribution guide \u003chttps://bcg-x-official.github.io/sklearndf/contribution_guide.html\u003e`__.\n\n\nLicense\n-------\n\n*sklearndf* is licensed under Apache 2.0 as described in the\n`LICENSE \u003chttps://github.com/BCG-X-Official/sklearndf/blob/develop/LICENSE\u003e`_ file.\n\n\nAcknowledgements\n----------------\n\nLearners and pipelining from the popular Machine Learning package\n`scikit-learn \u003chttps://github.com/scikit-learn/scikit-learn\u003e`__  support\nthe corresponding *sklearndf* implementations.\n\n\nBCG GAMMA\n---------\n\nWe are always on the lookout for passionate and talented data scientists to join the\nBCG GAMMA team. If you would like to know more you can find out about\n`BCG GAMMA \u003chttps://www.bcg.com/en-gb/beyond-consulting/bcg-gamma/default\u003e`_,\nor have a look at\n`career opportunities \u003chttps://www.bcg.com/en-gb/beyond-consulting/bcg-gamma/careers\u003e`_.\n\n.. Begin-Badges\n\n.. |conda| image:: https://anaconda.org/bcg_gamma/sklearndf/badges/version.svg\n   :target: https://anaconda.org/BCG_Gamma/sklearndf\n\n.. |pypi| image:: https://badge.fury.io/py/sklearndf.svg\n   :target: https://pypi.org/project/sklearndf/\n\n.. |azure_build| image:: https://dev.azure.com/gamma-facet/facet/_apis/build/status/BCG-X-Official.sklearndf?repoName=BCG-X-Official%2Fsklearndf\u0026branchName=develop\n   :target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=8\u0026_a=summary\n\n.. |azure_code_cov| image:: https://img.shields.io/azure-devops/coverage/gamma-facet/facet/8/2.1.x\n   :target: https://dev.azure.com/gamma-facet/facet/_build?definitionId=8\u0026_a=summary\n\n.. |python_versions| image:: https://img.shields.io/badge/python-3.7|3.8|3.9-blue.svg\n   :target: https://www.python.org/downloads/release/python-380/\n\n.. |code_style| image:: https://img.shields.io/badge/code%20style-black-000000.svg\n   :target: https://github.com/psf/black\n\n.. |made_with_sphinx_doc| image:: https://img.shields.io/badge/Made%20with-Sphinx-1f425f.svg\n   :target: https://bcg-x-official.github.io/sklearndf/index.html\n\n.. |license_badge| image:: https://img.shields.io/badge/License-Apache%202.0-olivegreen.svg\n   :target: https://opensource.org/licenses/Apache-2.0\n\n.. End-Badges\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcg-x-official%2Fsklearndf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbcg-x-official%2Fsklearndf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbcg-x-official%2Fsklearndf/lists"}