{"id":39289555,"url":"https://github.com/crowdcent/centimators","last_synced_at":"2026-01-18T01:16:11.525Z","repository":{"id":292923942,"uuid":"982386935","full_name":"crowdcent/centimators","owner":"crowdcent","description":"Python library for building and sharing dataframe-agnostic, sklearn-style transformers and ml models for data science competitions.","archived":false,"fork":false,"pushed_at":"2025-12-30T18:05:58.000Z","size":7023,"stargazers_count":24,"open_issues_count":2,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-03T14:49:30.146Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crowdcent.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-12T20:02:37.000Z","updated_at":"2025-12-30T18:05:31.000Z","dependencies_parsed_at":"2025-05-12T21:26:05.267Z","dependency_job_id":"a319d594-6a56-4df7-b149-aa2f1aad8b39","html_url":"https://github.com/crowdcent/centimators","commit_stats":null,"previous_names":["crowdcent/centimators"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/crowdcent/centimators","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crowdcent%2Fcentimators","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crowdcent%2Fcentimators/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crowdcent%2Fcentimators/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crowdcent%2Fcentimators/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crowdcent","download_url":"https://codeload.github.com/crowdcent/centimators/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crowdcent%2Fcentimators/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28525962,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-18T00:39:45.795Z","status":"ssl_error","status_checked_at":"2026-01-18T00:39:39.467Z","response_time":85,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-18T01:16:08.474Z","updated_at":"2026-01-18T01:16:11.517Z","avatar_url":"https://github.com/crowdcent.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"https://raw.githubusercontent.com/crowdcent/centimators/main/docs/overrides/assets/images/centimators_banner_transparent_thinner.png\" alt=\"Centimators\" width=\"100%\" style=\"max-width: 800px;\"/\u003e\n\n# Centimators: essential data transformers and model estimators for ML and data science competitions\n\n`centimators` is an open-source python library built on scikit-learn, keras, and narwhals: designed for building and sharing **dataframe-agnostic** (pandas/polars), **multi-framework** (jax/tf/pytorch), **sklearn-style** (fit/transform/predict) transformers, meta-estimators, and machine learning models for data science competitions like Numerai, Kaggle, and the CrowdCent Challenge. \n\n`centimators` makes heavy use of advanced scikit-learn concepts such as metadata routing. Familiarity with these concepts is recommended for optimal use of the library. You can learn more about metadata routing in the [scikit-learn documentation](https://scikit-learn.org/stable/metadata_routing.html).\n\nDocumentation is available at [https://crowdcent.github.io/centimators/](https://crowdcent.github.io/centimators/).\n\n## Installation\n\n```bash\n# Feature transformers only (minimal)\nuv pip install centimators # or\nuv add centimators\n\n# With Keras neural networks (JAX backend)\nuv add 'centimators[keras-jax]'\n\n# With DSPy LLM estimators\nuv add 'centimators[dspy]'\n\n# Everything\nuv add 'centimators[all]'\n```\n\n## Keras Backend Configuration\n\n**Note:** Only relevant if using `centimators[keras-jax]` or `centimators[all]`.\n\n`centimators` uses Keras 3 for its neural network models, which supports multiple backends (JAX, TensorFlow, PyTorch). By default, `centimators` uses **JAX** as the backend.\n\n### Using the Default JAX Backend\n\nNo configuration needed! Just import and use:\n\n```python\nfrom centimators.model_estimators import MLPRegressor\n\n# JAX backend is automatically set\nmodel = MLPRegressor()\n```\n\n### Switching Backends\n\nIf you want to use TensorFlow or PyTorch instead, you have two options:\n\n**Option 1: Set environment variable before importing**\n```python\nimport os\nos.environ[\"KERAS_BACKEND\"] = \"tensorflow\"  # or \"torch\"\n\n# Now import centimators\nfrom centimators.model_estimators import MLPRegressor\n```\n\n**Option 2: Use the configuration function**\n```python\nimport centimators\ncentimators.set_keras_backend(\"tensorflow\")  # or \"torch\"\n\n# Now import model estimators\nfrom centimators.model_estimators import MLPRegressor\n```\n\n**Note:** If you choose TensorFlow or PyTorch, you'll need to install them separately:\n```bash\nuv add tensorflow\nuv add torch\n```\n\n## Quick Start\n\n`centimators` transformers and estimators are dataframe-agnostic, powered by [narwhals](https://narwhals-dev.github.io/narwhals/). You can use the same transformer seamlessly with both Pandas and Polars DataFrames. Here's an example with RankTransformer, which calculates the normalized rank of features for all tickers over time *by date*.\n\nFirst, let's define some common data:\n```python\nimport pandas as pd\nimport polars as pl\n# Create sample OHLCV data for two stocks over four trading days\ndata = {\n    'date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', \n             '2021-01-03', '2021-01-03', '2021-01-04', '2021-01-04'],\n    'ticker': ['AAPL', 'MSFT', 'AAPL', 'MSFT', 'AAPL', 'MSFT', 'AAPL', 'MSFT'],\n    'open': [150.0, 280.0, 151.0, 282.0, 152.0, 283.0, 153.0, 284.0],    # Opening prices\n    'high': [152.0, 282.0, 153.0, 284.0, 154.0, 285.0, 155.0, 286.0],    # Daily highs\n    'low': [149.0, 278.0, 150.0, 280.0, 151.0, 281.0, 152.0, 282.0],     # Daily lows\n    'close': [151.0, 281.0, 152.0, 283.0, 153.0, 284.0, 154.0, 285.0],   # Closing prices\n    'volume': [1000000, 800000, 1200000, 900000, 1100000, 850000, 1050000, 820000]  # Trading volume\n}\n\n# Create both Pandas and Polars DataFrames\ndf_pd = pd.DataFrame(data)\ndf_pl = pl.DataFrame(data)\n\n# Define the OHLCV features we want to transform\nfeature_cols = ['volume', 'close']\n```\n\nNow, let's use the transformer:\n```python\nfrom centimators.feature_transformers import RankTransformer\n\ntransformer = RankTransformer(feature_names=feature_cols)\nresult_pd = transformer.fit_transform(df_pd[feature_cols], date_series=df_pd['date'])\nresult_pl = transformer.fit_transform(df_pl[feature_cols], date_series=df_pl['date'])\n```\n\nBoth `result_pd` (from Pandas) and `result_pl` (from Polars) will contain the same transformed data in their native DataFrame formats. You may find significant performance gains using Polars for certain operations.\n\n## Advanced Pipeline\n\n`centimators` transformers are designed to work seamlessly within scikit-learn Pipelines, leveraging its metadata routing capabilities. This allows you to pass data like date or ticker series through the pipeline to the specific transformers that need them, while also chaining together multiple transformers. This is useful for building more complex feature pipelines, but also allows for better cross-validation, hyperparameter tuning, and model selection. For example, if you add a Regressor at the end of the pipeline, you can imagine searching over various combinations of lags, moving average windows, and model hyperparameters during the training process.\n\n![output_chart](https://raw.githubusercontent.com/crowdcent/centimators/main/docs/overrides/assets/images/pipeline_output_example.png)\n```python\nfrom sklearn import set_config\nfrom sklearn.pipeline import make_pipeline\nfrom centimators.feature_transformers import (\n    LogReturnTransformer,\n    RankTransformer,\n    LagTransformer,\n    MovingAverageTransformer\n)\n\n# Enable metadata routing globally\nset_config(enable_metadata_routing=True)\n\n# Define individual transformers with their parameters\nlog_return_transformer = LogReturnTransformer().set_transform_request(\n    ticker_series=True\n)\nranker = RankTransformer().set_transform_request(date_series=True)\nlag_windows = [0, 5, 10, 15]\nlagger = LagTransformer(windows=lag_windows).set_transform_request(\n    ticker_series=True\n)\nma_windows = [5, 10, 20, 40]\nma_transformer = MovingAverageTransformer(\n    windows=ma_windows\n).set_transform_request(ticker_series=True)\n\n# Create the pipeline\nfeature_pipeline = make_pipeline(\n    log_return_transformer, ranker, lagger, ma_transformer\n)\n```\n![centimators_pipeline](https://raw.githubusercontent.com/crowdcent/centimators/main/docs/overrides/assets/images/centimators_pipeline.png)\n\n**Explanation:**\n\n- `set_config(enable_metadata_routing=True)` turns on scikit-learn's metadata routing.\n- `set_transform_request(metadata_name=True)` on each transformer tells the pipeline that this transformer expects `metadata_name` (e.g., `date_series`).\n- When `pipeline.fit_transform(X, date_series=dates, ticker_series=tickers)` is called:\n    - The `date_series` is automatically passed to `RankTransformer`.\n    - The `ticker_series` is automatically passed to `LagTransformer`, `MovingAverageTransformer`, and `LogReturnTransformer`.\n    - The output of `LogReturnTransformer` is passed to `RankTransformer`\n    - The output of `RankTransformer` is passed to `LagTransformer`\n    - The output of `LagTransformer` is passed to `MovingAverageTransformer`\n\nThis allows for complex data transformations where different steps require different auxiliary information, all managed cleanly by the pipeline.\n\n```python\n# Now you can use this pipeline with your data\nfeature_names = ['open', 'high', 'low', 'close']\ntransformed_df = feature_pipeline.fit_transform(\n    df_polars[feature_names],\n    date_series=df_polars[\"date\"],\n    ticker_series=df_polars[\"ticker\"],\n)\n```\n\nWe can take a closer look at a sample output for a single ticker and for a single initial feature. This clearly shows how the close price for a cross-sectional dataset is transformed into a log return, ranked (between 0 and 1) by date, and smoothed (moving average windows) by ticker:\n![feature_example](https://raw.githubusercontent.com/crowdcent/centimators/main/docs/overrides/assets/images/feature_example.png)\n\n## End-to-End Pipeline with an Estimator\n\nThe previous \"Advanced Pipeline\" example constructed only the *feature engineering* part of a workflow.  Thanks to Centimators' Keras-backed estimators you can seamlessly append a model as the final step and train everything through a single `fit` call.\n\n```python\nfrom sklearn.impute import SimpleImputer\nfrom centimators.model_estimators import MLPRegressor\n\n\nlag_windows = [0, 5, 10, 15]\nma_windows = [5, 10, 20, 40]\n\nmlp_pipeline = make_pipeline(\n    # Start with the existing feature pipeline\n    feature_pipeline,\n    # Replace NaNs created by lagging with a constant value\n    SimpleImputer(strategy=\"constant\", fill_value=0.5).set_output(transform=\"pandas\"),\n    # Train a neural network in-place\n    MLPRegressor().set_fit_request(epochs=True),\n)\n\nfeature_names = [\"open\", \"high\", \"low\", \"close\"]\n\nmlp_pipeline.fit(\n    df_pl[feature_names],\n    df_pl[\"target\"],\n    date_series=df_pl[\"date\"],\n    ticker_series=df_pl[\"ticker\"],\n    epochs=5,\n)\n```\n\n![centimators_pipeline_estimator](https://raw.githubusercontent.com/crowdcent/centimators/main/docs/overrides/assets/images/centimators_pipeline_estimator.png)\n\nJust as before, scikit-learn's *metadata routing* ensures that auxiliary inputs (`date_series`, `ticker_series`, `epochs`) are forwarded only to the steps that explicitly requested them.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrowdcent%2Fcentimators","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrowdcent%2Fcentimators","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrowdcent%2Fcentimators/lists"}