{"id":27058507,"url":"https://github.com/cvs-health/coldstart","last_synced_at":"2025-10-03T18:08:46.142Z","repository":{"id":62921669,"uuid":"563494086","full_name":"cvs-health/coldstart","owner":"cvs-health","description":"A package for automatic data collection and feature engineering","archived":false,"fork":false,"pushed_at":"2022-11-14T16:17:33.000Z","size":6527,"stargazers_count":25,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-04T20:48:46.711Z","etag":null,"topics":["bigquery","feature-engineering","python","sql","sqlalchemy"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cvs-health.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-08T18:22:50.000Z","updated_at":"2024-10-28T20:36:13.000Z","dependencies_parsed_at":"2023-01-21T19:00:11.083Z","dependency_job_id":null,"html_url":"https://github.com/cvs-health/coldstart","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvs-health%2Fcoldstart","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvs-health%2Fcoldstart/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvs-health%2Fcoldstart/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cvs-health%2Fcoldstart/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cvs-health","download_url":"https://codeload.github.com/cvs-health/coldstart/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247312066,"owners_count":20918340,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","feature-engineering","python","sql","sqlalchemy"],"created_at":"2025-04-05T12:15:29.274Z","updated_at":"2025-10-03T18:08:46.003Z","avatar_url":"https://github.com/cvs-health.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![](examples/coldstart.png)\n\nColdstart is a package for automatic data collection and feature engineering that should be used by new and seasoned data scientists/engineers interested in accelerating model development.\n\nData collection and feature engineering are among the most tedious and time-consuming steps in the data science workflow. Coldstart aims to solve this problem by encapsulating efficient patterns and abstracting away low-level details associated with dynamic query templating, query optimization, concurrent execution, memory management, data leakage, and pipeline deployment. The general order of events looks like this:\n\n\u003cdiv style=\"text-align:center\"\u003e\u003cimg src=\"examples/coldstart_flow.png\" /\u003e\u003c/div\u003e\n\nColdstart is meant to be a \"Goldilocks\" solution that sits somewhere between a collection of version-controlled queries and a full-fledged feature store. If you're making batch predictions that do not require ultra-low latency guarantees or if you're not taking full advantage of the warehouse’s available computing resources (i.e., waiting for dozens of queries to run one-by-one), then this package might be perfect for you!\n\n## Documentation\n\n[Documentation](https://cvs-health.github.io/coldstart/) is hosted on GitHub Pages.\n\n## Installation\n\nThe latest version can be installed from PyPI:\n\n```bash\npip install coldstart\n```\n\n## Usage\n\nColdstart embraces a code that writes code mindset by exposing a powerful class (`FeatureFactory`) that retrieves data from various user-defined domains by establishing 1:1 or 1:M relationships with peer-reviewed queries that are templated at runtime based on user input and executed concurrently in one or many batches. The output comes in the form of a single wide dataframe that can be held in memory (i.e., pandas) or on disk (i.e., dask) and then fed directly into a feature engineering/modeling pipeline (e.g., sklearn). Row-level observations are identifiable through the use of composite indexes that have two parts to them: an entity component and a temporal component, which satisfy most tabular supervised ML use cases. When it's time to move from development to production, users can \"freeze\" queries that they will be using in their production pipeline.\n\nHere is a basic example of how to use `FeatureFactory`:\n\n```python\nfrom coldstart import FeatureFactory\n\n# Instantiate FeatureFactory\nff = FeatureFactory()\n\n# List available query domains\nmy_domains = ff.list_domains(\n    dialect='bigquery',\n    entity_id='team_id' # replace with your entity_id\n)\nprint(f'Available domains: {my_domains}')\n\n# Set database spec\ndb_spec = {\n    'dialect': 'bigquery', # should match query DIALECT tags and SQLAlchemy name naming\n    'project_id': PROJECT_ID, # replace with your project if using Big Query\n    'schema': 'my_schema' # replace with your schema/dataset\n}\n\n# Start engine\nff.start_engine(db_spec)\n\n# Run FeatureFactory\nff.run(\n    leftmost_table='my_schema.my_leftmost_table', # table with columns: team_id and y\n    feature_table='my_schema.my_feature_table',\n    entity_id='team_id',\n    domains=my_domains,\n    date_range=['2010-01-01', '2015-01-01'],\n)\n\n# Stop engine\nff.stop_engine()\n\n# Return dataframe\nfeatures_df = ff.get_dataframe()\nfeatures_df.head()\n```\n\nSee [this notebook](examples/Quickstart.ipynb) for a more thorough example.\n\nNote that `leftmost_table` must be a predefined table with at least 2 columns: `entity_id` and `y` where entity_id corresponds with the tagged queries in the query bank and y corresponds with the dependent variable that you're eventually modeling. Optionally, you can also include a `min_date` and a `max_date` column so that each row is parameterized accordingly (if you do not include dates in your table, the `date_range` argument will be used for all records). A typical `leftmost_table` will look like this:\n\nentity_id|y\n---|---\nabc|0\ndef|1\n...|...\n\nOr:\n\nentity_id|y|min_date|max_date\n---|---|---|---\nabc|0|2020-01-01|2020-12-31\nabc|1|2021-01-01|2021-12-31\ndef|0|2020-01-01|2020-12-31\ndef|1|2021-01-01|2021-12-31\n...|...|...|...\n\nQueries in the query bank must adhere to an established pattern. It's this pattern that makes consistent dynamic runtime templating possible. All queries must:\n\n- Have a unique file name\n    \u003e **Tip**: By beginning the file name with the corresponding entity and domain name, you will easily be able to establish feature lineage back to the query because the final table appends the query name to the column name to ensure uniqueness \n- Be tagged with `DIALECT`, `ENTITY`, and `DOMAIN`\n- Use the default `idx` column (which is a concatenation of `entity_id` + `min_date` + `max_date`) in the **SELECT** in all CTEs/subqueries and the outer-most query\n    \u003e **Tip**: You do not need to carry the `entity_id`, `min_date`, or `max_date` down through CTEs/subqueries because it is baked into `idx`\n- Use the `{LEFTMOST_TABLE}` variable as the left-most table in the **FROM**\n- Use `min_date` and `max_date` to constrain relevant date columns in the **WHERE** if dates are involved\n- Have an outer-most query that uses `idx` in the **GROUP BY** if aggregation is involved\n\nA typical query (e.g., `teamGameStats.sql`) will look something like this:\n\n```sql\n-- DIALECT: your_dialect (e.g., bigquery)\n-- ENTITY: your_entity_id (e.g., team_id)\n-- DOMAIN: your_domain (e.g., game)\nWITH T1 AS (\n    SELECT\n        LMT.idx,\n        A.some_column,\n        B.some_other_column,\n        ...\n    FROM\n        {LEFTMOST_TABLE} AS LMT\n        INNER JOIN some_schema.some_table_1 AS A\n            ON LMT.team_id = A.team_id\n        LEFT JOIN some_schema.some_table_2 AS B\n            ON A.id = B.team_id\n        ...\n    WHERE\n        some_date_column \u003e= LMT.min_date\n        AND some_date_column \u003c= LMT.max_date\n        ...\n)\nSELECT\n    T1.idx,\n    SUM(T1.some_column) AS some_sum,\n    SUM(T1.some_other_column) AS some_other_sum,\n    ...\nFROM\n    T1\nGROUP BY\n    T1.idx\n```\n\nIf you're looking for more fine-grained control over which queries to run, you can use the `queries` parameter, as opposed to the `domains` parameter. Before doing so though, you'll need to familiarize yourself with the queries in the query bank.\n\nAfter running, you should get back a table/dataframe that is as wide as the total number of columns returned in all underlying queries' outer-most SELECT (plus `idx` and `y`). Building off of the earlier example, the `feature_table` and/or returned dataframe would look like this:\n\nidx|y|teamGameStats_some_sum|teamGameStats_some_other_sum|...\n---|---|---|---|---\nabc_2010-01-01_2015-01-01|0|50|100|...\ndef_2010-01-01_2015-01-01|0|25|200|...\n...|...|...|...|...\n\nWhich can then be passed into a boilerplate pipeline like this:\n\n```python\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.compose import ColumnTransformer, make_column_selector\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\nfrom sklearn.model_selection import train_test_split\n\n# Construct pipeline\nnumeric_transformer = Pipeline(\n    steps=[\n        ('imputer', SimpleImputer(strategy='constant', fill_value=0, copy=False)),\n        ('scaler', StandardScaler()),\n    ]\n)\ncategorical_transformer = Pipeline(\n    steps=[\n        ('imputer', SimpleImputer(strategy='constant', fill_value='NA', copy=False)),\n        ('encoder', OneHotEncoder(sparse=False, handle_unknown='ignore')),\n    ]\n)\npreprocessor = ColumnTransformer(\n    transformers=[\n        ('num', numeric_transformer, make_column_selector(dtype_include=np.number)),\n        ('cat', categorical_transformer, make_column_selector(dtype_include=pd.CategoricalDtype)),\n    ]\n)\npipe = Pipeline(\n    steps=[\n        ('preprocessor', preprocessor), \n        ('classifier', RandomForestClassifier()),\n    ]\n)\n\n# Set features and class label\nX = features_df.iloc[:, 1:]\ny = features_df.iloc[:, 0]\n\n# Train test split\nX_train, X_test, y_train, y_test = train_test_split(X, y)\n\n# Fit pipeline\npipe.fit(X_train, y_train)\n```\n\nAgain, see [this notebook](examples/Quickstart.ipynb) for a more thorough example.\n\n## Features Under Development\n\n- Switch for writing intermediate results to Parquet files\n- Option to return Dask DataFrame\n- Testing for more databases\n- Retrying decorator for run_query\n\n## Contributor Guide\n\n1. Before contributing to this CVS Health sponsored project, you will need to sign the associated [Contributor License Agreement](https://forms.office.com/r/HvYxTheDG5).\n2. See [contributing](CONTRIBUTING.md) page.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcvs-health%2Fcoldstart","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcvs-health%2Fcoldstart","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcvs-health%2Fcoldstart/lists"}