{"id":19582947,"url":"https://github.com/tomgorb/ds-utils","last_synced_at":"2026-05-16T07:39:57.034Z","repository":{"id":92216843,"uuid":"281322437","full_name":"tomgorb/ds-utils","owner":"tomgorb","description":"pre-processing of a DataFrame into a sparse matrix for model input","archived":false,"fork":false,"pushed_at":"2024-07-24T10:21:36.000Z","size":5853,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-09T05:25:28.838Z","etag":null,"topics":["machine-learning","preprocessing","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tomgorb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-07-21T07:07:06.000Z","updated_at":"2024-07-24T13:06:59.000Z","dependencies_parsed_at":"2024-11-11T07:38:51.631Z","dependency_job_id":"64847bd7-bd26-4623-8ae8-927851b4ee8c","html_url":"https://github.com/tomgorb/ds-utils","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomgorb%2Fds-utils","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomgorb%2Fds-utils/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomgorb%2Fds-utils/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomgorb%2Fds-utils/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tomgorb","download_url":"https://codeload.github.com/tomgorb/ds-utils/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240849793,"owners_count":19867755,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","preprocessing","scikit-learn"],"created_at":"2024-11-11T07:38:44.685Z","updated_at":"2026-05-16T07:39:52.015Z","avatar_url":"https://github.com/tomgorb.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"ds-utils\n-----\n\nA set of classes to ease the pre-processing of data to feed machine learning algorithms.\n\n**python 2.7 and python 3.12 compatible**\n\nMain tool:\n\n```python\nclass Preprocessor(Model):\n    \"\"\" Preprocessor\n    This class allows for the complete pre-processing of a DataFrame into a sparse matrix for model input.\n    \"\"\"\n\n    def __init__(self):\n        super(Preprocessor, self).__init__(name=\"Preprocessor\")\n        self.prunificator = None\n        self.counterizor = None\n        self.vectorizor = None\n        self.imputor = None\n        self.sparsifior = None\n        self.variance_selector = None\n\n    def fit_transform(self, df, pruning_frequency=None, do_not_use=None, sharp_categorical_dict=None, na_strategy=MeanStrategy(), variance_threshold=None, low_memory=True):\n        \"\"\" Pre-process input_files for the training phase. Once completed, you should save the resulting Preprocessor object for the predict phase.\n\n        Args:\n            df (pandas DataFrame): dataframe to be pre-processed.\n\n            pruning_frequency (float or None): Frequency below which value in categorical features are pruned (set to *misc*). (deactivated by default)\n\n            do_not_use (list or None): Leave these columns alone!\n\n            sharp_categorical_dict (dict): {'column': {'sep': \"#\", 'norm': True/False} }.\n\n                                           If not provided, program looks for columns ending in *_cat* and automatically\n                                           creates an entry in the dict with value {'sep': \"#\", 'norm': True}.\n\n            na_strategy (Strategy): Strategy used to impute missing values.\n\n            variance_threshold (float or None): Threshold for variance selector. (deactivated by default)\n\n            low_memory (bool): If True, counterizor will not use parallel computation. default: False\n\n        Returns:\n            namedtuple('data', ['X', 'other', 'names'])\n\n                data.X (scipy sparse matrix): model input\n\n                data.other (pandas DataFrame): columns unused\n        \"\"\"\n```\n\nFirst release in 2016. \n\nDocumentation compiled using *sphynx*. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomgorb%2Fds-utils","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomgorb%2Fds-utils","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomgorb%2Fds-utils/lists"}