{"id":13738471,"url":"https://github.com/kozodoi/dptools","last_synced_at":"2025-05-08T16:34:08.714Z","repository":{"id":57423900,"uuid":"255876247","full_name":"kozodoi/dptools","owner":"kozodoi","description":"Python package with utilities for data processing, aggregation, feature engineering and data versioning","archived":false,"fork":false,"pushed_at":"2022-04-19T13:41:33.000Z","size":111,"stargazers_count":4,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-11-07T03:49:35.117Z","etag":null,"topics":["aggregation","data-preparation","data-preprocessing","data-science","feature-engineering","python"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/dptools/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kozodoi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-15T10:01:28.000Z","updated_at":"2024-09-25T12:14:42.000Z","dependencies_parsed_at":"2022-08-30T02:11:06.027Z","dependency_job_id":null,"html_url":"https://github.com/kozodoi/dptools","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kozodoi%2Fdptools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kozodoi%2Fdptools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kozodoi%2Fdptools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kozodoi%2Fdptools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kozodoi","download_url":"https://codeload.github.com/kozodoi/dptools/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224746768,"owners_count":17363109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aggregation","data-preparation","data-preprocessing","data-science","feature-engineering","python"],"created_at":"2024-08-03T03:02:23.404Z","updated_at":"2024-11-15T07:31:10.894Z","avatar_url":"https://github.com/kozodoi.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# dptools: data preprocessing functions for Python\n\n---\n\n[![PyPI Latest Release](https://img.shields.io/pypi/v/dptools.svg)](https://pypi.org/project/dptools/)\n[![Python 3.7](https://img.shields.io/badge/python-3.7-blue.svg)](https://pypi.org/project/dptools/)\n[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)\n[![Licence](https://img.shields.io/github/license/mashape/apistatus.svg)](http://choosealicense.com/licenses/mit/)\n[![Build Status](https://travis-ci.org/kozodoi/dptools.svg?branch=master)](https://travis-ci.com/kozodoi/dptools)\n[![Downloads](https://img.shields.io/pypi/dm/dptools)](https://pypi.org/project/dptools/)\n\n---\n\n## Overview\n\nThe `dptools` Python package provides helper functions to simplify common data processing tasks in a data science pipeline, including feature engineering, data aggregation, working with missing values and more.\n\nThe package currently encompasses the following functions:\n- Feature engineering:\n    - `add_date_features()`: create date and time-based features\n    - `add_text_features()`: create text-based features (including counts and TF-IDF)\n    - `aggregate_data()`: aggregate data and create features based on aggregated statistics\n    - `encode_factors()`: perform label or dummy encoding of categorical features\n- Data processing:\n    - `split_nested_features()`: split features nested in a single column\n    - `fill_missings()`: replace missings with specific values\n    - `correct_colnames()`: correct column names to be unique and remove foreign symbols\n    - `print_missings()`: print information on features with missing values\n    - `print_factor_levels()`: print levels of categorical features\n- Data cleaning:\n    - `find_correlated_features()`: identify features with a high pairwise correlation\n    - `find_constant_features()`: identify features with a single unique value\n- Import and versioning:\n    - `read_csv_with_json()`: read CSV where some columns are in JSON format\n    - `save_csv_version()`: save CSV with an automatically assigned version to prevent overwriting\n\n\n## Installation\n\nThe latest stable release of `dptools` can be installed from PyPI:\n```\npip install dptools\n```\n\nYou may also install the development version from Github:\n```\npip install git+https://github.com/kozodoi/dptools.git\n```\n\nAfter the installation, you can import the included functions:\n```py\nfrom dptools import *\n```\n\n\n## Examples\n\nThis section contains a few examples of using functions from `dptools` for different data preprocessing tasks. Please refer to the docstring documentation in the implemented functions for further examples.\n\n\n### Creating a toy data set\n\nFirst, let us create a toy data frame to demonstrate the package functionality.\n\n```py\n# import dependencies\nimport pandas as pd\nimport numpy as np\n\n# create data frame\ndata = {'age': [27, np.nan, 30, 25, np.nan],\n        'height': [170, 168, 173, 177, 165],\n        'gender': ['female', 'male', np.nan, 'male', 'female'],\n        'income': ['high', 'medium', 'low', 'low', 'no income']}\ndf = pd.DataFrame(data)\n```\n| age | height | gender | income |\n|---:| ---:| ---:| ---:|   \n| 27.0 | 170 | female | high |\n| NaN | 168 | male | medium |\n| 30.0 | 173 | NaN | low |\n| 25.0 | 177 | male | low |\n| NaN | 165 | female | no income |\n\n\n### Aggregating features\n\n```py\n# aggregating the data\nfrom dptools import aggregate_data\ndf_new = aggregate_data(df, group_var = 'gender', num_stats = ['mean', 'max'], fac_stats = 'mode')   \n```\n| gender | age_mean | age_max | height_mean | height_max | income_mode |\n|---:| ---:| ---:| ---:| ---:| ---:|    \n| female | 27.0 | 27.0 | 167.5 | 170 | 'high' |\n| male | 25.0 | 25.0 | 172.5 | 177 | 'low' |\n\n\n### Creating text-based features\n\n```py\n# creating text-based features\nfrom dptools import add_text_features\ndf_new = add_text_features(df, text_vars = 'income')\n```\n| age | height | gender | income_word_count | income_char_count |  income_tfidf_0 | ... | income_tfidf_3 |\n|---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:|\n| 27.0 | 170 | female | 1 | 4 | 1.0 | ... | 0.0 |\n| NaN | 168 | male | 1 | 6 | 0.0 | ... | 1.0 |\n| 30.0 | 173 | NaN | 1 | 3 | 0.0 | ... | 0.0 |\n| 25.0 | 177 | male | 1 | 3 | 0.0 | ... | 0.0 |\n| NaN | 165 | female | 2 | 9 | 0.0 | ... | 0.0 |\n\n\n### Working with missings\n\n```py\n# print statistics on missing values\nfrom dptools import print_missings\nprint_missings(df)\n```\n| | Total | Percent |\n|---:| ---:| ---:|\n| age | 2 | 0.4 |\n| gender | 1 | 0.2 |\n\n\n### Finding correlated features\n\n```py\n# displays one correlated feature from each pair\nfrom dptools import find_correlated_features\nfeats = find_correlated_features(df, cutoff = 0.4, method = 'spearman')\nfeats\n```\n\u003e Found 1 correlated features.\n\n\u003e ['age']\n\n### Data versioning\n\n```py\n# first call saves df as 'data_v1.csv'\nfrom dptools import save_csv_version\nsave_csv_version('data.csv', df, index = False)\n\n# second call saves df as 'data_v2.csv' as data_v1.csv already exists\nsave_csv_version('data.csv', df, index = False)\n```\n\n\n## Dependencies\n\nInstallation requires Python 3.7+ and the following packages:\n- [numpy](https://www.numpy.org)\n- [pandas](https://pandas.pydata.org)\n- [sklearn](https://scikit-learn.org)\n- [scipy](https://scipy.org)\n\n\n## Feedback\n\nIn case you need help on the included data preprocessing functions or you want to report an issue, please do so at the corresponding [GitHub page](https://github.com/kozodoi/dptools/issues).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkozodoi%2Fdptools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkozodoi%2Fdptools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkozodoi%2Fdptools/lists"}