{"id":15914625,"url":"https://github.com/justinhchae/pd-helper","last_synced_at":"2025-03-23T01:31:25.523Z","repository":{"id":53055256,"uuid":"355688254","full_name":"justinhchae/pd-helper","owner":"justinhchae","description":"A helpful package to streamline Pandas DataFrame optimization.","archived":false,"fork":false,"pushed_at":"2022-01-19T18:39:23.000Z","size":98,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-18T16:14:45.382Z","etag":null,"topics":["bigdata","dataframes","developer-tools","optimization-tools","pandas","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/justinhchae.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-04-07T21:36:35.000Z","updated_at":"2021-12-07T03:48:51.000Z","dependencies_parsed_at":"2022-09-04T10:40:42.207Z","dependency_job_id":null,"html_url":"https://github.com/justinhchae/pd-helper","commit_stats":null,"previous_names":[],"tags_count":21,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/justinhchae%2Fpd-helper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/justinhchae%2Fpd-helper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/justinhchae%2Fpd-helper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/justinhchae%2Fpd-helper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/justinhchae","download_url":"https://codeload.github.com/justinhchae/pd-helper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245043953,"owners_count":20551855,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","dataframes","developer-tools","optimization-tools","pandas","python3"],"created_at":"2024-10-06T17:04:46.844Z","updated_at":"2025-03-23T01:31:25.244Z","avatar_url":"https://github.com/justinhchae.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pd-helper\n \n A helpful package to streamline Pandas DataFrame optimization.\n \n Save 50-75% on DataFrame memory usage by running the optimizer. \n \n Autoconfigure dtypes for appropriate data types in each column with **helper**.\n\n Generate a random DataFrame of controlled random variables for testing with **maker**.\n\n## Install\n ```bash\n pip install pd-helper\n ```\n\n## Basic Usage to Iterate over DataFrame\n```python\nfrom pd_helper.maker import MakeData \nfrom pd_helper.helper import optimize\nfaker = MakeData()\n\nif __name__ == \"__main__\":\n   # MakeData() generates a fake dataframe, convenient for testing\n   df = faker.make_df()\n   df = optimize(df)\n```\n## Better Usage With Multiprocessing\n```python\nfrom pd_helper.maker import MakeData \nfrom pd_helper.helper import optimize\nfaker = MakeData()\n\nif __name__ == \"__main__\":\n   # MakeData() generates a fake dataframe, convenient for testing\n   df = faker.make_df()\n   df = optimize(df, enable_mp=True)\n```\n\n## Specify Special Mappings\n```python\nfrom pd_helper.maker import MakeData \nfrom pd_helper.helper import optimize\nfaker = MakeData()\n\nif __name__ == \"__main__\":\n   # MakeData() generates a fake dataframe, convenient for testing\n   df = faker.make_df()\n   special_mappings = {'string': ['object_id'],\n                       'category': ['item_name']}\n   \n   # special mappings will be applied instead of by optimize ruleset, they will be returned.\n   df = optimize(df\n                 , enable_mp=True,\n                 special_mappings=special_mappings\n                 )\n```\n\n\n## Sample Results with Helper\n\n```bash\nStarting with 175.63 MB memory.\n\nAfter optmization. \n\nEnding with 65.33 MB memory.\n```\n\n## Generating a Randomly Imperfect DataFrame with Maker\n\n Maker provides a class, MakeData(), to generate a table of made-up records. \n \n Each row is an event where an item was retrieved. \n \n Options to make the table imperfectly random in various ways. \n \n Sample table below:\n\n|  | Retrieved Date  | Item Name | Retrieved | Condition | Sector |\n| ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |\n| Example | 2019-01-01, 2019-03-4  | Toaster, Lighter  | True, False  | Junk, Excellent  | 1, 2 |\n| Data Type | String  | String  | String  | String | Integer |\n\n\n## References\n\n* Pandas Categorical: \u003chttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html\u003e\n\n* Pandas Pickle: \u003chttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_pickle.html\u003e\n\n* Pandas CSV: \u003chttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html\u003e\n\n* Pandas Datetime: \u003chttps://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html\u003e\n\n### TODO\n\n* Improve efficiency of iterating on DataFrame.\n\n* Allow user to toggle logging.\n\n* Provide tools for imputing missing data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustinhchae%2Fpd-helper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjustinhchae%2Fpd-helper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjustinhchae%2Fpd-helper/lists"}