{"id":23229187,"url":"https://github.com/eltoulemonde/datapreparation","last_synced_at":"2025-08-19T15:31:07.258Z","repository":{"id":22393900,"uuid":"96125447","full_name":"ELToulemonde/dataPreparation","owner":"ELToulemonde","description":"Data preparation for data science projects. ","archived":false,"fork":false,"pushed_at":"2023-07-04T13:26:34.000Z","size":5436,"stargazers_count":31,"open_issues_count":1,"forks_count":10,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-04-24T19:33:20.855Z","etag":null,"topics":["data-preparation","data-preprocessing","data-science","date-conversion","r","speed","variable-elimination","variable-selection"],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ELToulemonde.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-07-03T15:29:17.000Z","updated_at":"2024-02-09T06:10:30.000Z","dependencies_parsed_at":"2022-09-22T20:50:39.292Z","dependency_job_id":"1e1a0447-d3e0-4338-bed6-8981b35e2ba4","html_url":"https://github.com/ELToulemonde/dataPreparation","commit_stats":{"total_commits":74,"total_committers":4,"mean_commits":18.5,"dds":0.06756756756756754,"last_synced_commit":"d2e22a9e5035e9b1978dc06995741e21b5bb12af"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELToulemonde%2FdataPreparation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELToulemonde%2FdataPreparation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELToulemonde%2FdataPreparation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ELToulemonde%2FdataPreparation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ELToulemonde","download_url":"https://codeload.github.com/ELToulemonde/dataPreparation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230359874,"owners_count":18214159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-preparation","data-preprocessing","data-science","date-conversion","r","speed","variable-elimination","variable-selection"],"created_at":"2024-12-19T01:17:33.245Z","updated_at":"2024-12-19T01:17:33.882Z","avatar_url":"https://github.com/ELToulemonde.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"dataPreparation\n===============\n[![Github actions](https://github.com/ELToulemonde/dataPreparation/actions/workflows/r.yml/badge.svg)](https://github.com/ELToulemonde/dataPreparation/actions/workflows/r.yml) [![codecov](https://codecov.io/gh/ELToulemonde/dataPreparation/branch/master/graph/badge.svg)](https://codecov.io/gh/ELToulemonde/dataPreparation)   [![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/dataPreparation)](https://cran.r-project.org/package=dataPreparation)  [![](http://cranlogs.r-pkg.org/badges/dataPreparation)](https://CRAN.R-project.org/package=dataPreparation) [![](https://cranlogs.r-pkg.org/badges/grand-total/dataPreparation)](https://CRAN.R-project.org/package=dataPreparation)\n  [![HitCount](http://hits.dwyl.com/eltoulemonde/dataPreparation.svg?style=flat-square)](http://hits.dwyl.com/eltoulemonde/dataPreparation)\n\nData preparation accounts for about 80% of the work during a data science project. Let's take that number down.\n__dataPreparation__ will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.\n\n\nThis package is\n- fast (use `data.table` and exponential search)\n- RAM efficient (perform operations by reference and column-wise to avoid copying data)\n- stable (most exceptions are handled)\n- verbose (log a lot)\n\n\n\n--------------------------\n\nMain preparation steps\n=======================\n\nBefore using any machine learning (ML) algorithm, one need to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:\n\n  * __Read__: load the data set (this package don't treat this point: for csv we recommend `data.table::fread`)\n  * __Correct__: most of the times, there are some mistake after reading, wrong format... one have to correct them\n  * __Transform__: creating new features from date, categorical, character... in order to have information usable for a ML algorithm (aka: numeric or categorical)\n  * __Filter__: get rid of useless information in order to speed up computation\n  * __Pre model transformation__: Specific manipulation for the chosen model (handling NA, discretization, one hot encoding, scaling...)\n  * __Shape__: put your data set in a nice shape usable by a ML algorithm\n \nHere are the functions available in this package to tackle those issues:\n\nCorrect                     | Transform                | Filter                  | Pre model manipulation| Shape             \n---------                   |-----------               |--------                 |-----------            |------------------------\nun_factor                    | generate_date_diffs        | fast_filter_variables     | fast_handle_na          | shape_set          \nfind_and_transform_dates       | generate_factor_from_date   | which_are_constant        | fast_discretization    | same_shape         \nfind_and_transform_numerics    | aggregate_by_key           | which_are_in_double        | fast_scale             | set_as_numeric_matrix\nset_col_as_character           | generate_from_factor       | which_are_bijection       |                       | one_hot_encoder\nset_col_as_numeric             | generate_from_character    |remove_sd_outlier        |                       |\nset_col_as_date                | fast_round                |remove_rare_categorical  |                       |\nset_col_as_factor              | target_encode            |remove_percentile_outlier|                       |\n\nAll of those functions are integrated in the __full pipeline__ function `prepare_set`.\n\n\nFor more details on how it work go check our [tutorial](https://cran.r-project.org/web/packages/dataPreparation/vignettes/dataPreparation.html).\n\nGetting started: 30 seconds to dataPreparation\n==============================================\n\n### Installation\nInstall the package from CRAN:\n```R\ninstall.packages(\"dataPreparation\")\n```\n\nTo have the latest features, install the package from github:\n```R\nlibrary(devtools)\ninstall_github(\"ELToulemonde/dataPreparation\")\n```\n\n### Test it\nLoad a toy data set\n```R\nlibrary(dataPreparation)\ndata(messy_adult)\nhead(messy_adult)\n```\n\nPerform full pipeline function\n```R\nclean_adult \u003c- prepare_set(messy_adult)\nhead(clean_adult)\n```\n\n__That's it.__ For all functions, you can check out documentation and/or tutorial vignette.\n\nHow to Contribute\n=================\n\ndataPreparation has been developed and used by many active community members. Your help is very valuable to make it better for everyone.\n\n- Check out call for [contributions](https://github.com/ELToulemonde/dataPreparation/blob/master/CONTRIBUTING.rst) to see what can be improved, or open an issue if you want something.\n- Contribute to add new usesfull features.\n- Contribute to the [tests](https://github.com/ELToulemonde/dataPreparation/tree/master/tests/testthat) to make it more reliable.\n- Contribute to the documents to make it clearer for everyone.\n- Contribute to the [examples](https://github.com/ELToulemonde/dataPreparation/tree/master/vignettes) to share your experience with other users.\n- Open [issue](https://github.com/ELToulemonde/dataPreparation/issues/) if you met problems during development.\n\nFor more details, please refer to CONTRIBUTING.\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feltoulemonde%2Fdatapreparation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feltoulemonde%2Fdatapreparation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feltoulemonde%2Fdatapreparation/lists"}