{"id":22944259,"url":"https://github.com/otrecoding/otrecod","last_synced_at":"2025-09-05T02:16:51.018Z","repository":{"id":41078671,"uuid":"183584955","full_name":"otrecoding/OTrecod","owner":"otrecoding","description":"R package for optimal transportation to recode variables","archived":false,"fork":false,"pushed_at":"2023-03-14T20:24:56.000Z","size":1825,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-04T10:09:35.550Z","etag":null,"topics":["optimal-transport","r"],"latest_commit_sha":null,"homepage":"https://otrecoding.github.io/OTrecod","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/otrecoding.png","metadata":{"files":{"readme":"README.Rmd","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-04-26T07:59:44.000Z","updated_at":"2023-07-25T14:25:25.000Z","dependencies_parsed_at":"2025-02-07T14:13:43.586Z","dependency_job_id":null,"html_url":"https://github.com/otrecoding/OTrecod","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/otrecoding/OTrecod","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/otrecoding%2FOTrecod","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/otrecoding%2FOTrecod/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/otrecoding%2FOTrecod/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/otrecoding%2FOTrecod/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/otrecoding","download_url":"https://codeload.github.com/otrecoding/OTrecod/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/otrecoding%2FOTrecod/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273699842,"owners_count":25152310,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-05T02:00:09.113Z","response_time":402,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["optimal-transport","r"],"created_at":"2024-12-14T14:17:23.501Z","updated_at":"2025-09-05T02:16:50.989Z","avatar_url":"https://github.com/otrecoding.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"---\ntitle: \"OTrecod package\"\noutput: github_document\n---\n\n```{r setup, include = FALSE}\nknitr::opts_chunk$set(\n  collapse = TRUE,\n  comment = \"#\u003e\",\n  fig.path = \"man/figures/README-\",\n  out.width = \"100%\"\n)\n```\n# A package dedicated to data fusion\n\n## Introduction\n\n\u003c!-- badges: start --\u003e\n[![build status](https://github.com/otrecoding/OTrecod/workflows/R-CMD-check/badge.svg)](https://github.com/otrecoding/OTrecod/actions/workflows/check-release.yaml)\n[![CRAN status](https://www.r-pkg.org/badges/version/OTrecod)](https://cran.r-project.org/package=OTrecod)\n[![Launch binder](http://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/otrecoding/OTrecod/master)\n[![Website](https://img.shields.io/website?url=https%3A%2F%2Fotrecoding.github.io%2FOTrecod%2F)](https://otrecoding.github.io/OTrecod/)\n[![Codecov test coverage](https://codecov.io/gh/otrecoding/OTrecod/branch/master/graph/badge.svg)](https://app.codecov.io/gh/otrecoding/OTrecod?branch=master)\n\u003c!-- badges: end --\u003e\n\nThe **OTrecod** package gives access to a set of original functions dedicated to data fusion.\n\n\u003cp align=\"justify\"\u003e From two separate data sources with no overlapping units, sharing only a set of common variables X and a same target information not jointly observed in a same encoding from one data source to another (Y in A and Z in B), the functions **OT\\_outcome** and **OT\\_joint** aim at providing users a complete synthetic database where the missing information is available for every unit.\u003c/p\u003e\n\n\u003cp align=\"justify\"\u003e This recoding problem is solved using the optimal transportation theory which provides a map that transfers the joint distribution of the first target variable and X to the joint distribution of the second one and X, or inversely. Algorithms used in these two functions come from the references (1) and (2).\u003c/p\u003e\n\n\u0026nbsp;\n\n## Package installation\n\nIf the package **OTrecod** is not installed in their current R versions, users can install it by following the standard instruction:\n\n```{r,eval=FALSE}\ninstall.packages(\"OTrecod\")\n```\n\nObviously, each time an R session is opened, the **OTrecod** library must be loaded with:\n\n```{r,results='hide',message=FALSE,warning=FALSE}\nlibrary(OTrecod)\n```\n\nMoreover, the development version of **OTrecod** can be installed actually from [GitHub](https://github.com/otrecoding/OTrecod) with:\n\n```r\n# Install development version from GitHub\ndevtools::install_github(\"otrecoding/OTrecod\")\n```\n\u0026nbsp;\n\n## Database examples and expected structure before data fusion\n\n\u003cp align=\"justify\"\u003e The available databases called **tab\\_test** and **simu\\_data** correspond to overlayed databases used as examples in the documentation of all the functions.\nTheir structures can help users understanding the database structure expected as input argument of the functions **OT\\_outcome** and **OT\\_joint**. \nThe first rows of the two overlayed data sources of **simu\\_data** are visualized as follows to inform about the expected database structure: \u003c/p\u003e\n\n```{r,comment=''}\ndata(simu_data)\ndim(simu_data)\nsimu_data[c(1:5,301:305),]\n```\n\n\u003cp align=\"justify\"\u003e The first column called *DB* corresponds here to the database identifier (two data sources called here 1 and 2 with the data source 1 placed above the data source 2).\nThe second column called *Yb1* is the target variable of the data source 1. The values of *Yb1* in the data source 2 are missing and will be predicted using an optimal transportation algorithm integrated in one of the two functions called **OT\\_outcome** and **OT\\_joint**.\nIn the same way, the variable *Yb2* (third column)  is the target variable of the data source 2 whose values in 1 are unknown. These missing values can also be predicted using  **OT\\_outcome** and **OT\\_joint**.\u003c/p\u003e\n\n\u003cp align=\"justify\"\u003e The presence of these three variables is essential in any database dedicated to datafusion in the **OTrecod** package whatevever their names and whatever their orders in the database.\nThe following columns correspond to shared variables of any type, complete or not. Note that continuous variables (like age in years) are not allowed with the **OT\\_joint** function. \u003c/p\u003e\n \n\u003cp align=\"justify\"\u003e Support functions are available in the package (**merge_dbs**, **imput_cov**) to assist user in this preparation. \u003c/p\u003e\n\n\u003cp align=\"justify\"\u003e Finally, the supplementary datasets **api29** and **api35** are simple datasets extracted from the API program (https://www.cde.ca.gov/re/pr/api.asp)\nto allow users to practice with convenient databases. \u003c/p\u003e\n\n\u0026nbsp;\n\n\n## Support functions\n\n\u003cp align=\"justify\"\u003e Among the available functions, the **OTrecod** package provides a set of support functions to assist users in each step of their data fusion projects. \u003c/p\u003e\n\n\n\n### merge\\_dbs\n\n\u003cp align=\"justify\"\u003e The **merge\\_dbs** function is a pre-process data fusion function dedicated to the harmonization of two data sources. By default, variables (not target variables) with same labels are considered as shared between the two databases.\nThe **merge\\_dbs** function detects potential discrepancies between the variables before merging by:\n\n- firstly excluding variables with different labels from the first database to the second one and inversely.\n- excluding a priori shared variables with different types.\n- excluding a priori shared factors with different levels.\n\nThe actual form of the function does not propose automatic reconciliation actions to reintroduce the problematic variables but gives user enough information in output to do it by himself if necessary.\nThe call of the **merge\\_dbs** function is actually:  \u003c/p\u003e\n\n```{r, eval=FALSE}\nmerge_dbs = function(DB1, DB2, row_ID1 = NULL, row_ID2 = NULL, NAME_Y, NAME_Z, order_levels_Y = levels(DB1[, NAME_Y]), order_levels_Z = levels(DB2[, NAME_Z]), ordinal_DB1 = NULL, ordinal_DB2 = NULL,\n                     impute = \"NO\", R_MICE = 5, NCP_FAMD = 3, seed_choice = sample(1:1000000, 1))\n```\n\n\u003cp align=\"justify\"\u003e The **merge\\_dbs** function notably provides in output an unique database, result of the overlayed of the two initial data sources, in the structure expected by the **OT\\_outcome** and **OT\\_joint** functions. \u003c/p\u003e\n\n\u0026nbsp;\n\n### select\\_pred\n\n\u003cp align=\"justify\"\u003e The **select\\_pred** function is a pre-process data fusion function dedicated to the selection of matching variables.\nThis selection is essential when the initial set of shared variables is important, but also because the choice of predictors greatly influences the quality of the data fusion whatever the optimal transportation algorithms chosen a posteriori.\n\n\nThe call of the **select\\_pred** function is actually: \u003c/p\u003e\n\n```{r, eval=FALSE}\nselect_pred = function(databa,Y = NULL, Z = NULL, ID = 1, OUT = \"Y\", quanti = NULL, nominal = NULL, ordinal = NULL, logic = NULL,\n                       convert_num = NULL, convert_class = NULL, thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.20,\n                       RF = TRUE, RF_ntree = 500, RF_condi = FALSE, RF_condi_thr = 0.20, RF_SEED = sample(1:1000000, 1))\n```\n\n\u0026nbsp;\n\n### verif\\_OT\n\n\u003cp align=\"justify\"\u003e The **verif\\_OT** function is a post-process data fusion function dedicated to the validation of the fusion.\nThe function provides a set of tools to assess the quality of the optimal transportation recoding proposed by the algorithms to predict the \nmissing information of the target variables in one or both datasources. \u003c/p\u003e\n\nThe call of the **verif\\_OT** function is actually:\n\n```{r, eval=FALSE}\nverif_OT = function(ot_out, group.class = FALSE, ordinal = TRUE, stab.prob = FALSE, min.neigb = 1, R = 10, seed.stab = sample(1:1000000, 1))\n```\n\n\u0026nbsp;\n\n## Optimal transportation functions\n\n\u003cp align=\"justify\"\u003e  The **OTrecod** package provides two algorithms that use optimal transportation theory to solve recoding problems in data fusion contexts (see (1) and (2) for more details).\nEach algorithm is stored in one function and each function provides in output a unique and synthetic database where the two initial data sources are overlayed and the missing information from only one or both target variables are fully completed.\n\nEach of the two alogorithms also proposed enrichments by relaxing the initial distributional constraints and adding regularization terms as described in (2). \u003c/p\u003e\n\n\u0026nbsp;\n\n### OT_outcome\n\n\u003cp align=\"justify\"\u003e The **OT_outcome** function can provide individual predictions of the incomplete target variables by considering\nthe recoding problem involving only optimal transportation of outcomes (see (1) and (2) for more details).\n\nThe call of the **OT_outcome** function is: \u003c/p\u003e\n\n```{r, eval=FALSE}\nOT_outcome = function(datab, index_DB_Y_Z = 1:3, quanti = NULL, nominal = NULL, ordinal = NULL,logic = NULL,\n                      convert.num = NULL, convert.class = NULL, FAMD.coord = \"NO\", FAMD.perc = 0.8,\n                      dist.choice = \"E\", percent.knn = 1, maxrelax = 0, indiv.method = \"sequential\",\n                      prox.dist = 0.30, solvR = \"glpk\", which.DB = \"BOTH\")\n```\n\n\u0026nbsp;\n\n### OT_joint\n\n\u003cp align=\"justify\"\u003e The **OT_joint** function can provide individual predictions of the incomplete target variables by considering\nthe recoding problem involving optimal transportation of shared variables and outcomes (see(2) for more details).\n\n\nThe call of the **OT_joint** function is:  \u003c/p\u003e\n\n```{r, eval=FALSE}\nOT_joint = function(datab, index_DB_Y_Z = 1:3, nominal = NULL, ordinal = NULL,logic = NULL,\n                    convert.num = NULL, convert.class = NULL, dist.choice = \"E\", percent.knn = 1,\n                    maxrelax = 0, lambda.reg = 0.0, prox.X = 0.10, solvR = \"glpk\", which.DB = \"BOTH\")\n```\n\n\u0026nbsp;\n\n\n## References\n\n(1) Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics.Volume 16, Issue 1, 20180106, eISSN 1557-4679. \n\n(2) Gares V, Omer J (2020). Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fotrecoding%2Fotrecod","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fotrecoding%2Fotrecod","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fotrecoding%2Fotrecod/lists"}