{"id":27608881,"url":"https://github.com/polkas/cat2cat","last_synced_at":"2025-10-23T20:50:07.480Z","repository":{"id":44533691,"uuid":"250753014","full_name":"Polkas/cat2cat","owner":"Polkas","description":"Handling an Inconsistently Coded Categorical Variable in a Longitudinal Dataset","archived":false,"fork":false,"pushed_at":"2024-01-22T22:19:51.000Z","size":15731,"stargazers_count":4,"open_issues_count":4,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-22T22:54:03.983Z","etag":null,"topics":["categories","cran","encoding","encodings","factor","longitudinal","mapping","mappings","panel","r","r-package","transitions"],"latest_commit_sha":null,"homepage":"https://polkas.github.io/cat2cat","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Polkas.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":".github/CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":"codemeta.json","zenodo":null}},"created_at":"2020-03-28T09:06:45.000Z","updated_at":"2023-10-27T08:16:58.000Z","dependencies_parsed_at":"2023-10-15T18:01:23.383Z","dependency_job_id":"84cb7af3-5435-43f5-b5b8-1e04378c354c","html_url":"https://github.com/Polkas/cat2cat","commit_stats":{"total_commits":113,"total_committers":1,"mean_commits":113.0,"dds":0.0,"last_synced_commit":"10c42070096d1c919aed9bf4bb043c663b2dd102"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/Polkas/cat2cat","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Polkas%2Fcat2cat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Polkas%2Fcat2cat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Polkas%2Fcat2cat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Polkas%2Fcat2cat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Polkas","download_url":"https://codeload.github.com/Polkas/cat2cat/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Polkas%2Fcat2cat/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271998854,"owners_count":24856123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-25T02:00:12.092Z","response_time":1107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["categories","cran","encoding","encodings","factor","longitudinal","mapping","mappings","panel","r","r-package","transitions"],"created_at":"2025-04-22T22:51:53.977Z","updated_at":"2025-10-23T20:50:07.396Z","avatar_url":"https://github.com/Polkas.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cat2cat \u003ca href='https://github.com/polkas/cat2cat'\u003e\u003cimg src='man/figures/cat2cat_logo.png' align=\"right\" width=\"200px\" /\u003e\u003c/a\u003e\n[![R build status](https://github.com/polkas/cat2cat/workflows/R-CMD-check/badge.svg)](https://github.com/polkas/cat2cat/actions)\n[![CRAN](http://www.r-pkg.org/badges/version/cat2cat)](https://cran.r-project.org/package=cat2cat)\n[![codecov](https://codecov.io/gh/Polkas/cat2cat/branch/main/graph/badge.svg)](https://app.codecov.io/gh/Polkas/cat2cat)\n[![Dependencies](https://tinyverse.netlify.com/badge/cat2cat)](https://cran.r-project.org/package=cat2cat)\n\n## Handling an Inconsistent Coded Categorical Variable in a Longitudinal Dataset\n\nUnifying an inconsistent coded categorical variable in a panel/longtitudal dataset.  \nThere is offered the novel `cat2cat` procedure to map a categorical variable according to a mapping (transition) table between two different time points.\nThe mapping (transition) table should to have a candidate for each category from the targeted for an update period. The main rule is to replicate the observation if it could be assigned to a few categories, then using simple frequencies or modern statistical methods to approximate probabilities of being assigned to each of them.\n\n**This algorithm was invented and implemented in the paper by [(Nasinski, Majchrowska and Broniatowska (2020))](https://doi.org/10.24425/cejeme.2020.134747).**\n\n**For more details please read the paper by [(Nasinski, Gajowniczek (2023))](https://doi.org/10.1016/j.softx.2023.101525).**\n\n[**Please visit the cat2cat webpage for more information**](https://polkas.github.io/cat2cat/articles/cat2cat.html)\n\n[**Python Version**](https://pypi.org/project/cat2cat/)\n\n## Installation\n\n```r\n# install.packages(\"remotes\")\nremotes::install_github(\"polkas/cat2cat\")\n# or\ninstall.packages(\"cat2cat\")\n```\n\n## Example\n\n`occup` dataset is an example of unbalance panel dataset.\nThis is a simulated data although there are applied a real world characteristics from national statistical office survey.\nThe original survey is anonymous and take place **every two years**.\n\n`trans` dataset containing mappings (transitions) between old (2008) and new (2010) occupational codes.\nThis table could be used to map encodings in both directions.\n\nPanel dataset without the unique identifiers and only two periods, backward and simple frequencies:\n\n```r\nlibrary(\"cat2cat\")\ndata(\"occup\", package = \"cat2cat\")\ndata(\"trans\", package = \"cat2cat\")\n\noccup_old \u003c- occup[occup$year == 2008, ]\noccup_new \u003c- occup[occup$year == 2010, ]\n\noccup_simple \u003c- cat2cat(\n  data = list(\n    old = occup_old, new = occup_new,\n    cat_var_old = \"code\", cat_var_new = \"code\", time_var = \"year\"\n  ),\n  mappings = list(trans = trans, direction = \"backward\")\n)\n```\n\nPanel dataset without the unique identifiers and four periods, backward direction and ml models:\n\n```r\nlibrary(\"cat2cat\")\ndata(\"occup\", package = \"cat2cat\")\ndata(\"trans\", package = \"cat2cat\")\n\noccup_2006 \u003c- occup[occup$year == 2006,]\noccup_2008 \u003c- occup[occup$year == 2008,]\noccup_2010 \u003c- occup[occup$year == 2010,]\noccup_2012 \u003c- occup[occup$year == 2012,]\n\nlibrary(\"caret\")\n\nml_setup \u003c- list(\n  data = occup_2010,\n  cat_var = \"code\",\n  method = c(\"knn\"),\n  features = c(\"age\", \"sex\", \"edu\", \"exp\", \"parttime\", \"salary\"),\n  args = list(k = 10, ntree = 50)\n)\n\nmappings \u003c- list(trans = trans, direction = \"backward\")\n\n# ml model performance check\nprint(cat2cat_ml_run(mappings, ml_setup))\n\n# from 2010 to 2008\noccup_back_2008_2010 \u003c- cat2cat(\n  data = list(\n    old = occup_2008, new = occup_2010, \n    cat_var_old = \"code\", cat_var_new = \"code\", time_var = \"year\"\n  ),\n  mappings = mappings,\n  ml = ml_setup\n)\n\n# from 2008 to 2006\noccup_back_2006_2008 \u003c- cat2cat(\n  data = list(\n    old = occup_2006, new = occup_back_2008_2010$old,\n    cat_var_new = \"g_new_c2c\", cat_var_old = \"code\", time_var = \"year\"\n  ),\n  mappings = mappings,\n  ml = ml_setup\n)\n\no_2006_new \u003c- occup_back_2006_2008$old\no_2008_new \u003c- occup_back_2008_2010$old # or occup_back_2006_2008$new\no_2010_new \u003c- occup_back_2008_2010$new\no_2012_new \u003c- dummy_c2c(\n  occup_2012, cat_var = \"code\", ml = c(\"knn\")\n)\n\nfinal_data_back \u003c- do.call(\n  rbind, \n  list(o_2006_new, o_2008_new, o_2010_new, o_2012_new)\n)\n\n# possible processing, leaving only one obs per subject and period\n# still it is recommended to leave all replications and use the weights in the statistical models\nlibrary(magrittr)\nff \u003c- final_data_back %\u003e% \n  split(.$year) %\u003e% \n  lapply(function(x) cross_c2c(x)) %\u003e% \n  lapply(function(x) \n    prune_c2c(x, column = \"wei_cross_c2c\", method = \"highest1\")\n  ) %\u003e% \n  do.call(rbind, .)\nall.equal(nrow(ff), sum(ff$wei_cross_c2c))\nall.equal(nrow(ff), sum(final_data_back$wei_freq_c2c))\n```\n\n**More complex examples are presented in the \"Get Started\" vignette.**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolkas%2Fcat2cat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpolkas%2Fcat2cat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpolkas%2Fcat2cat/lists"}