{"id":47891473,"url":"https://github.com/ihmeuw/person_linkage_case_study","last_synced_at":"2026-04-04T03:07:30.045Z","repository":{"id":242141518,"uuid":"808793794","full_name":"ihmeuw/person_linkage_case_study","owner":"ihmeuw","description":"Emulates the methods the US Census Bureau uses to link people across multiple data sources, using open-source software (Splink) and simulated data (from pseudopeople).","archived":false,"fork":false,"pushed_at":"2024-10-14T18:00:19.000Z","size":4645,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-25T15:15:42.668Z","etag":null,"topics":["census-bureau","dask","data-matching","data-science","entity-resolution","fuzzy-matching","record-linkage","spark","splink"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ihmeuw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-31T20:44:00.000Z","updated_at":"2025-03-01T00:57:43.000Z","dependencies_parsed_at":"2024-06-24T21:20:17.970Z","dependency_job_id":"44cbe561-319a-470a-abb3-4287496bcc11","html_url":"https://github.com/ihmeuw/person_linkage_case_study","commit_stats":null,"previous_names":["ihmeuw/person_linkage_case_study"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ihmeuw/person_linkage_case_study","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ihmeuw%2Fperson_linkage_case_study","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ihmeuw%2Fperson_linkage_case_study/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ihmeuw%2Fperson_linkage_case_study/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ihmeuw%2Fperson_linkage_case_study/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ihmeuw","download_url":"https://codeload.github.com/ihmeuw/person_linkage_case_study/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ihmeuw%2Fperson_linkage_case_study/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31385960,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T01:22:39.193Z","status":"online","status_checked_at":"2026-04-04T02:00:07.569Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["census-bureau","dask","data-matching","data-science","entity-resolution","fuzzy-matching","record-linkage","spark","splink"],"created_at":"2026-04-04T03:07:29.298Z","updated_at":"2026-04-04T03:07:30.036Z","avatar_url":"https://github.com/ihmeuw.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Person linkage case study\n\nThis case study emulates the methods the US Census Bureau\nuses to link people across multiple data sources,\nusing open-source software\n([Splink](https://moj-analytical-services.github.io/splink/index.html))\nand simulated data (from [pseudopeople](https://pseudopeople.readthedocs.io/en/latest/)).\nIt is based on public descriptions of the [Person Identification Validation System](https://www.census.gov/about/adrm/linkage/projects/pvs.html),\nwhich is one of the Census Bureau's primary person linkage pipelines.\nThe case study runs at multiple scales, including at full-USA scale --\nhundreds of billions of record comparisons.\nThis presents a realistic test case for assessing the\ncomputational performance and accuracy of record linkage methods improvements.\nThe case study also serves as a concrete example of the sorts of methods\nthe Census Bureau uses to link files, and may be of interest to those who would\nlike to understand those methods better.\n\n## Quickstart\n\nYou can run the case study at small scale on your laptop in just a few minutes.\nFirst, install conda if you don't have it already --\nwe recommend [Miniforge](https://github.com/conda-forge/miniforge).\n\nClone this repository to your computer and enter this directory:\n\n```console\n$ git clone https://github.com/ihmeuw/person_linkage_case_study\n$ cd person_linkage_case_study\n```\n\nInstall dependencies with conda and pip:\n\n```console\n$ conda create --name person_linkage_case_study --file conda.lock.txt\n$ conda activate person_linkage_case_study\n(person_linkage_case_study) $ pip install -r pip.lock.txt\n(person_linkage_case_study) $ pip install -e .\n```\n\nIf the `conda.lock.txt` line doesn't work because you aren't on Linux or for some other reason, you can roughly recreate\nthe environment necessary with `conda create --name person_linkage_case_study python=3.10`.\nIf the `pip.lock.txt` line doesn't work for some reason, it can be skipped to\napproximate the environment.\n\nNow, you can run the case study like so:\n\n```console\n(person_linkage_case_study) $ snakemake --forceall\n```\n\nThis will run the case study on about 10,000 rows of sample data.\n\nThe `diagnostics` folder will be updated with diagnostics about the linkage models\nyou ran (in the `small_sample` subfolders).\nSpecifically, take a look at\n[`diagnostics/executed_notebooks/small_sample/04_calculate_ground_truth_accuracy.ipynb`](./diagnostics/executed_notebooks/small_sample/04_calculate_ground_truth_accuracy.ipynb)\nfor information on how accurate the linkage was.\nIn addition, [`benchmarks/benchmark-small_sample.txt`](./benchmarks/benchmark-small_sample.txt)\nwill be updated with computational performance information about the core linking part\nof the case study.\nThese two files are the most important outputs for evaluating new methods:\ncan methods improvements improve the accuracy, or improve the runtime/resources without\nsubstantially decreasing the accuracy?\n\n**Note: In actual use, these questions should be investigated at full scale; this case study\ndoes not attempt to realistically represent how the Census Bureau would link smaller files.\nThe smaller scales are included for experimentation, getting started, and testing.**\n\n## Dask and Spark\n\nThe small-scale run in the previous section did not use any parallel or distributed processing,\nwhich will be needed in order to scale up.\nYou can set up the technologies needed and test them on the small-scale data.\n\nCreate a file called `overrides.yaml` in the `config/` directory. Give it the following contents:\n\n```yaml\npapermill_params:\n  small_sample:\n    all:\n      compute_engine: dask_local\n    link_datasets:\n      splink_engine: spark\n      spark_local: True\n```\n\nThese overrides say to use Dask and Spark locally (on your computer).\n[Dask](https://www.dask.org/) is used by the case study itself and\n[Spark](https://spark.apache.org/) is used within Splink.\n\nYou will need to install [Singularity](https://docs.sylabs.io/guides/latest/user-guide/) to run this, because Spark cannot\nbe installed via conda.\n\nYou also need to install additional Python packages, like so:\n\n```console\n$ pip install -r pip.lock-dask.txt\n$ pip install -r pip.lock-spark.txt\n$ pip install -e .[dask,spark]\n```\n\nNow, run `snakemake --forceall` to re-run the entire case study using these settings.\nYou will see a lot more output this time, but you should get the same result.\n\n**Note: Both Dask and Spark can spill data to disk. As you run at larger scales,\nyou will want to make sure you have both enough RAM and enough empty disk space.**\n\n## Distributed Dask\n\nIn the previous section, Dask ran entirely on a single computer, which puts limits\non how many resources it can use: as many as you have on one machine.\nYou can run Dask across many computers in a cluster like this:\n\n```yaml\npapermill_params:\n  small_sample:\n    all:\n      compute_engine: dask\n      compute_engine_scheduler: slurm\n      queue: \u003cyour Slurm partition\u003e\n      account: \u003cyour Slurm account\u003e\n      walltime: 1-00:00:00 # 1 day\n```\n\nThis is powered by [dask_jobqueue](https://jobqueue.dask.org/en/latest/), which supports a number of schedulers besides\nSlurm.\nYou will likely also want to configure the resources requested; see [`config/defaults.yaml`](./config/defaults.yaml)\nfor examples of this.\n\nYou will now need access to your job scheduler from inside the Singularity image for Spark.\nAn example container for this purpose, which works on IHME's Slurm cluster, is included in\nthis repository. To use this example, run\n\n```console\n$ singularity build --fakeroot spark_slurm_container/spark.sif spark_slurm_container/Singularity\n```\n\nand then add `custom_spark_container_path: spark_slurm_container/spark.sif` to the top\nlevel of your configuration YAML.\n\n## Distributed Spark\n\nThe same reasoning goes for Spark as for Dask: you may want to run it across multiple machines\nto utilize more compute resources.\n\nUnfortunately, a flexible library like dask_jobqueue doesn't exist for Spark.\nCurrently, **this will only work on a Slurm cluster**.\nSupport for more schedulers may be added in the future.\nThe configuration looks as follows:\n\n```yaml\npapermill_params:\n  small_sample:\n    all:\n      scheduler: slurm\n      queue: \u003cyour Slurm partition\u003e\n      account: \u003cyour Slurm account\u003e\n      walltime: 1-00:00:00 # 1 day\n    link_datasets:\n      splink_engine: spark\n      spark_local: False\n```\n\n## Distributed Dask *and* Spark\n\nYou can use any combination of local or distributed Dask or Spark.\nFor example, to run both distributed, your configuration would look like this:\n\n```yaml\npapermill_params:\n  small_sample:\n    all:\n      compute_engine: dask\n      scheduler: slurm\n      queue: \u003cyour Slurm partition\u003e\n      account: \u003cyour Slurm account\u003e\n      walltime: 1-00:00:00 # 1 day\n    link_datasets:\n      splink_engine: spark\n      spark_local: False\n```\n\n## Scaling up\n\nThere are three scales to choose from:\n\n- `small_sample`, which is the default of about 10,000 simulated people\n- `ri`, which uses the simulated state of Rhode Island (about 1 million simulated people)\n- `usa`, which uses an entire simulated USA population (about 330 million simulated people)\n\n**Note: as mentioned above, the smaller-scale options are intended for experimentation, setup, and testing.\nThe USA scale is the only one that attempts to realistically emulate Census Bureau processes.**\n\nTo move up from `small_sample`, you'll need access to a larger simulated population.\nInstructions to request access can be found [in the pseudopeople documentation](https://pseudopeople.readthedocs.io/en/latest/simulated_populations/index.html).\nOnce you have downloaded and unzipped your simulated population, you can configure\nthe case study to use it like so:\n\n```yaml\ndata_to_use: ri\npapermill_params:\n  ri:\n    generate_pseudopeople_simulated_datasets:\n      ri_simulated_population: /path/to/your/simulated/population\n```\n\nThe default configuration contains defaults adapted to each scale, to request\nabout the amount of resources you will need.\nYou can also override these, of course.\n\n## Making methods changes\n\nThe primary purpose of the case study is to be used in evaluating changes in linkage methods.\nOnce you have the case study running, the next step is to tweak something and\nsee what effect that has on accuracy and/or runtime.\nYou'll want to make edits only in `03_link_datasets.ipynb`, as this is the part of the\ncase study that simulates the linkage process; the other notebooks are setup and evaluation.\nWhen you edit the notebook and run `snakemake`, Snakemake should recognize that it only needs\nto re-run the linkage and the accuracy assessment, skipping the redundant setup.\n\n## What's next?\n\nWe plan to iterate on this person linkage case study over time to improve its realism,\nusability, and computational efficiency.\nPlease feel free to open an issue if you have ideas or need help!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fihmeuw%2Fperson_linkage_case_study","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fihmeuw%2Fperson_linkage_case_study","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fihmeuw%2Fperson_linkage_case_study/lists"}