{"id":23421191,"url":"https://github.com/nhsdigital/asc_la_peer_groups","last_synced_at":"2025-10-27T06:07:26.863Z","repository":{"id":217281749,"uuid":"741994515","full_name":"NHSDigital/ASC_LA_Peer_Groups","owner":"NHSDigital","description":"Calculates statistical neighbours (aka peers) for Local Authorities in England.","archived":false,"fork":false,"pushed_at":"2024-09-12T11:21:10.000Z","size":67,"stargazers_count":15,"open_issues_count":1,"forks_count":5,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-26T08:37:30.769Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NHSDigital.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-11T14:42:59.000Z","updated_at":"2025-03-19T11:44:17.000Z","dependencies_parsed_at":"2024-09-12T18:44:36.384Z","dependency_job_id":null,"html_url":"https://github.com/NHSDigital/ASC_LA_Peer_Groups","commit_stats":null,"previous_names":["nhsdigital/asc_la_peer_groups"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NHSDigital%2FASC_LA_Peer_Groups","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NHSDigital%2FASC_LA_Peer_Groups/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NHSDigital%2FASC_LA_Peer_Groups/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NHSDigital%2FASC_LA_Peer_Groups/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NHSDigital","download_url":"https://codeload.github.com/NHSDigital/ASC_LA_Peer_Groups/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248578860,"owners_count":21127713,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-23T02:14:22.343Z","updated_at":"2025-10-27T06:07:26.846Z","avatar_url":"https://github.com/NHSDigital.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Peer Groups for Local Authorities\r\n\r\nCalculates statistical neighbours (aka peers) for Local Authorities in England, for use in Adult Social Care statistics.\r\n\r\n## Contact\r\n\r\nThis repository is maintained by the NHS England Adult Social Care Statistics Team.\r\n\r\n\u003e To contact us raise an issue on Github or via email at [socialcare.statistics@nhs.net](mailto:socialcare.statistics@nhs.net).\r\n\u003e See our (and our colleagues') other work here: [NHS England Analytical Services](https://github.com/NHSDigital/data-analytics-services).\r\n\r\n## Description\r\n\r\nThis repository was developed by the Data Science team for the Adult Social Care Statistics team, to provide a way of comparing statistics between 'similar' Local Authorities.\r\n\r\nWe have calculated a metric of similarity (Euclidean distance) based on standardised, normalised input features from Census 2021 data, including population demographics such as age, ethnicity and educational attainment.\r\n\r\n## Setup\r\n\r\n* This project was developed using Python 3.10.5\r\n* Required Python libraries are listed in `requirements.txt`\r\n* _Optional:_: Python libraries used for linting are included in `dev-requirements.txt`. See the [developing the pipeline](#developing-the-pipeline) section for more details about linting configuration.\r\n\r\n### Set up a virtual environment\r\n\r\nClone this project and ensure you're in the root directory, ASC_LA_Peer_Groups. You can change your current directory in the terminal e.g.\r\n\r\n```bash\r\ncd ASC_LA_Peer_Groups\r\n```\r\n\r\nSet up a virtual environment and install requirements:\r\n\r\n```bash\r\npy -m venv .venv\r\n.venv\\Scripts\\activate\r\npip install -r requirements.txt\r\n\r\n```\r\n\r\n## Getting started\r\n\r\n### Configuring the pipeline\r\n\r\nThe configuration for the pipeline is defined in `config.toml`. If you want to adjust the weights of any of the inputs features (including adding or removing features), change the UTLA definitions etc., make the required edits in the `config.toml`.\r\n\r\nThere are four sections:\r\n\r\n1. `[LOCATION]` - locations used by the pipeline. This includes where the input data is stored, as well as where logs and outputs are saved. The location of your input and output directories need setting up in the `config.toml`.\r\n    \r\n    \u003cdetails\u003e\r\n            \u003csummary\u003eNotes on file path\u003c/summary\u003e\r\n\r\n \r\n\r\n    * Data files should be downloaded and stored to the location specified in `config.toml`. This must be outside of this repository, and can be a shared drive location\r\n    * Also note that file paths should contain **forward** slashes e.g. \"C:/Users/username/Documents/data\"\r\n\r\n\r\n    \u003c/details\u003e\r\n2. `[LOCAL_AUTHORITY]` - defines the local authority codes to use\r\n3. `[MODEL_OUTPUT]` - defines changeable characteristics of the output. Currently this only includes `n_peers` which limits the number of closest nearest neighbours output by the model.\r\n\r\n    \u003cdetails\u003e\r\n        \u003csummary\u003eModel output defaults\u003c/summary\u003e\r\n\r\n    ```toml\r\n    n_peers = 15\r\n    ```\r\n\r\n    \u003c/details\u003e\r\n\r\n4. `[FEATURE_WEIGHTS]` - lists features along with their associated weights. Note that a weight of zero reduces the effect of the feature to zero and thereby excludes it completely.\r\n\r\n    \u003cdetails\u003e\r\n        \u003csummary\u003eFeature weight defaults\u003c/summary\u003e\r\n\r\n    ```toml\r\n    \"Over 15 Population\" = 1\r\n    \"85 and over Population %\" = 1\r\n    \"Aged 65 to 84 Population %\" = 0\r\n    \"black african %\" = 0.5\r\n    \"black caribbean %\" = 0.5\r\n    \"bangladeshi %\" = 0.5\r\n    \"indian %\" = 0.5\r\n    \"chinese %\" = 0.5\r\n    \"pakistani %\" = 0.5\r\n    \"mixed %\" = 0\r\n    \"white %\" = 0\r\n    \"home_owners %\" = 0\r\n    \"social_renters %\" = 1\r\n    \"student %\" = 1\r\n    \"routine_manual %\" = 0\r\n    \"low_english_proficiency %\" = 1\r\n    \"People per square km\" = 1\r\n    \"higher_level_qualifications %\" = 1\r\n    \"few_rooms %\" = 0\r\n    \"Distance to Sea (km)\" = 0.5\r\n    \"Sparsity (% population living in low density areas)\" = 1\r\n    ```\r\n\r\n    \u003c/details\u003e\r\n\r\n5. `[REMOVE_LAS]` - lists UTLAs to be excluded from analysis. For example [\"Isles of Scilly\", \"City of London\"]. Please use the name of the Local Authority, using the `la_name` field defined in `[LOCAL_AUTHORITY]` above.\r\n    \u003cdetails\u003e\r\n        \u003csummary\u003eDefault removed UTLAs\u003c/summary\u003eNHS England remove Isle of Scilly and City of London from the default model. I.e.:\r\n\r\n    ```toml\r\n    las_to_remove = [\"Isles of Scilly\", \"City of London\"]\r\n    ```\r\n\r\n    \u003c/details\u003e\r\n\r\n### Data\r\n\r\nData files should be downloaded and stored to the location specified in `config.toml`. This must be outside of this repository, and can be a shared drive location.\r\n\r\n1. Download ten CSV files and save them to the input location specified in the `config.toml`. The names of the files and their sources are provided below. Some files are only available as part of a collection, in which case the source is listed as a zip file containing more than one csv. Where this is the case, download and extract the zip file saving the file version which ends 'lsoa'.\r\n\r\n\r\n| Save as file name | Source | Details |\r\n| ----------------- | ------ | ------- |\r\n| area_sqkm.csv | https://geoportal.statistics.gov.uk/datasets/a488cb8fc9a74accb63cb52961e456ef/about | Click the Download button at the top of the page. Within the subfolder \"Measurements\", rename the file \"SAM_LSOA_DEC_2021_EW_in_KM\" to \"area_sqkm.csv\" | \r\n| distance_to_sea.csv | https://digital.nhs.uk/supplementary-information/2024/distance-to-sea-calculations | Download the csv data file and rename to distance_to_sea.csv |\r\n| english_proficiency.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts029.zip | Download the zip folder. Rename the csv ending in \"lsoa\" to \"english_proficiency\" |\r\n| ethnicity.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts021.zip | Download the zip folder. Rename the csv ending in \"lsoa\" to \"ethnicity\" |\r\n| housing_tenure.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts054.zip | Download the zip folder. Rename the csv ending in \"lsoa\" to \"housing_tenure\" |\r\n| ns-sec.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts062.zip | Download the zip folder. Rename the csv ending in \"lsoa\" to \"ns-sec\" |\r\n| population_data.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts007a.zip | Download the zip folder. Rename the csv ending in \"lsoa\" to \"population_data\" |\r\n| qualification_level.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts067.zip | Download the zip folder. Rename the csv ending in \"lsoa\" to \"qualification_level\" |\r\n| rooms.csv | https://www.nomisweb.co.uk/output/census/2021/census2021-ts051.zip | Download the zip folder. Rename the csv ending in \"lsoa\" to \"rooms\" |\r\n| LSOA21_to_UTLA22.csv | https://www.data.gov.uk/dataset/14d8efd0-b14c-46ac-b2fe-a7892ea51ca5/lsoa-2021-to-utlas-december-2022-best-fit-lookup-in-ew-v2 | Under \"Data links\", click the CSV hyperlink to download the file. Rename this file to LSOA21_to_UTLA22 |\r\n\r\n**A note on lookups:** The final CSV file listed above, `LSOA21_to_UTLA22.csv` , maps LSOAs to UTLAs (local authorities).\r\n    \r\n    E.g.:\r\n\r\n    | LSOA21_CODE | LSOA21_NAME       | UTLA_CODE | UTLA_NAME     |\r\n    |-------------|-------------------|-----------|---------------|\r\n    | E01012052   |Middlesbrough 014D | E06000002 | Middlesbrough |\r\n\r\n\r\n### Running the pipeline\r\n\r\n\u003e **NOTE:** Please edit the `LOCATION` in `config.py` before running the pipeline.\r\n\r\nOnce you've initially setup the virtual environment in the previous steps, ensure you're in the virtual environment by running the code `.venv\\Scripts\\activate` in the terminal.\r\n\r\nOnce you've activated your virtual environment, run the following code from the terminal:\r\n\r\n```bash\r\npython main.py\r\n```\r\n\r\nIf you want to adjust the weights of any of the inputs features (including adding or removing features), change the UTLA definitions etc., make the required edits in the `config.toml`.\r\n\r\n**_(Optional)_ Adding a custom hash:**\r\n\r\nTo make your pipeline run easier to identify, it is possible to pass a custom hash to name your pipeline. This means log names and your output pipeline folder name will include the hash.\r\n\r\nThe hash length is set in `config.toml`- if you supply a shorter hash this is fine, but be aware that a longer hash will be cropped to the first n characters using the hash length.\r\n\r\n```bash\r\npython main.py --hash my_run\r\n```\r\n\r\nWhere `my_run` is the custom hash you have supplied.\r\n\r\n### Outputs\r\n\r\nThis pipeline produces the following as _final_ outputs, saved to the `outputs` directory:\r\n\r\n* `features.csv` - The final features used to produce the distances\r\n* `distances.csv` - Distance between each pair of local authorities\r\n* `peers.csv` - N most similar peers for each local authority (n defined in `config.toml`)\r\n* `example_peers.csv` - The above but limited to a subset of local authorities specified in `src/params.py`\r\n\r\nReports to accompany these outputs, including details of correlation between features and feature distributions, are saved to the `reports` directory.\r\n\r\nFinal outputs and reports are saved to a pipeline folder saved in the output directory defined in `config.toml`. The name of each pipeline folder corresponds to the time the pipeline was initialised, and any custom hash that was provided.\r\n\r\nInterim data processing produces files saved to the `data/` directory- these are NOT copied to the pipeline output location.\r\n\r\n## Updating the data/lookup files\r\n\r\nNew data and lookups can be added easily to the pipeline. All new data and lookups should be stored in the input directory, as specified in the config file as `input_dir`.\r\n\r\nFirst check that the format of the new/updated file matches the old one (see the earlier Data section for links). Move the new file into the input directory (you may want to archive the old file). \r\n\r\nIn `src/params.py`, check the \"Data File Names\" section, and ensure the name of the replaced file matches the corresponding file name in the params. Further to this, check the column values in the “Columns” section of params for the feature you have changed, and ensure these match the columns within the new data.\r\n\r\nIf updating the lookup file, open `config.toml` and check that `la_code` and `la_name` point to the correct columns in the new lookup.\r\n\r\n### Example: updating the LSOA to UTLA lookup\r\n\r\nAs of 2024 the latest lookup can be found here: https://www.data.gov.uk/dataset/801d40f6-fa98-40ef-ba16-0193ef04cff0/lsoa-2021-to-utlas-april-2023-best-fit-lookup-in-ew\r\n\r\nCopy across the new lookup to the input directory, ensure it has a unique name (e.g. LSOA21_to_UTLA23) and that it is saved as a CSV. Ensure the new lookup has no blank space above the headers, and make a note of the header names.\r\n\r\nNavigate to `src\\params.py`, go to the \"Data File Names\" section and change the name of the `LSOA_UTLA_lookup_file` to the new lookup file e.g.\r\n```\r\nLSOA_UTLA_lookup_file = \"LSOA21_to_UTLA23.csv\"\r\n```\r\n\r\n If the LSOA code column name in the new lookup has changed, you will also need to update `LSOA_code` (in the \"Pathway Parameters\" section) to point to the correct column name.\r\n\r\nNavigate to `config.toml` and ensure the `la_code` and `la_name` match the names of the relevant columns in your new lookup e.g.\r\n```\r\nla_code = \"UTLA23CD\"\r\nla_name = \"UTLA23NM\"\r\n```\r\n\r\nYou can now run the pipeline with the updated lookup.\r\n\r\n### Boundary changes\r\n\r\nIf there are boundary changes, LSOA_AREA_KM.csv will need updating (if there is an available update), along with the LSOA to UTLA lookup. See the above section on how to update the data/lookup. The code will then use the new boundary definitions when calculating the Euclidean Distances for each of the variables.\r\n\r\n\r\n## Project structure\r\n\r\n```text\r\n| .gitignore                \u003c- ignores data and virtual environment files\r\n| config.toml               \u003c- options for modelling, e.g. output location, k etc.\r\n| requirements.txt          \u003c- python libraries required\r\n| dev_requirements.txt      \u003c- python libraries required for development (optional, includes linting libraries)\r\n| LICENSE                   \u003c- license info for public distribution\r\n|\r\n+---reports                 \u003c- This is a placeholder which the pipeline populates with report outputs (e.g. histograms showing feature distributions)\r\n|\r\n+---output                 \u003c- This is a placeholder which the pipeline populates with output data\r\n|\r\n| main.py                   \u003c- Runs the pipeline\r\n|\r\n+---data                    \u003c- This is a placeholder which the pipeline populates with data\r\n|   +---raw\r\n|   +---interim\r\n|   +---primary\r\n|\r\n+--- src                    \u003c- Scripts with functions used in main.py\r\n|   |   __init__.py         \u003c- Makes the scripts importable python modules\r\n|   |   params.py           \u003c- configures column names, file paths etc.\r\n|   |   load.py             \u003c- Copies input files from location specified in config\r\n|   |   clean.py            \u003c- Cleans input data to LSOA level\r\n|   |   process.py          \u003c- Aggregates cleaned data to UTLA level\r\n|   |   model.py            \u003c- Calculates distance metric\r\n|   |   report.py           \u003c- Produces accompanying reports e.g. correlation\r\n|   |   utils.py            \u003c- Useful functions used across modules\r\n|\r\n```\r\n\r\n## Developing the pipeline\r\n\r\n_(Optional)_ Install dev requirements:\r\n\r\n\r\n```bash\r\npip install -r dev_requirements.txt\r\n```\r\n\r\nYou can also run the testing suite once these requirements have been installed:\r\n\r\n```bash\r\npytest\r\n```\r\n\r\n## Contributors\r\n\r\nThis codebase was originally developed by data scientists at NHS England: [Harriet Sands](https://github.com/harrietrs) and [Will Poulett](https://github.com/willpoulett), with help from the Adult Social Care Team at NHS England. \r\n\r\n## Licence\r\n\r\nThis codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation.\r\n\r\nAny HTML or Markdown documentation is [© Crown copyright](https://www.nationalarchives.gov.uk/information-management/re-using-public-sector-information/uk-government-licensing-framework/crown-copyright/) and available under the terms of the [Open Government 3.0 licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnhsdigital%2Fasc_la_peer_groups","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnhsdigital%2Fasc_la_peer_groups","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnhsdigital%2Fasc_la_peer_groups/lists"}