{"id":17174972,"url":"https://github.com/corneliusroemer/desh-data","last_synced_at":"2025-04-13T16:23:30.517Z","repository":{"id":45241286,"uuid":"441703685","full_name":"corneliusroemer/desh-data","owner":"corneliusroemer","description":"Sequence lineage information extracted from RKI sequence data repo","archived":false,"fork":false,"pushed_at":"2022-05-17T05:30:43.000Z","size":530890,"stargazers_count":24,"open_issues_count":1,"forks_count":3,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-27T07:21:19.231Z","etag":null,"topics":["dataset","germany","lineages","pangolin","robert-koch-institut","sars-cov-2","sequencing"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"unlicense","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/corneliusroemer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-12-25T15:12:40.000Z","updated_at":"2024-01-11T19:14:26.000Z","dependencies_parsed_at":"2022-09-10T22:31:50.814Z","dependency_job_id":null,"html_url":"https://github.com/corneliusroemer/desh-data","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/corneliusroemer%2Fdesh-data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/corneliusroemer%2Fdesh-data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/corneliusroemer%2Fdesh-data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/corneliusroemer%2Fdesh-data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/corneliusroemer","download_url":"https://codeload.github.com/corneliusroemer/desh-data/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248742181,"owners_count":21154448,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","germany","lineages","pangolin","robert-koch-institut","sars-cov-2","sequencing"],"created_at":"2024-10-14T23:55:23.727Z","updated_at":"2025-04-13T16:23:30.494Z","avatar_url":"https://github.com/corneliusroemer.png","language":"Jupyter Notebook","readme":"# Pango lineage information for German SARS-CoV-2 sequences\n\nThis repository contains a join of the metadata and pango lineage tables of all German SARS-CoV-2 sequences published by the Robert-Koch-Institut on [Github](https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland).\n\nThe data here is updated every hour, automatically through a Github action, so whenever new data appears in the RKI repo, you will see it here within at most an hour.\n\nThe resulting dataset can be downloaded here, beware it's currently around 50MB in size: \u003chttps://raw.githubusercontent.com/corneliusroemer/desh-data/main/data/meta_lineages.csv\u003e\n\n## Omicron share plot\n\nType `N` means representative surveillance. Type `X` means unknown, but since this is unlikely to be heavily targeted and includes quite a number of labs I include it now in the main plot (hence type `NX`).\n\n![Omicron Logit Plot](plots/omicron_N_logit.png)\n\n![Omicron Logit Plot](plots/omicron_N_linear.png)\n\n![Omicron share by zip code area](plots/omi_share_by_area.png)\n\n## Description of data\n\nColumn description:\n\n- IMS_ID: Unique identifier of the sequence\n- DATE_DRAW: Date the sample was taken from the patient\n- SEQ_REASON: Reason for sequencing, one of:\n  - X: Unknown\n  - N: Random sampling\n  - Y: Targeted sequencing (exact reason unknown)\n  - A[\\\u003creason\\\u003e]: Targeted sequencing because variant PCR indicated VOC\n- PROCESSING_DATE: Date the sample was processed by the RKI and added to Github repo\n- SENDING_LAB_PC: Postcode (PLZ) of lab that did the initial PCR\n- SEQUENCING_LAB_PC: Postcode (PLZ) of lab that did the sequencing\n- lineage: Pango lineage as reported by `pangolin`\n- scorpio_call: Alternative, rough, variant as determined by `scorpio` (part of `pangolin`), this is less precise but a bit more robust than `pangolin`.\n\n## Excerpt\n\nHere are the first 10 lines of the dataset.\n\n```csv\nIMS_ID,DATE_DRAW,SEQ_REASON,PROCESSING_DATE,SENDING_LAB_PC,SEQUENCING_LAB_PC,lineage,scorpio_call\nIMS-10294-CVDP-00001,2021-01-14,X,2021-01-25,40225,40225,B.1.1.297,\nIMS-10025-CVDP-00001,2021-01-17,N,2021-01-26,10409,10409,B.1.389,\nIMS-10025-CVDP-00002,2021-01-17,N,2021-01-26,10409,10409,B.1.258,\nIMS-10025-CVDP-00003,2021-01-17,N,2021-01-26,10409,10409,B.1.177.86,\nIMS-10025-CVDP-00004,2021-01-17,N,2021-01-26,10409,10409,B.1.389,\nIMS-10025-CVDP-00005,2021-01-18,N,2021-01-26,10409,10409,B.1.160,\nIMS-10025-CVDP-00006,2021-01-17,N,2021-01-26,10409,10409,B.1.1.297,\nIMS-10025-CVDP-00007,2021-01-18,N,2021-01-26,10409,10409,B.1.177.81,\nIMS-10025-CVDP-00008,2021-01-18,N,2021-01-26,10409,10409,B.1.177,\nIMS-10025-CVDP-00009,2021-01-18,N,2021-01-26,10409,10409,B.1.1.7,Alpha (B.1.1.7-like)\nIMS-10025-CVDP-00010,2021-01-17,N,2021-01-26,10409,10409,B.1.1.7,Alpha (B.1.1.7-like)\nIMS-10025-CVDP-00011,2021-01-17,N,2021-01-26,10409,10409,B.1.389,\n```\n\n## Suggested import into pandas\n\nYou can import the data into pandas as follows:\n\n```python\n#%%\nimport pandas as pd\n\n#%%\ndf = pd.read_csv(\n    'https://raw.githubusercontent.com/corneliusroemer/desh-data/main/data/meta_lineages.csv',\n    index_col=0,\n    parse_dates=[1,3],\n    infer_datetime_format=True,\n    cache_dates=True,\n    dtype = {'SEQ_REASON': 'category',\n             'SENDING_LAB_PC': 'category',\n             'SEQUENCING_LAB_PC': 'category',\n             'lineage': 'category',\n             'scorpio_call': 'category'\n             }\n)\n#%%\ndf.rename(columns={\n    'DATE_DRAW': 'date',\n    'PROCESSING_DATE': 'processing_date',\n    'SEQ_REASON': 'reason',\n    'SENDING_LAB_PC': 'sending_pc',\n    'SEQUENCING_LAB_PC': 'sequencing_pc',\n    'lineage': 'lineage',\n    'scorpio_call': 'scorpio'\n    },\n    inplace=True\n)\ndf\n```\n\n## License\n\nThe underlying files that I use as input are licensed by RKI under CC-BY 4.0, see more details here: \u003chttps://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland#lizenz\u003e.\n\nThe software here is licensed under the \"Unlicense\". You can do with it whatever you want.\n\nFor the data, just cite the original source, no need to cite this repo since it's just a trivial join.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcorneliusroemer%2Fdesh-data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcorneliusroemer%2Fdesh-data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcorneliusroemer%2Fdesh-data/lists"}