{"id":21905045,"url":"https://github.com/textcorpuslabs/covid19","last_synced_at":"2026-05-10T19:42:57.207Z","repository":{"id":174805308,"uuid":"248488503","full_name":"TextCorpusLabs/covid19","owner":"TextCorpusLabs","description":"Walk through to convert Kaggle's COVID-19 Open Research Dataset Challenge into a text corpus","archived":false,"fork":false,"pushed_at":"2020-03-23T21:39:54.000Z","size":11,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-27T07:27:29.376Z","etag":null,"topics":["covid-19","python3","text-corpus"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TextCorpusLabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-19T11:42:47.000Z","updated_at":"2021-06-14T19:47:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"4e3d7549-2f99-4e23-849c-daa07255f78d","html_url":"https://github.com/TextCorpusLabs/covid19","commit_stats":null,"previous_names":["textcorpuslabs/covid19"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fcovid19","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fcovid19/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fcovid19/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fcovid19/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TextCorpusLabs","download_url":"https://codeload.github.com/TextCorpusLabs/covid19/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244918709,"owners_count":20531686,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["covid-19","python3","text-corpus"],"created_at":"2024-11-28T16:20:34.549Z","updated_at":"2026-05-10T19:42:52.170Z","avatar_url":"https://github.com/TextCorpusLabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# COVID-19 To Text Corpus\n\n[Kaggle](https://www.kaggle.com/) has provided an excelent [data source](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) for the COVID-19 courtesy of [AI2](https://allenai.org/)\nThe purpose of this repo is to convert it from the given format into the normal text corpus format.\nI.E. one document per file, one sentence per line, pargraphs have a blank line between them.\n\n# Prerequisites\n\nThe following packages need to be installed.\nI recommend using [Chocolatey](https://chocolatey.org/install).\n\n* [7-zip](https://www.7-zip.org/)\n* [Python](https://www.python.org/downloads/)\n\n  \n```{ps1}\nif('Unrestricted' -ne (Get-ExecutionPolicy)) { Set-ExecutionPolicy Bypass -Scope Process -Force }\niex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))\nrefreshenv\n\nchoco install 7zip.install -y\nchoco install python3 -y\n```\n\n# Modules\n\nAll scripts have been tested on Python 3.8.2.\nThe below modules are need to run the scripts.\nThe scripts were tested on the noted versions, so YMMV.\n**Note**: not all modules are required for all scripts.\nIf this it the first time running the scripts, the modules will need to be installed.\nThey can be installed by navigating to the `~/code` folder, then using the below code.\n\n* nltk 3.4.5\n* progressbar2 3.47.0\n\n```{shell}\npip install -r requirments.txt\npython -c \"import nltk;nltk.download('punkt')\"\n```\n\n# Steps\n\nThe below document describes how to recreate the text corpus.\nIt assumes that a particular path structure will be used, but the commands can be modified to target a different directory structure without changing the code.\nI am choosing the `d:/covid19` directory because my d drive is big enough to hold everything.\n\n1. Clone this repo then open a shell to the `~/code` directory.\n2. Retrieve the dataset _by hand_.\n   Click on the [download](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/download) link, saving the file to a know location.   \n3. Extract the data in-place with no folder structure.\n   * The `e` switch flattens the extract so the custom code does not need to recursivaly search the folder structure.\n```{shell}\n\"C:/Program Files/7-Zip/7z.exe\" e -od:/covid19/raw \"d:/covid19/*.zip\"\n```\n4. [Extract](./code/extract_metadata.py) the meta-data.\n   This will create a single `metadata.csv` containing some useful information.\n   In general this would be used as part of segementation or as part of a MANOVA.\n```{shell}\npython extract_metadata.py -in d:/covid19/raw -out d:/covid19/metadata.csv\n```\n5. [Convert](./code/convert_to_corpus.py) the raw JSON files into the nomal folder corpus format.\n   This will create a text corpus folder at the location I.E. `./corpus` containing 2 sub folders, one for the abstract and one for the body.\n   Some of the files provide by Kaggle are not full text articles I.E. empty abstract or body.\n   These _incomplete_ files are filtered out of the final folders and noted in `error.csv`\n```{shell}\npython convert_to_corpus.py -in d:/covid19/raw -out d:/covid19/corpus\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextcorpuslabs%2Fcovid19","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftextcorpuslabs%2Fcovid19","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextcorpuslabs%2Fcovid19/lists"}