{"id":21905040,"url":"https://github.com/textcorpuslabs/wikimedia","last_synced_at":"2025-03-22T07:14:53.573Z","repository":{"id":174805654,"uuid":"285336669","full_name":"TextCorpusLabs/wikimedia","owner":"TextCorpusLabs","description":"Walk through to convert WikiMedia into a text corpus","archived":false,"fork":false,"pushed_at":"2023-01-26T20:31:39.000Z","size":81,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-27T07:27:30.667Z","etag":null,"topics":["python3","text-corpus","wikimedia"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TextCorpusLabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-05T15:50:40.000Z","updated_at":"2023-01-22T15:04:32.000Z","dependencies_parsed_at":null,"dependency_job_id":"ac25c5d5-ae7c-4b79-ae78-2858c48bbb82","html_url":"https://github.com/TextCorpusLabs/wikimedia","commit_stats":null,"previous_names":["textcorpuslabs/wikimedia"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fwikimedia","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fwikimedia/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fwikimedia/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TextCorpusLabs%2Fwikimedia/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TextCorpusLabs","download_url":"https://codeload.github.com/TextCorpusLabs/wikimedia/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244918710,"owners_count":20531686,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python3","text-corpus","wikimedia"],"created_at":"2024-11-28T16:20:24.380Z","updated_at":"2025-03-22T07:14:53.546Z","avatar_url":"https://github.com/TextCorpusLabs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Wikimedia To Text Corpus\n\n![Python](https://img.shields.io/badge/python-3.x-blue.svg)\n![MIT license](https://img.shields.io/badge/License-MIT-green.svg)\n![Last Updated](https://img.shields.io/badge/Last%20Updated-2023.01.22-success.svg)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3975690.svg)](https://doi.org/10.5281/zenodo.3975690)\n\n[Wikimedia](https://www.wikimedia.org/) is the driving force behind [Wikipedia](https://www.wikipedia.org/).\nThey provide a monthly full backup of all the data on Wikipedia as well as their properties.\nThe purpose of this repo is to convert the Wikimedia dump from the given format into the text corpus format we use.\nI.E.\n\n* The full corpus consisting of one or more TXT files in a single folder\n* One or more articles in a single TXT file\n* Each article will have a header in the form \"--- {id} ---\"\n* Each article will have its abstract and body extracted\n* One sentence per line\n* Paragraphs are separated by a blank line\n\n# Operation\n\n## Install\n\nYou can install the package using the following steps:\n\n`pip` install using an _admin_ prompt.\n\n```{ps1}\npip uninstall wikimedia\npython -OO -m pip install -v git+https://github.com/TextCorpusLabs/wikimedia.git\n```\n\nor if you have the code local\n\n```{ps1}\npip uninstall wikimedia\npython -OO -m pip install -v c:/repos/TextCorpusLabs/wikimedia\n```\n\n## Run\n\nYou are responsible for getting the source files.\nThey can be found at this [site](https://dumps.wikimedia.org/backup-index.html).\nYou will need to further navigate into particular wiki you want to download.\n\nYou are responsible for un-compressing and validating the source files.\nI recommend using [7zip](https://www.7-zip.org/).\nI installed my copy using [Chocolatey](https://community.chocolatey.org/packages/7zip).\n\nThe reason you are responsible is because the dump files are a single **MASSIVE** file.\nSometimes Wikimedia will be busy and the download will be slow.\nModern browsers support resume for exactly this case.\nAs of 2023/01/22 it is over 90 GB in _.xml_ form.\nYou must make sure you have enough space before you start.\n\nAll the below commands assume the corpus is an extracted _.xml_ file.\n\n1. Extracts the metadata from the corpus.\n\n```{ps1}\nwikimedia metadata -source d:/data/wiki/enwiki.xml -dest d:/data/wiki/enwiki.meta.csv\n```\n\nThe following are required parameters:\n\n* `source` is the _.xml_ file sourced from Wikimedia.\n* `dest` is the CSV file used to store the metadata.\n\nThe following are optional parameters:\n\n* `log` is the folder of raw XML chunks that did not process.\n  It defaults to empty (not saved).\n\n2. Convert the data to our standard format.\n\n```{ps1}\nwikimedia convert -source d:/data/wiki/enwiki.xml -dest d:/data/wiki.std\n```\n\nThe following are required parameters:\n\n* `source` is the _.xml_ file sourced from Wikimedia.\n* `dest` is the folder for the converted TXT files.\n\nThe following are optional parameters:\n\n* `lines` is the number of lines per TXT file.\n  The default is 1000000.\n* `dest_pattern` is the format of the TXT file name.\n  It defaults to `wikimedia.{id:04}.txt`.\n  `id` is an increasing value that increments after `lines` are stored in a file. \n* `log` is the folder of raw XML chunks that did not process.\n  It defaults to empty (not saved).\n\n## Debug/Test\n\nThe code in this repo is setup as a module.\n[Debugging](https://code.visualstudio.com/docs/python/debugging#_module) and [testing](https://code.visualstudio.com/docs/python/testing) are based on the assumption that the module is already installed.\nIn order to debug (F5) or run the tests (Ctrl + ; Ctrl + A), make sure to install the module as editable (see below).\n\n```{ps1}\npip uninstall wikimedia\npython -m pip install -e c:/repos/TextCorpusLabs/wikimedia\n```\n\n# Academic boilerplate\n\nBelow is the suggested text to add to the \"Methods and Materials\" section of your paper when using this _process_.\nThe references can be found [here](./references.bib)\n\n\u003e The 2022/10/01 English version of Wikipedia [@wikipedia2020] was downloaded using Wikimedia's download service [@wikimedia2020].\n\u003e The single-file data dump was then converted to a corpus of plain text articles using the process described in [@wikicorpus2020].\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextcorpuslabs%2Fwikimedia","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftextcorpuslabs%2Fwikimedia","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftextcorpuslabs%2Fwikimedia/lists"}