{"id":15662199,"url":"https://github.com/macbre/mediawiki-dump","last_synced_at":"2025-08-20T01:31:16.556Z","repository":{"id":36981040,"uuid":"153651912","full_name":"macbre/mediawiki-dump","owner":"macbre","description":"Python package for working with MediaWiki XML content dumps","archived":false,"fork":false,"pushed_at":"2024-10-30T03:03:33.000Z","size":292,"stargazers_count":23,"open_issues_count":6,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-10-30T04:18:53.007Z","etag":null,"topics":["fandom","mediawiki-dump","python","python3-library","wikia","wikipedia","wikipedia-corpus","wikipedia-dump","xml-dump"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/mediawiki_dump/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/macbre.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-18T16:08:48.000Z","updated_at":"2024-10-25T03:50:55.000Z","dependencies_parsed_at":"2024-11-16T06:06:18.008Z","dependency_job_id":"e5cdc8df-9789-4767-8677-4f55da9fab61","html_url":"https://github.com/macbre/mediawiki-dump","commit_stats":{"total_commits":310,"total_committers":5,"mean_commits":62.0,"dds":"0.49677419354838714","last_synced_commit":"517962cee764e755d43485e5e6d4072fbdc5f0db"},"previous_names":[],"tags_count":19,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/macbre%2Fmediawiki-dump","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/macbre%2Fmediawiki-dump/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/macbre%2Fmediawiki-dump/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/macbre%2Fmediawiki-dump/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/macbre","download_url":"https://codeload.github.com/macbre/mediawiki-dump/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229982103,"owners_count":18154512,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fandom","mediawiki-dump","python","python3-library","wikia","wikipedia","wikipedia-corpus","wikipedia-dump","xml-dump"],"created_at":"2024-10-03T13:30:39.381Z","updated_at":"2024-12-19T05:07:17.747Z","avatar_url":"https://github.com/macbre.png","language":"Python","readme":"# mediawiki-dump\n[![PyPI](https://img.shields.io/pypi/v/mediawiki_dump.svg)](https://pypi.python.org/pypi/mediawiki_dump)\n[![Downloads](https://pepy.tech/badge/mediawiki_dump)](https://pepy.tech/project/mediawiki_dump)\n[![CI](https://github.com/macbre/mediawiki-dump/actions/workflows/tests.yml/badge.svg)](https://github.com/macbre/mediawiki-dump/actions/workflows/tests.yml)\n[![Coverage Status](https://coveralls.io/repos/github/macbre/mediawiki-dump/badge.svg?branch=master)](https://coveralls.io/github/macbre/mediawiki-dump?branch=master)\n\n```\npip install mediawiki_dump\n```\n\n[Python3 package](https://pypi.org/project/mediawiki_dump/) for working with [MediaWiki XML content dumps](https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki#Backup_the_content_of_the_wiki_(XML_dump)).\n\n[Wikipedia](https://dumps.wikimedia.org/) (bz2 compressed) and [Wikia](https://community.fandom.com/wiki/Help:Database_download) (7zip) content dumps are supported.\n\n## Dependencies\n\nIn order to read 7zip archives (used by Wikia's XML dumps) you need to install [`libarchive`](http://libarchive.org/):\n\n```\nsudo apt install libarchive-dev\n```\n\n## API\n\n### Tokenizer\n\nAllows you to clean up the wikitext:\n\n```python\nfrom mediawiki_dump.tokenizer import clean\nclean('[[Foo|bar]] is a link')\n'bar is a link'\n```\n\nAnd then tokenize the text:\n\n```python\nfrom mediawiki_dump.tokenizer import tokenize\ntokenize('11. juni 2007 varð kunngjørt, at Svínoyar kommuna verður løgd saman við Klaksvíkar kommunu eftir komandi bygdaráðsval.')\n['juni', 'varð', 'kunngjørt', 'at', 'Svínoyar', 'kommuna', 'verður', 'løgd', 'saman', 'við', 'Klaksvíkar', 'kommunu', 'eftir', 'komandi', 'bygdaráðsval']\n```\n\n### Dump reader\n\nFetch and parse dumps (using a local file cache):\n\n```python\nfrom mediawiki_dump.dumps import WikipediaDump\nfrom mediawiki_dump.reader import DumpReader\n\ndump = WikipediaDump('fo')\npages = DumpReader().read(dump)\n\n[page.title for page in pages][:10]\n\n['Main Page', 'Brúkari:Jon Harald Søby', 'Forsíða', 'Ormurin Langi', 'Regin smiður', 'Fyrimynd:InterLingvLigoj', 'Heimsyvirlýsingin um mannarættindi', 'Bólkur:Kvæði', 'Bólkur:Yrking', 'Kjak:Forsíða']\n```\n\n`read` method yields the `DumpEntry` object for each revision.\n\nBy using `DumpReaderArticles` class you can read article pages only:\n\n```python\nimport logging; logging.basicConfig(level=logging.INFO)\n\nfrom mediawiki_dump.dumps import WikipediaDump\nfrom mediawiki_dump.reader import DumpReaderArticles\n\ndump = WikipediaDump('fo')\nreader = DumpReaderArticles()\npages = reader.read(dump)\n\nprint([page.title for page in pages][:25])\n\nprint(reader.get_dump_language())  # fo\n```\n\nWill give you:\n\n```\nINFO:DumpReaderArticles:Parsing XML dump...\nINFO:WikipediaDump:Checking /tmp/wikicorpus_62da4928a0a307185acaaa94f537d090.bz2 cache file...\nINFO:WikipediaDump:Fetching fo dump from \u003chttps://dumps.wikimedia.org/fowiki/latest/fowiki-latest-pages-meta-current.xml.bz2\u003e...\nINFO:WikipediaDump:HTTP 200 (14105 kB will be fetched)\nINFO:WikipediaDump:Cache set\n...\n['WIKIng', 'Føroyar', 'Borðoy', 'Eysturoy', 'Fugloy', 'Forsíða', 'Løgmenn í Føroyum', 'GNU Free Documentation License', 'GFDL', 'Opið innihald', 'Wikipedia', 'Alfrøði', '2004', '20. juni', 'WikiWiki', 'Wiki', 'Danmark', '21. juni', '22. juni', '23. juni', 'Lívfrøði', '24. juni', '25. juni', '26. juni', '27. juni']\n```\n\n## Reading Wikia's dumps\n\n ```python\nimport logging; logging.basicConfig(level=logging.INFO)\n\nfrom mediawiki_dump.dumps import WikiaDump\nfrom mediawiki_dump.reader import DumpReaderArticles\n\ndump = WikiaDump('plnordycka')\npages = DumpReaderArticles().read(dump)\n\nprint([page.title for page in pages][:25])\n```\n\nWill give you:\n\n```\nINFO:DumpReaderArticles:Parsing XML dump...\nINFO:WikiaDump:Checking /tmp/wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b.7z cache file...\nINFO:WikiaDump:Fetching plnordycka dump from \u003chttps://s3.amazonaws.com/wikia_xml_dumps/p/pl/plnordycka_pages_current.xml.7z\u003e...\nINFO:WikiaDump:HTTP 200 (129 kB will be fetched)\nINFO:WikiaDump:Cache set\nINFO:WikiaDump:Reading wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b file from dump\n...\nINFO:DumpReaderArticles:Parsing completed, entries found: 615\n['Nordycka Wiki', 'Strona główna', '1968', '1948', 'Ormurin Langi', 'Mykines', 'Trollsjön', 'Wyspy Owcze', 'Nólsoy', 'Sandoy', 'Vágar', 'Mørk', 'Eysturoy', 'Rakfisk', 'Hákarl', '1298', 'Sztokfisz', '1978', '1920', 'Najbardziej na północ', 'Svalbard', 'Hamferð', 'Rok w Skandynawii', 'Islandia', 'Rissajaure']\n```\n\n## Fetching full history\n\nPass `full_history` to `BaseDump` constructor to fetch the XML content dump with full history:\n\n```python\nimport logging; logging.basicConfig(level=logging.INFO)\n\nfrom mediawiki_dump.dumps import WikiaDump\nfrom mediawiki_dump.reader import DumpReaderArticles\n\ndump = WikiaDump('macbre', full_history=True)  # fetch full history, including old revisions\npages = DumpReaderArticles().read(dump)\n\nprint('\\n'.join([repr(page) for page in pages]))\n```\n\nWill give you:\n\n```\nINFO:DumpReaderArticles:Parsing completed, entries found: 384\n\u003cDumpEntry \"Macbre Wiki\" by Default at 2016-10-12T19:51:06+00:00\u003e\n\u003cDumpEntry \"Macbre Wiki\" by Wikia at 2016-10-12T19:51:05+00:00\u003e\n\u003cDumpEntry \"Macbre Wiki\" by Macbre at 2016-11-04T10:33:20+00:00\u003e\n\u003cDumpEntry \"Macbre Wiki\" by FandomBot at 2016-11-04T10:37:17+00:00\u003e\n\u003cDumpEntry \"Macbre Wiki\" by FandomBot at 2017-01-25T14:47:37+00:00\u003e\n\u003cDumpEntry \"Macbre Wiki\" by Ryba777 at 2017-04-10T11:20:25+00:00\u003e\n\u003cDumpEntry \"Macbre Wiki\" by Ryba777 at 2017-04-10T11:21:20+00:00\u003e\n\u003cDumpEntry \"Macbre Wiki\" by Macbre at 2018-03-07T12:51:12+00:00\u003e\n\u003cDumpEntry \"Main Page\" by Wikia at 2016-10-12T19:51:05+00:00\u003e\n\u003cDumpEntry \"FooBar\" by Anonymous at 2016-11-08T10:15:33+00:00\u003e\n\u003cDumpEntry \"FooBar\" by Anonymous at 2016-11-08T10:15:49+00:00\u003e\n...\n\u003cDumpEntry \"YouTube tag\" by FANDOMbot at 2018-06-05T11:45:44+00:00\u003e\n\u003cDumpEntry \"Maps\" by Macbre at 2018-06-06T08:51:24+00:00\u003e\n\u003cDumpEntry \"Maps\" by Macbre at 2018-06-07T08:17:13+00:00\u003e\n\u003cDumpEntry \"Maps\" by Macbre at 2018-06-07T08:17:36+00:00\u003e\n\u003cDumpEntry \"Scary transclusion\" by Macbre at 2018-07-24T14:52:20+00:00\u003e\n\u003cDumpEntry \"Lua\" by Macbre at 2018-09-11T14:04:15+00:00\u003e\n\u003cDumpEntry \"Lua\" by Macbre at 2018-09-11T14:14:24+00:00\u003e\n\u003cDumpEntry \"Lua\" by Macbre at 2018-09-11T14:14:37+00:00\u003e\n```\n\n## Reading dumps of selected articles\n\nYou can use [`mwclient` Python library](https://mwclient.readthedocs.io/en/latest/index.html)\nand fetch \"live\" dumps of selected articles from any MediaWiki-powered site.\n\n```python\nimport mwclient\nsite = mwclient.Site('vim.fandom.com', path='/')\n\nfrom mediawiki_dump.dumps import MediaWikiClientDump\nfrom mediawiki_dump.reader import DumpReaderArticles\n\ndump = MediaWikiClientDump(site, ['Vim documentation', 'Tutorial'])\n\npages = DumpReaderArticles().read(dump)\n\nprint('\\n'.join([repr(page) for page in pages]))\n```\n\nWill give you:\n\n```\n\u003cDumpEntry \"Vim documentation\" by Anonymous at 2019-07-05T09:39:47+00:00\u003e\n\u003cDumpEntry \"Tutorial\" by Anonymous at 2019-07-05T09:41:19+00:00\u003e\n```\n\n## Finding pages with a specific [parser tag](https://www.mediawiki.org/wiki/Manual:Tag_extensions)\n\nLet's find pages where no longer supported `\u003cplace\u003e` tag is still used:\n\n```python\nimport logging; logging.basicConfig(level=logging.INFO)\n\nfrom mediawiki_dump.dumps import WikiaDump\nfrom mediawiki_dump.reader import DumpReader\n\ndump = WikiaDump('plpoznan')\npages = DumpReader().read(dump)\n\nwith_places_tag = [\n    page.title\n    for page in pages\n    if '\u003cplace ' in page.content\n]\n\nlogging.info('Pages found: %d', len(with_places_tag))\n\nwith open(\"pages.txt\", mode=\"wt\", encoding=\"utf-8\") as fp:\n    for entry in with_places_tag:\n        fp.write(entry + \"\\n\")\n\nlogging.info(\"pages.txt file created\")\n```\n\n## Reading dumps from local files\n\nYou can also read dumps from local, non-compressed XML files:\n\n```python\nfrom mediawiki_dump.dumps import LocalFileDump\nfrom mediawiki_dump.reader import DumpReader\n\ndump = LocalFileDump(dump_file=\"test/fixtures/dump.xml\")\nreader = DumpReader()\n\npages = [entry.title for entry in reader.read(dump)]\nprint(dump, pages)\n```\n\n## Reading dumps from compressed local files\n\nOr any other iterators (like HTTP responses):\n\n```python\nimport bz2\n\nfrom mediawiki_dump.dumps import IteratorDump\nfrom mediawiki_dump.reader import DumpReader\n\ndef get_content(file_name: str):\n    with bz2.open(file_name, mode=\"r\") as fp:\n        yield from fp\n\ndump = IteratorDump(iterator=get_content(file_name=\"test/fixtures/dump.xml.bz2\"))\nreader = DumpReader()\n\npages = [entry.title for entry in reader.read(dump)]\nprint(dump, pages)\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmacbre%2Fmediawiki-dump","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmacbre%2Fmediawiki-dump","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmacbre%2Fmediawiki-dump/lists"}