https://github.com/macbre/mediawiki-dump
Python package for working with MediaWiki XML content dumps
https://github.com/macbre/mediawiki-dump
fandom mediawiki-dump python python3-library wikia wikipedia wikipedia-corpus wikipedia-dump xml-dump
Last synced: 4 months ago
JSON representation
Python package for working with MediaWiki XML content dumps
- Host: GitHub
- URL: https://github.com/macbre/mediawiki-dump
- Owner: macbre
- License: mit
- Created: 2018-10-18T16:08:48.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2024-10-30T03:03:33.000Z (6 months ago)
- Last Synced: 2024-10-30T04:18:53.007Z (6 months ago)
- Topics: fandom, mediawiki-dump, python, python3-library, wikia, wikipedia, wikipedia-corpus, wikipedia-dump, xml-dump
- Language: Python
- Homepage: https://pypi.org/project/mediawiki_dump/
- Size: 285 KB
- Stars: 23
- Watchers: 3
- Forks: 3
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# mediawiki-dump
[](https://pypi.python.org/pypi/mediawiki_dump)
[](https://pepy.tech/project/mediawiki_dump)
[](https://github.com/macbre/mediawiki-dump/actions/workflows/tests.yml)
[](https://coveralls.io/github/macbre/mediawiki-dump?branch=master)```
pip install mediawiki_dump
```[Python3 package](https://pypi.org/project/mediawiki_dump/) for working with [MediaWiki XML content dumps](https://www.mediawiki.org/wiki/Manual:Backing_up_a_wiki#Backup_the_content_of_the_wiki_(XML_dump)).
[Wikipedia](https://dumps.wikimedia.org/) (bz2 compressed) and [Wikia](https://community.fandom.com/wiki/Help:Database_download) (7zip) content dumps are supported.
## Dependencies
In order to read 7zip archives (used by Wikia's XML dumps) you need to install [`libarchive`](http://libarchive.org/):
```
sudo apt install libarchive-dev
```## API
### Tokenizer
Allows you to clean up the wikitext:
```python
from mediawiki_dump.tokenizer import clean
clean('[[Foo|bar]] is a link')
'bar is a link'
```And then tokenize the text:
```python
from mediawiki_dump.tokenizer import tokenize
tokenize('11. juni 2007 varð kunngjørt, at Svínoyar kommuna verður løgd saman við Klaksvíkar kommunu eftir komandi bygdaráðsval.')
['juni', 'varð', 'kunngjørt', 'at', 'Svínoyar', 'kommuna', 'verður', 'løgd', 'saman', 'við', 'Klaksvíkar', 'kommunu', 'eftir', 'komandi', 'bygdaráðsval']
```### Dump reader
Fetch and parse dumps (using a local file cache):
```python
from mediawiki_dump.dumps import WikipediaDump
from mediawiki_dump.reader import DumpReaderdump = WikipediaDump('fo')
pages = DumpReader().read(dump)[page.title for page in pages][:10]
['Main Page', 'Brúkari:Jon Harald Søby', 'Forsíða', 'Ormurin Langi', 'Regin smiður', 'Fyrimynd:InterLingvLigoj', 'Heimsyvirlýsingin um mannarættindi', 'Bólkur:Kvæði', 'Bólkur:Yrking', 'Kjak:Forsíða']
````read` method yields the `DumpEntry` object for each revision.
By using `DumpReaderArticles` class you can read article pages only:
```python
import logging; logging.basicConfig(level=logging.INFO)from mediawiki_dump.dumps import WikipediaDump
from mediawiki_dump.reader import DumpReaderArticlesdump = WikipediaDump('fo')
reader = DumpReaderArticles()
pages = reader.read(dump)print([page.title for page in pages][:25])
print(reader.get_dump_language()) # fo
```Will give you:
```
INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikipediaDump:Checking /tmp/wikicorpus_62da4928a0a307185acaaa94f537d090.bz2 cache file...
INFO:WikipediaDump:Fetching fo dump from ...
INFO:WikipediaDump:HTTP 200 (14105 kB will be fetched)
INFO:WikipediaDump:Cache set
...
['WIKIng', 'Føroyar', 'Borðoy', 'Eysturoy', 'Fugloy', 'Forsíða', 'Løgmenn í Føroyum', 'GNU Free Documentation License', 'GFDL', 'Opið innihald', 'Wikipedia', 'Alfrøði', '2004', '20. juni', 'WikiWiki', 'Wiki', 'Danmark', '21. juni', '22. juni', '23. juni', 'Lívfrøði', '24. juni', '25. juni', '26. juni', '27. juni']
```## Reading Wikia's dumps
```python
import logging; logging.basicConfig(level=logging.INFO)from mediawiki_dump.dumps import WikiaDump
from mediawiki_dump.reader import DumpReaderArticlesdump = WikiaDump('plnordycka')
pages = DumpReaderArticles().read(dump)print([page.title for page in pages][:25])
```Will give you:
```
INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikiaDump:Checking /tmp/wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b.7z cache file...
INFO:WikiaDump:Fetching plnordycka dump from ...
INFO:WikiaDump:HTTP 200 (129 kB will be fetched)
INFO:WikiaDump:Cache set
INFO:WikiaDump:Reading wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b file from dump
...
INFO:DumpReaderArticles:Parsing completed, entries found: 615
['Nordycka Wiki', 'Strona główna', '1968', '1948', 'Ormurin Langi', 'Mykines', 'Trollsjön', 'Wyspy Owcze', 'Nólsoy', 'Sandoy', 'Vágar', 'Mørk', 'Eysturoy', 'Rakfisk', 'Hákarl', '1298', 'Sztokfisz', '1978', '1920', 'Najbardziej na północ', 'Svalbard', 'Hamferð', 'Rok w Skandynawii', 'Islandia', 'Rissajaure']
```## Fetching full history
Pass `full_history` to `BaseDump` constructor to fetch the XML content dump with full history:
```python
import logging; logging.basicConfig(level=logging.INFO)from mediawiki_dump.dumps import WikiaDump
from mediawiki_dump.reader import DumpReaderArticlesdump = WikiaDump('macbre', full_history=True) # fetch full history, including old revisions
pages = DumpReaderArticles().read(dump)print('\n'.join([repr(page) for page in pages]))
```Will give you:
```
INFO:DumpReaderArticles:Parsing completed, entries found: 384...
```
## Reading dumps of selected articles
You can use [`mwclient` Python library](https://mwclient.readthedocs.io/en/latest/index.html)
and fetch "live" dumps of selected articles from any MediaWiki-powered site.```python
import mwclient
site = mwclient.Site('vim.fandom.com', path='/')from mediawiki_dump.dumps import MediaWikiClientDump
from mediawiki_dump.reader import DumpReaderArticlesdump = MediaWikiClientDump(site, ['Vim documentation', 'Tutorial'])
pages = DumpReaderArticles().read(dump)
print('\n'.join([repr(page) for page in pages]))
```Will give you:
```
```
## Finding pages with a specific [parser tag](https://www.mediawiki.org/wiki/Manual:Tag_extensions)
Let's find pages where no longer supported `` tag is still used:
```python
import logging; logging.basicConfig(level=logging.INFO)from mediawiki_dump.dumps import WikiaDump
from mediawiki_dump.reader import DumpReaderdump = WikiaDump('plpoznan')
pages = DumpReader().read(dump)with_places_tag = [
page.title
for page in pages
if '