https://github.com/gambolputty/wiktionary-de-parser
Extract data from German Wiktionary XML files.
https://github.com/gambolputty/wiktionary-de-parser
data-extraction dewiktionary german german-language nlp wiktionary wiktionary-dump wiktionary-parser
Last synced: about 2 months ago
JSON representation
Extract data from German Wiktionary XML files.
- Host: GitHub
- URL: https://github.com/gambolputty/wiktionary-de-parser
- Owner: gambolputty
- License: mit
- Created: 2019-04-10T19:57:39.000Z (almost 7 years ago)
- Default Branch: main
- Last Pushed: 2025-11-16T13:48:42.000Z (4 months ago)
- Last Synced: 2025-11-16T14:41:26.470Z (4 months ago)
- Topics: data-extraction, dewiktionary, german, german-language, nlp, wiktionary, wiktionary-dump, wiktionary-parser
- Language: Python
- Homepage:
- Size: 416 KB
- Stars: 26
- Watchers: 3
- Forks: 8
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/FUNDING.yml
- License: LICENSE.txt
Awesome Lists containing this project
README
# wiktionary-de-parser
A Python module to extract data from German Wiktionary XML files (for Python 3.11+).
## Features
- Extracts _IPA transcriptions_, _hyphenation_, _language_, _part of speech_ information (basic), _genus_ and _flexion tables_ of a word.
- Yields per entry, not per page (a page can have multiple entries/ words can have different meanings)
## Installation
`pip install wiktionary-de-parser`
Or with [Poetry](https://python-poetry.org/):
`poetry add wiktionary-de-parser`
## Usage
### Loading the XML dump file
```python
from wiktionary_de_parser import WiktionaryParser
from wiktionary_de_parser.dump_processor import WiktionaryDump
# To download the dump file, specify the directory where the
# dump file should be stored.
dump = WiktionaryDump(dump_dir_path="directory-of-dump-file")
# This will download "dewiktionary-latest-pages-articles-multistream.xml.bz2" to
# the directory specified in `dump_dir_path`.
dump.download_dump()
# Alternatively you can specify a different dump file to download.
dump = WiktionaryDump(
dump_dir_path="directory-of-dump-file",
dump_download_url="url-to-dump-file.xml.bz2",
)
dump.download_dump()
# If you already have the dump file locally, specify the path to the file.
dump = WiktionaryDump(dump_file_path="path-to-dump-file.xml.bz2")
dump.download_dump()
```
### Parsing the dump file
```python
from pprint import pprint
from wiktionary_de_parser import WiktionaryParser
# ... (see above)
parser = WiktionaryParser()
for page in dump.pages():
# Skip redirects
if page.redirect_to:
continue
if page.name == "Abend":
# Parse all entries for "Abend"
for entry in parser.entries_from_page(page):
results = parser.parse_entry(entry)
pprint(results)
break
```
## Output
All page entries for "Abend":
```python
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion={
"Genus": "m",
"Nominativ Singular": "Abend",
"Nominativ Plural": "Abende",
"Genitiv Singular": "Abends",
"Genitiv Plural": "Abende",
"Dativ Singular": "Abend",
"Dativ Plural": "Abenden",
"Akkusativ Singular": "Abend",
"Akkusativ Plural": "Abende",
},
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", reference_type=),
pos={"Substantiv": []},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", reference_type=),
pos={"Substantiv": ["Nachname"]},
rhymes=["aːbn̩t"],
)
ParsedWiktionaryPageEntry(
name="Abend",
hyphenation=["Abend"],
flexion=None,
ipa=["ˈaːbn̩t", "ˈaːbm̩t"],
language=Language(lang="Deutsch", lang_code="de"),
lemma=Lemma(lemma="Abend", reference_type=),
pos={"Substantiv": ["Toponym"]},
rhymes=["aːbn̩t"],
)
```
## Development
This project uses [Poetry](https://python-poetry.org/).
1. Install [Poetry](https://python-poetry.org/).
2. Clone this repository
3. Run `poetry install` inside of the project folder to install dependencies.
4. There is a `notebook.ipynb` to test the parser.
5. Run `poetry run pytest` to run tests.
## License
[MIT](https://github.com/gambolputty/wiktionary-de-parser/blob/master/LICENSE.md) © Gregor Weichbrodt