https://github.com/danieldk/nl-lemmatizer-ext
Dutch lemmatizer extensions
https://github.com/danieldk/nl-lemmatizer-ext
Last synced: about 2 months ago
JSON representation
Dutch lemmatizer extensions
- Host: GitHub
- URL: https://github.com/danieldk/nl-lemmatizer-ext
- Owner: danieldk
- Created: 2023-11-26T08:53:32.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-17T14:07:06.000Z (over 1 year ago)
- Last Synced: 2025-02-08T21:23:32.438Z (3 months ago)
- Language: Python
- Size: 23.4 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# `nl-lemmatizer-ext`
## Introduction
This packages contains extensions for Dutch lemmatization.
Pipelines:
- `gigant_lemmatizer`: a pipe that uses the GiGaNT-Molex lexicon for
lemmatization.CLI commands:
- `nl-lemmatizer-util convert`: convert GiGaNT-Molex TSV file to a
JSON lexicon for the `gigant_lemmatizer` pipe.
- `nl-lemmatizer-util extend-model`: add the `gigant_lemmatizer` pipe
to an existing pipeline.## Adding the `gigant_lemmatizer` pipe to a pipeline
Install this package and get the
[GiGaNT-Molex](https://taalmaterialen.ivdnt.org/download/tstc-gigant-molex-c/)
dataset from Instituut voor de Nederlandse Taal.First convert the tab-separated file from the dataset:
```shell
nl-lemmatizer-util convert molex_22_02_2022.tsv/molex_22_02_2022.tsv gigant-molex.json
```Then add the `gigant-molex` pipe to an existing pipeline:
```shell
nl-lemmatizer-util extend-pipeline nl_core_news_lg gigant-molex.json nl_core_news_gigant
```## Authors
This package was developed by Daniël de Kok (Biaffine) and Jeroen van de
Nieuwenhof (Tolkie).