Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thjbdvlt/spacy-french-parser
syntactic dependency parser for french using spacy
https://github.com/thjbdvlt/spacy-french-parser
french nlp nlp-french spacy spacy-parser syntactic-dependency-parsing universal-dependencies
Last synced: 6 days ago
JSON representation
syntactic dependency parser for french using spacy
- Host: GitHub
- URL: https://github.com/thjbdvlt/spacy-french-parser
- Owner: thjbdvlt
- License: other
- Created: 2024-08-26T07:57:18.000Z (2 months ago)
- Default Branch: sea
- Last Pushed: 2024-08-27T08:21:19.000Z (2 months ago)
- Last Synced: 2024-10-10T11:42:48.267Z (26 days ago)
- Topics: french, nlp, nlp-french, spacy, spacy-parser, syntactic-dependency-parsing, universal-dependencies
- Language: Python
- Homepage:
- Size: 1.07 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
__syntactic dependency parser__ for french with [spacy](https://spacy.io/api/).
this repository is comprised of scripts that fetch and prepare data to train a [syntactic dependencies](https://universaldependencies.org/u/dep/) parser with [spacy](https://spacy.io/api/) for the french language, along with a configuration file and script to train it. the __model__ itself is available under __releases__.
the data used for the training is an aggregation of three [UD](https://universaldependencies.org/) datasets and makes some minor changes to these datasets.
in the datasets i used, the word _du_ is splitted into its logical component _de_ and _le_. a text like _on parle du ciel_ becomes _on parle de le ciel_ in the `.conllu` files. but in the texts i have to analyze, _du_ isn't splitted at all, so i need to unsplit it. thus the following:
```conllu
11-12 du ... _ _ _ _
11 de ... 19 case _ _
12 le ... 11 det _ _
```is transformed into:
```conllu
11 du ... 19 case:det _ _
```upon that, some labels are replaced by others, and sentences containing certain labels (such as `dep` which indicates than the parsing failed) are removed. for a list of replaced or removed labels, refer the file [lookup_labels.txt](./lookup_labels.txt).
usage
-----the __parser__ is not a full pipeline. you have to source it from another pipeline as a component:
```python3
import spacy# load your main pipeline
nlp = spacy.load('fr_core_news_sm', exclude=['parser'])# load the model containing the parser
nlp_deps = spacy.load('./model', exclude=['tokenizer'])# put the parser in the main pipeline
nlp.add_pipe('parser', source=nlp_deps)
```