https://github.com/thjbdvlt/solipcysme
spaCy pipeline for french focused on personal pronouns, fictions and first person point of view texts.
https://github.com/thjbdvlt/solipcysme
french french-nlp lemmatization morphological-analysis natural-language-processing nlp nlp-french normalization part-of-speech-tagging pos-tagging spacy spacy-extensions tokenization word-embeddings
Last synced: 3 months ago
JSON representation
spaCy pipeline for french focused on personal pronouns, fictions and first person point of view texts.
- Host: GitHub
- URL: https://github.com/thjbdvlt/solipcysme
- Owner: thjbdvlt
- License: gpl-3.0
- Created: 2024-09-18T12:21:41.000Z (about 1 year ago)
- Default Branch: sea
- Last Pushed: 2025-05-07T08:14:58.000Z (5 months ago)
- Last Synced: 2025-07-12T00:03:12.377Z (3 months ago)
- Topics: french, french-nlp, lemmatization, morphological-analysis, natural-language-processing, nlp, nlp-french, normalization, part-of-speech-tagging, pos-tagging, spacy, spacy-extensions, tokenization, word-embeddings
- Language: Python
- Homepage:
- Size: 974 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
README
solipCysme
==========[spaCy](https://spacy.io/) [pipeline](https://spacy.io/usage/processing-pipelines) for french fictions or first person point of view texts (with a focus on personal pronouns, moods and tenses), mostly trained on novels.
| Feature | Description |
| --- | --- |
| __Language__ | french |
| __Name__ | `fr_solipcysme` |
| __Default Pipeline__ | `jusqucy_tokenizer`,`commecy_normalizer`, `jusqucy_normalizer`, `pretagger_hunspell`,`morphologizer`, `viceverser_lemmatizer`, `parser` |
| __Components__ | [jusqucy_tokenizer](https://github.com/thjbdvlt/jusquci), [jusqucy_normalizer](https://github.com/thjbdvlt/jusquci), [commecy_normalizer](https://github.com/thjbdvlt/commecy), `morphologizer`, [viceverser_lemmatizer](https://github.com/thjbdvlt/spacy-viceverser), `parser` |
| __Sources__ | Corpus [narraFEATS](https://github.com/thjbdvlt/corpus-narraFEATS) (morphologizer), [Universal Dependencies](https://universaldependencies.org/fr/) (parser), [french-word-vectors](https://github.com/thjbdvlt/french-word-vectors) (vectors)|
| __License__ | [GPL](https://www.gnu.org/licenses/gpl-3.0.html) |
| __Author__ | [thjbdvlt](https://github.com/thjbdvlt) |installation
------------```bash
# Main pipeline
pip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_lg-0.2.6-py3-none-any.whl# Faster, less accurate, smaller model
pip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_sm-0.2.6-py3-none-any.whl
```usage
-----```python
import spacynlp = spacy.load("fr_solipcysme_sm")
doc = nlp(
"la MACHINE à (b)rouiller le temps s'est peuuut-etre déraillée..?"
)for i in doc:
print(
i.norm_, # commecy_normalizer / jusqucy_normalizer
i.pos_, # morphologizer
i.morph, # morphologizer
i.lemma_, # viceverser_lemmatizer
i.dep_, # parser
i.head, # parser
i.sent_start, # jusqucy_tokenizer
i._.ttype, # jusqucy_tokenizer
i._.isword, # jusqucy_tokenizer
)print(
# these attributes are not especially usefull.
# mostly used to make morphologizer more accurate.
doc._.jusqucy_ttypes, # jusqucy_tokenizer
doc._.hunspell_po, # pretagger_hunspell
doc._.hunspell_is, # pretagger_hunspell
)
```components and architectures
------------solipCysme not only is a *trained pipeline*, but also a set of minimal pipeline components and model architectures that can be used independently.
### SolipcysmeMultiHashed
a modified [MultiHashEmbed](https://spacy.io/api/architectures#MultiHashEmbed) that makes it possible to use `Doc` underscore attributes as features. The value of an attribute must be a `list` of `int`, and must have the same length as the `Doc` itself.
### SolipcysmeCharEmbed
a modified [CharacterEmbed](https://spacy.io/api/architectures#CharacterEmbed) that makes it possible to use underscore attributes as features and that replace `nC` (number of character) by `nCstart` and `nCend`, so that one can chose an asymetric representation of words (e. g., for french, to only suffix, with `nCstart = 0` and `nCend = 6`).
### pretagger_hunspell
a component that makes Hunspell morphological analysis available as *features* for the `SolipcysmeMultiHashe` or `SolipcysmeCharEmbed` architectures.
limits and specificities
------- only knows about straigt apostroph (`'`) and quotes (`"`).
- morphologizer depends on the `jusqucy_tokenizer`, because this tokenizer sets a value to a doc extension (`Doc._.jusqucy_ttypes`), used by the morpholgizer.
- morphologizer depends on the `pretagger_hunspell` component, too; because the morphologizer uses the output of Hunspell as token features (`po:` and `is:` features).
- no `Gender` featurelicense
------this work is released under [GPL](https://www.gnu.org/licenses/gpl-3.0.html) license (v3).