https://github.com/thjbdvlt/solipcysme

spaCy pipeline for french focused on personal pronouns, fictions and first person point of view texts.
https://github.com/thjbdvlt/solipcysme

french french-nlp lemmatization morphological-analysis natural-language-processing nlp nlp-french normalization part-of-speech-tagging pos-tagging spacy spacy-extensions tokenization word-embeddings

Last synced: 3 months ago
JSON representation

spaCy pipeline for french focused on personal pronouns, fictions and first person point of view texts.

Host: GitHub
URL: https://github.com/thjbdvlt/solipcysme
Owner: thjbdvlt
License: gpl-3.0
Created: 2024-09-18T12:21:41.000Z (about 1 year ago)
Default Branch: sea
Last Pushed: 2025-05-07T08:14:58.000Z (5 months ago)
Last Synced: 2025-07-12T00:03:12.377Z (3 months ago)
Topics: french, french-nlp, lemmatization, morphological-analysis, natural-language-processing, nlp, nlp-french, normalization, part-of-speech-tagging, pos-tagging, spacy, spacy-extensions, tokenization, word-embeddings
Language: Python
Homepage:
Size: 974 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: COPYING

Awesome Lists containing this project

README

          solipCysme

==========

[spaCy](https://spacy.io/) [pipeline](https://spacy.io/usage/processing-pipelines) for french fictions or first person point of view texts (with a focus on personal pronouns, moods and tenses), mostly trained on novels.

| Feature | Description |

| --- | --- |

| __Language__ | french |

| __Name__ | `fr_solipcysme` |

| __Default Pipeline__ | `jusqucy_tokenizer`,`commecy_normalizer`, `jusqucy_normalizer`, `pretagger_hunspell`,`morphologizer`, `viceverser_lemmatizer`, `parser` |

| __Components__ | [jusqucy_tokenizer](https://github.com/thjbdvlt/jusquci), [jusqucy_normalizer](https://github.com/thjbdvlt/jusquci), [commecy_normalizer](https://github.com/thjbdvlt/commecy), `morphologizer`, [viceverser_lemmatizer](https://github.com/thjbdvlt/spacy-viceverser), `parser` |

| __Sources__ | Corpus [narraFEATS](https://github.com/thjbdvlt/corpus-narraFEATS) (morphologizer), [Universal Dependencies](https://universaldependencies.org/fr/) (parser), [french-word-vectors](https://github.com/thjbdvlt/french-word-vectors) (vectors)|

| __License__ | [GPL](https://www.gnu.org/licenses/gpl-3.0.html) |

| __Author__ | [thjbdvlt](https://github.com/thjbdvlt) |

installation

------------

```bash

# Main pipeline

pip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_lg-0.2.6-py3-none-any.whl

# Faster, less accurate, smaller model

pip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_sm-0.2.6-py3-none-any.whl

```

usage

-----

```python

import spacy

nlp = spacy.load("fr_solipcysme_sm")

doc = nlp(

    "la MACHINE à (b)rouiller le temps s'est peuuut-etre déraillée..?"

)

for i in doc:

    print(

        i.norm_,      # commecy_normalizer / jusqucy_normalizer

        i.pos_,       # morphologizer

        i.morph,      # morphologizer

        i.lemma_,     # viceverser_lemmatizer

        i.dep_,       # parser

        i.head,       # parser

        i.sent_start, # jusqucy_tokenizer

        i._.ttype,    # jusqucy_tokenizer

        i._.isword,   # jusqucy_tokenizer

    )

print(

    # these attributes are not especially usefull.

    # mostly used to make morphologizer more accurate.

    doc._.jusqucy_ttypes,  # jusqucy_tokenizer

    doc._.hunspell_po,     # pretagger_hunspell

    doc._.hunspell_is,     # pretagger_hunspell

)

```

components and architectures

------------

solipCysme not only is a *trained pipeline*, but also a set of minimal pipeline components and model architectures that can be used independently.

### SolipcysmeMultiHashed

a modified [MultiHashEmbed](https://spacy.io/api/architectures#MultiHashEmbed) that makes it possible to use `Doc` underscore attributes as features. The value of an attribute must be a `list` of `int`, and must have the same length as the `Doc` itself.

### SolipcysmeCharEmbed

a modified [CharacterEmbed](https://spacy.io/api/architectures#CharacterEmbed) that makes it possible to use underscore attributes as features and that replace `nC` (number of character) by `nCstart` and `nCend`, so that one can chose an asymetric representation of words (e. g., for french, to only suffix, with `nCstart = 0` and `nCend = 6`).

### pretagger_hunspell

a component that makes Hunspell morphological analysis available as *features* for the `SolipcysmeMultiHashe` or `SolipcysmeCharEmbed` architectures.

limits and specificities

------

- only knows about straigt apostroph (`'`) and quotes (`"`).

- morphologizer depends on the `jusqucy_tokenizer`, because this tokenizer sets a value to a doc extension (`Doc._.jusqucy_ttypes`), used by the morpholgizer.

- morphologizer depends on the `pretagger_hunspell` component, too; because the morphologizer uses the output of Hunspell as token features (`po:` and `is:` features).

- no `Gender` feature

license

------

this work is released under [GPL](https://www.gnu.org/licenses/gpl-3.0.html) license (v3).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thjbdvlt/solipcysme

Awesome Lists containing this project

README