{"id":24306537,"url":"https://github.com/thjbdvlt/solipcysme","last_synced_at":"2025-10-28T13:10:02.236Z","repository":{"id":272657126,"uuid":"859292243","full_name":"thjbdvlt/solipCysme","owner":"thjbdvlt","description":"spaCy pipeline for french focused on personal pronouns, fictions and first person point of view texts.","archived":false,"fork":false,"pushed_at":"2025-05-07T08:14:58.000Z","size":997,"stargazers_count":2,"open_issues_count":2,"forks_count":1,"subscribers_count":1,"default_branch":"sea","last_synced_at":"2025-07-12T00:03:12.377Z","etag":null,"topics":["french","french-nlp","lemmatization","morphological-analysis","natural-language-processing","nlp","nlp-french","normalization","part-of-speech-tagging","pos-tagging","spacy","spacy-extensions","tokenization","word-embeddings"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thjbdvlt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-09-18T12:21:41.000Z","updated_at":"2025-06-04T00:47:33.000Z","dependencies_parsed_at":null,"dependency_job_id":"62a4c9c0-2ee9-4381-85f3-69f43005c197","html_url":"https://github.com/thjbdvlt/solipCysme","commit_stats":null,"previous_names":["thjbdvlt/solipcysme"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/thjbdvlt/solipCysme","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thjbdvlt%2FsolipCysme","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thjbdvlt%2FsolipCysme/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thjbdvlt%2FsolipCysme/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thjbdvlt%2FsolipCysme/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thjbdvlt","download_url":"https://codeload.github.com/thjbdvlt/solipCysme/tar.gz/refs/heads/sea","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thjbdvlt%2FsolipCysme/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264915991,"owners_count":23682957,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["french","french-nlp","lemmatization","morphological-analysis","natural-language-processing","nlp","nlp-french","normalization","part-of-speech-tagging","pos-tagging","spacy","spacy-extensions","tokenization","word-embeddings"],"created_at":"2025-01-17T03:20:02.835Z","updated_at":"2025-10-28T13:09:57.192Z","avatar_url":"https://github.com/thjbdvlt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"solipCysme\n==========\n\n[spaCy](https://spacy.io/) [pipeline](https://spacy.io/usage/processing-pipelines) for french fictions or first person point of view texts (with a focus on personal pronouns, moods and tenses), mostly trained on novels.\n\n| Feature | Description |\n| --- | --- |\n| __Language__ | french |\n| __Name__ | `fr_solipcysme` |\n| __Default Pipeline__ | `jusqucy_tokenizer`,`commecy_normalizer`, `jusqucy_normalizer`, `pretagger_hunspell`,`morphologizer`, `viceverser_lemmatizer`, `parser` |\n| __Components__ | [jusqucy_tokenizer](https://github.com/thjbdvlt/jusquci), [jusqucy_normalizer](https://github.com/thjbdvlt/jusquci), [commecy_normalizer](https://github.com/thjbdvlt/commecy), `morphologizer`, [viceverser_lemmatizer](https://github.com/thjbdvlt/spacy-viceverser), `parser` |\n| __Sources__ | Corpus [narraFEATS](https://github.com/thjbdvlt/corpus-narraFEATS) (morphologizer), [Universal Dependencies](https://universaldependencies.org/fr/) (parser), [french-word-vectors](https://github.com/thjbdvlt/french-word-vectors) (vectors)|\n| __License__ | [GPL](https://www.gnu.org/licenses/gpl-3.0.html) |\n| __Author__ | [thjbdvlt](https://github.com/thjbdvlt) |\n\ninstallation\n------------\n\n```bash\n# Main pipeline\npip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_lg-0.2.6-py3-none-any.whl\n\n# Faster, less accurate, smaller model\npip install https://github.com/thjbdvlt/solipCysme/releases/download/0.2.6/fr_solipcysme_sm-0.2.6-py3-none-any.whl\n```\n\nusage\n-----\n\n```python\nimport spacy\n\nnlp = spacy.load(\"fr_solipcysme_sm\")\n\ndoc = nlp(\n    \"la MACHINE à (b)rouiller le temps s'est peuuut-etre déraillée..?\"\n)\n\nfor i in doc:\n    print(\n        i.norm_,      # commecy_normalizer / jusqucy_normalizer\n        i.pos_,       # morphologizer\n        i.morph,      # morphologizer\n        i.lemma_,     # viceverser_lemmatizer\n        i.dep_,       # parser\n        i.head,       # parser\n        i.sent_start, # jusqucy_tokenizer\n        i._.ttype,    # jusqucy_tokenizer\n        i._.isword,   # jusqucy_tokenizer\n    )\n\nprint(\n    # these attributes are not especially usefull.\n    # mostly used to make morphologizer more accurate.\n    doc._.jusqucy_ttypes,  # jusqucy_tokenizer\n    doc._.hunspell_po,     # pretagger_hunspell\n    doc._.hunspell_is,     # pretagger_hunspell\n)\n```\n\ncomponents and architectures\n------------\n\nsolipCysme not only is a *trained pipeline*, but also a set of minimal pipeline components and model architectures that can be used independently.\n\n### SolipcysmeMultiHashed\n\na modified [MultiHashEmbed](https://spacy.io/api/architectures#MultiHashEmbed) that makes it possible to use `Doc` underscore attributes as features. The value of an attribute must be a `list` of `int`, and must have the same length as the `Doc` itself.\n\n### SolipcysmeCharEmbed\n\na modified [CharacterEmbed](https://spacy.io/api/architectures#CharacterEmbed) that makes it possible to use underscore attributes as features and that replace `nC` (number of character) by `nCstart` and `nCend`, so that one can chose an asymetric representation of words (e. g., for french, to only suffix, with `nCstart = 0` and `nCend = 6`).\n\n### pretagger_hunspell\n\na component that makes Hunspell morphological analysis available as *features* for the `SolipcysmeMultiHashe` or `SolipcysmeCharEmbed` architectures.\n\nlimits and specificities\n------\n\n- only knows about straigt apostroph (`'`) and quotes (`\"`).\n- morphologizer depends on the `jusqucy_tokenizer`, because this tokenizer sets a value to a doc extension (`Doc._.jusqucy_ttypes`), used by the morpholgizer.\n- morphologizer depends on the `pretagger_hunspell` component, too; because the morphologizer uses the output of Hunspell as token features (`po:` and `is:` features).\n- no `Gender` feature\n\nlicense\n------\n\nthis work is released under [GPL](https://www.gnu.org/licenses/gpl-3.0.html) license (v3).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthjbdvlt%2Fsolipcysme","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthjbdvlt%2Fsolipcysme","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthjbdvlt%2Fsolipcysme/lists"}