Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/thjbdvlt/quelquhui
tokenizer for french
https://github.com/thjbdvlt/quelquhui
french french-nlp nlp spacy tokenizer-nlp
Last synced: 4 months ago
JSON representation
tokenizer for french
- Host: GitHub
- URL: https://github.com/thjbdvlt/quelquhui
- Owner: thjbdvlt
- License: mit
- Created: 2024-03-13T14:08:25.000Z (11 months ago)
- Default Branch: sea
- Last Pushed: 2024-09-18T10:57:37.000Z (4 months ago)
- Last Synced: 2024-09-25T19:41:27.850Z (4 months ago)
- Topics: french, french-nlp, nlp, spacy, tokenizer-nlp
- Language: Python
- Homepage:
- Size: 93.8 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
quelqu'hui
==========tokenizer for contemporary french.
| text | tokens |
| ----------------------- | --------------------------- |
| peut-on | `peut` `-on` |
| prends-les | `prends` `-les` |
| Villar-les-bois | `Villar-les-bois` |
| lecteur-rice-x-s | `lecteur-rice-x-s` |
| correcteur·rices | `correcteur·rices` |
| mais.maintenant | `mais` `.` `maintenant` |
| relecteur.rice.s | `relecteur.rice.s` |
| autre(s) | `autre(s)` |
| (autres) | `(` `autres` `)` |
| (autre(s)) | `(` `autre(s)` `)` |
| www.on-tenk.com. | `www.on-tenk.com` `.` |
| oui..? | `oui` `..?` |
| aujourd'hui | `aujourd'hui` |
| c'est | `c'` `est` |
| dedans/dehors | `dedans` `/` `dehors` |
| 02/10/2024 | `02/10/2024` |
| :-) | `:-)` |
| (:happy:) | `(` `:happy:` `)` |usage
-----use as a tokenizer in a [spacy](https://spacy.io/api) pipeline:
```python
import quelquhui
import spacynlp = spacy.load('fr_core_news_sm')
nlp.tokenizer = quelquhui.Toquenizer(nlp.vocab)
```if you save the pipeline and want to load it back:
```python
nlp2 = spacy.load("./model_output", config={
"nlp": {"tokenizer": {"@tokenizers": "quelquhui_tokenizer"}}
})
```use as a independant tokenizer (with no dependencies):
```python
import quelquhuiqh = quelquhui.light.Toquenizer()
doc = qh("la machine à (b)rouiller le temps s'est peut-être dérailler...")
```installation
------------```bash
pip install git+https://github.com/thjbdvlt/quelquhui
```configuration
-------------very few options can be set to modify the tokenizer behavior:
```python
import quelquhuiqh = quelquhui.Toquenizer(
abbrev = ["ref", "ed[s]"], # support regex
inclusive = True, # default
emoticon = True, # default
url = True, # default
regexurl = r"(?:\w+://|www\.)[\S]+[\w/]", # default
regexemoticon = r":-?[\)\(]", # (default one is too long to be reproduced here.)
chars = {
"APOSTROPHE": "'`´’" # default
"HYPHEN": "-–—", # default
# signs you'll set here will replace defaults.
# other won't be changed.
# complete list with default values can be found with
# `quelquhui.default.Chars.__dict__`
}
words = {
"ELISION": ["j", "s", "jusqu"], # ...
"INVERSION": ["on", "y", "ci"], # ...
"SUFF_FEMININE": ["e", "rice", "ère"], # ...
"SUFF_NONBINARY": ["x"],
"SUFF_PLURAL": ["s", "x"],
# there's only these 5.
# (default lists for the first three are longer.)
}
)
```how it works
------------1. _split text on spaces._
2. it re-splits using a few functions (looped) that produced _frozen_ tokens which won't be tokenized by next functions/steps (typically: urls, or text-emoji like `:happy:`, which may be hard to tokenized in cases like `(:happy:)`; we don't want the regex looking for _emoticons_ to match `:)`: i need to defines rules to be applied in a specific order).
2. for each resulting substring:
1. *list characters on which words must be split*. typically: punctuation marks, such as comma or period. let's say they are then considered *token boundaries*.
2. *list characters that must be kept together, even if they have been listed in step __2.i__*.
3. remove __2.i__ from __2.ii__, and split on remainings splitting characters.### period
in most cases, a period is a token distinct from the word it follows: a period ending a sentence obviously isn't part of the word it follows. but in some cases, a period actually is a part of a word (abbreviations: _p. 10_), and in some other cases, the period _and the letters following it_ must be kept in the token (inclusive language: _auteur.rice.s_). these cases are exceptions, hence they are handled in __2.ii__: i remove them from periods found in 2.i. the pattern in __2.i__ will be: `\.` (match period wherever it is, without any condition), while the pattern in __2.ii__ could be (if simplified) `(?<=[^a-z][a-z])\.|\.(?=rice|s)` (match period if preceded by a single letter or followed by _rice_ or _s_).
### hyphen
in most cases, a hyphen isn't a token boundary, because in french the hyphen is a sign that says "these two words are actually one word", such as in _Vaison-la-romaine_. but in some cases, they don't: in case of verb-subject inversion (mostly). these cases are easily described and handled with a regular expression, because subjects in these cases are always personnal pronoums: `-(?=je|tu|...`. there are also a few cases where the following word is not a pronominalized subject, but a pronominalized object, such as _prends-les_, with is also easily handled in a regular expression. hence, the pattern for hyphen in __2.i__ is not (as for period) unconditional and simple, but rather complex and conditional (match hyphen if followed by pronominalized subject or object).
dependencies
------------- python3
- optionnel: [spacy](https://spacy.io/api)