Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/thjbdvlt/jusquci

french tokenizer for postgresql text search / spacy
https://github.com/thjbdvlt/jusquci

nlp nlp-french postgresql postgresql-extension spacy tokenizer

Last synced: 2 days ago
JSON representation

french tokenizer for postgresql text search / spacy

Host: GitHub
URL: https://github.com/thjbdvlt/jusquci
Owner: thjbdvlt
License: other
Created: 2024-12-12T22:44:55.000Z (2 months ago)
Default Branch: sea
Last Pushed: 2025-01-25T09:05:27.000Z (17 days ago)
Last Synced: 2025-01-25T10:17:43.747Z (17 days ago)
Topics: nlp, nlp-french, postgresql, postgresql-extension, spacy, tokenizer
Language: C
Homepage:
Size: 113 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        __jusquci__ -- tokenizer for french (PostgreSQL/spaCy).

| text                    | tokens                      |

| ----------------------- | --------------------------- |

| jusqu'ici=>             | `jusqu'` `ici` `=>`         |

| celle-ci-->ici          | `celle` `-ci` `-->` `ici`   |

| lecteur-rice-x-s        | `lecteur-rice-x-s`          |

| peut-être--là           | `peut-être` `--` `là`       |

| correcteur·rices        | `correcteur·rices`          |

| mais.maintenant         | `mais` `.` `maintenant`     |

| \[re\]lecteur.rice.s    | `[re]lecteur.rice.s`        |

| autre(s)                | `autre(s)`                  |

| (autres)                | `(` `autres` `)`            |

| (autre(s))              | `(` `autre(s)` `)`          |

| www.on-tenk.com  | `www.on-tenk.com`           |

| \[@becker_1982,p.12\]  | `[` `@becker_1982` `,` `p.` `12` `]` |

| oui..?                  | `oui` `..?`                 |

| dedans/dehors           | `dedans` `/` `dehors`       |

| :happy: :) pour:        | `:happy:` `:)` `pour` `:`   |

| ô.ô^^=):-)xd            | `ô.ô` `^^` `=)` `:-)` `xd`  |

## postgresql extension

the primary role of this tokenizer is to be used as a [text search parser](https://www.postgresql.org/docs/current/textsearch-parsers.html) in postgresql, hence it's proposed here as an postgresql extension.

```bash

make install install_stop

```

```sql

create extension jusquci;

select to_tsvector(

    'jusquci',

    'le quotidien,s''invente-t-il par mille.manière de braconner???'

);

```

## in python

the single provided function (`tokenize`) returns three lists:

- __tokens__: a list of strings.

- __tokens types__: a list of token types ID; the types are defined as an *Enum* (`jusqucy.ttypes.TokenType`).

- __spaces__: a list of boolean values that indicates if tokens are followed by a space or not (for spaCy, mostly).

- __is_sent_start__: a list of boolean values that's used to set `Token.is_sent_start` (based of the __token types__).

the tokenizer can be used in a spacy pipeline. it tokenizes the text and add a attribute to the resulting `Doc` object, `Doc._.ttypes` in which are store token types (assigning to each token takes much more time).

```python

import spacy

import jusqucy

nlp = spacy.blank('fr')

nlp.tokenizer = jusqucy.JusqucyTokenizer(nlp.vocab)

# or:

nlp = spacy.load(your_model, config={

    "nlp": {"tokenizer": {"@tokenizers": "jusqucy_tokenizer"}}

})

```

to get the token types:

```python

from jusqucy.ttypes import TokenType

for token, ttype in zip(doc, doc._.jusqucy_ttypes):

    print(token, TokenType[ttype])

```

### normalizer

a normalizer can also be used as a spacy component. it replace the `norm_` attribute of token of some `ttypes`, in order to make the following components (e.g. morphologizer or parser) easier.

- `url`: `https://`

- `number`: `2`

- `ordinal`: `2ème`

- `emoticon`: `:)`

- `emoji`: `:)`

## as a command line tool

to use __jusquci__ as a simple command line tokenizer (that reads from `stdin`), just compile it with the makefile in the `cli` directory.

the program read a text from standard input and output tokens separated by spaces. it also add newlines after strong punctuation signs (`.`, `?`, `!`).

## sources

- [tsexample](https://github.com/postgrespro/tsexample), for the code.

- the stopwords list is the concatenation of postgresql default stopwords for french (`french.stop`) and a list establish by Jacques Savoy[^Savoy]. I've also added a few words: *elised* words with apostrophe (e.g. `c'`), to be consistent with the `jusquci` parser (postgresql doesn't include the apostrophe), and non-binary pronouns (e.g. `iel`, `celleux`).

[^Savoy]: *A stemming procedure and stopword list for general french corpora*, Jacques Savoy, Institut interfacultaire d'informatique, *Journal of the American Society for Information Science*, 50(10), 1999, 944-952. I removed a word from this list: `passé`.

## os

only tested on linux (debian) and postgresql 16

## license

licensed under [GPLv3](https://www.gnu.org/licenses/gpl-3.0).