Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/sloev/spacy-syllables

Multilingual syllable annotation pipeline component for spacy
https://github.com/sloev/spacy-syllables

Last synced: 2 months ago
JSON representation

Multilingual syllable annotation pipeline component for spacy

Host: GitHub
URL: https://github.com/sloev/spacy-syllables
Owner: sloev
License: mit
Created: 2020-03-13T16:29:12.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2023-03-08T11:15:47.000Z (almost 2 years ago)
Last Synced: 2024-10-01T05:41:25.947Z (3 months ago)
Language: Python
Size: 160 KB
Stars: 34
Watchers: 5
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

        ![spacy syllables](https://raw.githubusercontent.com/sloev/spacy-syllables/master/header.jpg)

# Spacy Syllables

![example workflow](https://github.com/sloev/spacy-syllables/actions/workflows/test.yml/badge.svg) [![Latest Version](https://img.shields.io/pypi/v/spacy-syllables.svg)](https://pypi.python.org/pypi/spacy-syllables) [![Python Support](https://img.shields.io/pypi/pyversions/spacy-syllables.svg)](https://pypi.python.org/pypi/spacy-syllables)



A [spacy 2+ pipeline component](https://spacy.io/universe/category/pipeline) for adding multilingual syllable annotation to tokens. 

* Uses well established [pyphen](https://github.com/Kozea/Pyphen) for the syllables.

* Supports [a ton of languages](https://github.com/Kozea/Pyphen/tree/master/pyphen/dictionaries)

* Ease of use thx to the awesome pipeline framework in spacy

## Install

```bash

$ pip install spacy_syllables

```

which also installs the following dependencies:

* spacy = "^2.2.3"

* pyphen = "^0.9.5"

## Usage

The [`SpacySyllables`](spacy_syllables/__init__.py) class autodetects language from the given spacy nlp instance, but you can also override the detected language by specifying the `lang` parameter during instantiation, see how [here](tests/test_all.py).

### Normal usecase

```python

import spacy

from spacy_syllables import SpacySyllables

nlp = spacy.load("en_core_web_sm")

nlp.add_pipe("syllables", after="tagger")

assert nlp.pipe_names == ["tok2vec", "tagger", "syllables", "parser", "ner", "attribute_ruler", "lemmatizer"]

doc = nlp("terribly long")

data = [(token.text, token._.syllables, token._.syllables_count) for token in doc]

assert data == [("terribly", ["ter", "ri", "bly"], 3), ("long", ["long"], 1)]

```

more examples in [tests](tests/test_all.py)

## Migrating from spacy 2.x to 3.0

In spacy 2.x, spacy_syllables was originally added to the pipeline by instantiating a [`SpacySyllables`](spacy_syllables/__init__.py) object with the desired options and adding it to the pipeline: 

```python

from spacy_syllables import SpacySyllables

syllables = SpacySyllables(nlp, "en_US")

nlp.add_pipe(syllables, after="tagger")

```

In spacy 3.0, you now add the component to the pipeline simply by adding it by name, setting custom configuration information in the `add_pipe()` parameters:

```python

from spacy_syllables import SpacySyllables

nlp.add_pipe("syllables", after="tagger", config={"lang": "en_US"})

```

In addition, the default pipeline components have changed between 2.x and 3.0; please make sure to update any asserts you have that check for these.

e.g.:

spacy 2.x:

```python

assert nlp.pipe_names == ["tagger", "syllables", "parser", "ner"]

```

spacy 3.0:

```python

assert nlp.pipe_names == ["tok2vec", "tagger", "syllables", "parser", "ner", "attribute_ruler", "lemmatizer"]

```

## Dev setup / testing

### install

install the dev package and pyenv versions

```bash

$ pip install -e ".[dev]"

$ python -m spacy download en_core_web_sm

```

### run tests

```bash

$ black .

$ pytest

```