https://github.com/explosion/spacymoji

💙 Emoji handling and meta data for spaCy with custom extension attributes
https://github.com/explosion/spacymoji

emoji emoji-unicode emojis natural-language-processing nlp spacy spacy-extension spacy-pipeline

Last synced: 9 months ago
JSON representation

💙 Emoji handling and meta data for spaCy with custom extension attributes

Host: GitHub
URL: https://github.com/explosion/spacymoji
Owner: explosion
License: mit
Created: 2017-10-12T21:39:45.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2023-05-10T14:06:51.000Z (over 2 years ago)
Last Synced: 2025-03-29T10:08:52.143Z (9 months ago)
Topics: emoji, emoji-unicode, emojis, natural-language-processing, nlp, spacy, spacy-extension, spacy-pipeline
Language: Python
Homepage: https://spacy.io
Size: 32.2 KB
Stars: 181
Watchers: 15
Forks: 20
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # spacymoji: emoji for spaCy

[spaCy](https://spacy.io) extension and pipeline component for adding emoji meta

data to `Doc` objects. Detects emoji consisting of one or more unicode

characters, and can optionally merge multi-char emoji (combined pictures, emoji

with skin tone modifiers) into one token. Human-readable emoji descriptions are

added as a custom attribute, and an optional lookup table can be provided for

your own descriptions. The extension sets the custom `Doc`, `Token` and `Span`

attributes `._.is_emoji`, `._.emoji_desc`, `._.has_emoji` and `._.emoji`. You

can read more about custom pipeline components and extension attributes

[here](https://spacy.io/usage/processing-pipelines).

Emoji are matched using spaCy's

[`PhraseMatcher`](https://spacy.io/api/phrasematcher), and looked up in the data

table provided by the [`emoji` package](https://github.com/carpedm20/emoji).

[![tests](https://github.com/explosion/spacymoji/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacymoji/actions/workflows/tests.yml)

[![Current Release Version](https://img.shields.io/github/release/explosion/spacymoji.svg?style=flat-square&logo=github)](https://github.com/explosion/spacymoji/releases)

[![pypi Version](https://img.shields.io/pypi/v/spacymoji.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacymoji/)

# ⏳ Installation

`spacymoji` requires `spacy` v3.0.0 or higher. For spaCy v2.x, install

`spacymoji==2.0.0`.

```bash

pip install spacymoji

```

# ☝️ Usage

Import the component and add it anywhere in your pipeline using the string name

of the `"emoji"` component factory:

```python

import spacy

nlp = spacy.load("en_core_web_sm")

nlp.add_pipe("emoji", first=True)

doc = nlp("This is a test 😻 👍🏿")

assert doc._.has_emoji is True

assert doc[2:5]._.has_emoji is True

assert doc[0]._.is_emoji is False

assert doc[4]._.is_emoji is True

assert doc[5]._.emoji_desc == "thumbs up dark skin tone"

assert len(doc._.emoji) == 2

assert doc._.emoji[1] == ("👍🏿", 5, "thumbs up dark skin tone")

```

`spacymoji` only cares about the token text, so you can use it on a blank

`Language` instance (it should work for all

[available languages](https://spacy.io/usage/models#languages)!), or in a

pipeline with a loaded pipeline. If your pipeline includes a tagger, parser and

entity recognizer, make sure to add the emoji component as `first=True`, so the

spans are merged right after tokenization, and _before_ the document is parsed.

If your text contains a lot of emoji, this might even give you a nice boost in

parser accuracy.

## Available attributes

The extension sets attributes on the `Doc`, `Span` and `Token`. You can change

the attribute names (and other parameters of the Emoji component) by passing

them via the `config` parameter in the `nlp.add_pipe(...)` method. For more

details on custom components and attributes, see the

[processing pipelines documentation](https://spacy.io/usage/processing-pipelines#custom-components).

| Attribute            | Type                       | Description                                                   |

| -------------------- | -------------------------- | ------------------------------------------------------------- |

| `Token._.is_emoji`   | bool                       | Whether the token is an emoji.                                |

| `Token._.emoji_desc` | str                        | A human-readable description of the emoji.                    |

| `Doc._.has_emoji`    | bool                       | Whether the document contains emoji.                          |

| `Doc._.emoji`        | List[Tuple[str, int, str]] | `(emoji, index, description)` tuples of the document's emoji. |

| `Span._.has_emoji`   | bool                       | Whether the span contains emoji.                              |

| `Span._.emoji`       | List[Tuple[str, int, str]] | `(emoji, index, description)` tuples of the span's emoji.     |

## Settings

You can configure the `emoji` factory by setting any of the following parameters

in the `config` dictionary:

| Setting       | Type                      | Description                                                                                                                            |

| ------------- | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |

| `attrs`       | Tuple[str, str, str, str] | Attributes to set on the `._` property. Defaults to `('has_emoji', 'is_emoji', 'emoji_desc', 'emoji')`.                                |

| `pattern_id`  | str                       | ID of match pattern, defaults to `'EMOJI'`. Can be changed to avoid ID conflicts.                                                      |

| `merge_spans` | bool                      | Merge spans containing multi-character emoji, defaults to `True`. Will only merge combined emoji resulting in one icon, not sequences. |

| `lookup`      | Dict[str, str]            | Optional lookup table that maps emoji strings to custom descriptions, e.g. translations or other annotations.                          |

```python

emoji_config = {"attrs": ("has_e", "is_e", "e_desc", "e"), lookup={"👨‍🎤": "David Bowie"})

nlp.add_pipe(emoji, first=True, config=emoji_config)

doc = nlp("We can be 👨‍🎤 heroes")

assert doc[3]._.is_e

assert doc[3]._.e_desc == "David Bowie"

```

If you're training a pipeline, you can define the component config in your

[`config.cfg`](https://spacy.io/usage/training):

```ini

[nlp]

pipeline = ["emoji", "ner"]

# ...

[components.emoji]

factory = "emoji"

merge_spans = false

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/explosion/spacymoji

Awesome Lists containing this project

README