{"id":15014044,"url":"https://github.com/explosion/spacymoji","last_synced_at":"2025-04-05T12:09:19.296Z","repository":{"id":52736935,"uuid":"106748160","full_name":"explosion/spacymoji","owner":"explosion","description":"💙 Emoji handling and meta data for spaCy with custom extension attributes","archived":false,"fork":false,"pushed_at":"2023-05-10T14:06:51.000Z","size":33,"stargazers_count":181,"open_issues_count":4,"forks_count":20,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-03-29T10:08:52.143Z","etag":null,"topics":["emoji","emoji-unicode","emojis","natural-language-processing","nlp","spacy","spacy-extension","spacy-pipeline"],"latest_commit_sha":null,"homepage":"https://spacy.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/explosion.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-12T21:39:45.000Z","updated_at":"2025-02-25T08:59:34.000Z","dependencies_parsed_at":"2024-06-18T18:23:47.108Z","dependency_job_id":"fd11a71f-9357-4075-b3ba-123bb11fc150","html_url":"https://github.com/explosion/spacymoji","commit_stats":{"total_commits":19,"total_committers":3,"mean_commits":6.333333333333333,"dds":"0.26315789473684215","last_synced_commit":"38d49e3785d9f8df5d33bc50eeb8d28a941887fe"},"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacymoji","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacymoji/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacymoji/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/explosion%2Fspacymoji/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/explosion","download_url":"https://codeload.github.com/explosion/spacymoji/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247332612,"owners_count":20921853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["emoji","emoji-unicode","emojis","natural-language-processing","nlp","spacy","spacy-extension","spacy-pipeline"],"created_at":"2024-09-24T19:45:06.625Z","updated_at":"2025-04-05T12:09:19.267Z","avatar_url":"https://github.com/explosion.png","language":"Python","readme":"# spacymoji: emoji for spaCy\n\n[spaCy](https://spacy.io) extension and pipeline component for adding emoji meta\ndata to `Doc` objects. Detects emoji consisting of one or more unicode\ncharacters, and can optionally merge multi-char emoji (combined pictures, emoji\nwith skin tone modifiers) into one token. Human-readable emoji descriptions are\nadded as a custom attribute, and an optional lookup table can be provided for\nyour own descriptions. The extension sets the custom `Doc`, `Token` and `Span`\nattributes `._.is_emoji`, `._.emoji_desc`, `._.has_emoji` and `._.emoji`. You\ncan read more about custom pipeline components and extension attributes\n[here](https://spacy.io/usage/processing-pipelines).\n\nEmoji are matched using spaCy's\n[`PhraseMatcher`](https://spacy.io/api/phrasematcher), and looked up in the data\ntable provided by the [`emoji` package](https://github.com/carpedm20/emoji).\n\n[![tests](https://github.com/explosion/spacymoji/actions/workflows/tests.yml/badge.svg)](https://github.com/explosion/spacymoji/actions/workflows/tests.yml)\n[![Current Release Version](https://img.shields.io/github/release/explosion/spacymoji.svg?style=flat-square\u0026logo=github)](https://github.com/explosion/spacymoji/releases)\n[![pypi Version](https://img.shields.io/pypi/v/spacymoji.svg?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/spacymoji/)\n\n# ⏳ Installation\n\n`spacymoji` requires `spacy` v3.0.0 or higher. For spaCy v2.x, install\n`spacymoji==2.0.0`.\n\n```bash\npip install spacymoji\n```\n\n# ☝️ Usage\n\nImport the component and add it anywhere in your pipeline using the string name\nof the `\"emoji\"` component factory:\n\n```python\nimport spacy\n\nnlp = spacy.load(\"en_core_web_sm\")\nnlp.add_pipe(\"emoji\", first=True)\ndoc = nlp(\"This is a test 😻 👍🏿\")\nassert doc._.has_emoji is True\nassert doc[2:5]._.has_emoji is True\nassert doc[0]._.is_emoji is False\nassert doc[4]._.is_emoji is True\nassert doc[5]._.emoji_desc == \"thumbs up dark skin tone\"\nassert len(doc._.emoji) == 2\nassert doc._.emoji[1] == (\"👍🏿\", 5, \"thumbs up dark skin tone\")\n```\n\n`spacymoji` only cares about the token text, so you can use it on a blank\n`Language` instance (it should work for all\n[available languages](https://spacy.io/usage/models#languages)!), or in a\npipeline with a loaded pipeline. If your pipeline includes a tagger, parser and\nentity recognizer, make sure to add the emoji component as `first=True`, so the\nspans are merged right after tokenization, and _before_ the document is parsed.\nIf your text contains a lot of emoji, this might even give you a nice boost in\nparser accuracy.\n\n## Available attributes\n\nThe extension sets attributes on the `Doc`, `Span` and `Token`. You can change\nthe attribute names (and other parameters of the Emoji component) by passing\nthem via the `config` parameter in the `nlp.add_pipe(...)` method. For more\ndetails on custom components and attributes, see the\n[processing pipelines documentation](https://spacy.io/usage/processing-pipelines#custom-components).\n\n| Attribute            | Type                       | Description                                                   |\n| -------------------- | -------------------------- | ------------------------------------------------------------- |\n| `Token._.is_emoji`   | bool                       | Whether the token is an emoji.                                |\n| `Token._.emoji_desc` | str                        | A human-readable description of the emoji.                    |\n| `Doc._.has_emoji`    | bool                       | Whether the document contains emoji.                          |\n| `Doc._.emoji`        | List[Tuple[str, int, str]] | `(emoji, index, description)` tuples of the document's emoji. |\n| `Span._.has_emoji`   | bool                       | Whether the span contains emoji.                              |\n| `Span._.emoji`       | List[Tuple[str, int, str]] | `(emoji, index, description)` tuples of the span's emoji.     |\n\n## Settings\n\nYou can configure the `emoji` factory by setting any of the following parameters\nin the `config` dictionary:\n\n| Setting       | Type                      | Description                                                                                                                            |\n| ------------- | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |\n| `attrs`       | Tuple[str, str, str, str] | Attributes to set on the `._` property. Defaults to `('has_emoji', 'is_emoji', 'emoji_desc', 'emoji')`.                                |\n| `pattern_id`  | str                       | ID of match pattern, defaults to `'EMOJI'`. Can be changed to avoid ID conflicts.                                                      |\n| `merge_spans` | bool                      | Merge spans containing multi-character emoji, defaults to `True`. Will only merge combined emoji resulting in one icon, not sequences. |\n| `lookup`      | Dict[str, str]            | Optional lookup table that maps emoji strings to custom descriptions, e.g. translations or other annotations.                          |\n\n```python\nemoji_config = {\"attrs\": (\"has_e\", \"is_e\", \"e_desc\", \"e\"), lookup={\"👨‍🎤\": \"David Bowie\"})\nnlp.add_pipe(emoji, first=True, config=emoji_config)\ndoc = nlp(\"We can be 👨‍🎤 heroes\")\nassert doc[3]._.is_e\nassert doc[3]._.e_desc == \"David Bowie\"\n```\n\nIf you're training a pipeline, you can define the component config in your\n[`config.cfg`](https://spacy.io/usage/training):\n\n```ini\n[nlp]\npipeline = [\"emoji\", \"ner\"]\n# ...\n\n[components.emoji]\nfactory = \"emoji\"\nmerge_spans = false\n```\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacymoji","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexplosion%2Fspacymoji","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexplosion%2Fspacymoji/lists"}