https://github.com/mpuig/spacy-lookup

Named Entity Recognition based on dictionaries
https://github.com/mpuig/spacy-lookup

named-entity-recognition natural-language-processing ner nlp spacy spacy-extension spacy-pipeline

Last synced: 3 months ago
JSON representation

Named Entity Recognition based on dictionaries

Host: GitHub
URL: https://github.com/mpuig/spacy-lookup
Owner: mpuig
License: mit
Created: 2018-01-15T17:32:39.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2019-03-03T18:17:59.000Z (over 6 years ago)
Last Synced: 2025-03-30T12:09:18.457Z (4 months ago)
Topics: named-entity-recognition, natural-language-processing, ner, nlp, spacy, spacy-extension, spacy-pipeline
Language: Python
Size: 3.55 MB
Stars: 242
Watchers: 9
Forks: 38
Open Issues: 5
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

README

        spacy-lookup: Named Entity Recognition based on dictionaries

************************************************************

`spaCy v2.0 `_ extension and pipeline component

for adding Named Entities metadata to ``Doc`` objects. Detects Named Entities

using dictionaries. The extension sets the custom ``Doc``,

``Token`` and ``Span`` attributes ``._.is_entity``, ``._.entity_type``,

``._.has_entities`` and ``._.entities``.

Named Entities are matched using the python module ``flashtext``, and

looks up in the data provided by different dictionaries.

Installation

===============

``spacy-lookup`` requires ``spacy`` v2.0.16 or higher.

.. code:: bash

    pip install spacy-lookup

Usage

=====

First, you need to download a language model.

.. code:: bash

    python -m spacy download en

Import the component and initialise it with the shared ``nlp`` object (i.e. an

instance of ``Language``), which is used to initialise ``flashtext``

with the shared vocab, and create the match patterns. Then add the component

anywhere in your pipeline.

.. code:: python

    import spacy

    from spacy_lookup import Entity

    nlp = spacy.load('en')

    entity = Entity(keywords_list=['python', 'product manager', 'java platform'])

    nlp.add_pipe(entity, last=True)

    doc = nlp(u"I am a product manager for a java and python.")

    assert doc._.has_entities == True

    assert doc[0]._.is_entity == False

    assert doc[3]._.entity_desc == 'product manager'

    assert doc[3]._.is_entity == True

    print([(token.text, token._.canonical) for token in doc if token._.is_entity])

``spacy-lookup`` only cares about the token text, so you can use it on a blank

``Language`` instance (it should work for all

`available languages `_!), or in

a pipeline with a loaded model. If you're loading a model and your pipeline

includes a tagger, parser and entity recognizer, make sure to add  the entity

component as ``last=True``, so the spans are merged at the end of the pipeline.

Available attributes

--------------------

The extension sets attributes on the ``Doc``, ``Span`` and ``Token``. You can

change the attribute names on initialisation of the extension. For more details

on custom components and attributes, see the

`processing pipelines documentation `_.

====================== ======= ===

``Token._.is_entity``   bool    Whether the token is an entity.

``Token._.entity_type`` unicode A human-readable description of the entity.

``Doc._.has_entities``    bool    Whether the document contains entity.

``Doc._.entities``        list    ``(entity, index, description)`` tuples of the document's entities.

``Span._.has_entities``   bool    Whether the span contains entity.

``Span._.entities``       list    ``(entity, index, description)`` tuples of the span's entities.

====================== ======= ===

Settings

--------

On initialisation of ``Entity``, you can define the following settings:

=============== ============ ===

``nlp``         ``Language`` The shared ``nlp`` object. Used to initialise the matcher with the shared ``Vocab``, and create ``Doc`` match patterns.

``attrs``       tuple        Attributes to set on the ._ property. Defaults to ``('has_entities', 'is_entity', 'entity_type', 'entity')``.

``keywords_list``      list         Optional lookup table with the list of terms to look for.

``keywords_dict``      dict         Optional lookup table with the list of terms to look for.

``keywords_file``      string         Optional filename with the list of terms to look for.

=============== ============ ===

.. code:: python

    entity = Entity(nlp, keywords_list=['python', 'java platform'], label='ACME')

    nlp.add_pipe(entity)

    doc = nlp(u"I am a product manager for a java platform and python.")

    assert doc[3]._.is_entity

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mpuig/spacy-lookup

Awesome Lists containing this project

README