Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/centrefordigitalhumanities/textminer
A script to detect named entities and store them in an Elasticsearch annotated_text field
https://github.com/centrefordigitalhumanities/textminer
annotation elasticsearch ner spacy
Last synced: 10 days ago
JSON representation
A script to detect named entities and store them in an Elasticsearch annotated_text field
- Host: GitHub
- URL: https://github.com/centrefordigitalhumanities/textminer
- Owner: CentreForDigitalHumanities
- License: mit
- Created: 2023-11-01T13:39:48.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-29T11:41:05.000Z (9 months ago)
- Last Synced: 2024-04-24T11:12:06.868Z (7 months ago)
- Topics: annotation, elasticsearch, ner, spacy
- Language: Python
- Homepage:
- Size: 17.6 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TextMiNER
TextMiNER is a collection of scripts to perform named entity recognition (NER) in text, using the Python library [spaCy](https://spacy.io/). The detected named entities are saved in an Elasticsearch [annotated-text](https://www.elastic.co/guide/en/elasticsearch/plugins/8.10/mapper-annotated-text.html) field.## Requirements
- Python 3.10 or newer
- Elasticsearch 8 or newer
- Elasticsearch's annotated-field plugin. To install, run:
```
sudo bin/elasticsearch-plugin install mapper-annotated-text
```## Docker
This repository contains Docker images and a `docker-compose` file for runnig and testing the scripts locally. `docker-compose` requires an `.env` file, to be created next to `docker-compose.yaml`, with the following values:
```
ES_HOST=elasticsearch
ELASTIC_ROOT_PASSWORD={password-of-your-choice}
```## Usage
### Environment
Before running the script, define your environment variables to set correct values for `ES_HOST` if you don't run Elasticsearch on localhost, and `API_ID`, `API_KEY` and `CERTS_LOCATION`, if you access an Elasticsearch cluster using an API key.### SpaCy models
Make sure you have the required SpaCy models for NER analysis by running
```
python -m spacy download en_core_web_sm
python -m spacy download nl_core_news_sm
```### Run the script (without Docker)
To analyze data from an Elasticsearch index with SpaCy, and save this data back into an annotated field, change to the `code` directory (`cd code`) and then run the following command:
`python process_documents.py -i {index_name} -f {field_name} -l {language}`To run this for an English language corpus indexed as "test", which has text data saved in field "content", you could run
`python process_documents.py -i test -f content -l english`### Run the script locally (with Docker)
Altenatively, running with Docker, without changing to `code` first, run
`docker-compose run --rm backend python process_documents.py -i {index_name} -f {field_name} -l {language}`