{"id":22020654,"url":"https://github.com/uudigitalhumanitieslab/multiner","last_synced_at":"2025-03-23T10:17:09.087Z","repository":{"id":231541777,"uuid":"170504243","full_name":"UUDigitalHumanitieslab/multiNER","owner":"UUDigitalHumanitieslab","description":"A multiNER websevice based on the KB's multiNER","archived":false,"fork":false,"pushed_at":"2024-03-28T15:55:28.000Z","size":94,"stargazers_count":0,"open_issues_count":6,"forks_count":0,"subscribers_count":2,"default_branch":"develop","last_synced_at":"2025-01-28T16:43:48.084Z","etag":null,"topics":["ensemble-classifier","named-entity-recognition"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UUDigitalHumanitieslab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-02-13T12:27:33.000Z","updated_at":"2024-04-04T12:42:03.000Z","dependencies_parsed_at":"2024-04-04T15:04:48.869Z","dependency_job_id":null,"html_url":"https://github.com/UUDigitalHumanitieslab/multiNER","commit_stats":null,"previous_names":["uudigitalhumanitieslab/multiner"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UUDigitalHumanitieslab%2FmultiNER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UUDigitalHumanitieslab%2FmultiNER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UUDigitalHumanitieslab%2FmultiNER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UUDigitalHumanitieslab%2FmultiNER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UUDigitalHumanitieslab","download_url":"https://codeload.github.com/UUDigitalHumanitieslab/multiNER/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245084608,"owners_count":20558251,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ensemble-classifier","named-entity-recognition"],"created_at":"2024-11-30T06:07:25.713Z","updated_at":"2025-03-23T10:17:09.059Z","avatar_url":"https://github.com/UUDigitalHumanitieslab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# multiNER\n\nMultiNER is a webservice that combines the output from four different [named-entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) packages into one answer. This impementation is based on the [multiNER software implemented by the KB](https://github.com/KBNLresearch/multiNER), where it is part of the [DAC  project](https://github.com/KBNLresearch/dac) (Entity linker for the Dutch historical newspaper collection of the National Library of the Netherlands).\n\nThe following packages are used:\n\n- [Stanford NER](https://nlp.stanford.edu/software/CRF-NER.shtml)\n- [DBpedia Spotlight](https://www.dbpedia-spotlight.org/)\n- [spaCy](https://spacy.io/)\n- [polyglot](http://polyglot.readthedocs.io/)\n\n## Overview\n\nMultiNER is a Flask application that exposes one method (`collect_from_text`) that returns the entities found in a text based on the Named Entities suggested by the various NER packages. These packages are its main dependencies. Stanford and DbPedia Spotlight are Java applications that run as webservices, whereas Spacy and Polyglot are Python packages.\n\nMultiNER returns entities of four types: LOCATION, PERSON, ORGANIZATION and OTHER. All entities suggested by the various ner packages are translated into these four.\n\nOne of the nice features of multiNER is that the user can configure the weight of the different packages for each API call. See below for details.\n\n## Prerequisites\n\n### Stanford\n\n#### Info and download\n\nMore information and download link can be found in [this article](https://nlp.stanford.edu/software/CRF-NER.html).\n\n#### Models\n\nStanford by default includes some English models (in the folder `classifiers`). Models in other languages can be found as archives in tar format, e.g., `dutch.tar.gz`. Untar such a model and supply it as `\u003cthemodelyouwanttouse\u003e`.\n\n#### Run\n\n```java\njava -mx400m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -outputFormat inlineXML -encoding \"utf-8\" -loadClassifier classifiers/\u003cthemodelyouwanttouse\u003e.gz -port \u003cwhateverportyoulike\u003e\n```\n\nFor example, with the complete English model loaded and running at port 9899:\n\n```java\njava -mx400m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -outputFormat inlineXML -encoding \"utf-8\" -loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz -port 9899\n```\n\n#### Access\n\nAccessing the Stanford webservice directly is done via the [Telnet protocol](https://en.wikipedia.org/wiki/Telnet). For example:\n\n```python\nimport telnetlib\n\ndone = False\nretry = 0\nmax_retry = 10\n\nwhile not done and retry \u003c max_retry:\n    try:\n        conn = telnetlib.Telnet(host='\u003chostname\u003e', port=\u003cport\u003e, timeout=1000)\n        done = True\n    except Exception as e:\n        print(e)\n        retry += 1\n\ntext = 'a text with lots of nice entities'\ntext = text.encode('utf-8') + b'\\n'\nconn.write(text)\nresult = conn.read_all().decode('utf-8')\nconn.close()\n\n```\n\n### DbPedia Spotlight\n\n#### Info and download\n\nMore info on the DbPedia API can be found [here](https://www.dbpedia-spotlight.org/api). Note that it returns a plurality of types for each entity (e.g. 'DbPlace', 'DbLocation', 'Administrative Region', 'Governmental Region', etc)\n\nDownloads are available from [here](https://sourceforge.net/projects/dbpedia-spotlight/files/spotlight/)\n\nModels:\nModels for DbPedia are available [here](https://sourceforge.net/projects/dbpedia-spotlight/files/)\n\nDownload the Dutch model:\n\n```bash\nwget https://sourceforge.net/projects/dbpedia-spotlight/files/2016-10/nl/model/nl.tar.gz/download\n```\n\n#### Run\n\nJava above JDK version 8 (i.e. 9 and higher) do not include all required Java modules. (see [this SO answer](https://stackoverflow.com/a/43574427)). Therefore add `--add-modules java.se.ee` when you start the webservice:\n\n```java\njava --add-modules java.se.ee -jar dbpedia-spotlight-1.0.0.jar models/\u003cwhatevermodelyouwanttouse\u003e http://\u003chostname\u003e:\u003cportnumber\u003e/rest\n```\n\nFor example, run with the Italian model on localhost, port 2222:\n\n```java\njava --add-modules java.se.ee -jar dbpedia-spotlight-1.0.0.jar models/it http://localhost:2222/rest\n```\n\n#### Access\n\nYou can access the DbPedia webservice via HTTP, it even works in your browser:\n\n```\nhttp://localhost:2222/rest/annotate/?text='Another pretty text with some incredibly relevant entities'\n```\n\nOf course you can always use some other tool to make the request for you, like [curl](https://curl.haxx.se/) or [Postman](https://www.getpostman.com/), or whatever you like.\n\n### Spacy\n\n#### Info and download\n\nSpacy is a Python package and can be installed via Pip. Setting up the multiNER application will automatically install Spacy for you. Note that installing spacy takes a looooooong time.\n\n#### Models\n\nSpacy models are Python packages as well. These are installed automatically as well, but there is something special about them.\nAs [the Spacy documentation specifies](https://spacy.io/usage/models#production), they are loaded directly from Github (i.e. they are in `requirements.txt` as (for example) `https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm`)\n\n#### Run and Access\n\nSpacy is run and accessed from the multiNER code. See the documentation if you need to consume it yourself.\n\n### Polyglot\n\n#### Info and download\n\nPolyglot, like Spacy, is a Python package and can be installed via Pip. Setting up the multiNER application will automatically install Spacy for you.\n\nIt is somewhat special in that it tries to guess the language of the text it is being supplied (although you can suggest a language). This potentially leads to errors if models are not installed for the language Polyglot thinks it is NER-ing.\n\n#### Prerequisites\n\nInstalling Polyglot is dependent on the presence of the package `libicu-dev` (`sudo apt install libicu-dev`). If you do not have this package installed, you will get errors like `Command \"python setup.py egg_info\" failed with error code 1 in /tmp/tmp3wehjwuobuild/pyicu/` when installing via pip (or compiling with pip-tools). This requirement that this error comes from is pyicu. More solutions to this problem (i.e. for other OSs) [in this SO post](https://stackoverflow.com/questions/40940188/error-installing-pip-pyicu)\n\n\n#### Models\n\nPolyglot models are Python packages, but these are not automatically installed. You need to manually download them. For example, Dutch:\n\n```python\npolyglot download embeddings2.nl\npolyglot download ner2.nl\n```\n\nFind all available models [here](https://polyglot.readthedocs.io/en/latest/Download.html#langauge-task-support)\n\n#### Run and Access\n\nPolyglot is run and accessed from the multiNER code. See the documentation if you need to consume it yourself.\n\n## Installation\n\nOnce you have the Stanford and DbPedia Spotlight webservices up and running, you can set up multiNER itself.\n\nSetup a virtualenv with Python 3.4\n\n```bash\nvirtualenv .env -p python3.4 --prompt \"(multiner) \"\n```\n\nEnter the virtualenv: `source .env/bin/activate`\n\nInstall requirements: `pip install -r requirements.txt`\n\nSet Flask variables and run:\n\n```bash\nexport FLASK_APP=app.py\nexport FLASK_DEBUG=1\nflask run\n```\n\n## Tests\n\nThe multiNER application includes a bunch of unit tests. To run these, install pytest into your virtualenv (`pip install pytest`) and run: `pytest`.\n\n\n## Usage\n\n### Request\n\nYou should now have a webservice running at `http://localhost:5000/ner/collect_from_text`. It takes three parameters, of which one is required:\n\n| Name          | Required | Info                                 |\n| ------------- | -------- | ------------------------------------ |\n| title         | False    | If provided this will also be NER-ed |\n| text          | True     | The main text to be NER-ed           |\n| configuration | False    | If False, use default configuration  |\n\nExample request:\n\n```json\n{\n    \"title\": \"Mooi tekstje\",\n    \"text\": \"Utrecht ligt ver weg van Rotterdam\",\n    \"configuration\": {\n        \"language\": \"nl\",\n        \"context_length\": 3,\n        \"leading_ner_packages\": [\n                \"stanford\",\n                \"spotlight\"\n            ],\n        \"other_packages_min\": 2\n    }\n}\n```\n\n#### Configuration\n\nThe configuration parameter gives the user the ability to configure how multiNER creates one Named Entity from the sudggestions made by the various NER packages. It allows for several settings to be set:\n\n| Setting  | Info                     | Allowed values   | Default |\n| -------- | ------------------------ | ---------------- | ------- |\n| language | The language of the text | 'nl', 'en', 'it' | 'en'    |\n| context_length | Number of words to collect left and write of each entity | 0 to 15 | 5\n| leading_ner_packages | List of packages whose suggestions should always be included | 'stanford', 'spotlight', 'spacy', 'polyglot' | ['stanford', 'spotlight' ]\n| other_packages_min | The minimum number of non-leading-packages required to suggest a Named Entity before it is included. E.g. if only Spacy suggest that a particular piece of text is a LOCATION and this setting's value is 2, the suggestion will be ignored. | 1 to 4 | 2\n| type_preference | A dictionary with format `{\u003cnumber\u003e:\u003ctype\u003e}` specifying which type is preferred when more than one type is suggested. Note that this determines which type an entity is if there is no agreement (i.e. all packages suggest a different type). Currently, 'type_preference' is not integrated with 'leading_ner_packages', i.e. it just considers the different types, not which package suggested it. | Keys: [1, 2, 3, 4], Values: ['LOCATION', 'PERSON', 'ORGANIZATION', 'OTHER'] | { 1: 'LOCATION', 2: 'PERSON', 3: 'ORGANIZATION', 4: 'OTHER' }\n\n### Response\n\n#### General form\n\nMultiNER returns a JSON object with the following basic structure:\n\n```json\n    {\n        \"title\": {\n            \"text\": \"The title\",\n            \"entities\": [\n                // A list of the entities found\n            ]\n        }\n    },\n    {\n        \"text\": {\n            \"text\": \"The title\",\n            \"entities\": [\n                // A list of the entities found\n            ]\n        }\n    }\n```\n\nThat is to say it returns an embedded JSON object for each part you send: the title and the text. Each of these contains the original text and the entities found in that text.\n\n#### Entities\n\nMultiNER returns Named Entities in which the suggestions from the various NER packages are integrated. Therefore, each entity has some special fields pertaining to the details of the integration, such as 'type_certainty', and 'alt_nes'. In addition, a list of all types suggested is also present.\n\n| Attribute | Info |\n| --- | --- |\n| pos | The start index of the entity in the text |\n| ne | The actual named entity, i.e. the text for which a type is suggested |\n| alt_nes | Alternative entities suggested for the same position in the text. For example, Stanford might suggest 'Angelina Jolie', whereas 'spacy might suggest 'Angelina'. Such double data is stored here. Note that it is also possible that an `alt_ne` does not have the same starting index. For example, Polyglot may find 'Universiteit Utrecht' (an ORGANIZATION), and Spotlight suggests 'Utrecht' (a LOCATION). As long as Spotlight is referring to the same positions in the text, this is also considered an 'alt_ne'. |\n| right_context | A configurable number of words to the right of the entity |\n| left_context | A configurable number of words to the left of the entity |\n| count | The number of NER packages that suggest this entity |\n| type | The type of the entity found. Based on 'type_preference' if there is no agreement between the various NER packages. |\n| type_certainty | The number of packages that suggested the entity's type. |\n| ner_src | The packages that suggested this entity |\n| types | A list of the types suggested |\n\nExample response:\n\n```json\n    \"count\": 3,\n    \"type_certainty\": 2,\n    \"type\": \"PERSON\",\n    \"right_context\": \"zich mij te vragen, of\",\n    \"left_context\": \"op zijn allerlaatst verwaardigde mevrouw\",\n    \"pos\": 1324,\n    \"ne_context\": \"Manchon\",\n    \"ne\": \"Manchon\",\n    \"ner_src\": [\n        \"stanford\",\n        \"spacy\",\n        \"polyglot\"\n    ],\n    \"types\": [\n        \"PERSON\",\n        \"LOCATION\"\n    ]\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuudigitalhumanitieslab%2Fmultiner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fuudigitalhumanitieslab%2Fmultiner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fuudigitalhumanitieslab%2Fmultiner/lists"}