{"id":19852563,"url":"https://github.com/rosette-api-community/rosette-named-entity-conversion-sample","last_synced_at":"2025-07-13T01:04:22.141Z","repository":{"id":84699326,"uuid":"75213928","full_name":"rosette-api-community/rosette-named-entity-conversion-sample","owner":"rosette-api-community","description":"Python example for converting Rosette named entity extraction results to other formats.","archived":false,"fork":false,"pushed_at":"2019-04-22T18:23:35.000Z","size":15,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-01-11T13:27:55.282Z","etag":null,"topics":["adm","entity-extraction","named-entities","named-entity-recognition","natural-language-processing","nlp","python","rosette"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rosette-api-community.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-11-30T18:12:58.000Z","updated_at":"2019-04-22T18:23:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"6a2fdc8f-43f0-4fca-9532-dda662cf9850","html_url":"https://github.com/rosette-api-community/rosette-named-entity-conversion-sample","commit_stats":{"total_commits":5,"total_committers":3,"mean_commits":"1.6666666666666667","dds":0.4,"last_synced_commit":"1a8fb62fc6e6d604cd5a858dfb9c1e878980f6d8"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rosette-api-community%2Frosette-named-entity-conversion-sample","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rosette-api-community%2Frosette-named-entity-conversion-sample/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rosette-api-community%2Frosette-named-entity-conversion-sample/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rosette-api-community%2Frosette-named-entity-conversion-sample/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rosette-api-community","download_url":"https://codeload.github.com/rosette-api-community/rosette-named-entity-conversion-sample/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241241449,"owners_count":19932751,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adm","entity-extraction","named-entities","named-entity-recognition","natural-language-processing","nlp","python","rosette"],"created_at":"2024-11-12T14:03:31.449Z","updated_at":"2025-02-28T21:17:43.886Z","avatar_url":"https://github.com/rosette-api-community.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"#Introduction\nThis repository contains an example Python script demonstrating how one might go about converting results from Rosette API's named entity extraction to the data format used in the [CoNLL 2003 shared task for named entity extraction](http://www.aclweb.org/anthology/W03-0419).\n\n##The Annotated Data Model\nTo convert the named entity annotations we take advantage of [Rosette's A(nnotated) D(ata) M(odel)]((https://github.com/basis-technology-corp/annotated-data-model)) via the Python bindings.  The following is a sample ADM one might receive as a result when you set the `\"output\"` parameter to `\"rosette\"` and make an `entities` call to the Rosette API:\n\n    {\n        \"data\": \"New York City or NYC is the most populous city in the United States.\\n\",\n        \"attributes\": {\n            \"entities\": {\n                \"items\": [\n                    {\n                        \"headMentionIndex\": 0, \n                        \"mentions\": [\n                            {\n                                \"source\": \"gazetteer\", \n                                \"subsource\": \"/data/roots/rex/data/gazetteer/eng/accept/gaz-LE.bin\", \n                                \"normalized\": \"New York City\", \n                                \"startOffset\": 0, \n                                \"endOffset\": 13\n                            }, \n                            {\n                                \"source\": \"gazetteer\", \n                                \"subsource\": \"/data/roots/rex/data/gazetteer/eng/accept/gaz-LE.bin\", \n                                \"normalized\": \"NYC\", \n                                \"startOffset\": 17, \n                                \"endOffset\": 20\n                            }\n                        ], \n                        \"confidence\": 0.501718114501715, \n                        \"type\": \"LOCATION\", \n                        \"entityId\": \"Q60\"\n                    }, \n                    {\n                        \"headMentionIndex\": 0, \n                        \"mentions\": [\n                            {\n                                \"source\": \"gazetteer\", \n                                \"subsource\": \"/data/roots/rex/data/gazetteer/eng/accept/gaz-LE.bin\", \n                                \"normalized\": \"United States\", \n                                \"startOffset\": 54, \n                                \"endOffset\": 67\n                            }\n                        ], \n                        \"confidence\": 0.08375498050536179, \n                        \"type\": \"LOCATION\", \n                        \"entityId\": \"Q30\"\n                    }\n                ], \n                \"type\": \"list\", \n                \"itemType\": \"entities\"\n            }, \n            \"token\": {\n                \"items\": [\n                    {\n                        \"text\": \"New\", \n                        \"startOffset\": 0, \n                        \"endOffset\": 3\n                    }, \n                    {\n                        \"text\": \"York\", \n                        \"startOffset\": 4, \n                        \"endOffset\": 8\n                    }, \n                    {\n                        \"text\": \"City\", \n                        \"startOffset\": 9, \n                        \"endOffset\": 13\n                    }, \n                    {\n                        \"text\": \"or\", \n                        \"startOffset\": 14, \n                        \"endOffset\": 16\n                    }, \n                    {\n                        \"text\": \"NYC\", \n                        \"startOffset\": 17, \n                        \"endOffset\": 20\n                    }, \n                    {\n                        \"text\": \"is\", \n                        \"startOffset\": 21, \n                        \"endOffset\": 23\n                    }, \n                    {\n                        \"text\": \"the\", \n                        \"startOffset\": 24, \n                        \"endOffset\": 27\n                    }, \n                    {\n                        \"text\": \"most\", \n                        \"startOffset\": 28, \n                        \"endOffset\": 32\n                    }, \n                    {\n                        \"text\": \"populous\", \n                        \"startOffset\": 33, \n                        \"endOffset\": 41\n                    }, \n                    {\n                        \"text\": \"city\", \n                        \"startOffset\": 42, \n                        \"endOffset\": 46\n                    }, \n                    {\n                        \"text\": \"in\", \n                        \"startOffset\": 47, \n                        \"endOffset\": 49\n                    }, \n                    {\n                        \"text\": \"the\", \n                        \"startOffset\": 50, \n                        \"endOffset\": 53\n                    }, \n                    {\n                        \"text\": \"United\", \n                        \"startOffset\": 54, \n                        \"endOffset\": 60\n                    }, \n                    {\n                        \"text\": \"States\", \n                        \"startOffset\": 61, \n                        \"endOffset\": 67\n                    }, \n                    {\n                        \"text\": \".\", \n                        \"startOffset\": 67, \n                        \"endOffset\": 68\n                    }\n                ], \n                \"type\": \"list\", \n                \"itemType\": \"token\"\n            }, \n            \"scriptRegion\": {\n                \"items\": [\n                    {\n                        \"script\": \"Latn\", \n                        \"startOffset\": 0, \n                        \"endOffset\": 69\n                    }\n                ], \n                \"type\": \"list\", \n                \"itemType\": \"scriptRegion\"\n            }, \n            \"languageDetection\": {\n                \"detectionResults\": [\n                    {\n                        \"confidence\": 0.981137482980466, \n                        \"script\": \"Latn\", \n                        \"language\": \"eng\", \n                        \"encoding\": \"UTF-16BE\"\n                    }\n                ], \n                \"type\": \"languageDetection\", \n                \"startOffset\": 0, \n                \"endOffset\": 69\n            }, \n            \"sentence\": {\n                \"items\": [\n                    {\n                        \"startOffset\": 0, \n                        \"endOffset\": 69\n                    }\n                ], \n                \"type\": \"list\", \n                \"itemType\": \"sentence\"\n            }\n        }, \n        \"responseHeaders\": {\n            \"X-RosetteAPI-Concurrency\": \"2\", \n            \"transfer-encoding\": \"chunked\", \n            \"Strict-Transport-Security\": \"max-age=63072000; includeSubdomains; preload\", \n            \"Server\": \"openresty\", \n            \"Connection\": \"keep-alive\", \n            \"X-RosetteAPI-Request-Id\": \"a53453af-7c40-4bd3-8849-513405f7cba0\", \n            \"Content-Encoding\": \"gzip\", \n            \"Vary\": \"Accept-Encoding\", \n            \"X-RosetteAPI-App-Id\": \"1409612466626\", \n            \"Date\": \"Tue, 29 Nov 2016 21:31:11 GMT\", \n            \"Content-Type\": \"application/json\"\n        }, \n        \"version\": \"1.1.0\", \n        \"documentMetadata\": {\n            \"processedBy\": [\n                \"whole-document-language@10.233.73.125\", \n                \"entity-extraction@10.233.177.187\", \n                \"entity-linking@10.233.177.187\"\n            ], \n            \"res-docid\": [\n                \"res-document-964ec8f4-f361-494f-828b-0bc746decdc0\"\n            ]\n        }\n    }\n\nFrom this result we can access all the information we need to pull out the entity extractions and format them in the way we want.\n\n##`rosette_to_conll2003.py`\nThis script traverses the words, sentences, and named entities identified in the ADM to produce CoNLL 2003-style output with one token per line.\n\n###Installing Dependencies with Virtualenv\nThe script is written for Python 3.  If you are alright with installing external Python packages globally, you may skip this section.\n\nYou can install the dependencies using `virtualenv` so that you don't alter your global site packages.\n\nThe process for installing the dependencies using `virtualenv` is as follows for `bash` or similar shells:\n\nEnsure your `virtualenv` is up to date.\n\n    $ pip install -U virtualenv\n\n**Note**: You may need to use `pip3` depending on your Python installation.\n\n`cd` into the directory where the `rosette_to_conll2003.py` script exists and create a Python virtual environment (this is the same location as this README):\n\n    $ virtualenv .\n\nActivate the virtual environment:\n\n    $ source bin/activate\n\nOnce you've activated the virtual environment you can proceed to install the requirements safely without affecting your globabl site packages.\n\n###Installing the Dependencies\nYou can install the dependencies via `pip` (or `pip3` depending on your installation of Python 3) as follows using the provided `requirements.txt`:\n\n    $ pip install -r requirements.txt\n\n###Usage\nOnce you've installed the dependencies you can run the script as follows:\n\n    $ ./rosette_to_conll2003.py -h\n    usage: rosette_to_conll2003.py [-h] [-k KEY] [-u URL] [-l LANGUAGE] input\n    \n    Get Rosette API named entity results in CoNLL 2003-style BIO format\n    \n    positional arguments:\n      input                 A plain-text document to process\n    \n    optional arguments:\n      -h, --help            show this help message and exit\n      -k KEY, --key KEY     Rosette API Key (default: None)\n      -u URL, --url URL     Alternative API URL (default:\n                            https://api.rosette.com/rest/v1/)\n      -l LANGUAGE, --language LANGUAGE\n                            A three-letter (ISO 639-2 T) code that will override\n                            Rosette language detection (default: None)\n\nIf you do not use the `--key` option the script will prompt you to type in your Rosette API key before running.  If you find yourself running the script repeatedly, it may be convenient to set your Rosette API key as an environment variable in your shell:\n\n    $ export ROSETTE_USER_KEY=\u003cyour user key\u003e\n\nThen you can add your key as an option with `-k $ROSETTE_USER_KEY`.\n\n###Example\nThe CoNLL 2003 data format has 4 fields separated by spaces:\n\n| Field | Description                |\n|:-----:|----------------------------|\n| 1     | A word token               |\n| 2     | A part-of-speech (POS) tag |\n| 3     | A syntactic chunk tag      |\n| 4     | A named entity tag         |\n\nThe following is a sample sentence annotated in the [CoNLL 2003 format](http://www.aclweb.org/anthology/W03-0419):\n\n    U.N. NNP I-NP I-ORG\n    official NN I-NP O\n    Ekeus NNP I-NP I-PER\n    heads VBZ I-VP O\n    for IN I-PP O\n    Baghdad NNP I-NP I-LOC\n    . . O O\n\nThe ConLL 2003 format uses so-called BIO or B(egining) I(nside) O(outside) tags to indicate the relative position of word tokens within named entity boundaries.  Tokens that are part of a named entity are suffixed with a named entity type: `LOC`, `ORG` `PER`, or `MISC`.  Note that the first word within a named entity gets prefixed with `B-` because it is at the *beginning* of the mention.  Subsequent tokens within a named entity are prefixed with `I-` indicating they are *inside* the entity mention.  All other tokens that are *outside* of an entity mention are tagged as `O`.\n\n**Note**: In this example we will ignore the second field.  You can get POS tags from the [Rosette API via the `morphology/parts-of-speech` endpoint](https://developer.rosette.com/features-and-functions#parts-of-speech), but that is a separate API call, and we are only concerned with the named entity tags here.  Rosette does not currently offer syntactic chunking, so we will also ignore the third field (though we do offer [dependency parsing](https://developer.rosette.com/features-and-functions#syntactic-dependencies)).  In the fourth and final field, we use Rosette named entity tags, which includes a larger, more informative set of named entity tags than the four tags used in the CoNLL 2003 shared task.\n\nYou view the example text, `example/ny.txt`, as follows:\n\n    $ cat example/ny.txt \n    New York City or NYC is the most populous city in the United States.\n\nYou can run the script on the example file as follows:\n\n    $ ./rosette_to_conll2003.py example/ny.txt\n    Enter your Rosette API key: \n    -DOCSTART- -X- O O\n   \n    New   B-LOCATION\n    York   I-LOCATION\n    City   I-LOCATION\n    or   O\n    NYC   B-LOCATION\n    is   O\n    the   O\n    most   O\n    populous   O\n    city   O\n    in   O\n    the   O\n    United   B-LOCATION\n    States   I-LOCATION\n    .   O\n\nTo translate Rosette API named entity tags to CoNLL 2003 named entity tags, use the `--use-conll-ne-tags` option:\n\n    $ ./rosette_to_conll2003.py --use-conll-ne-tags example/ny.txt\n    Enter your Rosette API key: \n    -DOCSTART- -X- O O\n   \n    New   B-LOC\n    York   I-LOC\n    City   I-LOC\n    or   O\n    NYC   B-LOC\n    is   O\n    the   O\n    most   O\n    populous   O\n    city   O\n    in   O\n    the   O\n    United   B-LOC\n    States   I-LOC\n    .   O\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frosette-api-community%2Frosette-named-entity-conversion-sample","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frosette-api-community%2Frosette-named-entity-conversion-sample","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frosette-api-community%2Frosette-named-entity-conversion-sample/lists"}