{"id":19316579,"url":"https://github.com/tokenmill/dictionary-annotator","last_synced_at":"2025-04-22T17:30:26.275Z","repository":{"id":138584182,"uuid":"72636070","full_name":"tokenmill/dictionary-annotator","owner":"tokenmill","description":"Fast and configurable UIMA dictionary annotator.","archived":false,"fork":false,"pushed_at":"2023-04-17T15:52:33.000Z","size":65,"stargazers_count":7,"open_issues_count":3,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-02T02:02:09.555Z","etag":null,"topics":["annotators","csv","dictionary","dkpro","nlp","ruta"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tokenmill.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-11-02T12:03:24.000Z","updated_at":"2022-12-03T11:20:32.000Z","dependencies_parsed_at":"2023-09-29T07:19:30.934Z","dependency_job_id":null,"html_url":"https://github.com/tokenmill/dictionary-annotator","commit_stats":{"total_commits":26,"total_committers":5,"mean_commits":5.2,"dds":"0.15384615384615385","last_synced_commit":"5a26eb7a26d9e8a627dd39b663d541f7bafce913"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fdictionary-annotator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fdictionary-annotator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fdictionary-annotator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tokenmill%2Fdictionary-annotator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tokenmill","download_url":"https://codeload.github.com/tokenmill/dictionary-annotator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250287337,"owners_count":21405588,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotators","csv","dictionary","dkpro","nlp","ruta"],"created_at":"2024-11-10T01:11:56.812Z","updated_at":"2025-04-22T17:30:25.927Z","avatar_url":"https://github.com/tokenmill.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca href=\"http://www.tokenmill.lt\"\u003e\n      \u003cimg src=\".github/tokenmill-logo.svg\" width=\"125\" height=\"125\" align=\"right\" /\u003e\n\u003c/a\u003e\n\n# dictionary-annotator\n\nDictionary Annotator is inspired by DKPro's [dictionary-annotator](https://github.com/dkpro/dkpro-core/tree/master/dkpro-core-dictionaryannotator-asl) and UIMA Ruta's [MARKTABLE](https://uima.apache.org/d/ruta-current/tools.ruta.book.html#ugr.tools.ruta.language.actions.marktable) action\n\n## Features\n\n* Annotates JCas with phrases from CSV file (supported by DKPro and MARKTABLE)\n* Supports multiple annotations with different features on the same block of text (not supported by DKPro nor MARKTABLE)\n* Configurable case sensitivity (supported by MARKTABLE)\n* Supports unlimited number of annotation features (supported by MARKTABLE)\n* Configurable tokenizer (not supported by DKPro nor MARKTABLE)\n\n## Performance\n\nSimple performance benchmark was done to compare with other alternatives. Numbers are averages from 3 trials.\n[20 Newsgroups](http://qwone.com/~jason/20Newsgroups/) texts were used.\n\n|| Tokenization               | Time (Tokenization+Dictionary) | Tokens/sec |\n|----------------------------|---|--------------------------------|------------|\n| DkPro dictionary-annotator | OpenNlp Simple Tokenizer | 368.2 sec                      | 8 724     |\n| Ruta MARKTABLE | OpenNlp Simple Tokenizer for dictionary, Ruta tokenizer for texts|21.9 sec | 146 684 |\n| **This dictionary annotator** |  OpenNlp Simple Tokenizer |1.7 sec | 1 889 637 |\n\nHowever this benchmark might be inaccurate because of following differences between annotators:\n\n * DkPro requires text to be segmented into senteces an tokens. While testing text was marked as single sentence\n * Ruta has its own rich tokenizer which takes significant amount of time\n\nBenchmarking can be done by running ```./benchmark.sh``` \n\n## Usage\n\nMaven dependency\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003elt.tokenmill.uima\u003c/groupId\u003e\n    \u003cartifactId\u003edictionary-annotator\u003c/artifactId\u003e\n    \u003cversion\u003e0.1.1\u003c/version\u003e\n\u003c/dependency\u003e\n\n```\n\nDictionary (leaders.csv)\n\n```csv\nBarack Obama,US,2009-01-20,2017-01-20,president,100023\nDalia Grybauskaite,Lithuania,2009-06-12,,president,100049\nDalia Grybauskaite,EU,2004-11-22,2009-06-01,commissioner,100050\n\n```\nConfiguration\n\n```java\nAnalysisEngineDescription description = AnalysisEngineFactory.createEngineDescription(DictionaryAnnotator.class,\n        DictionaryAnnotator.PARAM_DICTIONARY_LOCATION, \"classpath:leaders.csv\",\n        DictionaryAnnotator.PARAM_ANNOTATION_TYPE, Person.class.getName(),\n        DictionaryAnnotator.PARAM_DICTIONARY_CASE_SENSITIVE, true,\n        DictionaryAnnotator.PARAM_FEATURE_MAPPING, asList(\n                \"1 -\u003e country\", \"2 -\u003e from\", \"3 -\u003e to\", \"5 -\u003e id\", \"4 -\u003e role\"));\n```\n\nRunning it on text ```Barack Obama met Dalia Grybauskaite in Vilnius``` would produce 3 annotations:\n\n```\nPerson(id=100023, from=\"2009-01-20\", to=\"2017-01-20\", country=\"US\", role=\"president\"),\nPerson(id=100049, from=\"2009-06-12\", to=null, country=\"Lithuania\", role=\"president\"),\nPerson(id=100050, from=\"2004-11-22\", to=\"2009-06-01\", country=\"EU\", role=\"commissioner\")\n```\n\nA working example can be found in [DictionaryAnnotatorTest](https://github.com/tokenmill/dictionary-annotator/blob/master/src/test/java/lt/tokenmill/uima/dictionaryannotator/DictionaryAnnotatorTest.java)\n\n## Configuration\n\n### Basic Example\n\n```java\nAnalysisEngineDescription description = AnalysisEngineFactory.createEngineDescription(DictionaryAnnotator.class,\n        DictionaryAnnotator.PARAM_DICTIONARY_LOCATION, \"classpath:dictionary.csv\",\n        DictionaryAnnotator.PARAM_ANNOTATION_TYPE, DictionaryEntry.class.getName(),\n        DictionaryAnnotator.PARAM_DICTIONARY_CASE_SENSITIVE, false,\n        DictionaryAnnotator.PARAM_FEATURE_MAPPING, asList(\n                \"1 -\u003e feature1\", \"2 -\u003e feature2\"));\n```\n\n### Tokenizer\n\nBy default whitespace tokenizer is used for dictionary entries tokenization. \nBut you can provide a custom one (usually you want your text and dictionary tokenized by the same tokenizer)\n\n```java\nAnalysisEngineDescription description = AnalysisEngineFactory.createEngineDescription(DictionaryAnnotator.class,\n        DictionaryAnnotator.PARAM_DICTIONARY_LOCATION, \"classpath:dictionary.csv\",\n        DictionaryAnnotator.PARAM_TOKENIZER_CLASS, YourDictionaryTokenizer.class.getName(),\n        DictionaryAnnotator.PARAM_ANNOTATION_TYPE, DictionaryEntry.class.getName(),\n        DictionaryAnnotator.PARAM_DICTIONARY_CASE_SENSITIVE, false,\n        DictionaryAnnotator.PARAM_FEATURE_MAPPING, asList(\n                \"1 -\u003e feature1\", \"2 -\u003e feature2\"));\n```\n\nNOTE: Tokenizer must implement ```lt.tokenmill.uima.dictionaryannotator.DictionaryTokenizer```\n\n### Accent-insensitive matching\n\nDictionary annotator can match text ignoring letter accents. To enable this feature set following configuration property to ```false```:\n\n```java\nDictionaryAnnotator.PARAM_DICTIONARY_ACCENT_SENSITIVE\n```\n## Known issues\n\nIf some line in a long CSV doesn't have a closing quote character then the CSV reader might strugle to finish its job. If you know that one line corresponds to exactly one dictionary entry then check if there are lines that have exactly one quote character and fix those lines. One possible solution is to get rid of the problematic linee altogether, e.g. the quote character is `\"` and e.g. with `sed` delete those lines in the same file:\n```bash\nsed -i -e '/^[^\\\"]*\\\"[^\\\"]*$/d' input-file.csv\n```\n\n## TODO\n\n* Phrase matching using stemmed tokens\n* Configurable CSV separator\n* Configurable ignored characters (as in MARKTABLE)\n\n## License\n\nCopyright \u0026copy; 2019 [TokenMill UAB](http://www.tokenmill.lt).\n\nDistributed under the The Apache License, Version 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftokenmill%2Fdictionary-annotator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftokenmill%2Fdictionary-annotator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftokenmill%2Fdictionary-annotator/lists"}