{"id":18947484,"url":"https://github.com/pepperkit/corenlp-stop-words-annotator","last_synced_at":"2025-09-09T07:45:46.068Z","repository":{"id":37961883,"uuid":"341634386","full_name":"pepperkit/corenlp-stop-words-annotator","owner":"pepperkit","description":"Stop words annotator for Stanford's CoreNLP library.","archived":false,"fork":false,"pushed_at":"2024-06-10T14:19:40.000Z","size":65,"stargazers_count":3,"open_issues_count":6,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-03-29T04:04:37.460Z","etag":null,"topics":["annotator","corenlp","nlp","stop-words"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pepperkit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-02-23T17:28:05.000Z","updated_at":"2023-11-24T23:57:09.000Z","dependencies_parsed_at":"2023-11-23T13:30:34.328Z","dependency_job_id":"c7598961-3c6b-4d0e-982c-61654feacc96","html_url":"https://github.com/pepperkit/corenlp-stop-words-annotator","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pepperkit%2Fcorenlp-stop-words-annotator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pepperkit%2Fcorenlp-stop-words-annotator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pepperkit%2Fcorenlp-stop-words-annotator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pepperkit%2Fcorenlp-stop-words-annotator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pepperkit","download_url":"https://codeload.github.com/pepperkit/corenlp-stop-words-annotator/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249166210,"owners_count":21223408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotator","corenlp","nlp","stop-words"],"created_at":"2024-11-08T13:10:06.472Z","updated_at":"2025-04-15T22:31:42.979Z","avatar_url":"https://github.com/pepperkit.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CoreNLP Stop Words Annotator\n\n![StopWordsAnnotator](https://img.shields.io/badge/CoreNLP%20Compatible-v4.3-blue)\n[![Java CI with Maven](https://github.com/pepperkit/corenlp-stop-words-annotator/actions/workflows/maven.yml/badge.svg?branch=master)](https://github.com/pepperkit/corenlp-stop-words-annotator/actions/workflows/maven.yml)\n[![Coverage](https://sonarcloud.io/api/project_badges/measure?project=pepperkit_corenlp-stop-words-annotator\u0026metric=coverage)](https://sonarcloud.io/dashboard?id=pepperkit_corenlp-stop-words-annotator)\n[![Maintainability Rating](https://sonarcloud.io/api/project_badges/measure?project=pepperkit_corenlp-stop-words-annotator\u0026metric=sqale_rating)](https://sonarcloud.io/dashboard?id=pepperkit_corenlp-stop-words-annotator)\n[![Reliability Rating](https://sonarcloud.io/api/project_badges/measure?project=pepperkit_corenlp-stop-words-annotator\u0026metric=reliability_rating)](https://sonarcloud.io/dashboard?id=pepperkit_corenlp-stop-words-annotator)\n[![Security Rating](https://sonarcloud.io/api/project_badges/measure?project=pepperkit_corenlp-stop-words-annotator\u0026metric=security_rating)](https://sonarcloud.io/dashboard?id=pepperkit_corenlp-stop-words-annotator)\n\nAnnotator for CoreNLP library, allows adding the set of rules or/and the word themselves, which should be filtered out in the\nCoreNLP pipeline processing.\n\n## Usage\nJust add the annotator and CoreNLP library with models into the dependencies list like this:\n```xml\n        \u003cdependency\u003e\n            \u003cgroupId\u003eio.github.pepperkit\u003c/groupId\u003e\n            \u003cartifactId\u003ecorenlp-stop-words-annotator\u003c/artifactId\u003e\n            \u003cversion\u003e1.0.0\u003c/version\u003e\n        \u003c/dependency\u003e\n\n        \u003cdependency\u003e\n            \u003cgroupId\u003eedu.stanford.nlp\u003c/groupId\u003e\n            \u003cartifactId\u003estanford-corenlp\u003c/artifactId\u003e\n            \u003cversion\u003e4.2.2\u003c/version\u003e\n        \u003c/dependency\u003e\n        \u003cdependency\u003e\n            \u003cgroupId\u003eedu.stanford.nlp\u003c/groupId\u003e\n            \u003cartifactId\u003estanford-corenlp\u003c/artifactId\u003e\n            \u003cversion\u003e4.2.2\u003c/version\u003e\n            \u003cclassifier\u003emodels\u003c/classifier\u003e\n        \u003c/dependency\u003e\n```\n\nThe annotator is configured with `Properties`, it marks the words as stopped using one of the following rules:\n- provided list of particular words (and/or its lemmas) using a string containing comma-separated words, or a file with newline-separated \n  words (from any place in the file system or from a bundled resource) - `stopwords.customList`, `stopwords.customListFilePath`, \n  and `stopwords.customListResourcesFilePath` properties (if all of the properties are provided, only one list of words\n  will be initialized from a provided property, the order of precedence: string with words, from a file, from a bundled resource);\n- POS (part-of-speech) categories (of words lemmas) as a string containing a comma-separated list of the categories - `stopwords.withPosCategories` property;\n- the length of a word or its lemma - `stopwords.shorterThan` and `stopwords.withLemmasShorterThan` properties.\n\nDescription of the available POS categories can be found here (also see complex example below):\n - https://nlp.stanford.edu/software/pos-tagger-faq.html\n - https://catalog.ldc.upenn.edu/docs/LDC99T42/tagguid1.pdf\n\n### Requirements\n- Java version should be 8 or higher;\n- annotator should be added at the project's POM as a dependency;\n- CoreNLP library should be present in the classpath;\n- *tokenize*, *ssplit*, *pos*, and *lemma* annotators should be present in the pipeline before *stopwords* annotator.\n\n### Simple Example\nIf we just want to filter out the words from a list of stop words, we can easily do it like following:\n```java\nclass Example {\n    public Set\u003cString\u003e getInterestingWords() {\n        final String text = \"Once upon a time there was a dear little girl who was loved by everyone who looked at her\";\n        \n        final Properties props;\n        props = new Properties();\n        props.put(\"annotators\", \"tokenize, ssplit, pos, lemma, stopwords\");\n        props.setProperty(\"customAnnotatorClass.stopwords\", \"io.github.pepperkit.corenlp.stopwords.StopWordsAnnotator\");\n        props.setProperty(\"ssplit.isOneSentence\", \"true\");\n        \n        // Filter out these words\n        props.setProperty(\"stopwords.customList\", \"once,upon,a,little,girl\");\n\n        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);\n        Annotation document = new Annotation(text);\n        pipeline.annotate(document);\n\n        Set\u003cString\u003e result = new HashSet\u003c\u003e();\n        List\u003cCoreLabel\u003e tokens = document.get(CoreAnnotations.TokensAnnotation.class);\n\n        for (CoreLabel token : tokens) {\n            // token.get(StopWordsAnnotator.class) will be TRUE if the word is stopped\n            if (!token.get(StopWordsAnnotator.class)) {\n                result.add(token.get(CoreAnnotations.LemmaAnnotation.class));\n            }\n        }\n        return result;\n    }\n}\n```\n\n### Complex Example\nLet's use *stopwords* annotator for a particular complex scenario when we need to process a text and extract a set of lemmas of only \n\"interesting\" words, where the word is considered \"interesting\", if it is not a common word (more detailed definition is further in the text).\n\n**Scenario**:\n\n*Given* I have the text  \n  *And* the stop words are defined in the resources file (containing the most common English words)  \n*When* I launch text processing using StanfordCoreNLP pipeline with StopWordsAnnotator  \n  *And* set it to mark words as stopped if it is shorter than 3 letters (to remove all the punctuation and simple words like be, so, etc.)  \n  *And* is of POS category I am not interested in  \n  *And* is in the list of stop words I provided in the resources file  \n*Then* I should be able to filter out the common words from the text  \n\n```java\nclass Example {\n    public Set\u003cString\u003e getInterestingWords() {\n        // I have the text:\n        final String text = \"Once upon a time there was a dear little girl who was loved by everyone who looked at \" +\n                \"her, but most of all by her grandmother, and there was nothing that she would not have given to the \" +\n                \"child. Once she gave her a little riding hood of red velvet, which suited her so well that she would\" +\n                \" never wear anything else; so she was always called 'Little Red Riding Hood.'\";\n\n        // I want to get the list of lemmas created from the text, excluding words from the provided list and all the\n        // common or simple words (like propositions, conjunctions, etc.), since I want to extract only the words\n        // I could be interested to learn\n        String[] expectedWords = {\"dear\", \"look\", \"have\", \"give\", \"riding\", \"hood\", \"velvet\", \"suit\", \"wear\", \"call\"};\n\n        // And the stop words in resources (containing the most known English words)\n        final String stopWordsResourcePath = \"common-words-list-it.txt\";\n\n        final Properties props;\n        props = new Properties();\n        props.put(\"annotators\", \"tokenize, ssplit, pos, lemma, stopwords\");\n        props.setProperty(\"customAnnotatorClass.stopwords\", \"io.github.pepperkit.corenlp.stopwords.StopWordsAnnotator\");\n        props.setProperty(\"ssplit.isOneSentence\", \"true\");\n\n        // to filter out all the punctuation and simple words like be, so, etc.\n        props.setProperty(\"stopwords.withLemmasShorterThan\", \"3\");\n\n        // to filter out all the common and simple words\n        // Description of the available POS categories can be found here:\n        // - https://nlp.stanford.edu/software/pos-tagger-faq.html\n        // - https://catalog.ldc.upenn.edu/docs/LDC99T42/tagguid1.pdf\n        props.setProperty(\"stopwords.withPosCategories\",\n                \"NNP,NNPS,\" + // proper noun singular and plural\n                        \"PDT,\" + // predeterminer\n                        \"IN,CC,\" + // conjunction and coordinating conjunction (but, and etc.)\n                        \"DT,\" + // determiner - the, a, etc.\n                        \"UH,\" + // interjection - my, his, oh, uh etc.\n                        \"FW,\" + // foreign word\n                        \"MD,\" + // modal verb\n                        \"RP,\" + // particle\n                        \"PRP,PRP$,\" + // personal pronoun\n                        \"EX,\" + // existential there\n                        \"POS,\" + // possessive ending: 's\n                        \"SYM,\" + // symbol\n                        \"WDT,WP,WP$,\" + // wh-determiner (who), wh-pronoun (who, what, whom) and possessive wh-pronoun (whose)\n                        \"WRB\" // wh-adverb\n        );\n\n        // provide the file with stop words list\n        props.setProperty(\"stopwords.customListResourcesFilePath\", stopWordsResourcePath);\n\n        // Annotate the text using StanfordCoreNLP pipeline\n        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);\n        Annotation document = new Annotation(text);\n        pipeline.annotate(document);\n\n        // Process returned tokens\n        Set\u003cString\u003e result = new HashSet\u003c\u003e();\n        List\u003cCoreLabel\u003e tokens = document.get(CoreAnnotations.TokensAnnotation.class);\n\n        // Return only lemmas of only interesting words\n        for (CoreLabel token : tokens) {\n            if (!token.get(StopWordsAnnotator.class)) {\n                result.add(token.get(CoreAnnotations.LemmaAnnotation.class));\n            }\n        }\n        return result;\n    }\n}\n```\n\n## Project's structure\n```\n└── src\n    ├── main                # code of the annotator\n    ├── test                # unit tests\n    └── integration-test    # integration tests\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpepperkit%2Fcorenlp-stop-words-annotator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpepperkit%2Fcorenlp-stop-words-annotator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpepperkit%2Fcorenlp-stop-words-annotator/lists"}