{"id":13482570,"url":"https://github.com/ispras/atr4s","last_synced_at":"2025-04-10T23:41:59.085Z","repository":{"id":57742722,"uuid":"74143779","full_name":"ispras/atr4s","owner":"ispras","description":"Toolkit with state-of-the-art Automatic Terms Recognition methods in Scala","archived":false,"fork":false,"pushed_at":"2018-07-23T21:35:28.000Z","size":184,"stargazers_count":35,"open_issues_count":3,"forks_count":5,"subscribers_count":19,"default_branch":"master","last_synced_at":"2025-03-24T20:37:53.837Z","etag":null,"topics":["nlp-keywords-extraction","nlp-library","scala","terminology-extraction"],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ispras.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-11-18T15:50:10.000Z","updated_at":"2024-11-16T22:53:31.000Z","dependencies_parsed_at":"2022-09-09T11:21:17.137Z","dependency_job_id":null,"html_url":"https://github.com/ispras/atr4s","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ispras%2Fatr4s","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ispras%2Fatr4s/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ispras%2Fatr4s/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ispras%2Fatr4s/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ispras","download_url":"https://codeload.github.com/ispras/atr4s/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248317732,"owners_count":21083527,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nlp-keywords-extraction","nlp-library","scala","terminology-extraction"],"created_at":"2024-07-31T17:01:03.356Z","updated_at":"2025-04-10T23:41:59.061Z","avatar_url":"https://github.com/ispras.png","language":"Scala","funding_links":[],"categories":["Packages","函式庫"],"sub_categories":["Libraries","書籍"],"readme":"# ATR4S\n\nAn open-source library for [Automatic Term Recognition](https://en.wikipedia.org/wiki/Terminology_extraction)\nwritten in Scala.\n\nTo cite ATR4S:\n\nN.Astrakhantsev.\nATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala.\narXiv preprint [arXiv:1611.07804](http://arxiv.org/abs/1611.07804), 2016.\n\n## Implemented algorithms\n\n1. AvgTermFreq\n2.  ResidualIDF\n3.  TotalTF-IDF\n4.  CValue\n5.  Basic\n6.  ComboBasic\n7.  PostRankDC\n8.  Relevance\n9.  Weirdness\n10.  DomainPertinence\n11.  NovelTopicModel\n12.  LinkProbability\n13.  KeyConceptRelatedness\n14.  Voting\n15.  PU-ATR\n\n\n[//]: # (See details in the paper.)\n\n## Requirements\n\n### Libraries\n\nScala 2.11\n\nSpark 1.5+ (for Voting and PU-ATR)\n\n[Emory nlp4j](https://emorynlp.github.io/nlp4j/)\n\n([Apache OpenNLP](http://opennlp.apache.org/) is also supported, but\npreliminary experiments showed that its quality is not better than Emory nlp4j, while it is not thread-safe;\nif you are going to use OpenNLP, download models from Apache OpenNLP and place them into `src/main/resources`)\n\n([Stanford CoreNLP](http://stanfordnlp.github.io/CoreNLP/) is also supported by\n[this helper](https://github.com/ispras/atr4s/releases/download/v1.2/StanfordNLPPreprocessor.scala),\nwhich is moved to a separate module licensed by GPL, due to GPL licensing of Stanford CoreNLP).\n\n### Data\n\nIn order to use some algorithms you need to download auxiliary files and place them into\n`WORKING_DIRECTORY/data` directory (note that working directory can be specified in `gradle.properties` - by default, this is `experiments`)\nor specify path in the corresponding configuration/builder class\n(e.g. `Word2VecAdapterConfig` of `KeyConceptRelatedness`).\n\nNamely,\n- for **LinkProbability** download [info_measure.txt](https://github.com/ispras/atr4s/releases/download/v1.2/info-measure.txt); \n- for **Relevance** download [COHA_term_occurrences.txt](https://github.com/ispras/atr4s/releases/download/v1.2/COHA_term_occurrences.txt);\n- for **KeyConceptRelatedness** download [w2vConcepts.model](https://github.com/ispras/atr4s/releases/download/v1.2/w2vConcepts.model).\n\nDatasets used in the experiments can be downloaded from [Release page](https://github.com/ispras/atr4s/releases/tag/v1.2).\n\n### OS\n\nPU algorithm may or may not work on Windows due to some bugs in Spark (see relevant questions on Stackoverflow, \nmaybe they help you: \n[1](https://stackoverflow.com/questions/41825871/exception-while-deleting-spark-temp-dir-in-windows-7-64-bit),\n[2](https://stackoverflow.com/questions/31274170/spark-error-error-utils-exception-while-deleting-spark-temp-dir), \n[3](https://stackoverflow.com/questions/43731967/spark-failed-to-delete-temp-directory)).\n\n## Linking\n\nThe library is published into Maven central and JCenter.\nAdd the following lines depending on your build system.\n\n### Gradle\n\n```gradle\ncompile 'ru.ispras:atr4s:1.2.2'\n```\n\n### Maven\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eru.ispras\u003c/groupId\u003e\n    \u003cartifactId\u003eatr4s\u003c/artifactId\u003e\n    \u003cversion\u003e1.2.2\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\n### SBT\n\n```\nlibraryDependencies += \"ru.ispras\" % \"atr4s\" % \"1.2.2\"\n```\n\n## Building from Sources\n\nBuild library with gradle:\n\n```shell\n./gradlew jar\n```\n\n## Usage\n\n### Command line example\n\n```shell\n./gradlew recognize -Pdataset=acl2 -PtopCount=10 -Pconfig=CValue.conf -Poutput=cvalueterms.txt\n```\n\nHere we recognize top 10 terms from text files stored in `acl2` directory \n(should be subdirectory of `WORKING_DIRECTORY`) by CValue measure\n(stored in `CValue.conf` file) and writes recognized terms with weights in `cvalueterms.txt`.\n\nNote that if the encoding of input text files differs from UTF-8, then you should specify the correct encoding in the config of `NLPPreprocessor`\n(or convert input files, there are many [tools](http://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets) for that).\n\n### Program API\n\nSee `ATRConfig` class, which is a Configuration/builder for a facade class `AutomaticTermsRecognizer`.\n\nSee `AutomaticTermsRecognizer` object for example.\n\n### Program API (Java)\n\nUsage in Java does not differ significantly, so see the same classes for examples. \nHowever, since Java does not support parameters with default values, \nwe provide helper static functions named `make()` \nfor most classes containing parameters with default values or parameters with Scala collections, \nsee example below.\n\nAlso note that there is a special method returning weighted terms as Java Iterable, \nso that you won't need to convert Scala collections to Java ones.\n\n```java\nclass ATRExample {\n    public static void main(String[] args) {\n        String datasetDir = args[0];\n        int topCount = args[1];\n        ATRConfig atrConfig = new ATRConfig(EmoryNLPPreprocessorConfig.make(),\n                TCCConfig.make(),\n                new OneFeatureTCWeighterConfig(Weirdness.make()));\n        Iterable\u003cWeightedTerm\u003e terms = atrConfig.build().recognizeAsJavaIterable(datasetDir, topCount);\n        for (WeightedTerm termAndWeight: terms) {\n            System.out.println(termAndWeight);\n        }\n    }\n}\n```\n\n## License\n\nApache License Version 2.0.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fispras%2Fatr4s","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fispras%2Fatr4s","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fispras%2Fatr4s/lists"}