Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/ispras/atr4s

Toolkit with state-of-the-art Automatic Terms Recognition methods in Scala
https://github.com/ispras/atr4s

nlp-keywords-extraction nlp-library scala terminology-extraction

Last synced: about 2 months ago
JSON representation

Toolkit with state-of-the-art Automatic Terms Recognition methods in Scala

Host: GitHub
URL: https://github.com/ispras/atr4s
Owner: ispras
License: apache-2.0
Created: 2016-11-18T15:50:10.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2018-07-23T21:35:28.000Z (almost 6 years ago)
Last Synced: 2024-04-17T06:08:31.454Z (about 2 months ago)
Topics: nlp-keywords-extraction, nlp-library, scala, terminology-extraction
Language: Scala
Size: 180 KB
Stars: 34
Watchers: 20
Forks: 4
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-nlp - ATR4S - 具有最先進的[自動術語識別](https://en.wikipedia.org/wiki/Terminology_extraction)方法的工具包。 (函式庫 / 書籍)
https-github.com-keon-awesome-nlp - ATR4S - Toolkit with state-of-the-art [automatic term recognition](https://en.wikipedia.org/wiki/Terminology_extraction) methods. (Packages / Libraries)
awesome-nlp - ATR4S - 具有最先進的[自動術語識別](https://en.wikipedia.org/wiki/Terminology_extraction)方法的工具包。 (函式庫 / 書籍)
awesome-nlp - ATR4S - 具有最先進的[自動術語識別](https://en.wikipedia.org/wiki/Terminology_extraction)方法的工具包。 (函式庫 / 書籍)
awesome-nlp - ATR4S - 具有最先進的[自動術語識別](https://en.wikipedia.org/wiki/Terminology_extraction)方法的工具包。 (函式庫 / 書籍)
awesome-nlp - ATR4S - Toolkit with state-of-the-art [automatic term recognition](https://en.wikipedia.org/wiki/Terminology_extraction) methods. (Libraries / Books)
awesome-nlp - ATR4S - 具有最先進的[自動術語識別](https://en.wikipedia.org/wiki/Terminology_extraction)方法的工具包。 (函式庫 / 書籍)

README

        # ATR4S

An open-source library for [Automatic Term Recognition](https://en.wikipedia.org/wiki/Terminology_extraction)

written in Scala.

To cite ATR4S:

N.Astrakhantsev.

ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala.

arXiv preprint [arXiv:1611.07804](http://arxiv.org/abs/1611.07804), 2016.

## Implemented algorithms

1. AvgTermFreq

2.  ResidualIDF

3.  TotalTF-IDF

4.  CValue

5.  Basic

6.  ComboBasic

7.  PostRankDC

8.  Relevance

9.  Weirdness

10.  DomainPertinence

11.  NovelTopicModel

12.  LinkProbability

13.  KeyConceptRelatedness

14.  Voting

15.  PU-ATR

[//]: # (See details in the paper.)

## Requirements

### Libraries

Scala 2.11

Spark 1.5+ (for Voting and PU-ATR)

[Emory nlp4j](https://emorynlp.github.io/nlp4j/)

([Apache OpenNLP](http://opennlp.apache.org/) is also supported, but

preliminary experiments showed that its quality is not better than Emory nlp4j, while it is not thread-safe;

if you are going to use OpenNLP, download models from Apache OpenNLP and place them into `src/main/resources`)

([Stanford CoreNLP](http://stanfordnlp.github.io/CoreNLP/) is also supported by

[this helper](https://github.com/ispras/atr4s/releases/download/v1.2/StanfordNLPPreprocessor.scala),

which is moved to a separate module licensed by GPL, due to GPL licensing of Stanford CoreNLP).

### Data

In order to use some algorithms you need to download auxiliary files and place them into

`WORKING_DIRECTORY/data` directory (note that working directory can be specified in `gradle.properties` - by default, this is `experiments`)

or specify path in the corresponding configuration/builder class

(e.g. `Word2VecAdapterConfig` of `KeyConceptRelatedness`).

Namely,

- for **LinkProbability** download [info_measure.txt](https://github.com/ispras/atr4s/releases/download/v1.2/info-measure.txt); 

- for **Relevance** download [COHA_term_occurrences.txt](https://github.com/ispras/atr4s/releases/download/v1.2/COHA_term_occurrences.txt);

- for **KeyConceptRelatedness** download [w2vConcepts.model](https://github.com/ispras/atr4s/releases/download/v1.2/w2vConcepts.model).

Datasets used in the experiments can be downloaded from [Release page](https://github.com/ispras/atr4s/releases/tag/v1.2).

### OS

PU algorithm may or may not work on Windows due to some bugs in Spark (see relevant questions on Stackoverflow, 

maybe they help you: 

[1](https://stackoverflow.com/questions/41825871/exception-while-deleting-spark-temp-dir-in-windows-7-64-bit),

[2](https://stackoverflow.com/questions/31274170/spark-error-error-utils-exception-while-deleting-spark-temp-dir), 

[3](https://stackoverflow.com/questions/43731967/spark-failed-to-delete-temp-directory)).

## Linking

The library is published into Maven central and JCenter.

Add the following lines depending on your build system.

### Gradle

```gradle

compile 'ru.ispras:atr4s:1.2.2'

```

### Maven

```xml

    ru.ispras

    atr4s

    1.2.2

```

### SBT

```

libraryDependencies += "ru.ispras" % "atr4s" % "1.2.2"

```

## Building from Sources

Build library with gradle:

```shell

./gradlew jar

```

## Usage

### Command line example

```shell

./gradlew recognize -Pdataset=acl2 -PtopCount=10 -Pconfig=CValue.conf -Poutput=cvalueterms.txt

```

Here we recognize top 10 terms from text files stored in `acl2` directory 

(should be subdirectory of `WORKING_DIRECTORY`) by CValue measure

(stored in `CValue.conf` file) and writes recognized terms with weights in `cvalueterms.txt`.

Note that if the encoding of input text files differs from UTF-8, then you should specify the correct encoding in the config of `NLPPreprocessor`

(or convert input files, there are many [tools](http://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets) for that).

### Program API

See `ATRConfig` class, which is a Configuration/builder for a facade class `AutomaticTermsRecognizer`.

See `AutomaticTermsRecognizer` object for example.

### Program API (Java)

Usage in Java does not differ significantly, so see the same classes for examples. 

However, since Java does not support parameters with default values, 

we provide helper static functions named `make()` 

for most classes containing parameters with default values or parameters with Scala collections, 

see example below.

Also note that there is a special method returning weighted terms as Java Iterable, 

so that you won't need to convert Scala collections to Java ones.

```java

class ATRExample {

    public static void main(String[] args) {

        String datasetDir = args[0];

        int topCount = args[1];

        ATRConfig atrConfig = new ATRConfig(EmoryNLPPreprocessorConfig.make(),

                TCCConfig.make(),

                new OneFeatureTCWeighterConfig(Weirdness.make()));

        Iterable terms = atrConfig.build().recognizeAsJavaIterable(datasetDir, topCount);

        for (WeightedTerm termAndWeight: terms) {

            System.out.println(termAndWeight);

        }

    }

}

```

## License

Apache License Version 2.0.