https://github.com/neuw84/cvalue-termextraction

A free implementation of the C-Value algorithm based on this paper http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/ijodl2000.pdf
https://github.com/neuw84/cvalue-termextraction

cvalue java keyword-extraction nlp pos-tagger

Last synced: 4 months ago
JSON representation

A free implementation of the C-Value algorithm based on this paper http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/ijodl2000.pdf

Host: GitHub
URL: https://github.com/neuw84/cvalue-termextraction
Owner: Neuw84
License: bsd-3-clause
Created: 2013-12-12T19:26:18.000Z (almost 12 years ago)
Default Branch: master
Last Pushed: 2018-02-08T09:25:12.000Z (over 7 years ago)
Last Synced: 2025-04-02T16:52:17.758Z (7 months ago)
Topics: cvalue, java, keyword-extraction, nlp, pos-tagger
Language: Java
Size: 32.2 KB
Stars: 9
Watchers: 5
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          CValue-TermExtraction

=====================

A free Java 8 implementation of the C-Value algorithm based on this paper:

http://personalpages.manchester.ac.uk/staff/sophia.ananiadou/ijodl2000.pdf

The CValue is an algorithm for keyphrase extraction that has good results without any trainning combining statistical measures with 

POS information. 

It supports English using Penn Treebank POS Tags for english and Spanish using EAGLES tag set.

This implementation requires a POS tagger to be used in order to work. For example The Illinois POS tagger could be used for English. 

http://cogcomp.cs.illinois.edu/page/software_view/POS

For Spanish it has been tested using Freeling. 

http://nlp.lsi.upc.edu/freeling/

The implementation has been tested with English, for documents that contains a lot of noise (like the extracted via OCR recognition) there are some fixes but the Filters for english should change to use Regular Expressions like the Spanish one in order to avoid problems with the Java Stack (although 26,000 papers have been tested with the current implementation). 

License: GPL V2

TODO: 

     - Change the English filters to use regular expressions instead the reccursion approach. 

     - Implement the NC-Value measure (will require a corpus)

Then an example parser for english that will provide the required data (using Illinois POS Tagger)

```java

    import LBJ2.nlp.SentenceSplitter;

    import LBJ2.nlp.WordSplitter;

    import LBJ2.nlp.seg.PlainToTokenParser;

    import LBJ2.parse.Parser;

    import edu.illinois.cs.cogcomp.lbj.chunk.Chunker;

    import edu.illinois.cs.cogcomp.lbj.pos.POSTagger;

    import edu.ehu.galan.cvalue.model.Token;

     ......

     List> tokenizedSentenceList;

     List sentenceList;

     POSTagger tagger = new POSTagger();

     Chunker chunker = new Chunker();

     boolean first = true;

     parser = new PlainToTokenParser(new WordSplitter(new SentenceSplitter(pFile)));

     String sentence = "";

     LinkedList tokenList = null;

     for (LBJ2.nlp.seg.Token word = (LBJ2.nlp.seg.Token) parser.next(); word != null;

            word = (LBJ2.nlp.seg.Token) parser.next()) {

            String chunked = chunker.discreteValue(word);

            tagger.discreteValue(word);

            if (first) {

                tokenList = new LinkedList<>();

                tokenizedSentenceList.add(tokenList);

                first = false;

            }

            tokenList.add(new Token(word.form, word.partOfSpeech, null, chunked));

            sentence = sentence + " " + (word.form);

            if (word.next == null) {

                sentenceList.add(sentence);

                first = true;

                sentence = "";

            }

     }

     parser.reset();

     

```

Then The CValue can be processed then.....

```java

    Document doc=new Document(full_path,name);

    doc.setSentenceList(sentences);

    doc.setTokenList(tokenized_sentences); 

    CValueAlgortithm cvalue=new CValueAlgortithm();

    cvalue.init(doc); // initializes the algorithm for processing the desired document. 

    cvalue.addNewProcessingFilter(use_one_of_the_provides); //for example the AdjNounFilter

    cvalue.runAlgorithm(); //process the CValue algorithm with the provided filters

    doc.getTermList(); //get the results

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/neuw84/cvalue-termextraction

Awesome Lists containing this project

README