https://github.com/arne-cl/nltk-maxent-pos-tagger

maximum entropy based part-of-speech tagger for NLTK
https://github.com/arne-cl/nltk-maxent-pos-tagger

maximum-entropy nltk pos pos-tagger

Last synced: 6 months ago
JSON representation

maximum entropy based part-of-speech tagger for NLTK

Host: GitHub
URL: https://github.com/arne-cl/nltk-maxent-pos-tagger
Owner: arne-cl
Created: 2012-01-10T21:12:36.000Z (over 13 years ago)
Default Branch: master
Last Pushed: 2016-12-08T23:02:00.000Z (almost 9 years ago)
Last Synced: 2023-03-19T17:55:23.952Z (over 2 years ago)
Topics: maximum-entropy, nltk, pos, pos-tagger
Language: Python
Homepage:
Size: 12.7 KB
Stars: 46
Watchers: 5
Forks: 20
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          nltk-maxent-pos-tagger

======================

`nltk-maxent-pos-tagger` is a part-of-speech (POS) tagger based on Maximum

Entropy (ME) principles written for [NLTK](http://nltk.org/ "Python's Natural Language Toolkit").

It is based on NLTK's Maximum Entropy classifier

(`nltk.classify.maxent.MaxentClassifier`), which uses

[MEGAM](http://hal3.name/megam "Hal Daume's MEGA Model Optimization Package") for number

crunching.

Part-of-Speech Tagging

----------------------

`nltk-maxent-pos-tagger` uses the set of features proposed by 

[Ratnaparki (1996)](http://www.aclweb.org/anthology-new/W/W96/W96-0213.pdf 

"A Maximum Entropy Model for Part-of-Speech Tagging"), which are also used 

in his [MXPOST](ftp://ftp.cis.upenn.edu/pub/adwait/jmx/) implementation (Java).

Installation

------------

1.  Install Python and NLTK.

NLTK offers lots of data sets, which you might download and install from within

a Python shell:

    import nltk

    nltk.download()

Download at least `brown` or `treebank`, as nltk-maxent-pos-tagger uses them

for its `demo()` function.

2. (Mac) Install MEGAM.

On Mac, it is easy to install MEGAM using brew:

    brew tap homebrew/science

    brew install megam

Usage

-----

Have a look at the example given in the `demo()` function in `mxpost.py`.

Basically, you just have to import the tagger and train it with labelled data

to use it:

    import mxpost

    maxent_tagger = mxpost.MaxentPosTagger()

    maxent_tagger.train(tagged_training_sentences)

    for sentence in unlabeled_sentences:

        maxent_tagger.tag(sentence)

Meta

----

Status: Beta. I wrote this in 2008 as a semester project for a class on NLP tools.  

Licence: GPL Version 3  

Original Author: Arne Neumann  

Contributors: Arne Neumann, Andrew Drozdov

TODO

----

1.   *speed / memory consumption*   

     As you can expect, a Python implementation is much slower and consumes

     much more RAM than similar tools written in Java or C/C++ (MXPOST,

     acopost, C&C etc.). This being said, most of the time isn't spend in

     Python but rather in MEGAM (which is written in O'Caml and therefore

     shouldn't have such issues).  NLTK currently is only able to encode POS

     features explicitly when converting data for MEGAM. According to the MEGAM

     website, using implicit feature encoding should be much faster.

    

2.  *accuracy*  

    I trained several taggers on the WSJ corpus (90% training / 10% test data).

    nltk-maxent-pos-tagger achieved an accuracy of 93.64% (100 iterations, rare

    feature cutoff = 5) while MXPOST reached 96.93% (100 iterations). Since

    both implementations use the same feature set, results shouldn't be that

    different.  Unfortunately, there's no source code available for `MXPOST`,

    but comparing `nltk-maxent-pos-tagger` with OpenNLP's implementation should

    be helpful.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arne-cl/nltk-maxent-pos-tagger

Awesome Lists containing this project

README