Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/syllog1sm/redshift

Transition-based statistical parser
https://github.com/syllog1sm/redshift

Last synced: 3 months ago
JSON representation

Transition-based statistical parser

Lists

README

        

Redshift
========

**This library is research code, and is in maintainence mode.**

**For my actively developed, commercially-focussed NLP library, see http://honnibal.github.io/spaCy/**

Redshift is a natural-language syntactic dependency parser. The current release features fast and accurate parsing,
but requires the text to be pre-processed. Future releases will integrate tokenisation and part-of-speech tagging,
and have special features for parsing informal text.

If you don't know what a syntactic dependency is, read this:
http://googleresearch.blogspot.com.au/2013/05/syntactic-ngrams-over-time.html

Main features:

* Fast linear time parsing: the slowest model is still over 100 sentences/second
* State-of-the-art accuracy: 93.5% UAS on English (Stanford scheme, WSJ 23)
* Super fast "greedy" mode: over 1,000 sentences per second at 91.5% accuracy
* Native Python interface (the parser is written in Cython)

Key techniques:

* Arc-eager transition-based dependency parser
* Averaged perceptron for learning
* redshift.parser.BeamParser is basically the model of Zhang and Nivre (2011)
* redshift.parser.GreedyParser adds the non-monotonic model of Honnibal et al (2013) to the dynamic oracle model of Goldberg and Nivre (2012)
* redshift.features includes the standard Zhang and Nivre (2011) feature set, and also some work pending publication.

Example usage
-------------

Here is an example of how the parser is called from Python, once you have a model trained:

::

>>> import redshift.parser
>>> from redshift.sentence import Input
>>> parser = redshift.parser.Parser()
>>> sentence = Input.from_untagged(['A', 'list', 'of', 'tokens', 'is', 'required', '.'])
>>> parser.parse(sentence)
>>> print sentence.to_conll()

The command-line interfaces have a lot of probably-confusing options for my current research. The main scripts I use are
scripts/train.py, scripts/parse.py, and scripts/evaluate.py . All print usage information, and require the plac library.

From a Unix/OSX terminal, after compilation, and within the "redshift" directory:

::

$ export PYTHONPATH=`pwd`
$ ./scripts/train.py # Use -h or --help for more detailed info. Most of these are research flags.
usage: train.py [-h] [-a static] [-i 15] [-k 1] [-f 10] [-r] [-d] [-u] [-n 0] [-s 0] train_loc model_loc
train.py: error: too few arguments
$ ./scripts/train.py -k 16
$ ./scripts/parse.py
$ ./scripts/evaluate.py output_dir/parses

In more detail:

* Ensure your PYTHONPATH variable includes the redshift directory
* Most of the training-script flags refer to research settings.
* the k parameter controls the speed-accuracy trade-off, via the beam-width. Run-time is roughly O(nk), where n is the number of words, and k is the beam-width. In practice it's slightly sub-linear in k due to some simple memoisation. Accuracy plateaus at about k=64. For k=1, use "-a dyn -r -d", to enable some recent special-case wizardry that gives the k=1 case over 1% extra accuracy, at no run-time cost.
* parse.py reads in the training configuration from "parser.cfg", which sits in the output model directory.
* The parser currently expects one sentence per line, space-separated tokens, tokens of the form word/POS.
* evaluate.py runs as a separate script from parse.py so that the parser never sees the answers, and cannot "accidentally cheat".

Installation
------------

The following commands will set up a virtualenv with Python 2.7.5, the parser, and its core dependencies from scratch::

$ git clone https://github.com/syllog1sm/redshift.git
$ cd redshift
$ git checkout develop

**EITHER**
a) $ virtualenv .env
**OR**
b) $ ./make_virtualenv.sh # Downloads Python 2.7.5 and virtualenv

$ source .env/bin/activate
$ pip install distribute
$ pip install cython
$ pip install thinc
$ pip install -r requirements.txt
$ export PYTHONPATH=`pwd`:$PYTHONPATH # ...and set PYTHONPATH.
$ fab make test

The make_virtualenv.sh script downloads and compiles Python 2.7.5, and uses it to create a virtualenv. This is one way to use a version of Python that isn't system-wide, or to control the compiler that Cython will use. You may not need to do this, or you may wish to do it manually --- it's up to you.

virtualenv is not a requirement, although it's useful. If a virtualenv is not active (i.e. if the $VIRTUALENV
environment variable is not set), you'll need to ensure that the setup.py file knows where to find the C headers that the murmurhash dependency installs.

Installation requires a recent version of pip, which is provided by the version of virtualenv that the make_virtualenv.sh script downloads. If you don't use the make_virtualenv.sh script, ensure you're using a recent version of pip.

Cython
------

redshift is written almost entirely in Cython, a superset of the Python language that additionally supports
calling C/C++ functions and declaring C/C++ types on variables and class attributes. This allows the compiler to
generate very efficient C/C++ code from Cython code. Many popular Python packages, such as numpy, scipy and lxml,
rely heavily on Cython code.

A Cython source file such as redshift/parser.pyx is compiled into redshift/parser.cpp and redshift/parser.so by
the project's setup.py file. The module can then by imported by standard Python code, although only the pure-Python
functions (declared by "def" and "cpdef", instead of "cdef") will be accessible.

The parser currently has Cython as a requirement, instead of distributing
the "compiled" .cpp files as part of the release (against Cython's recommendation). This could change in future,
but currently it feels strange to have a "source" release that users wouldn't be able to modify.

LICENSE
---------------

This software is available for non-commercial use only. You may download, run and modify the code for research purposes,
personal interest, education, teaching, etc. My commercial NLP suite is spaCy: http://spacy.io .

::

Copyright (C) 2014 Matthew Honnibal