Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mjpost/extract-spfeatures

Extracts the Charniak & Johnson (ACL 2005) feature set from PTB-style trees.
https://github.com/mjpost/extract-spfeatures

Last synced: about 2 months ago
JSON representation

Extracts the Charniak & Johnson (ACL 2005) feature set from PTB-style trees.

Awesome Lists containing this project

README

        

Matt Post
September 2011
--

NOTE (2012-02-28): A rather serious bug was found that removed the
final digit from any feature ID ending in 0 and having a count >=2
for a particular parse tree.

--

This package is a wrapper around Eugene Charniak and Mark Johnson's
parse tree feature extractor, described in their 2005 paper:

@inproceedings{charniak2005coarse,
Title = {Coarse-to-fine n-best parsing and {MaxEnt} discriminative
Author = {Charniak, Eugene and Johnson, Mark},
Address = {Ann Arbor, Michigan, USA},
Booktitle = {Proc.\ ACL},
Month = {June},
Year = {2005}
}

The purpose of the wrapper is to allow features to be extracted over
single parse tree instances. As written, the feature extractor
distributed with the code published by Mark Johnson (and available at
http://web.science.mq.edu.au/~mjohnson/code/reranking-parser-Nov2009.tgz)
is highly suited for their particular purpose of discriminative n-best
parse tree reranking, requiring corresponding gold standard parse
trees and outputting information in a format used by their MaxEnt
training code.

In addition to the wrapper script, this package includes the
a modified version of the "extract-spfeatures" program with the
following changes:

- Pseudo-constant features (ones that are the same for all items on an
n-best list) are not output in the C&J code, because they have no
discriminative ability. My changes disable this disabling.

- It defins a "local" feature set, which is the set of features that
can be computed over single hyper-edges (including their head and
tail nodes).

- The minimum feature count threshold has been changed from 5 to 1.
This is done so that you'll see features even if you pass it a
single parse tree. If you plan to compute features over a whole set
of parse trees, you should increase this setting to 5 (--min 5).

See the INSTALL file for details on compiling (hint: type "make").

== USAGE =============================================================

To run the program, you need a file or stream of parse trees, one per
line. Then invoke the wrapper as follows:

$ cat parse_trees | extract.pl [options] > features 2> feature_map

This writes the features extracted for each parse tree to the file
"features" and the mapping between feature id and feature name to the
file "feature_map". The script can also map the feature IDs back to
the feature names they were computed from, but be warned that this
generates an enormous amount of text:

$ cat parse_trees | extract.pl --undo-map [options] > features

Parse trees should follow standard bracketing format, with a
root-level label and all no mixing of terminals and preterminals
(i.e., all terminals must be immediately governed by a preterminal).
For example,

(TOP (S (NP (DT The) (NN boy)) (VP (VBD was)) (. .)))

The command-line arguments are:

--min N

This changes the rule count threshold, i.e., the minumum number of
times a feature must have been observed to be output. The default
is 1 so that passing a single parse tree to the wrapper will still
output something. If you are extracting features over a large
feature set, you should change this to 5, which is C&J's default.

--local

This computes only features that can be computed over a single
hyperedge. It disables the following features and feature
classes:

RightBranch, Heavy, CoPar, WProj, Rule(0,1), NGram*, NGramTree*,
HeadTree*, Heads*, Edges*

--undo-map

By default, C&J's 'extract-spfeatures' program maps the (nicely)
verbose feature names to integers. This wrapper script then
outputs that map to STDERR. If you'd like to have the original
feature names (with spaces replaced with underscores), this option
will do that.

--root-label LABEL

If you pass the script trees that have no root label, e.g.,

( (S (NP (DT The) (NN boy)) (VP (VBD was)) (. .)) )

(output, for one, by the Berkeley parser), the root will be
renamed to "TOP" by default, since label-less roots cause the
extract-spfeatures binary to crash. You can change the default
label with this flag. Note that this will not change the root
label of already-labeled roots.

== FAQ ===============================================================

Q. Why do I get so many features?

A. The minimum feature count threshold (which determines how many
separate parse trees must exhibit a feature before that feature is
output for any of them) has been changed to 1 from Charniak and
Johnson's default of 5. You can change this by passing "--min 5"
to the wrapper script.

Q. Why does the script crash with this error?

$ cat parse_trees | ./extract.pl
readline() on closed filehandle FEATURES at ./extract.pl line 131.

A. There was a problem calling the extract-spfeatures binary; either
it doesn't exist or isn't executable on your system. Recompile it
with:

make clean
make

== Acknowledgments ==============================================

Thanks to the following people for committing bug fixes and other
feedback:

- Ritwik Banerjee