Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mjpost/post2011judging
Data used in my 2011 ACL paper, "Judging Grammaticality With Tree Substitution Grammar Derivations"
https://github.com/mjpost/post2011judging
Last synced: about 1 month ago
JSON representation
Data used in my 2011 ACL paper, "Judging Grammaticality With Tree Substitution Grammar Derivations"
- Host: GitHub
- URL: https://github.com/mjpost/post2011judging
- Owner: mjpost
- Created: 2012-02-01T02:38:49.000Z (almost 13 years ago)
- Default Branch: master
- Last Pushed: 2012-02-01T14:15:59.000Z (almost 13 years ago)
- Last Synced: 2023-04-09T20:57:28.190Z (over 1 year ago)
- Language: Perl
- Homepage:
- Size: 8.21 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
README
Matt Post
January 30, 2012
--This document describes how to repeat the experiments described in my 2011 paper,
@inproceedings{post2011judging,
Address = {Portland, Oregon, USA},
Author = {Post, Matt},
Booktitle = ACL2011,
Month = {June},
Title = {Judging Grammaticality with Tree Substitution Grammar Derivations},
Year = {2011},
url = {www.aclweb.org/anthology/P/P11/P11-2038.pdf}
}It includes data and code used to extract TSG derivations and the
Charniak & Johnson (2005) feature set, plus the environment used to
evaluate arbitrary feature sets in a simple, extendable way. Due to
LDC licensing restrictions, it does not include the data splits that
we used for our experiments. If you wish to have those splits and
have the appropriate LDC license, please email me, and I'll send them
to you.1. Download my code for building TSGs, which can be found on Github.
Note that you do not need to build your own TSG since this
repository includes the TSG I used in my experiments, but that code
contains a number of support scripts that you will need here.git clone [email protected]:mjpost/dptsg.git
Then set the environment variable DPTST to point to that
directory. In bash:export DPTSG=$(pwd)/dptsg
Next, download my modifications to Mark Johnson's code for CKY
parsing.git clone [email protected]:mjpost/cky.git
This code includes modifications I added to enable parsing
flattened versions of TSGs, to work with our black-box
parallelizer, and to incorporate some convenient command-line
options.2. Edit the file builddir.sh. At the top, there are two environment
variables you need to define: (1) DPTSG (as above), and "basedir",
which should point to the directory containing this README file.export basedir=$(pwd)
3. Compile Mark Johnson's CKY code. My version of this code contains
some modifications that enable it to parse TSG grammars.make -C cky/
4. To compute TSG features over a corpus, you need to parse the corpus
with the TSG grammar and then extract the TSG features from the
resulting derivations. This requires a number of pre- and
post-processing steps which convert unknown words in the corpus,
flatten the TSG, parse with it, and expand it afterwards.All of this functionality is contained in the "builddir.sh"
script. To run that script, you simply point it at a directory
which contains a single file named "words". This file contains the
sentences of the corpus, one per line.bash builddir.sh DIR
Alternately, you can pass the directory as an environment variable
(which makes it amenable to qsub). e.g.,qsub -v dir=DIR builddir.sh
As mentioned, in the directory DIR, builddir.sh expects to find a
file named "words", which contains the sentences to parse and
process, one per line. It will then- preprocess the file to mark and convert OOVs
- parse with the grammar
- restore the TSG fragments from the flattened versions the Johnson parser producesNote that the script I've provided does sequential parsing of
sentences with at most 100 words. Mark Johnson's CKY parser is
exhaustive, which makes it somewhat slow. If you want to
parallelize the parsing you can use the included black-box
parallelizer (written by Adam Lopez). You can enable this by
uncommenting out the appropriate line in builddir.sh, and
commenting out the sequential version. You have to edit
environment/LocalConfig.pm to add your environment, which describes
how to call qsub. If you want to use this, compile it by typingmake -C parallelize/
5. When builddir.sh is done, the directory you passed it will contain
a number of files containing different feature sets. These files
are all parallel to words, so that, for example, line 17 of each
file will correspond to the features extracted for sentence 17.
With respect to TSGs, the feature file you care about is "rules",
which contains counts of the TSG fragments used in the Viterbi
derivation of each sentence. The format of this file isfragment:count fragment:count ...
where "fragment" is a TSG fragment (collapsed to remove colons and
spaces) and "count" is a count of the number of times it was seen.
This facilitates conversion for toolkits such as SVM-light.6. My classification environment relies on six data sets: positive and
negative training, development, and test data. As described in the
paper, training proceeds on the training data. Dev is used to tune
the regularization parameter, and the best model is then used to
score the test set.The training and evaluation script is eval.sh. It assumes the
existence of the following six directories that correspond to the
six data sets just described.train/good
train/bad
dev/good
dev/bad
test/good
test/badThe script is called with
./eval.sh FEATURE1 FEATURE2 FEATURE3 ...
It then searches for a file named FEATURE1 in *each* of the six
directories. Each of these files contains a single training or
testing example, with any number of feature:value pairs on each
line, corresponding to the features extracted for that sentence or
training instance. For example, the following invocation:./eval.sh sentlens rules
Would look for files {train,dev,test}/{good,bad}/{sentlens,rules}.
The sentlens file would have something likesentlen:34
sentlen:25
...And the "rules" file is as described above. The eval.sh script
constructs files usable by liblinear in a directory named
"run.FEATURE1+FEATURE2+...".The main thing to note in adding your own features is that each
file must contain feature:value pairs, and the feature names should
be globally unique.7. The builddir.sh script described above can be used to easily
produce feature sets. Just create the six directories, and within
each, create a file "words" that contains the sentences. Then
call:./builddir.sh train/good
./builddir.sh train/bad
./builddir.sh dev/good
./builddir.sh dev/bad
./builddir.sh test/good
./builddir.sh test/bad8. Download liblinear from
http://www.csie.ntu.edu.tw/~cjlin/liblinear/ . Then edit the
variables "train" and "predict" at the top of eval.sh to point to
the liblinear binaries.--
If you have any questions, please feel free to email me and ask.