https://github.com/dcavar/treebankparser
Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars
https://github.com/dcavar/treebankparser
bnf bnfc context-free-grammar lexical-functional-grammar parser penn-treebank probabilistic-context-free-grammar syntax treebank
Last synced: 4 months ago
JSON representation
Parser for treebanks based on Penn Treebank type of encoding that generates Probabilistic Context Free Grammars
- Host: GitHub
- URL: https://github.com/dcavar/treebankparser
- Owner: dcavar
- License: apache-2.0
- Created: 2018-10-15T13:21:21.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2018-10-17T23:31:14.000Z (almost 7 years ago)
- Last Synced: 2024-11-07T17:36:26.808Z (11 months ago)
- Topics: bnf, bnfc, context-free-grammar, lexical-functional-grammar, parser, penn-treebank, probabilistic-context-free-grammar, syntax, treebank
- Language: C
- Homepage: http://damir.cavar.me/
- Size: 186 KB
- Stars: 3
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TreebankParser
(C) 2016-2018 by [Damir Cavar] <[dcavar@iu.edu](mailto:dcavar@iu.edu)>
This code and the binaries are made available under the
[Apache License, Version 2.0, January 2004](http://www.apache.org/licenses/). For details see the included
*LICENSE.txt* file.This is a tool that reads treebank files and generates a probabilistic grammar for use in [FLE].
Currently it can generate all Context-free Grammar rules from a treebank in the Penn-treebank format.
Take for example the *test1.txt* file in the current source repository. You can run treebankparser to generate a frequency profile of the rules:
./treebankparser -y S test1.txt
The *-y S* parameter generates an S-symbol for empty root nodes, as in *test1.txt*. The default is to generate *ROOT* as the label for such root nodes.
The out put should look like this:
1 ADJP --> JJ
1 IP-HLN --> VP
1 JJ --> 重要
1 NN --> 企业
1 NN --> 增长点
1 NN --> 外商
1 NN --> 外贸
1 NN --> 投资
2 NP --> NN
1 NP --> NP
1 NP-OBJ --> NP
1 NP-PN --> NR
1 NP-SBJ --> NN NN NN
1 NR --> 中国
1 S --> IP-HLN
1 VP --> NP-OBJ
1 VV --> 成为The probability is tab-delimited from the rule. It can also be generated as a float using the *-r* parameter:
./treebankparser -r -y S test1.txt > res.log
The output should look like:
0.0555556 ADJP --> JJ
0.0555556 IP-HLN --> VP
0.0555556 JJ --> 重要
0.0555556 NN --> 企业
0.0555556 NN --> 增长点
0.0555556 NN --> 外商
0.0555556 NN --> 外贸
0.0555556 NN --> 投资
0.111111 NP --> NN
0.0555556 NP --> NP
0.0555556 NP-OBJ --> NP
0.0555556 NP-PN --> NR
0.0555556 NP-SBJ --> NN NN NN
0.0555556 NR --> 中国
0.0555556 S --> IP-HLN
0.0555556 VP --> NP-OBJ
0.0555556 VV --> 成为The rules are printed to standard out with absolute or relative frequencies.
I am adding more features, e.g.:
- reloading existing grammars (multi-batch cycles for larger corpus collections)
- elimination of terminal rules
- parsing alternative coding formats for syntactic trees or treebanks (e.g. XML, TEI XML)
- output probabilities for Left-hand-side symbols only, rather than rules
- generation of a Weighted Finite State Transducer representation, as coded in [FLE]If you have ideas or suggestions, let me know.
## Prerequisites
The tool is written in [C++11] and requires the following libraries:
- [Boost]
- [Xerces-C++]## Compile
Use [CLion] or otherwise run:
cmake CMakeLists.txt
make[Damir Cavar]: http://damir.cavar.me/ "Damir Cavar"
[CLion]: https://www.jetbrains.com/clion/ "CLion IDE"
[Boost]: http://www.boost.org/ "Boost C++ Libraries"
[C++11]: https://en.wikipedia.org/wiki/C%2B%2B11 "C++11"
[Xerces-C++]: https://xerces.apache.org/xerces-c/ "Xerces-C++ XML Parser"
[FLE]: http://gorilla.linguistlist.org/fle/ "Free Linguistic Environment"