https://github.com/hankcs/treebankpreprocessing
Python scripts preprocessing Penn Treebank and Chinese Treebank
https://github.com/hankcs/treebankpreprocessing
corpus-processing natural-language-processing
Last synced: about 1 year ago
JSON representation
Python scripts preprocessing Penn Treebank and Chinese Treebank
- Host: GitHub
- URL: https://github.com/hankcs/treebankpreprocessing
- Owner: hankcs
- License: gpl-3.0
- Created: 2017-11-04T02:59:57.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2020-09-02T15:22:05.000Z (over 5 years ago)
- Last Synced: 2025-02-27T18:06:40.177Z (over 1 year ago)
- Topics: corpus-processing, natural-language-processing
- Language: Python
- Homepage: http://www.hankcs.com/nlp/ptb-ctb-python.html
- Size: 30.3 KB
- Stars: 161
- Watchers: 6
- Forks: 42
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TreebankPreprocessing
Python scripts preprocessing [Penn Treebank (PTB)](https://catalog.ldc.upenn.edu/ldc99t42) and [Chinese Treebank 5.1 (CTB)](https://catalog.ldc.upenn.edu/LDC2005T01). They can convert treebanks to:
| Corpus | Format | Description |
| --- | --- | --- |
| constituency parse tree | `.txt` | one line for one sentence |
| dependency parse tree | `.conllx` | [Basic Stanford Dependencies (SD)](https://nlp.stanford.edu/software/stanford-dependencies.shtml) |
| word segmentation corpus | `.tsv` | first column for characters, second column for BMES tags, sentences separated by a blank line |
| part-of-speech tagging corpus | `.tsv` | first column for words, second column for tags, sentences separated by a blank line |
When designing a tagger or parser, preprocessing treebanks is a troublesome problem. We need to:
- Split dataset into train/dev/test, following conventional splits.
- Remove xml tags inside CTB.
- Combine the multiline bracketed files into one file, one line for one sentence.
I wondered why there were no open-source tools handling these tedious works. Finally I decide to write one myself. Hopefully it will save you some time.
### Required software
- Python3
- NLTK
- Optional stanford-parser for converting to dependency parse trees
## Overview
What kind of task can we perform on treebanks?
### Chinese Word Segmentation
For CTB, segmentation corpus are split as per Jiang et al. (2009):
- **CTB** Training: 001–270, 400–1151. Development: 301–325. Test: 271-300.
### Part-of-Speech Tagging
- **PTB** Training: 0-18. Development: 19-21. Test: 22-24. As per Collins (2002) and Choi (2016).
- **CTB** The same with Chinese Word Segmentation.
### Phrase Structure Parsing
These scripts can also convert treebanks into the conventional data setup from Chen and Manning (2014), Dyer et al. (2015). The detailed splits are:
- **PTB** Training: 02-21. Development: 22. Test: 23.
- **CTB** Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147.
### Dependency Parsing
You will need Stanford Parser for converting phrase structure trees to dependency parse trees. Please download the [Stanford Parser Version 3.3.0](https://nlp.stanford.edu/software/stanford-parser-full-2013-11-12.zip) and place them in this folder:
```
TreebankPreprocessing
├── ...
├── stanford-parser-3.3.0-models.jar
└── stanford-parser.jar
```
OK, let's do it on the fly.
## PTB
### 1. Import PTB into NLTK
Bracketed files parsing relies on NLTK. Please follow [NLTK instruction](http://www.nltk.org/howto/corpus.html#parsed-corpora), put `BROWN` and `WSJ` into `nltk_data/corpora/ptb`, e.g.
```
ptb
├── BROWN
└── WSJ
```
### 2. Run `ptb.py`
This script does all the work for you, only requires a path to store output.
```text
$ python3 ptb.py --help
usage: ptb.py [-h] --output OUTPUT [--task TASK]
Combine Penn Treebank WSJ MRG files into train/dev/test set
optional arguments:
-h, --help show this help message and exit
--output OUTPUT The folder where to store the output train/dev/test files
--task TASK Which task (par, pos)? Use par for phrase structure
parsing, pos for part-of-speech tagging
```
* You will get 3 `.txt` files corresponding to train/dev/test set.
* If you want part-of-speech tagging corpora, simply append `--task pos`. This time, you get 3 `.tsv` files.
* `.txt` files can be converted to `.conllx` files by `tb_to_stanford.py`:
```
$ python3 tb_to_stanford.py --help
usage: tb_to_stanford.py [-h] --input INPUT --lang LANG --output OUTPUT
Convert combined Penn Treebank files (.txt) to Stanford Dependency format
(.conllx)
optional arguments:
-h, --help show this help message and exit
--input INPUT The folder containing train.txt/dev.txt/test.txt in
bracketed format
--lang LANG Which language? Use en for English, cn for Chinese
--output OUTPUT The folder where to store the output
train.conllx/dev.conllx/test.conllx in Stanford Dependency
format
```
## CTB
The CTB is a little messy, it contains extra xml tags in every gold tree, and is not natively supported by NLTK. You need to specify the CTB root path (the folder containing index.html).
```
$ python3 ctb.py --help
usage: ctb.py [-h] --ctb CTB --output OUTPUT [--task TASK]
Combine Chinese Treebank 5.1 fid files into train/dev/test set
optional arguments:
-h, --help show this help message and exit
--ctb CTB The root path to Chinese Treebank 5.1
--output OUTPUT The folder where to store the output
train.txt/dev.txt/test.txt
--task TASK Which task (seg, pos, par)? Use seg for word segmentation,
pos for part-of-speech tagging, par for phrase structure
parsing
```
- Tagging and dependency parsing corpora can be obtained similar to PTB.
Then you can start your research, enjoy it!