Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yuanzh/SegParser
Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing
https://github.com/yuanzh/SegParser
Last synced: 28 days ago
JSON representation
Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing
- Host: GitHub
- URL: https://github.com/yuanzh/SegParser
- Owner: yuanzh
- License: mit
- Created: 2014-09-24T17:35:48.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2015-11-29T05:31:19.000Z (about 9 years ago)
- Last Synced: 2024-08-04T04:07:37.071Z (4 months ago)
- Language: C++
- Size: 16.3 MB
- Stars: 9
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- low-resource-languages - SegParser - Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing. (Software / Utilities)
README
#### SegParser
Randomized Greedy algorithm for joint segmentation, POS tagging and dependency parsing
=========
#### Usage
##### 1. Compilation
To compile the project, first make sure you have installed boost and boost-regex on your machine. Next, go to the "Release" directory and run command "make all" to compile the code. Note that the implementation uses some c++0x/c++11 features. Please make sure your compiler supports them.
##### 2. Data Format
The data format for each sentence has two parts. The first part is similar to the one used in CoNLL-X shared task. The only difference is the index in the first column. Here the index format is "token index/segment index", where the token index starts from 1 (0 is for the root), while the segment index starts from 0.
The second part encodes the search space for segmentation and POS tagging. Each line contains a string for the lattice structure of each token. The format is as follows.
line := Token form\tCandidate1\tCandidate2\t...
Candidate := Segmentation||Al index||Morphology index||Morphology value||Candidate probability
Segmentation := Segment1&&Segment2&&...
Segment := Surface form@#Lemma form@#POS candidate1@#POS candidate2@#...
POS candidate := POS tag_probability
"data" directory includes sample data files for the SPMRL dataset.
##### 3. Datasets
Because of the license issue, datasets are not directly released here. You can find sample files in "data" directory. Please contact me for the full dataset if you are interested in.
UPDATE: data generator for SPMRL dataset and needed files for generating testing data are added into the directory spmrl_data_generator.
##### 4. Usage
Take a look at the scripts "run_DATA.sh" and "run_DATA_test.sh" where DATA=spmrl|classical|chinese. For example, to train a model on the SPMRL dataset, you can simply run
run_spmrl.sh run1
The model and development results will be saved in directory "runs". Note that the model is evaluated on the development set (if exists) after each epoch *in parallel* with the training. After the model is trained, you can evaluate it on the test set by running
run_spmrl_test.sh run1