An open API service indexing awesome lists of open source software.

https://github.com/unhammer/cg-mwesplit

DEPRECATED: now part of vislcg3. Splits constraint grammar-formatted multiwords created by hfst-tokenise
https://github.com/unhammer/cg-mwesplit

divvun vislcg3

Last synced: 4 months ago
JSON representation

DEPRECATED: now part of vislcg3. Splits constraint grammar-formatted multiwords created by hfst-tokenise

Awesome Lists containing this project

README

        

#+TITLE: cg-mwesplit
#+STARTUP: showall

#+CAPTION: Build Status
[[https://travis-ci.org/unhammer/cg-mwesplit][https://travis-ci.org/unhammer/cg-mwesplit.svg]]

* Description

This program reads input in Constraint Grammar format, and splits
special "multiword cohorts" into separate cohorts, leaving other
cohorts and intervening blanks as they were.

For examples of input/output, see the files in =test/=, e.g.
[[file:test/input.simple.cg][test/input.simple.cg]].

* Prerequisites
A C++ compiler that goes all the way to 11.

Tested with gcc-5.2.0, gcc-5.3.1 and clang-703.0.29.

(Should work all the way down to gcc-4.9, but will fail with e.g.
gcc-4.8.4 or clang-3.5.0.)

* Building

#+BEGIN_SRC sh
./autogen.sh
./configure # optionally with argument --prefix=$HOME/my/prefix
make
make install # with sudo if you didn't specify a prefix
#+END_SRC

* Usage

Takes no options, just stdin and stdout:
#+BEGIN_SRC sh
cg-mwesplit < infile > outfile
#+END_SRC

More typically, it'll be in a pipeline after =hfst-tokenise= and some
step that disambiguates multiwords using =vislcg3=:

#+BEGIN_SRC sh
echo words go here | hfst-tokenise --gtd tokeniser.pmhfst | vislcg3 -g mwe-dis.cg3 | cg-mwesplit
#+END_SRC

* Troubleshooting

If you get
: terminate called after throwing an instance of 'std::regex_error'
: what(): regex_error
then your C++ compiler is too old. See [[./README.org::*Prerequisites][Prerequisites]].