Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bitextor/bleualign-cpp


https://github.com/bitextor/bleualign-cpp

Last synced: 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# Bleualign-cpp
C++ sentence alignment tool based on [Bleualign](https://github.com/rsennrich/Bleualign).
Bleualign-cpp is expected to be used together with [document-aligner](https://github.com/bitextor/bitextor/tree/master/document-aligner).

### Requirements
- GCC, C++11 compiler
- [Boost](https://www.boost.org/) 1.58.0 or later
- [CMake](https://cmake.org/download/) 3.7.2 or later
- [GTest](https://github.com/google/googletest) (for tests)
- [kpu/preprocess](https://github.com/kpu/preprocess) (already included in this repository as a submodule)

### Compile with CMake

```bash
mkdir build
cd build
cmake .. -DBUILD_TEST=on -DCMAKE_BUILD_TYPE=Release
# use `cmake .. -DBUILD_TEST=on -DCMAKE_BUILD_TYPE=Release -DPREPROCESS_PATH=/home/user/preprocess/` if you use other 'preprocess' folder
make -j 4
tests/test_all
```

### Usage

Bleualign-cpp takes two texts in two different languages and aligns them to produce parallel sentences. To this end, it also needs a translation of one of these texts.

Input format is `url1 url2 text1 text2 text1translated [ text2processed ] [ text1metadata text2metadata ]` per line. Every text column is encoded as base64. After decoding text columns, they should contain a single sentence per line. The translation (`text1translated`) should correspond line-by-line with the original text (`text1`). The first line will be a header where, following the explained format, the expected fields are: `src_url trg_url src_text trg_text src_translated [ trg_translated ] [ src_metadata trg_metadata ]`. The first line of the output will contain a header as well which, depending on the provided arguments, the fields will be: `src_url trg_url src_text trg_text bleualign_score [ src_deferred_hash trg_deferred_hash ] [ src_metadata_field_1 trg_metadata_field_1 ... ]`.

Optionally a processed version of `text2` can be provided, as a sixth column, that better matches the processing applied to `text1translated` to help with calculating alignment scores. The output of bleualign will only mention `text1` and `text2`.

Bleualign-cpp outputs aligned sentences to standard output. Output format is (mandatory fields only): `url1 url2 source_sentence target_sentence score` per line.

Bleualign receives input by stdin and writes output to stdout.

##### Optional Parameters
* **--help** - Print help dialog
* **--bleu_threshold** - Sentence-level BLEU score threshold (Default: 0.0)
* **--print-sent-hash** - Print hash for each sentence
* **--metadata-header-fields** - Language agnostic comma separated list of metadata header fields (prefix `src_` and `trg_` will be added after)