Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jankaszel/eurol1
https://github.com/jankaszel/eurol1
Last synced: 17 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/jankaszel/eurol1
- Owner: jankaszel
- License: mit
- Created: 2021-06-09T19:51:25.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-07-07T14:42:25.000Z (over 3 years ago)
- Last Synced: 2024-10-17T00:58:35.937Z (22 days ago)
- Language: Go
- Size: 8.79 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# eurol1
eurol1 is a tool for post-processing parts of the [Europarl parallel corpus](https://www.statmt.org/europarl/index.html) and enriching the aligned sentences with additional metadata from the corpus source. As of now, the purpose of this tool is to be able to filter out sentences of a particular original language, since the aligned sentences may have been spoken in another language originally. This doesn't properly identify the speakers' L1, but comes close.
## Usage
Say you want to process all sentences of the parallel Spanish-English corpus that have been originally spoken in Spanish. Your directory may look like this, where the former two files contain the aligned sentences in each language and
the `txt` folder contains the corpus source (which includes metadata):```
europarl-v7.es-en.es
europarl-v7.es-en.en
txt/
es/
...
```Now, run eurol1:
```bash
$ eurol1 ./europarl-v7.es-en.es ./europarl-v7.es-en.en ./txt/es es-en.filtered.json
```