https://github.com/jonsafari/multiway-corpus

Build an n-way multilingual corpus
https://github.com/jonsafari/multiway-corpus

corpus-data machine-translation mt multilingual multiway-corpus zero-shot

Last synced: 5 months ago
JSON representation

Build an n-way multilingual corpus

Host: GitHub
URL: https://github.com/jonsafari/multiway-corpus
Owner: jonsafari
License: lgpl-3.0
Created: 2017-01-15T17:52:30.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2017-01-30T14:45:11.000Z (over 9 years ago)
Last Synced: 2025-10-28T06:07:29.903Z (9 months ago)
Topics: corpus-data, machine-translation, mt, multilingual, multiway-corpus, zero-shot
Language: Python
Homepage:
Size: 68.4 KB
Stars: 4
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Multiway Corpus

This builds an *n*-way multilingual corpus, from the data in the awesome [Tatoeba](http://tatoeba.org) dataset.

This allows you to do pivot-free [zero-shot](https://arxiv.org/abs/1611.04558) machine translation, as well as have unusual language combinations.

Usage is:

    python3 intersect_tatoeba.py Spanish jpn English

The arguments are the languages that you want to intersect, either the [ISO 639-3](data/lang_codes_iso-639-3.tsv) names (eg. English) or codes (eg. `eng`).

The output in this example will be `corpus.jpn`, `corpus.spa`, and `corpus.eng` .

First download two files into this directory, as these are constantly being updated upstream:

```bash

wget -c http://downloads.tatoeba.org/exports/sentences.tar.bz2  &&  tar jxvf sentences.tar.bz2

wget -c http://downloads.tatoeba.org/exports/links.tar.bz2      &&  tar jxvf links.tar.bz2

```

Then run the script.  Enjoy!

Here are some languages in the upstream dataset:

| Language | ISO 639-3 Code | Sentences |

| --- | --- | --- |

| English | eng | 641421 |

| Esperanto | epo | 511221 |

| Turkish | tur | 503109 |

| Russian | rus | 479397 |

| Italian | ita | 474880 |

| German | deu | 366934 |

| French | fra | 315677|

| Spanish | spa | 265058 |

| Portuguese | por | 231807 |

| Hungarian | hun | 191328 |

| Japanese | jpn | 184296 |

| Hebrew | heb | 153655 |

| Berber | ber | 104842 |

| (Hundreds more languages) | | |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jonsafari/multiway-corpus

Awesome Lists containing this project

README