https://github.com/ysymyth/ec-nl

[ICLR 2022] Linking Emergent and Natural Languages via Corpus Transfer
https://github.com/ysymyth/ec-nl

emergent-communication iclr iclr2022 language-model machine-learning natural-language-processing

Last synced: 3 months ago
JSON representation

[ICLR 2022] Linking Emergent and Natural Languages via Corpus Transfer

Host: GitHub
URL: https://github.com/ysymyth/ec-nl
Owner: ysymyth
License: mit
Created: 2022-02-07T02:04:37.000Z (over 3 years ago)
Default Branch: master
Last Pushed: 2024-06-02T01:22:23.000Z (over 1 year ago)
Last Synced: 2025-06-03T21:18:32.536Z (5 months ago)
Topics: emergent-communication, iclr, iclr2022, language-model, machine-learning, natural-language-processing
Language: Python
Homepage:
Size: 21.5 KB
Stars: 31
Watchers: 2
Forks: 7
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # EC-NL

Code and data for paper [Linking Emergent and Natural Languages via Cospus Transfer]() at ICLR 2022 (spotlight).

```bibtex

@inproceedings{yao2022linking,

  title = {Linking Emergent and Natural Languages via Corpus Transfer},

  author = {Yao, Shunyu and Yu, Mo and Zhang, Yang and Narasimhan, Karthik and Tenenbaum, Joshua and Gan, Chuang},

  booktitle = {International Conference on Learning Representations (ICLR)},

  year = {2022},

  html = {https://openreview.net/pdf?id=49A1Y6tRhaq},

}

```

## Dependencies

- PyTorch 1.8

- SciPy 1.4

- Transformers 4.4.2

- (Optional) Wandb

## Data

[Google Drive](https://drive.google.com/drive/folders/1dBdGaZzvQ4yn-RMpDMxLlFNLzSSbkOWF?usp=sharing) includes

- ```image_features```: Image features of coco-2014 (``coco.pt``) and Conceptual Captions (``cc.pt``) datasets from a pre-trained ResNet, to be used in EC pre-training.

- ```lm_corpora```: Corpora used for language modeling transfer experiments. 

| Name   | Usage | Comment      |

|--------------|-----------|---------|

| cc.pt    | pre-train         | Emergent language       |

| paren-zipf.pt    | pre-train         | Regular language of nesting parentheses  |

| wiki-es.pt    | pre-train         | Spanish (IE-Romance) Wikipedia       |

| wiki-da.pt    | fine-tune         | Danish (IE-Germanic) Wikipedia       |

| wiki-eu.pt    | fine-tune         | Basque (Basque) Wikipedia       |

| wiki-ja.pt    | fine-tune         | Japanese (Japanese) Wikipedia       |

| wiki-ro.pt    | fine-tune         | Romanian (IE-Romance) Wikipedia       |

| wiki-fi.pt    | fine-tune         | Finnish (Uralic) Wikipedia       |

| wiki-id.pt    | fine-tune         | Indonesian (Austronesian) Wikipedia       |

| wiki-kk.pt    | fine-tune         | Kazakh (Turkic) Wikipedia       |

| wiki-he.pt    | fine-tune         | Hebrew (Afro-Asiatic) Wikipedia       |

| wiki-ur.pt    | fine-tune         | Urdu (IE-Indic) Wikipedia       |

| wiki-fa.pt    | fine-tune         | Persian (IE-Iranian) Wikipedia       |

## Experiments

### Emergent Communication (EC) Game

This part aims to generate emergent langauge corpus for downstream tasks.

Download ```image_features``` from Google Drive to ```./ec-pretrain/data```.

To run the emergent communication training, 

```bash

cd ec-game

python train.py

```

Some major options:

- ```--dataset```: use Conceptual Captions (```cc```) or MS-COCO (```coco_2014```) dataset.

- ```--vocab_size```: Vocabulary size (default ```4035```).

- ```--seq_len```: Sequence length limit (default ```15```).

Such a game training automatically stores EC agents (e.g. ```./ckpt/cc_vocab_4035_seq_15_reset_-1_nlayers_1/run77926/model_90.6_1000_4035.pt```) and emergent language corpora (e.g. ```./ckpt/cc_vocab_4035_seq_15_reset_-1_nlayers_1/run77926/model_90.6_1000_4035.pt-cc.pt```, which can be used in place of ```lm_corpora/cc.pt``` from Google Drive)  from different training steps. In the example, ```90.6_1000_4035``` represents game accuracy, game training steps, and game vocabulary size respectively.

### Language Modeling Transfer

This part aims to reproduce Figure 2 of the paper. 

Download ```lm_corpora``` from Google Drive to ```./ec-pretrain/data```.

To run the pre-training, 

```bash

export size=2 # 2,5,10,15,30

export pt_name="wiki-es" # "paren-zipf", "cc"

. pretrain.sh

```

To run the fine-tuning,

```bash

export size=2 # 2,5,10,15,30

export pt_name="wiki-es" # "paren-zipf", "cc"

export ft_name="wiki-ro"

export ckpt=3000

. finetune.sh

```

Meaning of variables above:

- ```size```: Token size (million) of pre-training corpus (```[2, 5, 10, 15, 30]```).

- ```pt_name```: Name of pre-training corpus (```["wiki-es", "paren-zipf", "cc"]```).

- ```ft_name```: Name of fine-tuning corpus (```["wiki-ro", "wiki-da.pt]```).

- ```ckpt```: Which pre-training checkpoint to use for fine-tuning (default ```3000```).

   

## Acknowledgements

The EC part of the code is based on [ECNMT](https://github.com/cambridgeltl/ECNMT), which was partly based on [Translagent](https://github.com/facebookresearch/translagent). 

The LM part of the code is based on [Huggingface run_clm.py](https://github.com/huggingface/transformers/blob/v4.4.2/examples/language-modeling/run_clm.py).

The datasets for our EC experiments include [MS COCO](http://cocodataset.org/#home) and [Conceptual Captions](https://ai.google.com/research/ConceptualCaptions).

The datasets for our LM experiments derive from [tilt-transfer](https://github.com/toizzy/tilt-transfer).

Please cite these resources accordingly. For any question, contact [Shunyu](mailto:shunyuyao.cs@gmail.com).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ysymyth/ec-nl

Awesome Lists containing this project

README