https://github.com/thesofakillers/claficle
Official repository for the paper "CLAfICLe: Cross-Lingual Adaptation for In-Context Learning". Not Published.
https://github.com/thesofakillers/claficle
adapters gpt2 multilingual-nlp nlp python transformers
Last synced: 8 months ago
JSON representation
Official repository for the paper "CLAfICLe: Cross-Lingual Adaptation for In-Context Learning". Not Published.
- Host: GitHub
- URL: https://github.com/thesofakillers/claficle
- Owner: thesofakillers
- License: mit
- Created: 2022-05-12T09:15:44.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2023-06-28T10:30:51.000Z (almost 3 years ago)
- Last Synced: 2025-05-07T19:03:32.284Z (about 1 year ago)
- Topics: adapters, gpt2, multilingual-nlp, nlp, python, transformers
- Language: TeX
- Homepage:
- Size: 13.9 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# CLAfICLe
Read our paper:
[Cross-Lingual Adaptation for In-Context Learning [PDF]](./reports/report/main.pdf) (Not submitted for publication)
**Contents**
* [Requirements and Setup](#requirements-and-setup)
* [Required Packages](#required-packages)
* [Checkpoints](#checkpoints)
* [Model Reference](#model-reference)
* [Usage](#usage)
* [Project Organization](#project-organization)
## Requirements and Setup
### Required Packages
Details such as python and package versions can be found in the generated
[pyproject.toml](pyproject.toml) and [poetry.lock](poetry.lock) files.
We recommend using an environment manager such as
[conda](https://docs.conda.io/en/latest/). After setting up your environment
with the correct python version, please proceed with the installation of the
required packages
For [poetry](https://python-poetry.org/) users, getting setup is as easy as
running
```terminal
poetry install
```
We also provide a [requirements.txt](requirements.txt) file for
[pip](https://pypi.org/project/pip/) users who do not wish to use poetry. In
this case, simply run
```terminal
pip install -r requirements.txt
```
This `requirements.txt` file is generated by running the following
```terminal
sh gen_pip_reqs.sh
```
### Checkpoints
If you wish to run evaluation without first training the model, we provide our
checkpoints via [The Internet Archive](https://archive.org/) at
[this link](https://archive.org/download/claficle/checkpoints.zip). Please unzip
this folder and organize it such that the checkpoints are in the `checkpoints`
folder at the root of this repository.
We do not provide the bare `hr_to_lr` MetaICL model checkpoint. For this
checkpoint, please refer to the instructions on the
[MetaICL repo](https://github.com/facebookresearch/MetaICL) for downloading
their `metaicl` model in the `hr_to_lr` setting. Once downloaded, rename this to
`metaicl.pt` and place it in the relevant checkpoints directory.
## Model Reference
The following table provides a reference for the models evaluated in our paper.
| **Model Name** | **Evaluation Languages** | **Description** |
| ------------------------------- | ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `metaicl` | en | `direct` `hr_to_lr` checkpoint from the [MetaICL repo](https://github.com/facebookresearch/MetaICL) |
| `sandwich-{lang}` | fr, de | `metaicl` sandwiched in a translation API for `lang`, serving as a baseline |
| `metaicl-gewechselt-{lang}-clm` | fr, de | `metaicl` adapted to a `lang` (fr or de) using [WECHSEL](https://github.com/CPJKU/wechsel), 0 shot or with the additional recommended CLM training. |
| `gpt2-gewechselt-{lang}-clm` | not evaluated | `gpt2` adapted to `lang` (fr or de) using [WECHSEL](https://github.com/CPJKU/wechsel) with additional recommended CLM training. Note, we do not actually evaluate this buut only use it as a base. |
| `{base}-metaicla` | fr, de | A `base` (any of the `gpt2-gewechselt-{lang}-clm`) with a MetaICL adapter, trained the standard way. |
| `{base}-metaiclva` | fr, de | A `base` (any of the `gpt2-gewechselt-{lang}-clm`) with a MetaICL _vessel_ adapter, trained with targeted distillation. |
## Usage
We use [hydra](https://hydra.cc/) for configuring our project.
To download/process the data, either run
[claficle/data/oscar.py](claficle/data/oscar.py) or
[claficle/data/benchmark.py](claficle/data/benchmark.py) for OSCAR and our
multi-lingual multi-task benchmark respectively. You may have to configure or
override [claficle/conf/setup_data.yaml](claficle/conf/setup_data.yaml)
accordingly. We suggest inspecting [slurm/data/](slurm/data/) for examples of
how we ran these.
Note that to process OSCAR in French and German data you need to make use of
trained tokenizers from WECHSEL initialization. You can either download these
along with our checkpoints or run WECHSEL initalization yourself by running
[claficle/models/gewechselt.py](claficle/models/gewechselt.py), configured with
[claficle/conf/wechsel_init.yaml](claficle/conf/wechsel_init.yaml). We have
examples of how we ran this in [slurm/wechsel/](slurm/wechsel_init/).
Once the data is downloaded, to run evaluation run
[claficle/run/eval.py](claficle/run/eval.py), configured with
[claficle/conf/eval.yaml](claficle/conf/eval.yaml). Examples at
[slurm/eval/](slurm/eval/).
Of course, to run evaluation you need trained checkpoints. You can once again
either download these or train them yourself. For geWECHSELt models, you can
run [claficle/run/train.py](claficle/run/train.py). For MetaICLVA, you can run
[claficle/run/distil.py](claficle/run/distil.py). For MetaICLA, please refer to
[our MetaICL fork](https://github.com/thesofakillers/metaICLA). Like always,
these are configured with the relevant files in
[claficle/conf/](claficle/conf/) and are accompanies by examples of how we did
it in [slurm/](slurm/).
## Project Organization
```plaintext
├── LICENSE
├── README.md <- The top-level README
├── data/
│ ├── interim/ <- Intermediate data that has been transformed.
│ ├── processed/ <- The final, canonical data sets for modeling.
│ └── raw/ <- The original, immutable data dump.
├── checkpoints/ <- Trained and serialized models.
├── notebooks/ <- Jupyter notebooks.
├── slurm/ <- SLURM scripts
├── logs/ <- logs
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
├── pyproject.toml <- project metadata, handled by poetry.
├── poetry.lock <- resolving and locking dependencies, handled by poetry.
├── requirements.txt <- for non-poetry users.
├── gen_pip_reqs.sh <- for generating the pip requirements.txt file
└── claficle/ <- Source code for use in this project.
├── __init__.py <- Makes src a Python module
├── data/ <- Scripts to download or generate data
├── models/ <- Model definitions
├── run/ <- scripts to train, evaluate and use models
├── conf/ <- config files
├── utils/ <- miscellaneous utils
└── visualization/ <- Scripts for visualization
```
The project structure is largely based on the
[cookiecutter data-science template](https://github.com/drivendata/cookiecutter-data-science).
This is purposely opinionated so that paths align over collaborators without
having to edit config files. Users may find the
[cookiecutter data-science opinions page](http://drivendata.github.io/cookiecutter-data-science/#opinions),
of relevance
The top level `data/` and `models/` directory are in version control only to
show structure. Their contents will not be committed and are ignored via
`.gitignore`.