https://github.com/ljvmiranda921/libertus

Multilingual BERT model for Ancient and Historical Languages for SIGTYP Shared Task 2024
https://github.com/ljvmiranda921/libertus

Last synced: 3 months ago
JSON representation

Multilingual BERT model for Ancient and Historical Languages for SIGTYP Shared Task 2024

Host: GitHub
URL: https://github.com/ljvmiranda921/libertus
Owner: ljvmiranda921
Created: 2023-11-13T07:46:34.000Z (almost 2 years ago)
Default Branch: master
Last Pushed: 2024-03-27T21:04:45.000Z (over 1 year ago)
Last Synced: 2025-01-22T03:06:13.325Z (9 months ago)
Language: Python
Homepage:
Size: 12.8 MB
Stars: 3
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Citation: CITATION.cff

Awesome Lists containing this project

README

          

# 🪐 LiBERTus - A Multilingual Language Model for Ancient and Historical Languages

Submission to Task 1 (Constrained) of the [SIGTYP 2024 Shared Task on Word

Embedding Evaluation for Ancient and Historical

Languages](https://sigtyp.github.io/st2024.html).  The system is built by

first pretraining a multilingual language model and then finetuning it for a

downstream task. The submission for Phase 1 and 2 of the Shared Task can be

found in the `submission_p1` and `submission_p2` directories.

## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the

project, as well as the available commands and workflows. For details, see the

[Weasel documentation](https://github.com/explosion/weasel).

### ⏯ Commands

The following commands are defined by the project. They

can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run).

Commands are only re-run if their inputs have changed.

| Command | Description |

| --- | --- |

| `create-pretraining` | Create corpus for multilingual LM pretraining |

| `create-vocab` | Train a tokenizer to create a vocabulary |

| `pretrain-model` | Pretrain a multilingual LM from a corpus |

| `pretrain-model-from-checkpoint` | Pretrain a multilingual LM from a corpus based on a checkpoint |

| `upload-to-hf` | Upload pretrained model and corresponding tokenizer to the HuggingFace repository |

| `convert-to-spacy-merged` | Convert CoNLL-U files into spaCy format for finetuning |

| `convert-to-spacy` | Convert CoNLL-U files into spaCy format for finetuning |

| `finetune-tok2vec-model` | Finetune a tok2vec model given a training and validation corpora |

| `finetune-trf-model` | Finetune a transformer model given a training and validation corpora |

| `finetune-with-merged-corpus` | Finetune a transformer model on the combined training and validation corpora |

| `package-model` | Package model and upload to HuggingFace |

| `evaluate-model-dev` | Evaluate a model on the validation set |

| `plot-figures` | Plot figures for the writeup |

| `setup-test` | Install models from HuggingFace via pip |

| `download-models-locally` | Download models from HuggingFace |

| `get-test-results` | Get results from the test file |

| `zip-results-p1` | Zip the results into a single file for submission (Phase 1) |

| `zip-results-p2` | Zip teh results into a single file for submission (Phase 2) |

### ⏭ Workflows

The following workflows are defined by the project. They

can be executed using [`weasel run [name]`](https://github.com/explosion/weasel/tree/main/docs/cli.md#rocket-run)

and will run the specified commands in order. Commands are only re-run if their

inputs have changed.

| Workflow | Steps |

| --- | --- |

| `pretrain` | `create-pretraining` → `create-vocab` → `pretrain-model` |

| `finetune` | `convert-to-spacy` → `finetune-trf-model` → `evaluate-model-dev` |

| `experiment-merged` | `convert-to-spacy-merged` → `finetune-with-merged-corpus` |

| `experiment-sampling` | `create-vocab` → `pretrain-model` |

| `make-submission-p1` | `setup-test` → `get-test-results` → `zip-results-p1` |

| `make-submission-p2` | `download-models-locally` → `zip-results-p2` |

### 🗂 Assets

The following assets are defined by the project. They can

be fetched by running [`weasel assets`](https://github.com/explosion/weasel/tree/main/docs/cli.md#open_file_folder-assets)

in the project directory.

| File | Source | Description |

| --- | --- | --- |

| `assets/train/` | Git | CoNLL-U training datasets for Task 0 (morphology/lemma/POS) |

| `assets/dev/` | Git | CoNLL-U validation datasets for Task 0 (morphology/lemma/POS) |

| `assets/test/` | Git | CoNLL-U test datasets for Task 0 (morphology/lemma/POS) |

## 📄 Cite

If you used any of the code or the models, don't forget to cite

```

@inproceedings{miranda-2024-allen,

    title = "{A}llen Institute for {AI} @ {SIGTYP} 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages",

    author = "Miranda, Lester",

    booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",

    month = mar,

    year = "2024",

    address = "St. Julian's, Malta",

    publisher = "Association for Computational Linguistics",

    url = "https://aclanthology.org/2024.sigtyp-1.18",

    pages = "151--159",

}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ljvmiranda921/libertus

Awesome Lists containing this project

README