https://github.com/isaacus-dev/emubert-creator
The training code behind EmuBert, the largest open-source masked language model for Australian law.
https://github.com/isaacus-dev/emubert-creator
australia bert law legal llm llms model models nlp training transformers
Last synced: 5 months ago
JSON representation
The training code behind EmuBert, the largest open-source masked language model for Australian law.
- Host: GitHub
- URL: https://github.com/isaacus-dev/emubert-creator
- Owner: isaacus-dev
- License: mit
- Created: 2024-05-12T08:31:41.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-02T11:22:19.000Z (about 2 years ago)
- Last Synced: 2025-02-04T04:25:03.792Z (over 1 year ago)
- Topics: australia, bert, law, legal, llm, llms, model, models, nlp, training, transformers
- Language: Python
- Homepage: https://huggingface.co/umarbutler/emubert
- Size: 15.6 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# EmuBert Creator
EmuBert is the largest open-source masked language model for Australian law. This repository preserves the code used to create EmuBert.
If you're looking to download EmuBert, you may do so on [Hugging Face](https://huggingface.co/umarbutler/emubert).
## Setup 🛠️
The EmuBert Creator has only been tested on Python 3.11 but should work for later versions and *may* also work for earlier versions.
To set up the Creator, start by running the following commands:
```bash
git clone https://github.com/umarbutler/emubert-creator.git
cd emubert-creator
pip install -r requirements.txt
```
Next, download the version of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) you'd like to train EmuBert on by navigating to its [changelog](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus/blob/main/CHANGELOG.md), clicking on the version number you'd like to use, clicking on the file named `corpus.jsonl` and finally hitting 'download'. Any version of the Corpus that begins with the number 4 should be compatible with the Creator. The specific version of the Corpus used to produce EmuBert is 4.2.1 and can be downloaded [here](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus/blob/fe0cd918dbe0a1fb5afe09cfa682ec3dbc1b94ca/corpus.jsonl).
Finally, you can either place the Corpus in a directory named `data` in the root of the repository, define an environment variable named `OALC` that points to the Corpus or override the `corpus_path` variable in `scripts/config.py`.
## Usage 👩💻
To train EmuBert, run the following scripts in the `scripts` directory in order:
1. `preprocess.py`, which cleans documents, splits them into training, validation and test sets, filters out short documents from the training set, deduplicates the training set, trains a tokeniser and finally save the resulting data.
2. `block.py`, which splits texts into block of the same size as EmuBert's context window and saves them.
3. `train.py`, which trains EmuBert and saves it to a directory named `model` (unless the `model_dir` variable in `config.py` is overridden). If training is interrupted at any point, set the script's `RESUME` variable to `True`.
4. `convert.py`, which converts EmuBert from a Better Transformer into a vanilla Transformer.
5. `benchmark.py`, which benchmarks EmuBert against other popular masked language models.
## Licence 📜
The Creator is licensed under the [MIT License](LICENCE).