Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/huu4ontocord/mdel

Multi-Domain Expert Learning
https://github.com/huu4ontocord/mdel

Last synced: 10 days ago
JSON representation

Multi-Domain Expert Learning

Host: GitHub
URL: https://github.com/huu4ontocord/mdel
Owner: huu4ontocord
License: apache-2.0
Created: 2023-04-05T01:52:18.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-01-23T20:13:27.000Z (10 months ago)
Last Synced: 2024-06-02T05:52:03.736Z (6 months ago)
Language: Python
Homepage:
Size: 37.2 MB
Stars: 68
Watchers: 21
Forks: 14
Open Issues: 24
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# MDEL

Multi-Domain Expert Learning

# Environment Setup

To set up the development environment, run `make setup_dev`. This will setup the
pre-commit hooks.

## Creating Expert Datasets

First, make sure you followed the Environment Setup guidelines.

To create an expert dataset using the Pile data, follow these steps:

1. Download the Pile shard 1 data: `./scripts/get_pile_shard1_data.sh`
2. To set the domain, edit the variable `SUBSET_NAME` in
`scripts/create_domain_pile_mix.sh`. This should be set to a valid value of
the Pile's variable `pile_set_name`. A list of valid values can be found
below.
3. Run the above script to process the dataset
4. Authenticate into Hugginface:
`export HF_ACCESS_TOKEN={YOUR HUGGINGFACE TOKEN}`
5. Set the dataset name in `scripts/upload_to_hf.sh`
6. Run the above script to upload the processed dataset to HuggingFace

### Pile Subsets

- Pile-CC
- PubMed Central
- Books3†
- OpenWebText2
- ArXiv
- Github
- FreeLaw
- Stack Exchange
- USPTO Backgrounds
- PubMed Abstracts
- Gutenberg (PG-19)†
- OpenSubtitles†
- Wikipedia (en)†
- DM Mathematics†
- Ubuntu IRC
- BookCorpus2
- EuroParl†
- HackerNews
- YoutubeSubtitles
- PhilPapers
- NIH ExPorter
- Enron Emails†

# Training Expert Models

1. Clone this repo and follow the Environment Setup instructions
2. Set up HF authentication: `export HUGGING_FACE_HUB_TOKEN=[FILL ME]`
3. Set up W&B authentication: `export WANDB_API_KEY=[FILL ME]`
4. Edit the variable `DATASET` in script `src/mdel/train.sh` to match a valid
dataset name on the
[MDEL HF](https://huggingface.co/Multi-Domain-Expert-Layers).
5. Run the above script in background mode to start the training: `./train.sh &`
6. The trained model should be uploaded to the MDEL HF

# Merging Expert Models

1. Clone this repo and follow the Environment Setup instructions
2. Set up HF authentication: `export HUGGING_FACE_HUB_TOKEN=[FILL ME]`
3. Run the merge script

```bash
python src/mdel/merge_experts.py \
--hf-repo your_hf_username/desired_name_of_merged_model \
-e mdel/expert_1 \
-e mdel/expert_2 \
-e mdel/expert_n
```

# Evaluating Perplexity of Models

1. Clone this repo and follow the Environment Setup instructions
2. Set up HF authentication: `export HUGGING_FACE_HUB_TOKEN=[FILL ME]`
3. Run the perplexity script

```bash
python3 src/mdel/calculate_perplexity.py \
--model Multi-Domain-Expert-Layers/expert-arxiv \
--dataset Multi-Domain-Expert-Layers/arxiv \
--split validation_domain
```

# References

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... &
Leahy, C. (2020).The pile: An 800gb dataset of diverse text for language
modeling. _arXiv preprint arXiv:2101.00027_.