https://github.com/haozhg/lmd
Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models
https://github.com/haozhg/lmd
bert deep-learning language-models multilingual-bert natural-language-processing nlp pretrained-models python pytorch roberta transformers xlm-roberta
Last synced: 5 months ago
JSON representation
Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models
- Host: GitHub
- URL: https://github.com/haozhg/lmd
- Owner: haozhg
- License: apache-2.0
- Created: 2022-10-08T14:38:11.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-12-22T15:00:01.000Z (over 3 years ago)
- Last Synced: 2025-10-26T02:48:23.551Z (8 months ago)
- Topics: bert, deep-learning, language-models, multilingual-bert, natural-language-processing, nlp, pretrained-models, python, pytorch, roberta, transformers, xlm-roberta
- Language: Python
- Homepage:
- Size: 1.82 MB
- Stars: 10
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# lmd
Code for paper titled "Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models" (accepted to EMNLP 2022). The arxiv version is here: https://arxiv.org/abs/2210.10289
## Install
Create virtual env if needed
```
python3 -m venv .venv
source .venv/bin/activate
```
Install from pip (https://pypi.org/project/nlp.lmd/)
```
pip install nlp.lmd
```
Install from source
```
git clone git@github.com:haozhg/lmd.git
cd lmd
pip install -e .
````
To use lmd cli, run `lmd --help` or `python -m lmd.cli --help`
```
$ lmd --help
usage: Language Model Decomposition [-h] [--target TARGET] [--basis BASIS]
[--tokenizer-name TOKENIZER_NAME]
[--max-seq-length MAX_SEQ_LENGTH]
[--batch-size BATCH_SIZE]
[--dataset-name DATASET_NAME]
[--dataset-config-name DATASET_CONFIG_NAME]
[--val-split-percentage VAL_SPLIT_PERCENTAGE]
[--test-split-percentage TEST_SPLIT_PERCENTAGE]
[--max-train-samples MAX_TRAIN_SAMPLES]
[--max-val-samples MAX_VAL_SAMPLES]
[--max-test-samples MAX_TEST_SAMPLES]
[--preprocessing-num-workers PREPROCESSING_NUM_WORKERS]
[--overwrite_cache OVERWRITE_CACHE]
[--preprocess-dir PREPROCESS_DIR]
[--embedding-dir EMBEDDING_DIR]
[--results-dir RESULTS_DIR]
[--models-dir MODELS_DIR] [--alpha ALPHA]
[--log-level LOG_LEVEL]
[--try-models TRY_MODELS]
[--pre-select-multiplier PRE_SELECT_MULTIPLIER]
[--seed SEED]
```
## Results
To reproduce the results in Appendix B of the paper, run `bash scripts/run.sh`. The results are also stored in [`results/128k`](./results/128k/)
## Citation
If you find this paper/code useful, please cite us:
```
@misc{https://doi.org/10.48550/arxiv.2210.10289,
doi = {10.48550/ARXIV.2210.10289},
url = {https://arxiv.org/abs/2210.10289},
author = {Zhang, Hao},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7, 68T50 (Primary) 68T30, 68T07 (Secondary)},
title = {Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
```