Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/flbbb/locost-summarization
https://github.com/flbbb/locost-summarization
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/flbbb/locost-summarization
- Owner: flbbb
- License: apache-2.0
- Created: 2024-01-29T11:39:57.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-03-22T03:26:06.000Z (10 months ago)
- Last Synced: 2024-08-01T04:02:11.137Z (6 months ago)
- Language: Assembly
- Size: 6.76 MB
- Stars: 23
- Watchers: 4
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-state-space-models - Document Summarization
README
# LOCOST
This repo contains the code used to pretrain and finetune LOCOST.
The scripts about state-space models are adapted from the official [H3 repository](https://github.com/HazyResearch/H3).
Pre-trained models are available on the [HuggingFace model hub](https://huggingface.co/flbbb/locost-gsg-pretrained).
## Setup
Install both packages in the `csrc/` folder:
```bash
cd csrc
cd fftconv
pip install ./
cd ../cauchy
pip install ./
```## Data
We expect the datasets to be tokenized with the base LongT5 tokenizer. This formatting can be done with the script `preprocess_data.py`.
## Env
These scripts rely on a `.env` file, and is used through the [python-dotenv](https://pypi.org/project/python-dotenv/) package. Make sure to define here:
- `DATASET_PATH`, the base folder where are stored the dataset.
- `TOKENIZER_PATH`, the path to the model tokenizer (we used the LongT5 tokenizer).
- `CHECKPOINT_PATH` to save the models checkpoint during training.# Pretraining
The pretraining is ran with PytorchLightning and tracked with `wandb`.
```bash
TRANSFORMERS_NO_ADVISORY_WARNINGS="true" python pretrain_script.py --dataset path/to/pretraining/dataset --config configs/pretraining/locost.yaml --wandb_name locost-pretraining
```