Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/flbbb/locost-summarization

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/flbbb/locost-summarization
Owner: flbbb
License: apache-2.0
Created: 2024-01-29T11:39:57.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-03-22T03:26:06.000Z (10 months ago)
Last Synced: 2024-08-01T04:02:11.137Z (6 months ago)
Language: Assembly
Size: 6.76 MB
Stars: 23
Watchers: 4
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

Awesome-state-space-models - Document Summarization

README

# LOCOST

This repo contains the code used to pretrain and finetune LOCOST.

The scripts about state-space models are adapted from the official [H3 repository](https://github.com/HazyResearch/H3).

Pre-trained models are available on the [HuggingFace model hub](https://huggingface.co/flbbb/locost-gsg-pretrained).

## Setup
Install both packages in the `csrc/` folder:
```bash
cd csrc
cd fftconv
pip install ./
cd ../cauchy
pip install ./
```

## Data

We expect the datasets to be tokenized with the base LongT5 tokenizer. This formatting can be done with the script `preprocess_data.py`.

## Env

These scripts rely on a `.env` file, and is used through the [python-dotenv](https://pypi.org/project/python-dotenv/) package. Make sure to define here:

- `DATASET_PATH`, the base folder where are stored the dataset.
- `TOKENIZER_PATH`, the path to the model tokenizer (we used the LongT5 tokenizer).
- `CHECKPOINT_PATH` to save the models checkpoint during training.

# Pretraining

The pretraining is ran with PytorchLightning and tracked with `wandb`.

```bash
TRANSFORMERS_NO_ADVISORY_WARNINGS="true" python pretrain_script.py --dataset path/to/pretraining/dataset --config configs/pretraining/locost.yaml --wandb_name locost-pretraining
```