An open API service indexing awesome lists of open source software.

https://github.com/bigscience-workshop/catalogue_data

Scripts to prepare catalogue data
https://github.com/bigscience-workshop/catalogue_data

Last synced: 10 months ago
JSON representation

Scripts to prepare catalogue data

Awesome Lists containing this project

README

          

# catalogue_data
Scripts to prepare catalogue data.

## Setup
Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation
```shell
sudo apt-get install git-lfs
git lfs install
```

Install dependencies:
```shell
sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar
```

Create virtual environment, activate it and install dependencies:
```shell
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token
and set environment variables in the `.env` file at the root directory:
```
HF_USERNAME=
HF_USER_ACCESS_TOKEN=
GIT_USER=
GIT_EMAIL=
```

## Create metadata
To create dataset metadata (in file `dataset_infos.json`) run:
```shell
python create_metadata.py --repo
```
where you should replace ``, e.g. `bigscience-catalogue-lm-data/lm_ca_viquiquad`

## Aggregate datasets
To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:
```shell
python aggregate_datasets.py --dataset_ratios_path --save_path
```
where you should replace:
- `path_to_file_with_dataset_ratios`: path to JSON file containing a dict with dataset names (keys) and their ratio
(values) between 0 and 1.
- ``: directory path to save the aggregated dataset

## Downloads for cleaning

### Stanza

```python
import stanza

for lang in {"ar", "ca", "eu", "id", "vi", "zh-hans", "zh-hant"}:
stanza.download(lang, logging_level="WARNING")
```

### Indic NLP library

```bash
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
export INDIC_RESOURCES_PATH=
```

### NLTK
import nltk
nltk.download("punkt")