https://github.com/bigscience-workshop/catalogue_data
Scripts to prepare catalogue data
https://github.com/bigscience-workshop/catalogue_data
Last synced: 10 months ago
JSON representation
Scripts to prepare catalogue data
- Host: GitHub
- URL: https://github.com/bigscience-workshop/catalogue_data
- Owner: bigscience-workshop
- License: apache-2.0
- Created: 2022-02-01T05:35:53.000Z (about 4 years ago)
- Default Branch: master
- Last Pushed: 2022-04-25T11:51:09.000Z (almost 4 years ago)
- Last Synced: 2025-04-04T16:41:46.789Z (11 months ago)
- Language: Jupyter Notebook
- Size: 275 KB
- Stars: 8
- Watchers: 21
- Forks: 1
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# catalogue_data
Scripts to prepare catalogue data.
## Setup
Clone this repo.
Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation
```shell
sudo apt-get install git-lfs
git lfs install
```
Install dependencies:
```shell
sudo apt-add-repository non-free
sudo apt-get update
sudo apt-get install unrar
```
Create virtual environment, activate it and install dependencies:
```shell
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token
and set environment variables in the `.env` file at the root directory:
```
HF_USERNAME=
HF_USER_ACCESS_TOKEN=
GIT_USER=
GIT_EMAIL=
```
## Create metadata
To create dataset metadata (in file `dataset_infos.json`) run:
```shell
python create_metadata.py --repo
```
where you should replace ``, e.g. `bigscience-catalogue-lm-data/lm_ca_viquiquad`
## Aggregate datasets
To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:
```shell
python aggregate_datasets.py --dataset_ratios_path --save_path
```
where you should replace:
- `path_to_file_with_dataset_ratios`: path to JSON file containing a dict with dataset names (keys) and their ratio
(values) between 0 and 1.
- ``: directory path to save the aggregated dataset
## Downloads for cleaning
### Stanza
```python
import stanza
for lang in {"ar", "ca", "eu", "id", "vi", "zh-hans", "zh-hant"}:
stanza.download(lang, logging_level="WARNING")
```
### Indic NLP library
```bash
git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
export INDIC_RESOURCES_PATH=
```
### NLTK
import nltk
nltk.download("punkt")