https://github.com/bigscience-workshop/catalogue_data

Scripts to prepare catalogue data
https://github.com/bigscience-workshop/catalogue_data

Last synced: about 1 year ago
JSON representation

Scripts to prepare catalogue data

Host: GitHub
URL: https://github.com/bigscience-workshop/catalogue_data
Owner: bigscience-workshop
License: apache-2.0
Created: 2022-02-01T05:35:53.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2022-04-25T11:51:09.000Z (about 4 years ago)
Last Synced: 2025-04-04T16:41:46.789Z (over 1 year ago)
Language: Jupyter Notebook
Size: 275 KB
Stars: 8
Watchers: 21
Forks: 1
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # catalogue_data

Scripts to prepare catalogue data.

## Setup

Clone this repo.

Install git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation

```shell

sudo apt-get install git-lfs

git lfs install

```

Install dependencies:

```shell

sudo apt-add-repository non-free

sudo apt-get update

sudo apt-get install unrar

```

Create virtual environment, activate it and install dependencies:

```shell

python -m venv .venv

source .venv/bin/activate

pip install -r requirements.txt

```

Create User Access Token (with write access) at Hugging Face Hub: https://huggingface.co/settings/token

and set environment variables in the `.env` file at the root directory:

```

HF_USERNAME=

HF_USER_ACCESS_TOKEN=

GIT_USER=

GIT_EMAIL=

```

## Create metadata

To create dataset metadata (in file `dataset_infos.json`) run:

```shell

python create_metadata.py --repo 

```

where you should replace ``, e.g. `bigscience-catalogue-lm-data/lm_ca_viquiquad`

## Aggregate datasets

To create an aggregated dataset from multiple datasets, and save it as sharded JSON Lines GZIP files, run:

```shell

python aggregate_datasets.py --dataset_ratios_path  --save_path 

```

where you should replace:

- `path_to_file_with_dataset_ratios`: path to JSON file containing a dict with dataset names (keys) and their ratio

  (values) between 0 and 1.

- ``: directory path to save the aggregated dataset

## Downloads for cleaning

### Stanza

```python

import stanza

for lang in {"ar", "ca", "eu", "id", "vi", "zh-hans", "zh-hant"}:

    stanza.download(lang, logging_level="WARNING")

```

### Indic NLP library

```bash

git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git

export INDIC_RESOURCES_PATH=

```

### NLTK

import nltk

nltk.download("punkt")

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bigscience-workshop/catalogue_data

Awesome Lists containing this project

README