Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vzhong/wrangl

Parallel data preprocessing for NLP and ML.
https://github.com/vzhong/wrangl

Last synced: 30 days ago
JSON representation

Parallel data preprocessing for NLP and ML.

Host: GitHub
URL: https://github.com/vzhong/wrangl
Owner: vzhong
License: apache-2.0
Created: 2021-08-31T01:17:20.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-11-10T03:47:32.000Z (about 2 years ago)
Last Synced: 2024-09-07T11:38:35.859Z (2 months ago)
Language: Python
Homepage:
Size: 1.62 MB
Stars: 33
Watchers: 5
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # Wrangl

[![Tests](https://github.com/vzhong/wrangl/actions/workflows/test.yml/badge.svg)](https://github.com/vzhong/wrangl/actions/workflows/test.yml)

Parallel data preprocessing and fast experiments for NLP and ML.

See [docs here](https://www.victorzhong.com/wrangl).

## Why?

I built this library to prototype ideas quickly.

In essence it combines [Hydra](https://hydra.cc), [Pytorch Lightning](https://www.pytorchlightning.ai), [moolib](https://github.com/facebookresearch/moolib), and [Ray](https://ray.io) for some fast data processing and (supervised/reinforcement) learning.

The following are supported with command line or config tweaks (e.g. no additional boilerplate code):

- checkpointing

- early stopping

- auto git diffs

- logging to S3 (along with auto-generated seaborn plot), wandb

- Slurm launcher

## Installation

```bash

pip install -e .  # add [dev] if you want to run tests and build docs.

# for latest

pip install git+https://github.com/vzhong/wrangl

# pypi release

pip install wrangl

```

If [moolib](https://github.com/facebookresearch/moolib) install fails because you do not have CUDA you can try installing it yourself with `env USE_CUDA=0 pip install moolib`.

## Usage

See [the documentation](https://victorzhong.com/wrangl) for how to use Wrangl.

Examples of projects using Wrangl are found in `wrangl.examples`.

In particular `wrangl.examples.learn.xor_clf` shows an example of using Wrangl to quickly set up a supervised classification task.

`wrangl.examples.learn.atari_rl` shows an example of reinforcement learning using IMPALA VTrace.

For parallel data preprocessing `wrangl.examples.preprocess.using_stanza` shows an example of using Stanford NLP Stanza to parse text in parallel across CPU cores.

If you find this work helpful, please consider citing

```

@misc{zhong2021wrangl,

  author = {Zhong, Victor},

  title = {Wrangl: Parallel data preprocessing for NLP and ML},

  year = {2021},

  publisher = {GitHub},

  journal = {GitHub repository},

  howpublished = {\url{https://github.com/vzhong/wrangl}}

}

```