https://github.com/zphang/nlprunners

Last synced: 12 months ago
JSON representation

Host: GitHub
URL: https://github.com/zphang/nlprunners
Owner: zphang
Created: 2019-08-01T19:21:46.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2020-04-13T07:41:14.000Z (over 6 years ago)
Last Synced: 2025-01-29T13:43:36.508Z (over 1 year ago)
Language: Python
Size: 1.13 MB
Stars: 1
Watchers: 5
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# NLP Runners

This repository contains code for various NLP fine-tuning/transfer learning experiments, including semi-supervised learning, multi-task training, and adapters.

**This codebase is heavily WIP.**

----

### Quick Setup

* For a quick environment setup, see: [Simple Setup](packaging/simple_setup.md)
* For Adapters/Multi-Adapters, see: [Adapters/Multi-Adapters](packaging/adapters.md)

----

### Dependencies

These are the main notable dependencies. For more, see: [Simple Setup](packaging/simple_setup.md)

* PyTorch 1.2+
* HuggingFace/Transformers (usually the latest version. Currently 2.3.0)
* My own set of Python utility libraries: [zutils](https://github.com/zphang/zutils)

----

### Overview

#### Running

* Different research projects can be found in [nlpr/proj](nlpr/proj). The basic fine-tuning version can be found in [nlpr/proj/simple](nlpr/proj/simple).
* Each proj has one or more run scripts (`runscript.py`). Run scripts are the command-line scripts for kicking off a run, but also a good entry point for reading code.
* Run scripts use a `zconf.RunConfiguration` object, which allows for easy command-line or in-session instantiation of arguments. Importantly, you can use the `--ZZsrc {path.json}` argument to specify a JSON file that provides keys/values that correspond to the attributes of the `RunConfiguration` for more convenient instantiation of a configuration/script.
* More broadly, we make heavy use of JSON files for various configuration (e.g. model configs, task configs)
* `Runner` objects contain the core logic for the training/eval loop of a project. Often, the goal of a runscript is simply to setup the `Runner` object and let the `Runner` object do all the work.

#### Tasks

* Tasks are defined in [nlpr/tasks/lib](nlpr/tasks/lib), one per file.
* Each task broadly needs to specify the following:
* loading data
* tokenization (`Example.tokenize`), giving a `TokenizedExmaple`.
* featurization (`TokenizedExample.featurize`), giving a `DataRow`. This converts the tokenized data into a format that the model can take in (e.g. concatenating inputs, truncating sequence length, adding `[SEP]` tokens.)
* Conversion to a `Batch` (Batch.from_data_rows`), that our dataloaders know how to split up `

----

### Guiding Principles

* Use simple data formats (dictionaries, JSON/JSONL for serialization)
* Code should be straightforward to run either on command-line or within notebooks. See: `zconf` from [zutils](https://github.com/zphang/zutils)
* Explicit is better than implicit. Use classes rather than dicts for known data structures, refrain from using `kwargs`, use keyword arguments where possible, etc
* Verbose is better than implicit. Use an IDE.
* There should be a clean separation of messy "research" code, and solid "software engineering" code.
* [PEP 8](https://www.python.org/dev/peps/pep-0008/), [PEP 20](https://www.python.org/dev/peps/pep-0020/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/zphang/nlprunners

Awesome Lists containing this project

README