https://github.com/zphang/nlprunners
https://github.com/zphang/nlprunners
Last synced: 10 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/zphang/nlprunners
- Owner: zphang
- Created: 2019-08-01T19:21:46.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2020-04-13T07:41:14.000Z (about 6 years ago)
- Last Synced: 2025-01-29T13:43:36.508Z (over 1 year ago)
- Language: Python
- Size: 1.13 MB
- Stars: 1
- Watchers: 5
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# NLP Runners
This repository contains code for various NLP fine-tuning/transfer learning experiments, including semi-supervised learning, multi-task training, and adapters.
**This codebase is heavily WIP.**
----
### Quick Setup
* For a quick environment setup, see: [Simple Setup](packaging/simple_setup.md)
* For Adapters/Multi-Adapters, see: [Adapters/Multi-Adapters](packaging/adapters.md)
----
### Dependencies
These are the main notable dependencies. For more, see: [Simple Setup](packaging/simple_setup.md)
* PyTorch 1.2+
* HuggingFace/Transformers (usually the latest version. Currently 2.3.0)
* My own set of Python utility libraries: [zutils](https://github.com/zphang/zutils)
----
### Overview
#### Running
* Different research projects can be found in [nlpr/proj](nlpr/proj). The basic fine-tuning version can be found in [nlpr/proj/simple](nlpr/proj/simple).
* Each proj has one or more run scripts (`runscript.py`). Run scripts are the command-line scripts for kicking off a run, but also a good entry point for reading code.
* Run scripts use a `zconf.RunConfiguration` object, which allows for easy command-line or in-session instantiation of arguments. Importantly, you can use the `--ZZsrc {path.json}` argument to specify a JSON file that provides keys/values that correspond to the attributes of the `RunConfiguration` for more convenient instantiation of a configuration/script.
* More broadly, we make heavy use of JSON files for various configuration (e.g. model configs, task configs)
* `Runner` objects contain the core logic for the training/eval loop of a project. Often, the goal of a runscript is simply to setup the `Runner` object and let the `Runner` object do all the work.
#### Tasks
* Tasks are defined in [nlpr/tasks/lib](nlpr/tasks/lib), one per file.
* Each task broadly needs to specify the following:
* loading data
* tokenization (`Example.tokenize`), giving a `TokenizedExmaple`.
* featurization (`TokenizedExample.featurize`), giving a `DataRow`. This converts the tokenized data into a format that the model can take in (e.g. concatenating inputs, truncating sequence length, adding `[SEP]` tokens.)
* Conversion to a `Batch` (Batch.from_data_rows`), that our dataloaders know how to split up `
----
### Guiding Principles
* Use simple data formats (dictionaries, JSON/JSONL for serialization)
* Code should be straightforward to run either on command-line or within notebooks. See: `zconf` from [zutils](https://github.com/zphang/zutils)
* Explicit is better than implicit. Use classes rather than dicts for known data structures, refrain from using `kwargs`, use keyword arguments where possible, etc
* Verbose is better than implicit. Use an IDE.
* There should be a clean separation of messy "research" code, and solid "software engineering" code.
* [PEP 8](https://www.python.org/dev/peps/pep-0008/), [PEP 20](https://www.python.org/dev/peps/pep-0020/)