Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/imyizhang/pytorch-bolt
A simple PyTorch wrapper making multi-node multi-GPU training much easier on Slurm
https://github.com/imyizhang/pytorch-bolt
distributed pytorch-implementation pytorch-template slurm
Last synced: 6 days ago
JSON representation
A simple PyTorch wrapper making multi-node multi-GPU training much easier on Slurm
- Host: GitHub
- URL: https://github.com/imyizhang/pytorch-bolt
- Owner: imyizhang
- License: mit
- Created: 2021-05-11T01:03:08.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-05-11T20:16:48.000Z (over 3 years ago)
- Last Synced: 2024-11-01T14:17:10.076Z (13 days ago)
- Topics: distributed, pytorch-implementation, pytorch-template, slurm
- Language: Python
- Homepage: https://pypi.org/project/PyTorch-Bolt/
- Size: 14.6 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PyTorch Bolt
PyTorch Bolt is
* a simple PyTorch wrapper making multi-node multi-GPU training much easier on Slurm
PyTorch Bolt supports to
* use single-node single-GPU training on a specified GPU device
* use multi-node (or single-node) multi-GPU [`DistributedDataParallel`](https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html) (DDP) training
- with [`torch.distributed.launch`](https://pytorch.org/docs/stable/distributed.html#launch-utility) module
- with [Slurm](https://slurm.schedmd.com/quickstart.html) cluster workload manager## Quickstart
### All-in-One Template Using PyTorch Bolt
#### Recommended Structure
```
.
├── data
│ ├── __init__.py
│ └── customized_datamodule.py
├── model
│ ├── __init__.py
│ └── customized_model.py
├── main.py
├── main.sbatch
└── requirements.txt
```#### Demo
[MNIST classification using PyTorch Bolt](https://github.com/yzhang-dev/PyTorch-with-Slurm/tree/main/Tutorials/All-in-One-Template-Using-PyTorch-Bolt) (you might need to go through the relevant [tutorials](https://github.com/yzhang-dev/PyTorch-with-Slurm) step by step).
### Dependencies and Installation
#### Package Dependencies
`pip install -r requirements.txt` can handle all package dependencies.
#### Install Pytorch Bolt
```bash
$ pip install pytorch-bolt
```## Documentation
### Module `DataModule`
```python
class pytorch_bolt.DataModule(data_dir='data', num_splits=10, batch_size=1, num_workers=0, pin_memory=False, drop_last=False)
```#### `use_dist_sampler()`
Can be called to trigger `DistributedSampler` when using `DistributedDataParallel` (DDP).
#### `train_dataloader()`
Returns `Dataloader` for trainset.
#### `val_dataloader()`
Returns `Dataloader` for valset.
#### `test_dataloader()`
Returns `Dataloader` for testset.
#### `add_argparse_args(parent_parser)`
Returns `argparse` parser. (**Staticmethod**)
**Practical template**:
```python
import pytorch_boltclass MyDataModule(pytorch_bolt.DataModule):
def __init__(self, args):
super().__init__(args)
# arguments for customized dataset
# optional helper function can be used
def _prepare_data(self):
passdef _setup_dataset(self):
# trainset and valset for fit stage
# `self.num_splits` can be used for splitting trainset and valset
# testset for test stage
return trainset, valset, testset@staticmethod
def add_argparse_args(parent_parser):
parser = argparse.ArgumentParser(parents=[parent_parser], add_help=False)
parser = pytorch_bolt.DataModule.add_argparse_args(parser)
# TODO
return parser@classmethod
def from_argparse_args(cls, args):
return cls(args)
```### Module `Module`
```python
class pytorch_bolt.Module()
```#### `parameters_to_update()`
Returns model parameters that have `requires_grad=True`.
#### `configure_criterion()`
Returns criterion.
#### `configure_metric()`
Returns metric.
#### `configure_optimizer()`
Returns optimizer (and learning rate scheduler).
**Practical template**:
```python
import pytorch_boltclass MyModel(pytorch_bolt.Module):
def __init__(self, args):
super().__init__()
# hyperparameters for model
self.model = self._setup_model()
# hyperparameters for criterion, metric, optimizer and lr_schedulerdef _setup_model(self):
# TODO
return modeldef forward(self, inputs):
return self.model(inputs)# return parameters that have requires_grad=True
# `parameters_to_update` can be useful for transfer learning
def parameters_to_update(self):
return
# return criterion
def configure_criterion(self):
return
# return metric
def configure_metric(self):
return# return optimizer (and lr_scheduler)
def configure_optimizer(self):
return
@staticmethod
def add_argparse_args(parent_parser):
parser = argparse.ArgumentParser(parents=[parent_parser], add_help=False)
# TODO
return parser@classmethod
def from_argparse_args(cls, args):
return cls(args)
```### Module `Loggers`
```python
class pytorch_bolt.Loggers(logs_dir='logs', loggerfmt='%(asctime)s | %(levelname)-5s | %(name)s - %(message)s', datefmt=None, tracker_keys=None (Required), tracker_reduction='mean')
```#### `configure_root_logger(root)`
Returns `root` logger.
#### `configure_child_logger(child)`
Returns `root.child` logger.
#### `configure_tracker()`
Returns tracker for tracking forward propagation step outputs and statistics.
#### `configure_progressbar()`
Returns progress bar for showing forward propagation step progress and details.
#### `configure_writer()`
Returns Tensorboard writer for visualizing forward propagation epoch outputs.
#### `add_argparse_args(parent_parser)`
Returns `argparse` parser. (**Staticmethod**)
#### `from_argparse_args(args)`
`Loggers` constructor.
### Module `Trainer`
```python
class pytorch_bolt.Trainer(loggers=None (Required), device=None, distributed=False, use_slurm=False, dist_backend='nccl', master_addr='localhost', master_port='29500', world_size=1, rank=0, local_rank=0, datamodule=None (Required), model=None (Required), max_epochs=5, verbose=False)
```#### `get_rank()`
Gets rank of current process. (**Staticmethod**)
#### `fit()`
Fits the model on trainset, validating each epoch on valset.
#### `validate()`
Validates trained model by running one epoch on valset.
#### `test()`
Tests trained model by running one epoch on testset.
#### `destroy()`
Destroys trainer..
#### `add_argparse_args(parent_parser)`
Returns `argparse` parser. (**Staticmethod**)
#### `from_argparse_args(args)`
`Trainer` constructor.
**Practical template for customized trainer**:
```python
import pytorch_boltclass MyTrainer(pytorch_bolt.Trainer):
def _training_step(self, batch_idx, batch):
returndef _training_step_end(self, batch_idx, batch, step_outs):
return
# if return
# return dict, containing at least 2 keys: "loss", "score"
def _training_epoch_end(self):
return
```## Related Projects
* Inspired by [Pytorch Lightning](https://www.pytorchlightning.ai/)
## Appendix
### Environment Variable Mapping
**WORLD_SIZE** | **SLURM_NTASKS** (and **SLURM_NPROCS** for backwards compatibility)
> Same as **-n, --ntasks**
**RANK** | **SLURM_PROCID**
> The MPI rank (or relative process ID) of the current process
**LOCAL_RANK** | **SLURM_LOCALID**
> Node local task ID for the process within a job.
**MASTER_ADDR** | **SLURM_SUBMIT_HOST**
> The hostname of the machine from which **sbatch** was invoked.
**NPROC_PER_NODE** | **SLURM_NTASKS_PER_NODE**
> Number of tasks requested per node. Only set if the **--ntasks-per-node** option is specified.
**NNODES** | **SLURM_JOB_NUM_NODES** (and **SLURM_NNODES** for backwards compatibility)
> Total number of nodes in the job's resource allocation.
**NODE_RANK** | **SLURM_NODEID**
> ID of the nodes allocated.
**SLURM_JOB_NODELIST** (and **SLURM_NODELIST** for backwards compatibility)
> List of nodes allocated to the job.
### Reference
* [`sbatch` Output Environment Variables](https://slurm.schedmd.com/sbatch.html#lbAK)
* [`torch.distributed` TCP Initialization Environment Variables](https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization)